Differences

This shows you the differences between two versions of the page.

--- ocmchatlog1 [2009/04/30 12:54] – created megadiscman
+++ ocmchatlog1 [2009/04/30 12:59] (current) – megadiscman
@@ Line 1: / Line 1: @@
-dummy
+<code>
+<ocm geek> I started documenting the bytecode opcodes in our wiki.
+<inquirer> nice
+<inquirer> I am still banging my head against the netmd blocks
+<inquirer> actually, not still, I just picked it up again
+<ocm geek> https://wiki.physik.fu-berlin.de/linux-minidisc/doku.php?id=ocmbytecode
+<inquirer> where/how are the entry points found?
+<ocm geek> The main entry point is at offset 10 in the ocm file. Martens code should
+           take care of that.
+<ocm geek> Native modules are loaded with opcode 0x75.
+<ocm geek> Their entry point is stored in their header.
+<ocm geek> They are addressed by name, usually. The name is also stored in the header.
+<inquirer> yeah, I think netmd is a bad one to start with, no names to be found
+<inquirer> which one is less cryptic?
+<ocm geek> Bytecode blocks can be loaded as global code with opcode 0x7C
+<ocm geek> If you take apart init.ocm, thats a special .ocm file, you find the native
+           modules making up the bytecode interpreter.
+<ocm geek> init.ocm is made of a special loader bytecode. I don't know whether any of
+           Martens tool can parse that, but that doesn't really matter: You can do it
+           by hand.
+<inquirer> ah there is a ton of ocm files in OpenMG
+<ocm geek> I know. The only one I looked at yet is init.ocm
+<inquirer> I only looked at the ones in the ocm tar ball so far
+<ocm geek> First byte of init.ocm is "04" which means a blob follows.
+<inquirer> ah, sonicstage 3.4 doesn't have 0301 format
+<ocm geek> Does it have 0303?
+<ocm geek> As I said, init.ocm is special! netmd.ocm starts with 0301 in my SonicStage
+(I think 3.4) installation.
+<inquirer> right
+<ocm geek> Let me finish explaining the special init.ocm format.
+<inquirer> sure
+<ocm geek> The blob following the 04 is preceeded by its length.
+<ocm geek> 83 means that the blob is longer than 127 bytes, and the length itself
+           takes 3 bytes.
+<inquirer> ASN.1
+<ocm geek> The following bytes are 066e29 in my installation.
+<ocm geek> Yeah, right.
+<ocm geek> The OCM stuff uses ASN.1 like serialisation format.
+<ocm geek> And after the length, voila, you find 0301!
+<ocm geek> Because that blob is a standard 0301 bytecode blob.
+<inquirer> After the length I have 03 01 07 00
+<inquirer> yeah
+<inquirer> ah
+<inquirer> my length is different, but that's fin
+<inquirer> fine
+<ocm geek> Past that block, I find 04 82 9c 91
+<ocm geek> This indicates a second blob, this time length 9c91
+<ocm geek> That block is a native code module, name "intrins".
+<ocm geek> It contains the bytecode interpreter core.
+<ocm geek> After the second blob there are just two further bytes 06 and 08.
+<inquirer> I am not sure how I know the length of a 03 01 block
+<ocm geek> You find it beforehands.
+<inquirer> oh
+<ocm geek> If your init.ocm is the same as mine, its 0x66e29
+<inquirer> right, the file is longer than that
+<inquirer> this is wiki stuff :)
+<ocm geek> The 06 equals the bytecode 75 and loads the last blob as native code module.
+<inquirer> my file is different (bytecode length is 07e834), but same layout
+<inquirer> incl the 06 08 at the end
+<ocm geek> The 08 equals the bytecode 77 and calls the startup function of that module.
+<ocm geek> As the 0301 blob is still on the stack of the boot interpreter, this blob
+           gets passed as parameter to the startup function.
+<ocm geek> As the startup function of "intrins" is the bytecode interpreter, it
+           interpretes the big bytecode blob.
+<ocm geek> This big bytecode blob contains further native modules.
+<inquirer> this is because 04 is ipush_str4
+<inquirer> I see
+<ocm geek> Be careful. Boot bytecode is not completely equivalent to standard bytecode.
+<ocm geek> It happens to be the same for opcodes 01..04
+<inquirer> ah yes the 06 08
+<inquirer> I just got lucky here with 04
+<ocm geek> Probably that's on purpose, because the codes 01 to 04 are used for
+           serialization of internal values.
+<inquirer> at this stage in the game, is it still useful for me to get my own
+           disassembled bytecode interpreter?
+<ocm geek> Probably not.
+<ocm geek> I have it, and am currently transferring that knowledge to the Wiki.
+<inquirer> good
+<inquirer> I guess I will focus on the netmd then
+<ocm geek> I just pointed you to the init.ocm because you asked for something that
+           might be less cryptic.
+<inquirer> for the wiki, could you keep the mnemonic in the title? ie. Opcode 02:
+           Immediate BigInt (ipush_str4)
+<inquirer> no big deal of course
+<inquirer> yep, great stuff with the init
+<inquirer> it will help me to recognize the extension format header
+<ocm geek> I didn't really look at Martens opcode names, but I can put them in.
+<inquirer> they can be added later of course
+<inquirer> the doc is more important ;)
+<ocm geek> You can compare with Marten's scanner.c
+<ocm geek> But beware. The comments indicating indices are *decimal*, while everything
+           I write is *hexadecimal*.
+<inquirer> I realized that, so I switched to opcodes.h
+<inquirer> ha, I should have looked at the perl unpack syntax earlier
+<ocm geek> Yeah. opcodes.h is hexadecimal.
+<ocm geek> But some mnemonics seem to not match my findings.
+<ocm geek> For example "allocMem" in martens code is "Store to User Dictionary"
+           according to my analysis.
+<ocm geek> Might be that Sony changed the meaning of the code, or one of us is wrong.
+<inquirer> yrah
+<inquirer> ok, codeblockparsed the binary blob
+<inquirer> how do you invoke gas?  can I use i596-mingw32msvc-as?
+<inquirer> i586...
+<ocm geek> yes.
+<ocm geek> I used that one.
+[...]
+<inquirer> anything better than as' ing the codeblockparser output into a COFF
+           executable?
+<inquirer> i guess it doesn't actually matter what format the asm is wrapped in
+<ocm geek> Yeah. Must be a format IDA is able to read.
+<ocm geek> And IDA Freeware only reads COFF objects.
+<ocm geek> (and Windows/DOS EXE files and drivers)
+<inquirer> cool
+<ocm geek> You might want to try loading the object at a different offset than 0, to
+           help IDA distinguish offsets from numbers. Somehow IDA is unable to know
+           that with objects, *every* offset is tagged as offset.
+<ocm geek> In completely linked executables without reloc info, it is not that ease.
+<ocm geek> easy.
+<ocm geek> You will need info about the import functions to make sense of it.
+<ocm geek> Just a second...
+<inquirer> sonicstage 3.4 netmd.ocm is half as big as the one from maarten
+<ocm geek> https://wiki.physik.fu-berlin.de/linux-minidisc/doku.php?id=ocmsalwrapexports
+<ocm geek> Maarten had a much older sonic stage.
+<ocm geek> Maybe they moved parts out of netmd.ocm into standard DLLs.
+<ocm geek> Or they have rewritten parts from bytecode to native code.
+<inquirer> yeah
+<ocm geek> It was known the the OpenMG virtualization/crypto stuff was very heavy on
+           processing power in early sonic stage versions.
+<inquirer> it's a bit scary how much you know about this VM
+<ocm geek> I talked to Marten on MSN.
+<inquirer> ah, ok :)
+<ocm geek> And I reversed salwrap.dll myself.
+<ocm geek> Not that I ever got once completely through it.
+<inquirer> at least it starts making sense
+<ocm geek> I will go to sleep now.
+<inquirer> mh ok
+<ocm geek> See you tomorrow.
+<inquirer> or do you have 3 minutes for me?
+<ocm geek> OK.
+<inquirer> let's see if this is something simple
+<inquirer> between the bytecodeblocks there are 63xx0f instructions
+<inquirer> what's their significance?
+<inquirer> it's always 66 BIGLENGTH BYTECODE... 63 XX 0F  and again 66...
+<ocm geek> Ah. I see. 63 is bipush with encrypted operand
+<ocm geek> But Martens decoder already decrypts it for you.
+<ocm geek> As 66 is just ipush_str4 with encrypted operand, that martens decoder
+           decodes.
+<inquirer> yep
+<ocm geek> 0F is store to dictionary.
+<inquirer> so, it keeps pushing stuff
+<ocm geek> store to dictionary pops it.
+<inquirer> I think I am missing the big picture here, is this for constructing a
+           symbol table or something?
+<ocm geek> Every instruction pops the operands it used.
+<ocm geek> Kind of.
+<inquirer> mmmh, ok
+<ocm geek> There are two dictionaries.
+<ocm geek> The system dictionary has 256 entries addressed by smallints between 0 and
+.
+<ocm geek> What you see here is bytecode blobs stored into the system dictionary.
+<inquirer> cool
+<inquirer> for now that would be enough if you want to leave me now ;)
+<inquirer> I'll go to bed soon, too
+<ocm geek> So thats a way of exporting them to other OCM modules or perhaps even to
+           salwrap
+<inquirer> but some more info on this big picture would be cool to have in the wiki
+           *hint hint*
+<inquirer> yeah
+<inquirer> makes total sense
+<inquirer> it's kinda weird to have such a dynamic format
+<ocm geek> Someone at some other point might decide to "call the bytecode in system
+           dict at index 77"
+<inquirer> right
+<ocm geek> The system dict probably is quite fixed in purpose.
+<ocm geek> There are magic entries near the end of the dictionary, for example 0xfd
+           points to a blob that represents the jump table of the bytecode interpreter.
+<ocm geek> Probably you won't encounter any access to it, unless you look at init.ocm.
+<inquirer> sweet
+<inquirer> this is really helpful
+<ocm geek> The extension modules loaded in the bytecode part of init.ocm are
+           hotpatching their byte code instructions into the jump table.
+<ocm geek> But after init.ocm is done, the jump table is full, so no sense in
+           accessing it.
+<inquirer> yeah, well, I still don't know how addressing works in this system
+<inquirer> but the opcode description may shed light on that
+<ocm geek> What do you mean by "adressing"?
+<inquirer> things like jumps
+<inquirer> branching
+<ocm geek> There are no jumps and branches in the bytecode.
+<inquirer> oh cool
+<ocm geek> Have you ever programmed PostScript?
+<inquirer> that's easy ;)
+<inquirer> nope
+<inquirer> but I know fortran
+<inquirer> long time ago though
+<ocm geek> OK, but fortran does have jumps.
+<ocm geek> This byte code is much more structured.
+<inquirer> I meant forth
+<inquirer> sorry
+<inquirer> stack based, I forgot about that.  makes sense now
+<ocm geek> Ah. That's something completely different to fortran. I don't really know
+           it, but it might have similar properties to PostScript or this byte code
+           (both stack-based, too)
+<inquirer> it's coming together now
+<ocm geek> Be careful with Marten's CALL_IF instruction (0x33). That is a misnomer.
+<ocm geek> It should be CALL_WHILE.
+<inquirer> ok
+<inquirer> I really have an itch to improve the output of the program, it's quite a
+           mess
+<inquirer> but I have to understand more first, and it might be a waste
+<inquirer> thanks a lot, again
+<ocm geek> Probably he didn't notice that CALL_IF is wrong. The idea is that
+           CALL_WHILE returns to the CALL_WHILE instruction after running the code
+           block, so it gets executed again and again, until top-of-stack is zero.
+<inquirer> makes sense
+<ocm geek> If the return address stored in the interpreter would be the next
+           instruction (as in CALL and CALL_IF_ELSE) it would really be CALL_IF.
+<ocm geek> I also have started a bytecode parsing program before I got in contact with
+           Marten, but that is even more rough.
+<ocm geek> I used Haskell for it.
+<inquirer> there is a problem for me of course because netmd has some recursive
+           decryption
+<inquirer> seems difficult to me to make a static analysis here
+<ocm geek> What do you mean by that?
+<inquirer> as I said, the decrypted bytecode contains encrypred bytecode
+<inquirer> so, you would need to recursively decrypt
+<ocm geek> That seems to be standard practice in OCM bytecode modules, but Martens
+           dumper doesn't support it currently.
+<inquirer> but for that you need to "run" the bytecode
+<inquirer> yeah
+<ocm geek> Marten's decoder "run"s the crypto setup instruction for the main block.
+<inquirer> yeah
+<ocm geek> So the code to do that is already there.
+<inquirer> yup
+<inquirer> I already modified it to decrypt it, but not all of it
+<ocm geek> But I don't know how well-designed his code is, and how easy you could add
+           decryption of sub-blocks.
+<inquirer> it's modular enough
+<ocm geek> Oh, you already started :)
+<ocm geek> Nice.
+<inquirer> yeah, but only very simple.  I don't catch encryption that isn't at offset
+of a bytecodeblock
+<inquirer> there are some of those
+<ocm geek> Probably they set up other stuff first.
+<ocm geek> You might need to run that too.
+<inquirer> crunching away at it very slowly
+<inquirer> things like 30 80 04 07 02 6c 50 73
+<ocm geek> Strange.
+<ocm geek> 30 is "compare DWORDS for equality".
+<ocm geek> Why would a subblock start with it?
+<inquirer> well, we don't know how the block is used, do we?
+<inquirer> at least not yet
+<ocm geek> Yeah. Maybe its not a bytecode containing block after all.
+<inquirer> there are others exactly like that
+<inquirer> but the encryption!
+<inquirer> 026c5073
+<inquirer> but yeah, maybe it's processed first
+<ocm geek> Oh, you are right.
+<inquirer> it's a different keyindex though than usual in netmd which is 6c50
+<inquirer> so, who knows
+<inquirer> I also got a block:
+<inquirer> 30 80 02 04 08 20 00 20 02 04 08 20 80 00 02 03 00 80 20 02 01 00  etc
+<inquirer> going on and on in that manner
+<inquirer> with no 02xxyy73 in that block
+<ocm geek> Argh, wait a moment.
+<ocm geek> That might be ASN.1 encoded sequences.
+<ocm geek> If you decode them, you get an array of bytecodes.
+<ocm geek> 30 80 is the ASN tag for sequence of undetermined length.
+<inquirer> good one
+ * inquirer makes a mental note not to forget ASN.1
+<inquirer> cool
+<ocm geek> So your 30 80 04 07 is a sequence, whose first element is a 7 byte blob.
+<inquirer> so, it is code within a data structure
+<ocm geek> the first 4 bytes are setting up cryptography, so just 3 bytes remaining.
+<ocm geek> Exactly.
+<inquirer> you rock
+<ocm geek> That's the nice thing about languages where code blocks are first class
+           data objects: You can put them into any data structure you like.
+<ocm geek> OK. Good night for real now.
+<inquirer> like lisp
+<inquirer> gn!
+<ocm geek> Another hint for the sequences: They are really ASN.1 (including
+           the tags)
+<ocm geek> While normally, you have the bytecode and then untagged ASN.1 like encoded
+           data, inside the sequence instead of bytecodes the real ASN.1 tags are used.
+<ocm geek> That means specifically: All numbers (small numbers and arbitrary precision
+           integers) are encoded with ASN.1 tag 2 (INTEGER).
+<ocm geek> While the bytecode 02 is "16 bit constant", the ASN.1 tag 2 is
+           length-prefixed arbitrary precision integer.
+<inquirer> ok
+<ocm geek> Byte blocks are encoded with ASN.1 tag 4 (OCTET STRING) that happens to
+           coincide with bytecode 4.
+<ocm geek> Nested sequences are encoded as ASN.1 sequences like the top sequence.
+<ocm geek> BTW: your 30 80 02 04 08 20 00 20 02 04 08 20 80 00 02 03... constant you
+           quoted looks like an array Sony's DES implementation uses.
+<ocm geek> I will cross-check.
+<inquirer> this nested encryption is starting to annoy me
+<ocm geek> OK. It doesn't match any of the arrays in the HiMD Transfer Tool for Mac
+           used for DES encryption.
+<inquirer> does this ring  a bell?  4e 20 1d 3f ...
+<inquirer> when I try to disassemble it, I get garbage
+<inquirer> but it is CALL'ed, and I don't know how that works
+<ocm geek> Sorry. I did not yet look at calling bytecode.
+<ocm geek> But you are right. That doesn't seem like executable code.
+<ocm geek> The code in the interpreter looks like it would execute it as is.
+<ocm geek> Could you have something messed up with decryption?
+</code>