<ocm geek> I started documenting the bytecode opcodes in our wiki. <inquirer> nice <inquirer> I am still banging my head against the netmd blocks <inquirer> actually, not still, I just picked it up again <ocm geek> https://wiki.physik.fu-berlin.de/linux-minidisc/doku.php?id=ocmbytecode <inquirer> where/how are the entry points found? <ocm geek> The main entry point is at offset 10 in the ocm file. Martens code should take care of that. <ocm geek> Native modules are loaded with opcode 0x75. <ocm geek> Their entry point is stored in their header. <ocm geek> They are addressed by name, usually. The name is also stored in the header. <inquirer> yeah, I think netmd is a bad one to start with, no names to be found <inquirer> which one is less cryptic? <ocm geek> Bytecode blocks can be loaded as global code with opcode 0x7C <ocm geek> If you take apart init.ocm, thats a special .ocm file, you find the native modules making up the bytecode interpreter. <ocm geek> init.ocm is made of a special loader bytecode. I don't know whether any of Martens tool can parse that, but that doesn't really matter: You can do it by hand. <inquirer> ah there is a ton of ocm files in OpenMG <ocm geek> I know. The only one I looked at yet is init.ocm <inquirer> I only looked at the ones in the ocm tar ball so far <ocm geek> First byte of init.ocm is "04" which means a blob follows. <inquirer> ah, sonicstage 3.4 doesn't have 0301 format <ocm geek> Does it have 0303? <ocm geek> As I said, init.ocm is special! netmd.ocm starts with 0301 in my SonicStage 3 (I think 3.4) installation. <inquirer> right <ocm geek> Let me finish explaining the special init.ocm format. <inquirer> sure <ocm geek> The blob following the 04 is preceeded by its length. <ocm geek> 83 means that the blob is longer than 127 bytes, and the length itself takes 3 bytes. <inquirer> ASN.1 <ocm geek> The following bytes are 066e29 in my installation. <ocm geek> Yeah, right. <ocm geek> The OCM stuff uses ASN.1 like serialisation format. <ocm geek> And after the length, voila, you find 0301! <ocm geek> Because that blob is a standard 0301 bytecode blob. <inquirer> After the length I have 03 01 07 00 <inquirer> yeah <inquirer> ah <inquirer> my length is different, but that's fin <inquirer> fine <ocm geek> Past that block, I find 04 82 9c 91 <ocm geek> This indicates a second blob, this time length 9c91 <ocm geek> That block is a native code module, name "intrins". <ocm geek> It contains the bytecode interpreter core. <ocm geek> After the second blob there are just two further bytes 06 and 08. <inquirer> I am not sure how I know the length of a 03 01 block <ocm geek> You find it beforehands. <inquirer> oh <ocm geek> If your init.ocm is the same as mine, its 0x66e29 <inquirer> right, the file is longer than that <inquirer> this is wiki stuff :) <ocm geek> The 06 equals the bytecode 75 and loads the last blob as native code module. <inquirer> my file is different (bytecode length is 07e834), but same layout <inquirer> incl the 06 08 at the end <ocm geek> The 08 equals the bytecode 77 and calls the startup function of that module. <ocm geek> As the 0301 blob is still on the stack of the boot interpreter, this blob gets passed as parameter to the startup function. <ocm geek> As the startup function of "intrins" is the bytecode interpreter, it interpretes the big bytecode blob. <ocm geek> This big bytecode blob contains further native modules. <inquirer> this is because 04 is ipush_str4 <inquirer> I see <ocm geek> Be careful. Boot bytecode is not completely equivalent to standard bytecode. <ocm geek> It happens to be the same for opcodes 01..04 <inquirer> ah yes the 06 08 <inquirer> I just got lucky here with 04 <ocm geek> Probably that's on purpose, because the codes 01 to 04 are used for serialization of internal values. <inquirer> at this stage in the game, is it still useful for me to get my own disassembled bytecode interpreter? <ocm geek> Probably not. <ocm geek> I have it, and am currently transferring that knowledge to the Wiki. <inquirer> good <inquirer> I guess I will focus on the netmd then <ocm geek> I just pointed you to the init.ocm because you asked for something that might be less cryptic. <inquirer> for the wiki, could you keep the mnemonic in the title? ie. Opcode 02: Immediate BigInt (ipush_str4) <inquirer> no big deal of course <inquirer> yep, great stuff with the init <inquirer> it will help me to recognize the extension format header <ocm geek> I didn't really look at Martens opcode names, but I can put them in. <inquirer> they can be added later of course <inquirer> the doc is more important ;) <ocm geek> You can compare with Marten's scanner.c <ocm geek> But beware. The comments indicating indices are *decimal*, while everything I write is *hexadecimal*. <inquirer> I realized that, so I switched to opcodes.h <inquirer> ha, I should have looked at the perl unpack syntax earlier <ocm geek> Yeah. opcodes.h is hexadecimal. <ocm geek> But some mnemonics seem to not match my findings. <ocm geek> For example "allocMem" in martens code is "Store to User Dictionary" according to my analysis. <ocm geek> Might be that Sony changed the meaning of the code, or one of us is wrong. <inquirer> yrah <inquirer> ok, codeblockparsed the binary blob <inquirer> how do you invoke gas? can I use i596-mingw32msvc-as? <inquirer> i586... <ocm geek> yes. <ocm geek> I used that one. [...] <inquirer> anything better than as' ing the codeblockparser output into a COFF executable? <inquirer> i guess it doesn't actually matter what format the asm is wrapped in <ocm geek> Yeah. Must be a format IDA is able to read. <ocm geek> And IDA Freeware only reads COFF objects. <ocm geek> (and Windows/DOS EXE files and drivers) <inquirer> cool <ocm geek> You might want to try loading the object at a different offset than 0, to help IDA distinguish offsets from numbers. Somehow IDA is unable to know that with objects, *every* offset is tagged as offset. <ocm geek> In completely linked executables without reloc info, it is not that ease. <ocm geek> easy. <ocm geek> You will need info about the import functions to make sense of it. <ocm geek> Just a second... <inquirer> sonicstage 3.4 netmd.ocm is half as big as the one from maarten <ocm geek> https://wiki.physik.fu-berlin.de/linux-minidisc/doku.php?id=ocmsalwrapexports <ocm geek> Maarten had a much older sonic stage. <ocm geek> Maybe they moved parts out of netmd.ocm into standard DLLs. <ocm geek> Or they have rewritten parts from bytecode to native code. <inquirer> yeah <ocm geek> It was known the the OpenMG virtualization/crypto stuff was very heavy on processing power in early sonic stage versions. <inquirer> it's a bit scary how much you know about this VM <ocm geek> I talked to Marten on MSN. <inquirer> ah, ok :) <ocm geek> And I reversed salwrap.dll myself. <ocm geek> Not that I ever got once completely through it. <inquirer> at least it starts making sense <ocm geek> I will go to sleep now. <inquirer> mh ok <ocm geek> See you tomorrow. <inquirer> or do you have 3 minutes for me? <ocm geek> OK. <inquirer> let's see if this is something simple <inquirer> between the bytecodeblocks there are 63xx0f instructions <inquirer> what's their significance? <inquirer> it's always 66 BIGLENGTH BYTECODE... 63 XX 0F and again 66... <ocm geek> Ah. I see. 63 is bipush with encrypted operand <ocm geek> But Martens decoder already decrypts it for you. <ocm geek> As 66 is just ipush_str4 with encrypted operand, that martens decoder decodes. <inquirer> yep <ocm geek> 0F is store to dictionary. <inquirer> so, it keeps pushing stuff <ocm geek> store to dictionary pops it. <inquirer> I think I am missing the big picture here, is this for constructing a symbol table or something? <ocm geek> Every instruction pops the operands it used. <ocm geek> Kind of. <inquirer> mmmh, ok <ocm geek> There are two dictionaries. <ocm geek> The system dictionary has 256 entries addressed by smallints between 0 and 255. <ocm geek> What you see here is bytecode blobs stored into the system dictionary. <inquirer> cool <inquirer> for now that would be enough if you want to leave me now ;) <inquirer> I'll go to bed soon, too <ocm geek> So thats a way of exporting them to other OCM modules or perhaps even to salwrap <inquirer> but some more info on this big picture would be cool to have in the wiki *hint hint* <inquirer> yeah <inquirer> makes total sense <inquirer> it's kinda weird to have such a dynamic format <ocm geek> Someone at some other point might decide to "call the bytecode in system dict at index 77" <inquirer> right <ocm geek> The system dict probably is quite fixed in purpose. <ocm geek> There are magic entries near the end of the dictionary, for example 0xfd points to a blob that represents the jump table of the bytecode interpreter. <ocm geek> Probably you won't encounter any access to it, unless you look at init.ocm. <inquirer> sweet <inquirer> this is really helpful <ocm geek> The extension modules loaded in the bytecode part of init.ocm are hotpatching their byte code instructions into the jump table. <ocm geek> But after init.ocm is done, the jump table is full, so no sense in accessing it. <inquirer> yeah, well, I still don't know how addressing works in this system <inquirer> but the opcode description may shed light on that <ocm geek> What do you mean by "adressing"? <inquirer> things like jumps <inquirer> branching <ocm geek> There are no jumps and branches in the bytecode. <inquirer> oh cool <ocm geek> Have you ever programmed PostScript? <inquirer> that's easy ;) <inquirer> nope <inquirer> but I know fortran <inquirer> long time ago though <ocm geek> OK, but fortran does have jumps. <ocm geek> This byte code is much more structured. <inquirer> I meant forth <inquirer> sorry <inquirer> stack based, I forgot about that. makes sense now <ocm geek> Ah. That's something completely different to fortran. I don't really know it, but it might have similar properties to PostScript or this byte code (both stack-based, too) <inquirer> it's coming together now <ocm geek> Be careful with Marten's CALL_IF instruction (0x33). That is a misnomer. <ocm geek> It should be CALL_WHILE. <inquirer> ok <inquirer> I really have an itch to improve the output of the program, it's quite a mess <inquirer> but I have to understand more first, and it might be a waste <inquirer> thanks a lot, again <ocm geek> Probably he didn't notice that CALL_IF is wrong. The idea is that CALL_WHILE returns to the CALL_WHILE instruction after running the code block, so it gets executed again and again, until top-of-stack is zero. <inquirer> makes sense <ocm geek> If the return address stored in the interpreter would be the next instruction (as in CALL and CALL_IF_ELSE) it would really be CALL_IF. <ocm geek> I also have started a bytecode parsing program before I got in contact with Marten, but that is even more rough. <ocm geek> I used Haskell for it. <inquirer> there is a problem for me of course because netmd has some recursive decryption <inquirer> seems difficult to me to make a static analysis here <ocm geek> What do you mean by that? <inquirer> as I said, the decrypted bytecode contains encrypred bytecode <inquirer> so, you would need to recursively decrypt <ocm geek> That seems to be standard practice in OCM bytecode modules, but Martens dumper doesn't support it currently. <inquirer> but for that you need to "run" the bytecode <inquirer> yeah <ocm geek> Marten's decoder "run"s the crypto setup instruction for the main block. <inquirer> yeah <ocm geek> So the code to do that is already there. <inquirer> yup <inquirer> I already modified it to decrypt it, but not all of it <ocm geek> But I don't know how well-designed his code is, and how easy you could add decryption of sub-blocks. <inquirer> it's modular enough <ocm geek> Oh, you already started :) <ocm geek> Nice. <inquirer> yeah, but only very simple. I don't catch encryption that isn't at offset 0 of a bytecodeblock <inquirer> there are some of those <ocm geek> Probably they set up other stuff first. <ocm geek> You might need to run that too. <inquirer> crunching away at it very slowly <inquirer> things like 30 80 04 07 02 6c 50 73 <ocm geek> Strange. <ocm geek> 30 is "compare DWORDS for equality". <ocm geek> Why would a subblock start with it? <inquirer> well, we don't know how the block is used, do we? <inquirer> at least not yet <ocm geek> Yeah. Maybe its not a bytecode containing block after all. <inquirer> there are others exactly like that <inquirer> but the encryption! <inquirer> 026c5073 <inquirer> but yeah, maybe it's processed first <ocm geek> Oh, you are right. <inquirer> it's a different keyindex though than usual in netmd which is 6c50 <inquirer> so, who knows <inquirer> I also got a block: <inquirer> 30 80 02 04 08 20 00 20 02 04 08 20 80 00 02 03 00 80 20 02 01 00 etc <inquirer> going on and on in that manner <inquirer> with no 02xxyy73 in that block <ocm geek> Argh, wait a moment. <ocm geek> That might be ASN.1 encoded sequences. <ocm geek> If you decode them, you get an array of bytecodes. <ocm geek> 30 80 is the ASN tag for sequence of undetermined length. <inquirer> good one * inquirer makes a mental note not to forget ASN.1 <inquirer> cool <ocm geek> So your 30 80 04 07 is a sequence, whose first element is a 7 byte blob. <inquirer> so, it is code within a data structure <ocm geek> the first 4 bytes are setting up cryptography, so just 3 bytes remaining. <ocm geek> Exactly. <inquirer> you rock <ocm geek> That's the nice thing about languages where code blocks are first class data objects: You can put them into any data structure you like. <ocm geek> OK. Good night for real now. <inquirer> like lisp <inquirer> gn! <ocm geek> Another hint for the sequences: They are really ASN.1 (including the tags) <ocm geek> While normally, you have the bytecode and then untagged ASN.1 like encoded data, inside the sequence instead of bytecodes the real ASN.1 tags are used. <ocm geek> That means specifically: All numbers (small numbers and arbitrary precision integers) are encoded with ASN.1 tag 2 (INTEGER). <ocm geek> While the bytecode 02 is "16 bit constant", the ASN.1 tag 2 is length-prefixed arbitrary precision integer. <inquirer> ok <ocm geek> Byte blocks are encoded with ASN.1 tag 4 (OCTET STRING) that happens to coincide with bytecode 4. <ocm geek> Nested sequences are encoded as ASN.1 sequences like the top sequence. <ocm geek> BTW: your 30 80 02 04 08 20 00 20 02 04 08 20 80 00 02 03... constant you quoted looks like an array Sony's DES implementation uses. <ocm geek> I will cross-check. <inquirer> this nested encryption is starting to annoy me <ocm geek> OK. It doesn't match any of the arrays in the HiMD Transfer Tool for Mac used for DES encryption. <inquirer> does this ring a bell? 4e 20 1d 3f ... <inquirer> when I try to disassemble it, I get garbage <inquirer> but it is CALL'ed, and I don't know how that works <ocm geek> Sorry. I did not yet look at calling bytecode. <ocm geek> But you are right. That doesn't seem like executable code. <ocm geek> The code in the interpreter looks like it would execute it as is. <ocm geek> Could you have something messed up with decryption?