I started documenting the bytecode opcodes in our wiki. nice I am still banging my head against the netmd blocks actually, not still, I just picked it up again https://wiki.physik.fu-berlin.de/linux-minidisc/doku.php?id=ocmbytecode where/how are the entry points found? The main entry point is at offset 10 in the ocm file. Martens code should take care of that. Native modules are loaded with opcode 0x75. Their entry point is stored in their header. They are addressed by name, usually. The name is also stored in the header. yeah, I think netmd is a bad one to start with, no names to be found which one is less cryptic? Bytecode blocks can be loaded as global code with opcode 0x7C If you take apart init.ocm, thats a special .ocm file, you find the native modules making up the bytecode interpreter. init.ocm is made of a special loader bytecode. I don't know whether any of Martens tool can parse that, but that doesn't really matter: You can do it by hand. ah there is a ton of ocm files in OpenMG I know. The only one I looked at yet is init.ocm I only looked at the ones in the ocm tar ball so far First byte of init.ocm is "04" which means a blob follows. ah, sonicstage 3.4 doesn't have 0301 format Does it have 0303? As I said, init.ocm is special! netmd.ocm starts with 0301 in my SonicStage 3 (I think 3.4) installation. right Let me finish explaining the special init.ocm format. sure The blob following the 04 is preceeded by its length. 83 means that the blob is longer than 127 bytes, and the length itself takes 3 bytes. ASN.1 The following bytes are 066e29 in my installation. Yeah, right. The OCM stuff uses ASN.1 like serialisation format. And after the length, voila, you find 0301! Because that blob is a standard 0301 bytecode blob. After the length I have 03 01 07 00 yeah ah my length is different, but that's fin fine Past that block, I find 04 82 9c 91 This indicates a second blob, this time length 9c91 That block is a native code module, name "intrins". It contains the bytecode interpreter core. After the second blob there are just two further bytes 06 and 08. I am not sure how I know the length of a 03 01 block You find it beforehands. oh If your init.ocm is the same as mine, its 0x66e29 right, the file is longer than that this is wiki stuff :) The 06 equals the bytecode 75 and loads the last blob as native code module. my file is different (bytecode length is 07e834), but same layout incl the 06 08 at the end The 08 equals the bytecode 77 and calls the startup function of that module. As the 0301 blob is still on the stack of the boot interpreter, this blob gets passed as parameter to the startup function. As the startup function of "intrins" is the bytecode interpreter, it interpretes the big bytecode blob. This big bytecode blob contains further native modules. this is because 04 is ipush_str4 I see Be careful. Boot bytecode is not completely equivalent to standard bytecode. It happens to be the same for opcodes 01..04 ah yes the 06 08 I just got lucky here with 04 Probably that's on purpose, because the codes 01 to 04 are used for serialization of internal values. at this stage in the game, is it still useful for me to get my own disassembled bytecode interpreter? Probably not. I have it, and am currently transferring that knowledge to the Wiki. good I guess I will focus on the netmd then I just pointed you to the init.ocm because you asked for something that might be less cryptic. for the wiki, could you keep the mnemonic in the title? ie. Opcode 02: Immediate BigInt (ipush_str4) no big deal of course yep, great stuff with the init it will help me to recognize the extension format header I didn't really look at Martens opcode names, but I can put them in. they can be added later of course the doc is more important ;) You can compare with Marten's scanner.c But beware. The comments indicating indices are *decimal*, while everything I write is *hexadecimal*. I realized that, so I switched to opcodes.h ha, I should have looked at the perl unpack syntax earlier Yeah. opcodes.h is hexadecimal. But some mnemonics seem to not match my findings. For example "allocMem" in martens code is "Store to User Dictionary" according to my analysis. Might be that Sony changed the meaning of the code, or one of us is wrong. yrah ok, codeblockparsed the binary blob how do you invoke gas? can I use i596-mingw32msvc-as? i586... yes. I used that one. [...] anything better than as' ing the codeblockparser output into a COFF executable? i guess it doesn't actually matter what format the asm is wrapped in Yeah. Must be a format IDA is able to read. And IDA Freeware only reads COFF objects. (and Windows/DOS EXE files and drivers) cool You might want to try loading the object at a different offset than 0, to help IDA distinguish offsets from numbers. Somehow IDA is unable to know that with objects, *every* offset is tagged as offset. In completely linked executables without reloc info, it is not that ease. easy. You will need info about the import functions to make sense of it. Just a second... sonicstage 3.4 netmd.ocm is half as big as the one from maarten https://wiki.physik.fu-berlin.de/linux-minidisc/doku.php?id=ocmsalwrapexports Maarten had a much older sonic stage. Maybe they moved parts out of netmd.ocm into standard DLLs. Or they have rewritten parts from bytecode to native code. yeah It was known the the OpenMG virtualization/crypto stuff was very heavy on processing power in early sonic stage versions. it's a bit scary how much you know about this VM I talked to Marten on MSN. ah, ok :) And I reversed salwrap.dll myself. Not that I ever got once completely through it. at least it starts making sense I will go to sleep now. mh ok See you tomorrow. or do you have 3 minutes for me? OK. let's see if this is something simple between the bytecodeblocks there are 63xx0f instructions what's their significance? it's always 66 BIGLENGTH BYTECODE... 63 XX 0F and again 66... Ah. I see. 63 is bipush with encrypted operand But Martens decoder already decrypts it for you. As 66 is just ipush_str4 with encrypted operand, that martens decoder decodes. yep 0F is store to dictionary. so, it keeps pushing stuff store to dictionary pops it. I think I am missing the big picture here, is this for constructing a symbol table or something? Every instruction pops the operands it used. Kind of. mmmh, ok There are two dictionaries. The system dictionary has 256 entries addressed by smallints between 0 and 255. What you see here is bytecode blobs stored into the system dictionary. cool for now that would be enough if you want to leave me now ;) I'll go to bed soon, too So thats a way of exporting them to other OCM modules or perhaps even to salwrap but some more info on this big picture would be cool to have in the wiki *hint hint* yeah makes total sense it's kinda weird to have such a dynamic format Someone at some other point might decide to "call the bytecode in system dict at index 77" right The system dict probably is quite fixed in purpose. There are magic entries near the end of the dictionary, for example 0xfd points to a blob that represents the jump table of the bytecode interpreter. Probably you won't encounter any access to it, unless you look at init.ocm. sweet this is really helpful The extension modules loaded in the bytecode part of init.ocm are hotpatching their byte code instructions into the jump table. But after init.ocm is done, the jump table is full, so no sense in accessing it. yeah, well, I still don't know how addressing works in this system but the opcode description may shed light on that What do you mean by "adressing"? things like jumps branching There are no jumps and branches in the bytecode. oh cool Have you ever programmed PostScript? that's easy ;) nope but I know fortran long time ago though OK, but fortran does have jumps. This byte code is much more structured. I meant forth sorry stack based, I forgot about that. makes sense now Ah. That's something completely different to fortran. I don't really know it, but it might have similar properties to PostScript or this byte code (both stack-based, too) it's coming together now Be careful with Marten's CALL_IF instruction (0x33). That is a misnomer. It should be CALL_WHILE. ok I really have an itch to improve the output of the program, it's quite a mess but I have to understand more first, and it might be a waste thanks a lot, again Probably he didn't notice that CALL_IF is wrong. The idea is that CALL_WHILE returns to the CALL_WHILE instruction after running the code block, so it gets executed again and again, until top-of-stack is zero. makes sense If the return address stored in the interpreter would be the next instruction (as in CALL and CALL_IF_ELSE) it would really be CALL_IF. I also have started a bytecode parsing program before I got in contact with Marten, but that is even more rough. I used Haskell for it. there is a problem for me of course because netmd has some recursive decryption seems difficult to me to make a static analysis here What do you mean by that? as I said, the decrypted bytecode contains encrypred bytecode so, you would need to recursively decrypt That seems to be standard practice in OCM bytecode modules, but Martens dumper doesn't support it currently. but for that you need to "run" the bytecode yeah Marten's decoder "run"s the crypto setup instruction for the main block. yeah So the code to do that is already there. yup I already modified it to decrypt it, but not all of it But I don't know how well-designed his code is, and how easy you could add decryption of sub-blocks. it's modular enough Oh, you already started :) Nice. yeah, but only very simple. I don't catch encryption that isn't at offset 0 of a bytecodeblock there are some of those Probably they set up other stuff first. You might need to run that too. crunching away at it very slowly things like 30 80 04 07 02 6c 50 73 Strange. 30 is "compare DWORDS for equality". Why would a subblock start with it? well, we don't know how the block is used, do we? at least not yet Yeah. Maybe its not a bytecode containing block after all. there are others exactly like that but the encryption! 026c5073 but yeah, maybe it's processed first Oh, you are right. it's a different keyindex though than usual in netmd which is 6c50 so, who knows I also got a block: 30 80 02 04 08 20 00 20 02 04 08 20 80 00 02 03 00 80 20 02 01 00 etc going on and on in that manner with no 02xxyy73 in that block Argh, wait a moment. That might be ASN.1 encoded sequences. If you decode them, you get an array of bytecodes. 30 80 is the ASN tag for sequence of undetermined length. good one * inquirer makes a mental note not to forget ASN.1 cool So your 30 80 04 07 is a sequence, whose first element is a 7 byte blob. so, it is code within a data structure the first 4 bytes are setting up cryptography, so just 3 bytes remaining. Exactly. you rock That's the nice thing about languages where code blocks are first class data objects: You can put them into any data structure you like. OK. Good night for real now. like lisp gn! Another hint for the sequences: They are really ASN.1 (including the tags) While normally, you have the bytecode and then untagged ASN.1 like encoded data, inside the sequence instead of bytecodes the real ASN.1 tags are used. That means specifically: All numbers (small numbers and arbitrary precision integers) are encoded with ASN.1 tag 2 (INTEGER). While the bytecode 02 is "16 bit constant", the ASN.1 tag 2 is length-prefixed arbitrary precision integer. ok Byte blocks are encoded with ASN.1 tag 4 (OCTET STRING) that happens to coincide with bytecode 4. Nested sequences are encoded as ASN.1 sequences like the top sequence. BTW: your 30 80 02 04 08 20 00 20 02 04 08 20 80 00 02 03... constant you quoted looks like an array Sony's DES implementation uses. I will cross-check. this nested encryption is starting to annoy me OK. It doesn't match any of the arrays in the HiMD Transfer Tool for Mac used for DES encryption. does this ring a bell? 4e 20 1d 3f ... when I try to disassemble it, I get garbage but it is CALL'ed, and I don't know how that works Sorry. I did not yet look at calling bytecode. But you are right. That doesn't seem like executable code. The code in the interpreter looks like it would execute it as is. Could you have something messed up with decryption?