Modified UTF-8 (MUTF-8) #42

Offroaders123 · 2023-12-15T11:01:36Z

Tonight, we (Minecraft Manipulator) were discussing the NBT spec's use of the Java version of UTF-8, which is called MUTF-8, or Modified UTF-8. I have been meaning to look into the details about this and how it relates to NBT, so I'm glad that it came up in conversation again!

Found this JS implementation, which I will try to use for inspiration to add support for the format in NBTify, as it is inherent that strings will serialize correctly in the same fashion to how they do in the base game as well.

mutf-8 - GitHub
What does it mean to say "Java Modified UTF-8 Encoding"? - Stack Overflow
Modified UTF-8 - Wikipedia

Made a huge ton of progress on parsing values from the DB! Have been working all day pretty much, so I just kept going with it, into the night as well. Already had to charge my computer again, it was 100% this morning, turn off on me when I forgot it was so low, that was around 9. Yeah so I went bonkers with this one, a little out the window. A bunch of stuff has changed. I basically went all in on associating what values should be parsed as, with the specific type it's stored with. That was what I realized today I think, that it's not too much different of a parsing algorithm structure than to that of `NBTReader.readTag()`, where you can do a `switch` against all of the `case`s to validate/verify against, and handle each one specific to the criteria it needs. Then you return those results out of that discerner function. Next I can organize things a bit from here, to make it more tangible and organized, and also add return types, as I want to narrow the surface area of where types are used and defined, which has helped a lot with my code as of lately. If you check it in more places, the errors can't leak out as far because you trapped them in the scope of the function block you are writing. I got parsing for most of the different types added, I'm very happy to learn that a lot of them are just plain NBT data, so it's actually not too bad of a deal to parse each one. It's just how they are stored in the database is what you have to refactor to look and work a bit easier on your own end. The ones that still return plain `Buffer` values are the ones I still need to figure out how to parse. It's more a matter of just looking at the docs and figuring it out. It was more of a bonus that so many of the other files just use NBT, yeah that was a good thing. So diffing was less of a concern for this one, it pretty much just rehashed the whole project all over, so I wasn't too concerned about the different stages and everything. I also haven't completed the parsing for the `SuffixKey`-based values, hence why I keep around the `key` property when `readKey()` handles it. That is a placeholder just so I can debug the data when I come back to work on this again. https://minecraft.wiki/w/Bedrock_Edition_level_format#Chunk_key_format https://minecraft.wiki/w/Bedrock_Edition_level_format/Other_data_format https://stackoverflow.com/questions/75108373/how-to-check-if-a-node-js-buffer-contains-valid-utf-8 (Could use this, didn't quite apply here but is nice to know about) https://wiki.vg/Bedrock_Edition_level_format https://i.imgur.com/5ljYxry_d.webp https://github.com/mmccoo/minecraft_mmccoo/blob/master/parse_bedrock.cpp https://learn.microsoft.com/en-us/minecraft/creator/documents/actorstorage https://www.reddit.com/r/DevinTownsend/comments/b7i98h/singularity_one_of_dts_best/ (YES) Tonight, we (MM) were discussing the NBT spec's use of the Java UTF-8 version, which is called MUTF-8, or Modified UTF-8. Found this JS implementation, which I will try to use for inspiration to add support for NBTify, as it is inherent that strings will serialize correctly in the same fashion to how they do in the base game as well. https://github.com/sciencesakura/mutf-8 https://stackoverflow.com/questions/7921016/what-does-it-mean-to-say-java-modified-utf-8-encoding https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 Offroaders123/NBTify#42 (NBTify hook :P)

This is just a temporary thing to test things out. I need to add another test which includes MUTF-8-specific strings into an NBT file, because right now all of my tests are passing either way, even though NBTify wasn't MUTF-8 compliant, meaning that it's not 1:1 in serialization with what's expected in the NBT spec. And this part is temporary because I'm debating whether I should bundle `mutf-8` into and start building NBTify for the browser, rather than going the no-dependency route where I only transpile with TS. If I bundle for the browser (ESM), then I can use regular dependencies in Node, and put everything together for use elsewhere. If I do that, maybe I will rethink CJS support as well, since it's not that hard to simply enable nowadays. I've wanted to try building/transpiling with plain ESBuild for some time now as well, since it will allow for a minified output for the web, meaning you can still get tree-shaking when installing locally and using with bundlers, but you can use it feature-complete in the browser as well. https://wiki.vg/NBT#Specification #42 sciencesakura/mutf-8#17

- Binary Empty ListTag Type Persistence #40 - Runtime `typeof` Validation, now type-validated in TS too! Essentially, my API type validation code is now type-checked as well, since I want to ensure the runtime checks properly match that of their TS types/shapes. This is done for function parameters, and things like that, where NBTify is designed to throw when passed in incorrect values or primitive types. - Options object spreading / defining. This moves things away from destructuring values of API options objects in the function definition into the body itself, in favor of allowing the options to be passed in as a single object again in other places in the body, rather than having to build the same options object again later (restructuring? lol, hehe). This helps aid in not forgetting to add newly added config options to the restructuring call, because I had actually accidentally done this before, since TS won't catch it because the config objects allow config types to be optional. Now you would have to explicitly set it to `undefined` if you didn't want the value to be passed in from the parent options parameter. Here's a better demo for this crazy talk: ```ts export interface Hi { nice: number; haha: number; } export interface HiOptions extends Partial<Hi> {} export function hiLegacy({ nice, haha }: HiOptions = {}): Hi { if (nice === undefined){ return hi({ nice: 10, haha }); } nice satisfies Hi["nice"]; if (haha === undefined){ // forgot `nice`, no errors, eek! return hi({ haha: 25 }); } haha satisfies Hi["haha"]; return { nice, haha }; } export function hi(options: HiOptions = {}): Hi { const { nice, haha } = options; if (nice === undefined){ return hi({ ...options, nice: 10 }); } nice satisfies Hi["nice"]; if (haha === undefined){ return hi({ ...options, haha: 25 }); } haha satisfies Hi["haha"]; return { nice, haha }; } ``` - RootName + Bedrock Level Primitive - NBTReader/NBTWriter State Handling Another thing tested during this update period was MUTF-8 support, but that's not quite solved out yet, so I'm going to hold it forward until another release. #42 Clouseaupolice...!

@ts-check

This was an interesting one! Thankfully I found an issue page about MUTF-8 handling on the repo for Twoolie/NBT, the Python project. It gave me some insight and a file to test against. I wrote my own script to slim it down a bunch, and dedupe the tags that are used multiple times. It's crazy how big just book text can get! I used this actual version of NBTify in this commit, to write the new content to the file. That's also why I diffed it out, I wanted to make sure when I slimmed it down that the content coming out of it was actually what it was supposed to be as well. When using older NBTify, it didn't work correctly, because MUTF-8 handles things different than standard UTF-8. ```js // @ts-check import { readFile, writeFile } from "node:fs/promises"; import * as NBT from "./NBTify/src/index.ts"; const data = await readFile("./hotbar.nbt"); const trimmed = data.subarray(0x000BAE96, 0x000CA7C2); console.log(trimmed); /** @type {NBT.NBTData<any>} */ const hotbar = await NBT.read(data); const book = hotbar.data[0][1].tag.BlockEntityTag.Items[12]; console.log(book); const mutf8Demo = await NBT.write(book); console.log(mutf8Demo); const demoDiff = mutf8Demo.subarray(1, -2); console.log(Buffer.compare(trimmed, demoDiff)); await writeFile("./alien-book.nbt", mutf8Demo); ``` #42 #44 twoolie/NBT#144 (comment) twoolie/NBT#144 I'm still not sure I'm going to use the dependency itself or if I should just emded that into NBTify on it's own. I think I may just use it as a dependency, as I've been trying to get more used to not reinventing the wheel for everything, unless that has benefits. The MUTF-8 library already does everything I need it to, and it's ESM TypeScript, so I'm not sure what other reason I have to not just use it, it's great! Eventually I want to move my compression handling into a separate module too, so I will have to use module resolution for that down the road either way. I say heck to it! Let's do it :) Gonna look into if there's anything I'm forgetting, before doing that though. I really like having the ability to use projects like these (NBTify) without needing a transpilation or build step. Modern CDNs seem to handle this nicely, so we'll see.

Offroaders123 · 2024-05-14T10:50:28Z

I think we're nearly there! I'm leaning towards getting used to adding dependencies to the project, I haven't done that for modules I share yet. It was new initially to do that with my apps, but now I think I'm going to start doing it with my npm packages too.

Offroaders123 · 2024-09-07T05:11:52Z

Realized I hadn't closed this, it's now in the stable build on npm (I think it was sometime last week I published it), and the CDN is linked properly. We made it!! 🙂 🚀

Offroaders123 · 2024-09-13T22:09:12Z

Reopening this, as recently I just found out that exclusively Java Edition NBT uses MUTF-8 encoding for the format, with LCE and Bedrock using plain UTF-8 instead.

This makes things a bit more complex to deduce, especially since Java and LCE both use big endian (endianness is easy to detect because of the errors that occur in trying to read one as the other).

Offroaders123 added the bug Something isn't working label Dec 15, 2023

Offroaders123 self-assigned this Dec 15, 2023

Offroaders123 mentioned this issue May 15, 2024

AssemblyScript WASM Build sciencesakura/mutf-8#23

Open

Offroaders123 closed this as completed Sep 7, 2024

Offroaders123 reopened this Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modified UTF-8 (MUTF-8) #42

Modified UTF-8 (MUTF-8) #42

Offroaders123 commented Dec 15, 2023

Offroaders123 commented May 14, 2024

Offroaders123 commented Sep 7, 2024

Offroaders123 commented Sep 13, 2024

Modified UTF-8 (MUTF-8) #42

Modified UTF-8 (MUTF-8) #42

Comments

Offroaders123 commented Dec 15, 2023

Offroaders123 commented May 14, 2024

Offroaders123 commented Sep 7, 2024

Offroaders123 commented Sep 13, 2024