Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified UTF-8 (MUTF-8) #42

Open
Offroaders123 opened this issue Dec 15, 2023 · 3 comments
Open

Modified UTF-8 (MUTF-8) #42

Offroaders123 opened this issue Dec 15, 2023 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@Offroaders123
Copy link
Owner

Tonight, we (Minecraft Manipulator) were discussing the NBT spec's use of the Java version of UTF-8, which is called MUTF-8, or Modified UTF-8. I have been meaning to look into the details about this and how it relates to NBT, so I'm glad that it came up in conversation again!

Found this JS implementation, which I will try to use for inspiration to add support for the format in NBTify, as it is inherent that strings will serialize correctly in the same fashion to how they do in the base game as well.

mutf-8 - GitHub
What does it mean to say "Java Modified UTF-8 Encoding"? - Stack Overflow
Modified UTF-8 - Wikipedia

@Offroaders123 Offroaders123 added the bug Something isn't working label Dec 15, 2023
@Offroaders123 Offroaders123 self-assigned this Dec 15, 2023
Offroaders123 added a commit to Offroaders123/Bedrock-LevelDB that referenced this issue Dec 15, 2023
Made a huge ton of progress on parsing values from the DB! Have been working all day pretty much, so I just kept going with it, into the night as well. Already had to charge my computer again, it was 100% this morning, turn off on me when I forgot it was so low, that was around 9.

Yeah so I went bonkers with this one, a little out the window. A bunch of stuff has changed. I basically went all in on associating what values should be parsed as, with the specific type it's stored with. That was what I realized today I think, that it's not too much different of a parsing algorithm structure than to that of `NBTReader.readTag()`, where you can do a `switch` against all of the `case`s to validate/verify against, and handle each one specific to the criteria it needs. Then you return those results out of that discerner function. Next I can organize things a bit from here, to make it more tangible and organized, and also add return types, as I want to narrow the surface area of where types are used and defined, which has helped a lot with my code as of lately. If you check it in more places, the errors can't leak out as far because you trapped them in the scope of the function block you are writing.

I got parsing for most of the different types added, I'm very happy to learn that a lot of them are just plain NBT data, so it's actually not too bad of a deal to parse each one. It's just how they are stored in the database is what you have to refactor to look and work a bit easier on your own end. The ones that still return plain `Buffer` values are the ones I still need to figure out how to parse. It's more a matter of just looking at the docs and figuring it out. It was more of a bonus that so many of the other files just use NBT, yeah that was a good thing.

So diffing was less of a concern for this one, it pretty much just rehashed the whole project all over, so I wasn't too concerned about the different stages and everything.

I also haven't completed the parsing for the `SuffixKey`-based values, hence why I keep around the `key` property when `readKey()` handles it. That is a placeholder just so I can debug the data when I come back to work on this again.

https://minecraft.wiki/w/Bedrock_Edition_level_format#Chunk_key_format
https://minecraft.wiki/w/Bedrock_Edition_level_format/Other_data_format
https://stackoverflow.com/questions/75108373/how-to-check-if-a-node-js-buffer-contains-valid-utf-8 (Could use this, didn't quite apply here but is nice to know about)
https://wiki.vg/Bedrock_Edition_level_format
https://i.imgur.com/5ljYxry_d.webp
https://github.com/mmccoo/minecraft_mmccoo/blob/master/parse_bedrock.cpp
https://learn.microsoft.com/en-us/minecraft/creator/documents/actorstorage
https://www.reddit.com/r/DevinTownsend/comments/b7i98h/singularity_one_of_dts_best/ (YES)

Tonight, we (MM) were discussing the NBT spec's use of the Java UTF-8 version, which is called MUTF-8, or Modified UTF-8.
Found this JS implementation, which I will try to use for inspiration to add support for NBTify, as it is inherent that strings will serialize correctly in the same fashion to how they do in the base game as well.
https://github.com/sciencesakura/mutf-8
https://stackoverflow.com/questions/7921016/what-does-it-mean-to-say-java-modified-utf-8-encoding
https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

Offroaders123/NBTify#42 (NBTify hook :P)
Offroaders123 added a commit that referenced this issue Dec 29, 2023
This is just a temporary thing to test things out. I need to add another test which includes MUTF-8-specific strings into an NBT file, because right now all of my tests are passing either way, even though NBTify wasn't MUTF-8 compliant, meaning that it's not 1:1 in serialization with what's expected in the NBT spec.

And this part is temporary because I'm debating whether I should bundle `mutf-8` into and start building NBTify for the browser, rather than going the no-dependency route where I only transpile with TS. If I bundle for the browser (ESM), then I can use regular dependencies in Node, and put everything together for use elsewhere. If I do that, maybe I will rethink CJS support as well, since it's not that hard to simply enable nowadays. I've wanted to try building/transpiling with plain ESBuild for some time now as well, since it will allow for a minified output for the web, meaning you can still get tree-shaking when installing locally and using with bundlers, but you can use it feature-complete in the browser as well.

https://wiki.vg/NBT#Specification

#42

sciencesakura/mutf-8#17
Offroaders123 added a commit that referenced this issue Jan 3, 2024
- Binary Empty ListTag Type Persistence #40
- Runtime `typeof` Validation, now type-validated in TS too! Essentially, my API type validation code is now type-checked as well, since I want to ensure the runtime checks properly match that of their TS types/shapes. This is done for function parameters, and things like that, where NBTify is designed to throw when passed in incorrect values or primitive types.
- Options object spreading / defining. This moves things away from destructuring values of API options objects in the function definition into the body itself, in favor of allowing the options to be passed in as a single object again in other places in the body, rather than having to build the same options object again later (restructuring? lol, hehe). This helps aid in not forgetting to add newly added config options to the restructuring call, because I had actually accidentally done this before, since TS won't catch it because the config objects allow config types to be optional. Now you would have to explicitly set it to `undefined` if you didn't want the value to be passed in from the parent options parameter. Here's a better demo for this crazy talk:

```ts
export interface Hi {
  nice: number;
  haha: number;
}

export interface HiOptions extends Partial<Hi> {}

export function hiLegacy({ nice, haha }: HiOptions = {}): Hi {
  if (nice === undefined){
    return hi({ nice: 10, haha });
  }

  nice satisfies Hi["nice"];

  if (haha === undefined){
    // forgot `nice`, no errors, eek!
    return hi({ haha: 25 });
  }

  haha satisfies Hi["haha"];

  return { nice, haha };
}

export function hi(options: HiOptions = {}): Hi {
  const { nice, haha } = options;

  if (nice === undefined){
    return hi({ ...options, nice: 10 });
  }

  nice satisfies Hi["nice"];

  if (haha === undefined){
    return hi({ ...options, haha: 25 });
  }

  haha satisfies Hi["haha"];

  return { nice, haha };
}
```

- RootName + Bedrock Level Primitive
- NBTReader/NBTWriter State Handling

Another thing tested during this update period was MUTF-8 support, but that's not quite solved out yet, so I'm going to hold it forward until another release. #42

Clouseaupolice...!
Offroaders123 added a commit that referenced this issue May 14, 2024
This was an interesting one! Thankfully I found an issue page about MUTF-8 handling on the repo for Twoolie/NBT, the Python project. It gave me some insight and a file to test against. I wrote my own script to slim it down a bunch, and dedupe the tags that are used multiple times. It's crazy how big just book text can get!

I used this actual version of NBTify in this commit, to write the new content to the file. That's also why I diffed it out, I wanted to make sure when I slimmed it down that the content coming out of it was actually what it was supposed to be as well. When using older NBTify, it didn't work correctly, because MUTF-8 handles things different than standard UTF-8.

```js
// @ts-check

import { readFile, writeFile } from "node:fs/promises";
import * as NBT from "./NBTify/src/index.ts";

const data = await readFile("./hotbar.nbt");

const trimmed = data.subarray(0x000BAE96, 0x000CA7C2);
console.log(trimmed);

/** @type {NBT.NBTData<any>} */
const hotbar = await NBT.read(data);

const book = hotbar.data[0][1].tag.BlockEntityTag.Items[12];
console.log(book);

const mutf8Demo = await NBT.write(book);
console.log(mutf8Demo);

const demoDiff = mutf8Demo.subarray(1, -2);
console.log(Buffer.compare(trimmed, demoDiff));

await writeFile("./alien-book.nbt", mutf8Demo);
```

#42
#44
twoolie/NBT#144 (comment)
twoolie/NBT#144

I'm still not sure I'm going to use the dependency itself or if I should just emded that into NBTify on it's own. I think I may just use it as a dependency, as I've been trying to get more used to not reinventing the wheel for everything, unless that has benefits. The MUTF-8 library already does everything I need it to, and it's ESM TypeScript, so I'm not sure what other reason I have to not just use it, it's great! Eventually I want to move my compression handling into a separate module too, so I will have to use module resolution for that down the road either way. I say heck to it! Let's do it :) Gonna look into if there's anything I'm forgetting, before doing that though. I really like having the ability to use projects like these (NBTify) without needing a transpilation or build step. Modern CDNs seem to handle this nicely, so we'll see.
@Offroaders123
Copy link
Owner Author

I think we're nearly there! I'm leaning towards getting used to adding dependencies to the project, I haven't done that for modules I share yet. It was new initially to do that with my apps, but now I think I'm going to start doing it with my npm packages too.

@Offroaders123
Copy link
Owner Author

Realized I hadn't closed this, it's now in the stable build on npm (I think it was sometime last week I published it), and the CDN is linked properly. We made it!! 🙂 🚀

@Offroaders123
Copy link
Owner Author

Reopening this, as recently I just found out that exclusively Java Edition NBT uses MUTF-8 encoding for the format, with LCE and Bedrock using plain UTF-8 instead.

This makes things a bit more complex to deduce, especially since Java and LCE both use big endian (endianness is easy to detect because of the errors that occur in trying to read one as the other).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant