Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending RSV to support Base64-encoded binary data out-of-the-box #1

Open
CC007 opened this issue Jan 7, 2024 · 4 comments
Open

Comments

@CC007
Copy link

CC007 commented Jan 7, 2024

Idea

From a comment on your Youtube video by Rik Schaaf (me) (https://www.youtube.com/watch?v=tb_70o6ohMA&lc=Ugzsfj_OUAK4s_IYaNZ4AaABAg):

What about extending the RSV format to support Base64 encoded binary data, by prefixing a string with \xFB (I chose FB to easily remember B for Binary/Base64, while still being invalid for UTF-8, to prevent collisions, according to your table at 4:41).
This would make it much cheaper to represent numbers (with more than 2 digits), and dates (for example as timestamps).
It also would allow for lossless transfer of floating point values (which is a problem when just using strings, since their decimal string representation doesn't losslessly map to its binary representation)
It could even allow the encoding of an image as a bitmap or any other binary data.
This extension would turn the Array<Array<String | null>> data structure into Array<Array<String | Array | null>> instead.
With this addition, you could even embed an RSV file within an RSV file, because the inner RSV file would be Base64 encoded, preventing any collisions with the special characters.

Example

So:

[
 [1234567890, "Hello", "🌍", null]
]

Would translate to:

251 | 83, 90, 89, 67, 48, 103, 61, 61 | 255 | 72, 101, 108, 108, 111 | 255 | 240, 159, 140, 142 | 255 | 254 | 255 | 253
\FB | B64: SZYC0g==                   | \FF |        "Hello"         | \FF |        "🌍"        | \FF | \FE | \FF | \FD
    | Hex: 499602D2                   |
    | Dec: 1234567890                 |

So in essence, without prefix you get UTF-8 encoded data and with the \FB prefix you get Base64 encoded data (ASCII and UTF-8 compatible, to my knowledge)

What is this addition trying to do

The advantage from this encoding addition is that non-unicode characters could also be represented without risk of collisions, including the RSV special characters themselves.

Another advantage is that some data types can be stored more efficiently, like numbers and dates.

What is this addition NOT trying to do (but what could be added in a separate issue)

This is not a change to add the data types themselves to RSV. This additional special character only signifies the encoding, not the datatype, so you wouldn't know if the data represents an integer, timestamp, float, etc., just like you wouldn't know this with the current implementation. This is still left to the program that is using the RSV file.

If the data type would have to be derived from this binary data, the base64 value could be prefixed (after the \FB) by a string surrounded by non-base64 characters, to signify the data type, like (i32) for 32-bit integers.
Example:

251 | 40, 105, 51, 50, 41 | 83, 90, 89, 67, 48, 103, 61, 61 | 255 | 253
\FB |  type: 32-bit int   |         value: SZYC0g==         | \FF | \FD

...which would represent a single integer (int32) value that equals 1234567890.
Or you could use something more simple, but restrictive typing system, that uses a single non-base64 character to define the type, followed by a single character for the size.

251 | 35, 52 | 83, 90, 89, 67, 48, 103, 61, 61 | 255 | 253
\FB |   #4   |             SZYC0g==            | \FF | \FD

...where # defines an integer and 4 defines a size of 4 bytes (32 bit): 1234567890

251 | 35, 52 | 81, 69, 107, 80, 50, 119, 61, 61 | 255 | 253
\FB |   ~4   |             QEkP2w==             | \FF | \FD

...where ~ defines a floating point value and 4 defines a size of 4 bytes (32 bit): 3.141592...
This is out of scope for this issue though.

Considerations

With this addition, the name isn't really accurate anymore, so would this be RBSV (Rows of Binary or String Values)?

@CC007
Copy link
Author

CC007 commented Jun 9, 2024

To my knowledge, if you know that the resulting binary is 8-bit aligned, you can also skip the = character at the end of the Base64 string

So you would get 83, 90, 89, 67, 48, 103 instead of 83, 90, 89, 67, 48, 103, 61, 61

@CC007
Copy link
Author

CC007 commented Jun 9, 2024

Base64 encoding is a 6-bit encoding scheme, but since only F8-FF are reserved, you could get away with using a 7-bit encoding, like ASCII (with the input padded to a multiple of 7 bits, just like is done in Base64 for the 6-bit encoding).

The only thing would be that you can't cleanly view the characters, which also hinders the ability to copy. I don't know if that's an important consideration though.

@zacharysyoung
Copy link

I would like to see the see the RSV spec be kept as minimal as possible, so that it might someday replace CSV which itself is very simple (and I think that's one of it's strongest attributes).

I like that RSV has a formal spec, maybe not complete, and maybe needs improvement, but it sure beats the raft of different conventions that plague CSV. I also believe that RSV not being plain text, and therefore not editable by humans in a dumb text editor will remove validation errors that people can introduce.

@CC007, I'd like to see something very close to RSV as it's currently defined be formalized. Once we have a very simple and strong base, other formats/encodings, like you've proposed, can be built on top of it.

@zacharysyoung
Copy link

The author also addressed this concern:

Not having special data types makes the format both simple and universal, because you can represent every data type as string, without worrying about precision (int32, int64, ...) or aspects like endianness (little/big). It also makes writing code to read and write RSV documents really easy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants