HN Reader

Lite^3, a JSON-compatible zero-copy serialization format

152

https://lite3.io/design_and_limitations.html

See also Show HN: Lite³ – A JSON-Compatible Zero-Copy Serialization Format in 9.3 kB of C - https://news.ycombinator.com/item?id=45992832 (no comments, but a good writeup)

Author here,

First of all, hello Hacker News :)

Many of the comments seem to address the design of key hashing. The reason for using hashed keys inside B-tree nodes instead of the string keys directly is threefold:

1) The implementation is simplified.

2) When performing a lookup, it is faster to compare fixed-sized elements than it is to do variable length string comparison.

3) The key length is unlimited.

I should say the documentation page is out of date regarding hash collisions. The format now supports probing thanks to a PR merged yesterday. So inserting colliding keys will actually work.

It is true that databases and other formats do store string keys directly in the nodes. However as a memory format, runtime performance is very important. There is no disk or IO latency to 'hide behind'.

Right now the hash function used is DJB2. It has the interesting property of somewhat preserving the lexicographical ordering of the key names. So hashes for keys like "item_0001", "item_0002" and "item_0003" are actually more likely to also be placed sequentially inside the B-tree nodes. This can be useful when doing a sequential scan on the semantic key names, otherwise you are doing a lot more random access. Also DJB2 is so simple that it can be calculated entirely by the C preprocessor at compile time, so you are not actually paying the runtime cost of hashing.

We will be doing a lot more testing before DJB2 is finalized in the spec, but might later end up with a 'better' hash function such as XXH32.

Finally, TRON/Lite³ compared to other binary JSON formats (BSON, MsgPack, CBOR, Amazon Ion) is different in that:

1) none of the formats mentioned provide direct zero-copy indexed access to the data

2) none of the formats mentioned allow for partial mutation of the data without rewriting most of the document

This last point 2) is especially significant. For example, JSONB in Postgres is immutable. When replacing or inserting one specific value inside an object or array, with JSONB you will rewrite the entire document as a result of this, even if it is several megabytes large. If you are performing frequent updates inside JSONB documents, this will cause severe write amplification. This is the case for all current Postgres versions.

TRON/Lite³ is designed to blur the line between memory and serialization format.

1 month agoby eliasdejong

Lite^3 is a clever encoding for JSON data that is indexed as-encoded and is mutable in place.

Perhaps I should have posted this URI instead: https://lite3.io/design_and_limitations.html

Lite^3 deserves to be noticed by HN. u/eliasdejong (the author) posted it 23 days ago but it didn't get very far. I'm hoping this time it gets noticed.

1 month agoby cryptonector

This is cool, but the headline makes it sound like the wire format is json compatible which is not the case. I'm not really sure why there is a focus on json here at all - its the least interesting part of this and the same could be said for almost every serialization format.

1 month agoby bawolff

This is super interesting!

Apache Arrow is trying to do something similar, using Flatbuffer to serialize with zero-copy and zero-parse semantics, and an index structure built on top of that.

Would love to see comparisons with Arrow

1 month agoby lsb

The docs mention that space for overwritten variable-sized values in the buffer is not reclaimed:

    The overridden space is never recovered, causing buffer size
    to grow indefinitely.

Is the garbage at least zeroed? Otherwise seems like it could "leak" overwritten values when sending whole buffers via memcpy

1 month agoby al2o3cr

hash collision limitation for keys is the most questionable part of design. Usually thats handled by forcing key lookup to verify that what you looked up matches what you tried to lookup. Resolving this perf hit is probably doable by having an extra table of conflicting hashes

1 month agoby tarasglek

GLTF is like this too (or PLY)? The main difference is the format of their headers? Just by reading the header you can parse the binary data. I'm surprised BSON and any of the other binary JSON formats they list don't support reading the memory layout in a header.

1 month agoby koolala

It would be interesting to use lite3 for blob storage in or with sqlite.

1 month agoby mhalle

This is nice, but please don't clickbait headlines with straight-up lies. This is not JSON-compatible.

1 month agoby Jean-Papoulos

So it's not really a serialization format, it's a compact, modifiable untyped tree, that one can therefore send to another machine with the same architecture. Or deserialise into native language specific data structures.

Don't get me wrong, I find this type of data structures interesting and useful, but it's misleading to call it "serialization", unless my understanding is wrong.

1 month agoby rixed

I'm suspicious of their FlatBuffers performance comparison.

1 month agoby IshKebab

The benchmarks are flawed, verification is not generally used after serialization with flatbuffers. Deserialization with flatbuffers is a simple reinterpret_cast so it makes no sense for it to be 41.69ms.

It's just dishonest.

1 month agoby yIt9R8