I’m migrating data into YJS and don’t need to keep all the updates since there are no concurrent editors, just the migration script. How can I minimize the data I’m storing? Currently, I use Buffer.from(encodeStateAsUpdate(doc))
and store the binary in SQL. However, it took hundreds or thousands of operations on the Y.Doc
to achieve the migration, resulting in the stored binary being around 5x larger than the original JSON. How can I optimize this?
Maybe you can try the Y.mergeUpdates
API that merges multiple binary updates into one: Document Updates | Yjs Docs
Usually, the encoded Y.Doc will be about 1.4x of the original document after long editing sessions. In some cases (especially when using the v2 encoding Y.encodeStateAsUpdateV2
), the encoded doc might be smaller than the json document.
It could also be an encoding issue. I’ve consulted for several companies that encode the Uint8Array
/ Buffer
using JSON encoding (because of implicit transformation). E.g. instead of storing the binary object, they store the JSON array "[0,1,...]"
which will result in an overhead of about 5x of the original size. Some databases support binary data. Otherwise you should use base64 encoding. There is a special section about encoding in the docs that @doodlewind mentioned.
Thanks for the response. I am generating a Buffer from the result of encodeStateAsUpdate(doc) and plonking that straight into mysql as a longblob (binary).
I’m sure this is not relevant but will the update be smaller if I use the constructor syntax of the shared types when migrating? I.e. new Map([lots of data]) rather than map.set lots and lots of times?
No, there shouldn’t be any difference. However, if your encoded YDoc is larger than the JSON document right after initialization, something seems wrong. Yjs encodes things more efficiently than JSON, however it needs to keep part of the history after deletions.