encodeStateAsUpdate encodes the state into a single binary blob (i.e. an “update”—although it contains the entire history of edits). Due to the binary compression algorithm, this will always be smaller than the live, decoded state. It will also always be smaller than the same history split across multiple binary updates (e.g. if they are stored incrementally).
encodeStateAsUpdate is always safe and will not destroy the history. Therefore it can only reduce the size of the document so much. But that’s sort of a given with CRDT’s.
Sometimes minimizing the total size by compressing the entire history into a single update is not ideal. encodeStateAsUpdate has the potential to use a lot of memory. For example, retrieving the entire Doc history from the database could be rather slow. In that case, storing updates incrementally would allow a batch read to return just the updates that are needed.
const persistedYdoc = await mdb.getYDoc(docName);
// get the state vector so we can just store the diffs between client and server
const persistedStateVector = Y.encodeStateVector(persistedYdoc);
// better just get the differences and save those:
const diff = Y.encodeStateAsUpdate(ydoc, persistedStateVector);
// store the new data in db (if there is any: empty update is an array of 0s)
if (
diff.reduce(
(previousValue, currentValue) => previousValue + currentValue,
0
) > 0
)
mdb.storeUpdate(docName, diff);
@toantd90 Storing individual updates (diffs) will take up more space in the database than storing one big update, due to the reduced overhead. The downside is that you have to request the entire blob into memory in order to retrieve any data.
Consider two approaches to storage when the document changes:
Encode the entire Doc state and replace the existing database blob with the entire compressed update. This can be debounced to avoid churn.
Encode and append an encoded diff of the changes. This can also be debounced to avoid churn, and results in smaller incremental updates. This is the approach taken in your code example.
I was merely pointing out that the total size of the data is smaller in (1), while the incremental size of the data is smaller in (2). There is a tradeoff that has there that one has to factor into your architecture.
The obvious synthesis is to use approach (2) for realtime updates, and then periodically compress blocks of incremental updates by re-encoding and storing them from time to time to reduce the overall size (1) (as shown in the CodeSandbox demo). This is not standardized, but it can be seen in the implementation of y-indexeddb and some other providers.
That said, this optimization strategy (of combining approaches (1) and (2)) will not change the fundamental problem of too much data, too fast, which can easily result when working within a model that stores the entire history of changes (i.e. the CRDT model itself). I think the technology is still evolving to handle non-Word Processor use cases that involve greater volumes of data. For now, balancing these concerns and optimizing storage size and memory usage sits in userland.
This will only store the last update, since debounce ignores earlier calls. Dropping updates will result in an invalid Doc.
Instead, you want to accumulate updates during the debounce period, like this:
@dmonad I am attempting to clear the document history automatically when no user is connected to the document. I know it’s not recommended, but it’s a helpful feature for memory issues.
However, it appears that the cloned document is malfunctioning and cannot be properly constructed. Do you have any advice that could help with this issue?
let clonedYDoc = new Doc();
oldYDoc.share.forEach((value, key) => {
clonedYDoc.share.set(key, value);
});
return clonedYDoc;