Clear document history and reject old updates

raine · January 27, 2024, 3:49pm

Correct.

encodeStateAsUpdate encodes the state into a single binary blob (i.e. an “update”—although it contains the entire history of edits). Due to the binary compression algorithm, this will always be smaller than the live, decoded state. It will also always be smaller than the same history split across multiple binary updates (e.g. if they are stored incrementally).

encodeStateAsUpdate is always safe and will not destroy the history. Therefore it can only reduce the size of the document so much. But that’s sort of a given with CRDT’s.

Sometimes minimizing the total size by compressing the entire history into a single update is not ideal. encodeStateAsUpdate has the potential to use a lot of memory. For example, retrieving the entire Doc history from the database could be rather slow. In that case, storing updates incrementally would allow a batch read to return just the updates that are needed.

toantd90 · January 30, 2024, 6:18am

@raine I am using GitHub - MaxNoetzold/y-mongodb-provider: Mongodb database adapter for Yjs as a persistent layer. I faced the issue of MongoDB collection that is bigger than 16MB ( the limitation ). Will store the diff only reduce the size?

const persistedYdoc = await mdb.getYDoc(docName);
// get the state vector so we can just store the diffs between client and server
const persistedStateVector = Y.encodeStateVector(persistedYdoc);

// better just get the differences and save those:
const diff = Y.encodeStateAsUpdate(ydoc, persistedStateVector);

// store the new data in db (if there is any: empty update is an array of 0s)
if (
  diff.reduce(
    (previousValue, currentValue) => previousValue + currentValue,
    0
  ) > 0
)
  mdb.storeUpdate(docName, diff);

raine · January 30, 2024, 4:18pm

@toantd90 Storing individual updates (diffs) will take up more space in the database than storing one big update, due to the reduced overhead. The downside is that you have to request the entire blob into memory in order to retrieve any data.

You can see a simple example here: View in CodeSandbox.

FYI y-mongodb-provider automatically chunks documents that exceed 16MB.

toantd90 · January 30, 2024, 9:40pm

@raine Thank you very much for your reply. The “automatically chunks documents that exceed 16MB” is super helpful.

I have the below code after the above one. This is still using encodeStateAsUpdate to applyUpdate to the yjs doc.

applyUpdate(ydoc, encodeStateAsUpdate(persistedYdoc));

It is unclear to me why we will need to save all updates even if there is no change in the doc. Could you please explain it to me?

What I want to do here is I want to reduce the size of MongoDB collection as I may not need all the updates.

I even tried to debounce the update to MongoDB

ydoc.on(
      'update',
      debounce(
        (update) => {
          mdb.storeUpdate(docName, update);
        },
        CALLBACK_DEBOUNCE_WAIT,
        {
          maxWait: CALLBACK_DEBOUNCE_MAXWAIT,
        }
      )
    );

Could you please let me know your thoughts on it?

raine · January 31, 2024, 4:56pm

Consider two approaches to storage when the document changes:

Encode the entire Doc state and replace the existing database blob with the entire compressed update. This can be debounced to avoid churn.
Encode and append an encoded diff of the changes. This can also be debounced to avoid churn, and results in smaller incremental updates. This is the approach taken in your code example.

I was merely pointing out that the total size of the data is smaller in (1), while the incremental size of the data is smaller in (2). There is a tradeoff that has there that one has to factor into your architecture.

The obvious synthesis is to use approach (2) for realtime updates, and then periodically compress blocks of incremental updates by re-encoding and storing them from time to time to reduce the overall size (1) (as shown in the CodeSandbox demo). This is not standardized, but it can be seen in the implementation of y-indexeddb and some other providers.

That said, this optimization strategy (of combining approaches (1) and (2)) will not change the fundamental problem of too much data, too fast, which can easily result when working within a model that stores the entire history of changes (i.e. the CRDT model itself). I think the technology is still evolving to handle non-Word Processor use cases that involve greater volumes of data. For now, balancing these concerns and optimizing storage size and memory usage sits in userland.

This will only store the last update, since debounce ignores earlier calls. Dropping updates will result in an invalid Doc.

Instead, you want to accumulate updates during the debounce period, like this:

const storeUpdateThrottled = throttleConcat((updates: Uint8Array[]) => {
    if (updates.length === 0) return
    return db.storeUpdate(docName, Y.mergeUpdates(updates))
  },
  1000,
)

Reference: em/server/ThoughtspaceExtension.ts at 835f840684c8320e8438c77e1287e13ccb5b33db · cybersemics/em · GitHub

Where this relies on the primitives throttleConcat and throttleReduce. I apologize for the additional levels of the abstraction.

toantd90 · April 9, 2024, 2:16am

@raine, Is there a way to make it work with yMap?

To @jamis0n’s warning, I don’t allow clients to keep data so it shouldn’t be an issue.

toantd90 · April 10, 2024, 6:30am

@dmonad I am attempting to clear the document history automatically when no user is connected to the document. I know it’s not recommended, but it’s a helpful feature for memory issues.

However, it appears that the cloned document is malfunctioning and cannot be properly constructed. Do you have any advice that could help with this issue?

  let clonedYDoc = new Doc();

  oldYDoc.share.forEach((value, key) => {
    clonedYDoc.share.set(key, value);
  });

  return clonedYDoc;

dmonad · April 12, 2024, 12:29pm

A Yjs type can only be integrated once. It can’t be integrated in two different documents.

You need to figure out an algorithm to copy the data to a new Yjs type before integrating it into the cloned doc.