Clear document history and reject old updates

jamis0n · January 12, 2022, 4:12pm

We’re using YJS+Tiptap+Hocuspocus for our document editor, and everytime Hocuspocus detects that all websocket connections have dropped, we save the current Y.Doc binary to our database (and load it on cold start for the next connecting client).

This is working great! However, we’ve recently seen long running documents hit the 5MB+ Y.Doc size when their content is only 75KB or so.

I’d like to be able to clear out old history on cold start (loading the document when it hasn’t been edited for awhile). I know this can be achieved by creating a brand new Y.Doc from the original content without the history.

The challenge with this approach is that its possible that a client still exists (disconnected) with old in-memory state that contains the document updates that have now been deleted, which, when sent to the server, will result in that content being added to the doc.

Is there any way to instruct the server side document to reject updates that are before a certain time? Or if each update has a clock value, to reject updates before that clock value?

Also open to other approaches for keeping the document size down for long lived docs - This was the first one I thought of.

Thanks in advance!

ViktorQvarfordt · January 14, 2022, 10:32am

We spoke to Kevin on this topic yesterday and here is my summary:

Documents getting really large is most likely due to code that produces unnecessary operations. It could be an issue in an editor binding or in your application code. This was the case for us. With “normal” operations Yjs is very efficient and optimized. We identified two approaches to pruning documents to remove old unnecessary history.

Identifying the cause of the excessively large documents

This can be done by inspecting the update: Y.logUpdate(Y.encodeStateAsUpdate(yDoc)). It’s perhaps not trivial to read these messages, but we got the hang of it and we managed to identify a bunch of unnecessary operations in our case. Fixing them will mitigate most of the issues that we are seeing with our large YDocs.

Prune documents: Approach A: Isolated “sessions”

Initialize a new ydoc from json snapshot and give it a new documentName (perhaps ${documentId}:${sessionId}. This new YDoc will have no history and thus it will be as small as possible. Then, keep track of the active session id in your system. Make sure that new connections always connect to the active session id and make sure that existing connections are informed to connect to the new session. Edits to old sessions should be refused.

Prune documents: Approach B: Clearing YMap keys

If you have a root YMap in which you put all your data, you can completely delete and reinitialize keys and everything under that level will be garbage collected efficiently but not 100% because of tombstones that will be retained. This is perhaps simpler than Approach A since it doesn’t require reloading the document and keeping track of session ids.

Additional notes

Q: Why cannot some unnecessary operations be optimized away?
A: Y.Map doesn’t make use of Yjs’ optimizations if you write key-value entries in alternating order. Always writing the same entry does’t significantly increase the size of the document. But writing key1, then key2, then key1, then key2 (alternating order) breaks Yjs’ optimization. As a consequence of this, Kevin has started exploring a more optimized implementation of a “YKeyValue” type, similar to a YMap, still early and not yet feature complete: GitHub - yjs/y-utility: Utility features for Yjs. I think it will be very interesting to follow the development of this.

philip · January 14, 2022, 3:02pm

Im using this to prune history: implement clone for y documents by filipot · Pull Request #354 · yjs/yjs · GitHub
Has worked really well for me. It clones all shared types into a new Y.Doc (thought here you might want to edit the existing Y.Doc)
A positive is that it works for all shared types and not just YMap (I think)

jamis0n · January 16, 2022, 7:01am

@philip I’m trying your approach with a Y.Doc using y-prosemirror (where the root node is a yxmlFragment, e.g. doc.getXmlFragment('prosemirror') ).

It doesnt seem to work as expected (the duplicated prosemirror key is empty).

Any ideas?

philip · January 16, 2022, 2:34pm

I made a prosemirror repl for you, where it works : Testing y-prosemirror and cloneDoc • REPL • Svelte

(Also I’m using tiptap and its working for me there too but i think i had to do .getXmlFragment(‘default’) or something)

jamis0n · January 17, 2022, 7:27pm

@philip - I’ve implemented your approach in Hocuspocus (the Tiptap collaborative backend), and while the cloned doc approach does work, if an old client reconnects with its old state, that old state overwrites the cloned state.

@dmonad - Do you have any guidance for y-prosemirror users to compact/delete the history from the root yxmlFragment (doc.getXmlFragment('default')) in a way that would be safe for older clients to reconnect and apply that compression to their doc?

thomaszdxsn · January 18, 2022, 5:40am

I don’t understand this. if exiting connections switch to new doc, it will make in-memory state merge to new session, so new session blow as old session.

ViktorQvarfordt · January 18, 2022, 6:10am

On this case, existing connections need to discard their old document and get the new document from the provider. Ideally, session switching doesn’t happen when there online clients.

jamis0n · January 18, 2022, 4:12pm

Ideally, session switching doesn’t happen when there online clients.

So the case I’m specifically thinking about, for example, is someone closing their laptop and then opening it again a long time later (after we’ve compressed the doc).

The server sees a connecting client and must do 1 of 2 things:

Accept any updates sent by that client and relay new ones to them correctly (about the compressed doc)
OR
Reject any updates sent by that client and force them to reconnect with a clean state

ViktorQvarfordt · January 18, 2022, 5:07pm

For practical purposes I would go with the second option. This implies that the session approach is not suitable for use cases that allow offline edits.

jamis0n · January 18, 2022, 5:38pm

I agree that forcing a client to clean its state is the ideal approach.

However, I’m not sure how to accomplish that on the server/Hocuspocus side.

ViktorQvarfordt · January 18, 2022, 5:55pm

I would probably just store the active session id in a normal database (perhaps as a column on the table where you store ydoc byte arrays) and then have the rest of your system read this. Depending on how frequently you need to update this, you could just poll or do something more clever. Eg if using Redis you could have the active session id in there.

This added complexity is why I don’t like this approach

jamis0n · January 18, 2022, 8:13pm

This added complexity is why I don’t like this approach

100% agree here. I just don’t know of any other way to handle this with Hocuspocus+Tiptap connected clients that may go offline for extended periods of time.

Even if you code your frontend client to be defensive and refresh itself after a certain amount of time, it feels risky/brittle to rely on clients to not send “bad”/“outdated” updates to the server.

jamis0n · January 18, 2022, 8:16pm

A versioning API built into YJS could solve this problem as well as other “outdated” client issues like schema changes.

const doc = new Y.Doc();
Y.applyUpdate(doc, Y.encodeStateAsUpdate(<doc from database, compressed>))

// Increment the version number to disallow outdated client updates
doc.setVersion(doc.version + 1);

ViktorQvarfordt · January 18, 2022, 8:43pm

I don’t think it makes sense for Yjs to add support for this. You should handle this in application code.

jamis0n · January 18, 2022, 9:05pm

The application code in this case is the Hocuspocus message receiver: https://github.com/ueberdosis/hocuspocus/blob/main/packages/server/src/MessageReceiver.ts#L111-L113

The check we want to do on the incoming message then feels very similar to the readOnly check, except we want to identify the update as “outdated”.

I’m going to look into what information can be gleaned from the update to determine that (either an update number or a timestamp)

liseaght · March 17, 2022, 10:56am

How to do this please? I don’t need to consider offline editing. I want to initialize the document when it reaches the specified size, without keeping any history information

sumbad · January 26, 2024, 3:01pm

I think you can do this with the next code:

const ydocWithoutHistory = new YDoc();
applyUpdate(ydocWithoutHistory, encodeStateAsUpdate(sourceYdoc));

raine · January 26, 2024, 5:37pm

@sumbad encodeStateAsUpdate still includes the entire history (just in a single binary update).

To create a new history, you have to convert to raw delta or json and then set it on the new Doc. This will vary depending on your shared type (including nested shared types), but clearing the history on a basic YText would look like this:

Demo: View in CodeSandbox

// simulate a YText with some history
const doc = new Y.Doc()
const ytext = doc.getText()
ytext.insert(0, 'abc') // insert 'abc'
ytext.format(1, 2, { bold: true }) // format 'bc'
ytext.delete(2, 1) // delete 'c'
console.log('ytext', ytext.toJSON())

// get the raw state as a delta with no history
const delta = ytext.toDelta()

// duplicate YText with no history
const doc2 = new Y.Doc()
const ytext2 = doc2.getText()
ytext2.applyDelta(delta)
console.log('ytext2', ytext2.toJSON()) // 'ab'

However, please keep in mind @jamis0n’s original warning:

There can always be offline clients, so you can’t safely wipe the history (just like you can’t safely do a hard reset in a git repo after the history has been shared).

sumbad · January 27, 2024, 3:16pm

@raine, thank you very much for explaining! I tested the encodeStateAsUpdate function today and have had less result than the original document with history, but more than document without history (new content). This confirms your statements. So that, approach with encodeStateAsUpdate is more about compressing history, is it right?

And one more question, returning to the @jamis0n warning. Using encodeStateAsUpdate is more suitable when we want just reduce size of a document because it is still having history info?