My app consists of a list of text pages which a user can edit. I built a simple version of this with a single YJS document, each update is stored on my server.
It occurred to me that if a user wanted to delete one of their documents, I can’t delete it from the server, because I just have a list of updates and I don’t know which one corresponds to which page.
If I squashed those all into one update, along with an update that deletes a page, then run garbage collection on it, can I be sure the page data was removed from the data?
My understanding is that GC only works on shared types, not items within lists, which should be ok because the page mainly consists of a Y.Text object. Maybe even a subdocument.
This potential limitation made me wonder if it is more responsible to store a list of updates for each page as a separate entry in my database. That is, each page has its own list of updates, rather than the entire app state being a single list of updates.
Does anyone have a recommendation for this case? Much appreciated.
You should not filter updates depending on whether you still need them. You must always apply them to the document. Otherwise, you might not be able to apply future updates.
Deletion only marks items as deleted. In many cases we can compress state. When a parent is deleted, we can often compress the state of the children’s to a few bytes.
If you expect a lot of large documents, you could try separating your “pages” in subdocuments (see documentation - basically separate documents).
You should also not need to store all updates separately. You should simply apply them to a document. You could also do
mergedUpdate = Y.mergeUpdates(updates) from time to time. However,
Y.mergeUpdates doesn’t garbage collect, so
Y.applyUpdate(ydoc, mergedUpdates) should still be performed in regular intervals.
The easiest, most robust solution is to simply apply updates to a Yjs document once you receive it. From time to time you can encode the document to a binary state and store it in a database.
Thank you for the information. I’m still left wondering though - if a user wants to delete something, I’d like to know whether or not I can actually remove that data in a way that it can’t be recovered. In the case of subdocuments, if I merge all updates into a single document, will deleted subdocuments become unrecoverable after garbage collection?
I think the y-indexeddb provider records individual updates, but squashes together any updates past a certain number (like 500). Is there a reason to follow that model rather than merge every incoming update into a single document?
Thanks, your help is much appreciated!
If you delete something, and didn’t disable garbage-collection, then others can’t recover that information. While Yjs preserves metadata, the actual content is always deleted. (If you have an UndoManager, the local state of the client that deleted content will still show the deleted content, but it will be removed in the next session).
Since subdocuments are separate Yjs documents, you can fully delete them if that is something you want to do.
Storing the complete Yjs document on every single keystroke seems wasteful. Iterating through the whole document is not free. Hence we store incremental updates and squash them from time to time which is far more performant and less noticeable.
Ahh ok, beautiful. I think I get it now.
- Deleted stuff always becomes unrecoverable once merged into a single update (if saved as a list of separate updates, you could play those back up to before it was deleted)
- Squashing into a single update on every keystroke is expensive
- Iterating through a long list of updates is also expensive
- So, to balance this: save an ongoing list, and squash it down whenever that list reaches some limit
Is this a good understanding?
I have one other concern, but I’m not sure if its valid. If there is a garbage/corrupt/malicious update sent to the server and that gets merged into the document, (or attempted to) can that corrupt the whole document? Or does the library reject that as an update?
Yep, that sounds about right.
If you have a malicious participant that understands the Yjs message format they can certainly manipulate the document so that other clinets lose data. That is why untrusted participants shouldn’t have write-access.
If a message causes an exception on another client, then that update won’t be propagated. I can’t guarantee that maliciously designed messages won’t corrupt the document forever. This doesn’t just happen because of a bug: It requires mal-intend and a lot of knowledge of how Yjs manages CRDT state.