Y-websocket - re-init from different data source

lucien · September 30, 2020, 1:55pm

Hi!

As part of a new project, we are looking into realtime collaboration, with a goal similar to Figma; Being able to collaborate on design documents.

We’ve run some tests with y-websocket (with y-mongodb) and we think this would be a viable solution for us. However, it seems that the yjs transactions collection/table (either in memory or persisted) is the leading source of data and that a document should always be initialized through this data.

We currently persist the actual data inside the ydocument into a separate collection when all connections in a specific room closes. Ideally we’d like to make this data the leading source when a new session starts, after which the transactions/updates will become the source again.

So to clarify, we’d like the situation to be like this:

user opens design document
load “real” data from our separate collection
let yjs take over with transactions
user closes design document
save “real” data (from content of ydoc)
clear transactional data
repeat when needed

Other users that join in would sync from the websockets rather than init with the “real” data I assume?

Would this be possible to do?

canadaduane · September 30, 2020, 2:24pm

Yes, it’s possible, with a couple of caveats that I’m aware of. We do something similar in Relm–when a designer (of a 3D world) wants to truncate the history of a relm, we export a snapshot of the current state of the YDoc and then import into a new YDoc.

To accomplish this, we needed the ability to “get” the current snapshot of the YDoc, and we did so by exporting a new function, getYDoc in a custom version of the y-websocket code: https://github.com/relm-us/relm/blob/main/server/yws.js#L51

(I could wrap that up in a PR for y-websocket if it’s valuable to you and @dmonad finds it an acceptable change).

The other piece that’s a little tricky is that Yjs doesn’t keep track of the schema of your data. In other words, you might know that your YDoc consists of a y-array with a bunch of y-maps containing y-text; however, the YDoc itself doesn’t track how that maps to, say, a JSON export. So you’d need to hard-code or otherwise track the schema of the YDoc so that when you import it, you can put all the data you exported into the right Y types.

dmonad · October 1, 2020, 2:36pm

@lucien Ideally, you store the Yjs document alongside your JSON representation. This will introduce some overhead because you are storing the same data twice, but there are a lot of advantages of keeping the Yjs metadata around.

A client might not realize that it disconnected (it takes a while before the client realizes that it disconnected in some cases - e.g. over 3g, Starbucks Wifi, …). You won’t be able to apply edits after the server document is destroyed.
A nice feature of Yjs is that you can store your data offline using y-indexeddb. This improves load-time and ensures that users never loose any data unless the server AND the client lose all their data.
When you introduce the feature that you described, and that @canadaduane implemented, then you need to think about more special cases. A lot of developer overhead for losing some essential features.

Even ShareDB doesn’t recommend to delete the history - ever!

github.com

share/sharedb/blob/master/docs/faq.md

# FAQ

## Is it possible to completely delete documents from the db?

No, it is not possible to use the ShareDB API to fully delete data. In addition, the operation log is kept forever by default.

Maintaining persistence of snapshots and ops means that ShareDB can correctly deal with all cases where ops have been removed. Permanently removing the snapshot document could result in a corrupt state in some edge cases by not maintaining the current document version, which must be incremented on each commit. As well, if you delete ops and then a client reconnects needing those ops, you will break that client and it will be unable to submit any pending changes or bring itself up to date from its current state. If your use case calls for complete deletion of operations, you'll need to ensure that no clients will ever need them again or deal with the error appropriately.

You can currently delete from your persistent datastore directly. For example, if you're using MongoDB you can delete the data by connecting to Mongo directly, not via the ShareDB API. If you do delete snapshot data, be sure that you delete not just the document snapshot but all operations associated with that document. Having operations with no corresponding snapshot would result in a corrupt state.

In @canadaduane’s case, it really makes sense to restart the session without any associated metadata. If you have a document that really receives millions of millions of changes every day (e.g. a gameworld that allows thousands of users to concurrently move & rotate 3d objects) then you should think about buying into the complexity (and the restrictions!) of restarting sessions. If you only build a collaborative application that only receives a couple of million changes in its entire lifetime, then you don’t need to think about this feature. You can always re-implement it later. Build it first, and improve later.

@canadaduane I’m hesitant right now to make it part of y-websocket because I don’t want to give the impression that implementing this should be the norm. This feature won’t play nicely with other features I have planned (e.g. autoscaling of y-websocket). But I would appreciate it if you would write a tutorial on how you implemented this feature. Initially I only planned to build collaborative apps, but you built a whole 3d world with Yjs. It would be interesting to hear more about the challenges and solutions you came up with.

lucien · October 2, 2020, 9:06am

Thanks both for the detailed responses! We’ll have to give it some more thought and perhaps try to simulate a few environments to see what would work best for us.

In any case, thanks for all the hard work on yjs