How to use Y.PermanentUserData?

zlv-thisF · July 21, 2020, 2:56am

hello everyone,

I am building a collaborative editor based on prosemirror and yjs without y-websocket (https://github.com/yjs/y-websocket/blob/master/src/y-websocket.js).

On the server side, i have integrated a java server which just broadcast update from all clients and in charge of data persistence, and a node RPC server which has some pure functions to generate encodedState or stateVector based on empty ydoc and updates which comes from the java server.

this works nicely except for versions, if i use Y.PermanentUserData in client side as the prosemirror version demo:

clients cannot sync each other anymore, now i have checked the discuss here: Sync protocol over websockets. seems the problem comes from that the whole document is not synced before update broadcast due to each client has it’s own userMapping by add Y.PermanentUserData.

so i dive into the https://github.com/yjs/yjs-demos/blob/master/prosemirror-versions/prosemirror-versions.js, and try sync the whole document according to the syncStep1 and syncStep2. however i still cannot sync clients…

is there any potential risk base on my server solution or is the sharedDoc of the wsProvider on the server side necessary ? how can i sync each client’s permanentUserData base on the server that only broadcast updates and rpc the pure node server to data persistence

dmonad · July 21, 2020, 2:42pm

You are probably not broadcasting the local state to the other clients. If there are missing updates from one client, newer updates can’t be applied. Try broadcasting the local state after an initial connection.

tommoor · October 14, 2020, 6:25pm

Is there any documentation at all on how to use PermanentUserData beyond the addition to Y.Doc above?

How can I parse this data later to determine attribution of changes?

dmonad · October 14, 2020, 7:23pm

I need to communicate this better. Everything that is not documented in the Yjs readme is not stable.

The PermanentUserData feature works but I plan to improve it and I don’t want to give the intention that I will support this implementation in the future. I think that with the addition of Sets in Yjs we can represent PermanentUserData much more efficiently.

This is their use case:

They track insertions: Users produce changes using their client_ids (random integers that we use to generate unique identifiers for each change). The PermanentUserData associates client_ids with their respective user-name. This is currently a mapping from username to an array of client_ids. We could achieve slight improvements by using a Y.Set instead of a Y.Map.
They track deletions: Deletions are not associated to client_ids. We use encoded DeleteSets (ranges of deletions efficiently encoded) to track deletions. Deletions are tracked by associating a user-name to an Y.Array of DeleteSets for each deletion (typically just 5-8 bytes binary encoded). Deletions occur very frequently and we need to make sure that they are stored efficiently. It is currently not ideal in my opinion as each deletion will then be associated with an insertion in the PermanentUserMapping field. This is not too bad, and most other CRDTs represent deletions like this anyway. But we could do much better with the introduction of Y.Sets. DeleteSets are always mergeable and will eventually converge when all created DeleteSets are merged. I want to enable such a feature in Y.Set as well (the ability to implement State-based CRDTs on-top of the Yjs encoding format specifically).

So in short, the feature is working and you can certainly use it right now. I propose that you copy the PermanentUserStorage if you want to use it now. Future releases won’t break this feature, but I will certainly implement a V2 of the same API.

Now that I’m writing this I think the best approach would be to outsource PermanentUserData to a separate package that is versioned. You can use the v1 version (the current release) right now. I’m already planning a V2 release with improved encoding that you can use in the future. The encodings will be incompatible.

In order to use PermanentUserMapping, you need to be familiar with the Yjs document model. It basically only tracks client-ids and DeleteSets and associates them with users. If you want to find out who did a particular change (e.g. who deleted a range, or inserted specific content) then you search for meta-information in PermanentUserMapping.

If you want to understand the Yjs model I propose to tour through Yjs: https://www.youtube.com/watch?v=0l5XgnQ6rB4 It will be much clearer how you can do such a thing. At the moment there is no convenient API to calculate attribution of changes, you’d need to handle that yourself (unless you use y-prosemirror which already implements this feature).

tommoor · October 15, 2020, 6:24pm

Thanks for that, I think it’s the API’s that are missing that I was hoping existed. For a v2 would you consider tracking timestamps along with edits?

dmonad · October 16, 2020, 10:30pm

Merging data that aggressively is not possible if each update is associated with a timestamp. I even argue that timestamps are useless in real-world applications. If you want to implement something like a “timeslider” (similarly to PiratePad) you could associate a timestamp to each update message and restore the state by applying all updates before a certain point in time. But I recommend to rather associate snapshots (an efficient method to restore document state) with timestamps if needed. Snapshots should be created when a user leaves a session (ending an editing session) and when users join offline edits (allowing the user to review the merged document). A timeslider (& the whole concept of time) is meaningless when you consider offline editing which is the main use-case I want to solve.

tommoor · October 18, 2020, 5:58pm

This makes sense, I can understand that limitation.

tommoor · October 19, 2020, 12:28am

In the same vein, what do you think about allowing passing a clientId to new Y.Doc instances? Or perhaps it would be better to use sessionStorage internally to Y.Doc, you could then make sure that all docs created in the same browser tab receive the same clientId – reducing the amount of data churn.

tommoor · October 19, 2020, 11:57pm

I realize now you can’t do this because the clock is reset for each clientID. So to be clear is there no concept of time at all in YJS? I’m trying to determine who was the last client/user to touch a fragment and as far as I can tell it’s not possible…

It doesn’t have to be perfect, if changes are made offline it could be the order merged.

@dmonad mentioned because I edited this response.

dmonad · October 28, 2020, 2:46pm

You should only reuse a client-id if you are absolutely sure that no one else with the same client-id is producing conflicting changes. If you can assure that, you can set ydoc.clientID = myclientid (wait, don’t do it!). But in most cases, it is a really bad idea with unintentional side-effects. In the best case, you save a little bit of data (please do some tests, the overhead really is insignificant, especially with the v2 encoder). In the worst case, your document gets corrupted with no way to recover.

If you use session-storage, you must ensure that no other window is currently using your client-id (or at least that there are no conflicts). This might be true for most applications that only produce edits when a user opens a window and then performs changes. But at some point in the future, you might associate a timeout to a change and your document gets corrupted.

I don’t recommend it. There is a huge potential for corrupted documents.

Nope There are alternative shared editing solutions that used time to order messages. But they have increased potential for conflicts because it can’t be expected that computers have synced computer clocks.

Right, you would need to store this information in an additional field.