Appropriate way to load initial data / fallback to current yjs doc data

JoeRoddy · June 10, 2022, 7:21pm

Hi! I’m wondering what is the correct workflow and api for determining if my user has connected to the room and whether or not they should be the first user to populate a document with data.

Basically, if this is the first user to arrive to the room, I want to go fetch the data from my database, but if there are other users already there, I want to default to the content that is in the document.

I can check if there are other users by accessing awareness.getStates(), but that is empty while the user is still connecting. Right now I’m just using a setTimeout, after 1000 ms, check awareness, and if there are other users, let the user accept the data from yjs, otherwise load from my database. Obviously this is super janky, if the user is on a bad network connection they might think they’re the only current user, load from the db and potentially overwrite any unsaved changes.

Just wondering the right way to do this, thanks!

JoeRoddy · June 10, 2022, 7:32pm

Does the provider expose a callback I can use to do something once the user has established their connection? I’ve looked through the docs and I can’t find anything like that.

dmonad · June 13, 2022, 8:04am

This question gets asked here quite often. My opinion on this is that the creator of the document should populate the Yjs document. You avoid a lot of nasty bugs by simply storing the Yjs document in your database instead of recreating it every time a user loads it.

JoeRoddy · June 13, 2022, 1:23pm

Firstly, thanks so much for the reply and for working on this awesome library!!

My opinion on this is that the creator of the document should populate the Yjs document.

But say there are 2 users able to edit a document, ‘creator’ and an invited user, let’s call him ‘user2’.

If user2 enters the document, how does he know if the creator is even in the document? If the creator isn’t present, user2 must be the one to populate the document from the database, no?

You avoid a lot of nasty bugs by simply storing the Yjs document in your database instead of recreating it every time a user loads it.

I am storing the document to my db, but it’s via quill.getContent() rather than persisting the yjs doc itself. Then when I load, I set the initial content via quill.setContent(). Maybe that’s my issue then?

JoeRoddy · June 13, 2022, 4:15pm

To clarify, this is specifically for webrtc. I’m sure this problem would be easier to solve if I had a server.

artknight · June 16, 2022, 6:04am

I agree with @JoeRoddy that in an all-online scenario the initial data is always available from the DB, so the first user to the document ( think of it as a meeting that gets joined by attendees with equal permissions ) should be populating the doc with the initial value. All others will just get synced once they join.

@dmonad Btw, storing the doc in the DB is almost pointless b/c all changes are persisting to the DB via the onChange event on the editor.

dmonad · June 16, 2022, 7:30am

To clarify, this is specifically for webrtc. I’m sure this problem would be easier to solve if I had a server.

That’s fine. But I assume that you store your documents somewhere (e.g. in a Postgres database). I propose that instead of storing the text representation (quill.getContents()), you store the Yjs document instead (Y.encodeStateAsUpdate(ydoc)). You can still store the text representation for indexing, but the Yjs document should be the source of truth.

I am storing the document to my db, but it’s via quill.getContent() rather than persisting the yjs doc itself. Then when I load, I set the initial content via quill.setContent(). Maybe that’s my issue then?

Yes, I highly recommend storing the encoded Yjs state (Y.encodeStateAsUpdate(ydoc)) instead. For this approach you don’t need to determine the “first user”. The encoded Yjs document can be loaded as the initial state for all clients that load the document.

In the terminology that I used, the “creator” of a document is the user that initially creates a document (e.g. by clicking a button “create new document”). You want to retain the Yjs history because, as I said, you will avoid a lot of nasty issues that are related to two clients joining simultaneously or one client disconnecting very often.

artknight · June 16, 2022, 2:24pm

@dmonad Ok, I followed your advice and implemented the following. Note, that I am still seeing the value getting inserted twice for the second user joining the doc. Please let me know where I messed up

    getBase64YJSValue(notes){
		let dbdata = this.getDBData(),
			__ydoc = new YJS.Y.Doc(),
			__ytext = __ydoc.getText('codemirror');

		__ytext.insert(0, notes || dbdata.notes);

		return UTILS.bufferToBase64(YJS.Y.encodeStateAsUpdate(__ydoc));
	}

	enableNotesSync(){
		let dbdata = this.getDBData(),
		     __ydoc = new YJS.Y.Doc();

		if (dbdata.notes_yjs.length)
			YJS.Y.applyUpdate(__ydoc, UTILS.base64ToBuffer(dbdata.notes_yjs));

		let	__provider = new YJS.WebsocketProvider(ws_channel, ws_room, __ydoc),
			__ytext = __ydoc.getText('codemirror');
		

		let __binding = new YJS.CodemirrorBinding(__ytext, __codemirror, __provider.awareness);
	}

        saveNotes(notes){
               let notes_yjs = this.getBase64YJSValue(notes);
               
               //ajax call to the server to save the value
              new Ajax(...)...
        }

Please note that the saveNotes method is triggered by the onChange event on the Codemirror editor.

===>>> The line that duplicates the value is

if (dbdata.notes_yjs.length)
	YJS.Y.applyUpdate(__ydoc, UTILS.base64ToBuffer(dbdata.notes_yjs));

It gets run only once when the editor is loaded!

artknight · June 16, 2022, 2:39pm

@dmonad It almost feels like on subsequent page refreshes the Doc retains the value, and then when the editor is opened the value gets appended to the cached value?

Do I need to reset the Doc somehow each time it gets enabled?

JoeRoddy · June 16, 2022, 2:56pm

@dmonad

Thanks so much Kevin!

This solves the content overwrite bug perfectly, if another user joins and they set the content, it handles it appropriately and ignores the update.

I’m still left wondering if there is an easy answer to my original question though? My document is stored in Azure blob storage, so I would preferably like to only read the document once per session for $$ reasons, since the subsequent reads are totally pointless.

Is there any callback event from the presence system that can tell me when I can trust the state is accurate? It initially says it’s just the client user in the room, even if there are actually a ton of users.

It would almost be better if the room state was null until we can trust that the representation of users in the room is accurate. That, or some callback like in this pseudocode:

doc.on('connectionEstablished' => {
   if(awareness.getStates().size === 1) {
     populateInitialData()
  }
})

If this is not possible bc of limitations with WebRtc or something, I totally understand, and will move forward with a setTimeout being good enough. Just wanted to make sure there’s not a Yjs api I’m missing that will let me achieve this out of the box.

dmonad · June 17, 2022, 6:14am

@artknight The content gets duplicated because you don’t retain the editing history. Whenever you callgetBase64YjsValue you re-populate the Yjs document:

__ytext.insert(0, notes || dbdata.notes);`

I explained in my previous comments that you should only initialize the document once and then re-use the generated Yjs document. Think of it like a Git repository. The git repository contains editing traces from the past expressed as commits. If you create a fresh git repository with the same content and then merge with the other git repository that contains other editing traces, then you need to merge all changes. Yjs merges automatically without removing insertions (i.e. duplication in this case).

You said before that you don’t want to store the Yjs document. That’s fine, but then you need to find another way to get around the duplication issue. I really can’t help you then.

@JoeRoddy

I’m still left wondering if there is an easy answer to my original question though? My document is stored in Azure blob storage, so I would preferably like to only read the document once per session for $$ reasons, since the subsequent reads are totally pointless.

Reads should be very inexpensive. However, if you really care about this (you probably shouldn’t), then you can wait a few seconds before awareness populates. Something like:

// I don't recommend using this code!
setTimeout(() => {
  if (ydoc.store.clients.size === 0) {
    // nobody populated the ydoc, time to request content from server
  }
}, 3000)

If this is not possible bc of limitations with WebRtc or something, I totally understand, and will move forward with a setTimeout being good enough. Just wanted to make sure there’s not a Yjs api I’m missing that will let me achieve this out of the box.

No, there is no Yjs API that could make this more efficient. Ideally, every client requests the content from the server, if you have one.

artknight · June 17, 2022, 9:20am

@dmonad
Yes, that is exactly how I solved the issue (forgot to update my post here)! Instead of calling getBase64YjsValue I am just re-using the ydoc that was created originally. Thank you!

So just to give you some more context here. Normally, the ydoc gets generated when a meeting is created, and that ydoc is stored in the DB and is sent to every user who is connected to this meeting. (that works perfectly now!) There is another use-case, however, where the meeting is created through the scheduling page and initially there is no ydoc saved in the DB. I am fixing that state by quickly creating and storing a ydoc in the DB when that meeting gets viewed by the first participant … and if there are any notes that were added during external scheduling with this code

//this happens when the meeting is created through the scheduling page
		if (!dbdata.notes_yjs.length && dbdata.notes.length){
			__provider.on('status', options => {
				if (options.status==='connected'){
					setTimeout(() => {
						if (__ytext.toString()==='')
							__ytext.insert(0, dbdata.notes);

						this.saveNotes();
					}, 2000);
				}
			});
		}

I am not sure if there is a better way to handle this edge case. I do not really like using a timeout as it is not 100% reliable!

dmonad · June 23, 2022, 12:37pm

Personally, I’d simply let the server initialize the note if there is no existing content in the database.

Even better, create the Yjs document with the initial content when a user clicks the “new document” button if there is any.

The use of a timeout feels very error-prone for this…

artknight · June 23, 2022, 2:09pm

I totally agree, however, when the conference is created ( at that particular end-point ) the initial content already exists, so saving a default template is not really an option b/c the initial content would not be included.

Is there a way to generate the template with the initial content on the server-side? ( java, node, etc… )

dmonad · June 29, 2022, 10:35am

Yes, you can create the content on the server. Yjs also works in node. y-crdt · GitHub is a compatible framework that works in Rust, Python, Ruby, WASM, …

janostik · August 24, 2022, 10:47am

There is one more use case which is related to this topic of initial data load. We have an existing system where we stored rich text data. Meaning users could come in, create a document and we stored in the sql database (HTML / Markdown / …)

If we now want to introduce collaborative editing we’ll be running into similar issues. We already have content which exists and if multiple users access this document these need to get resolved. Based on this discussion (and others… ) I get we’ll need to update the y-websocket server to check whether the document exists in the leveldb (or some other persistence layer), if not fetch it from our server and convert it into yjs update. (which should be better solution then initialise the docs on the client)

If the above is correct, I have couple of questions…

Could there be concern about race conditions? 2 users open the document at the same time. One with fast connection starts making changes and second user with override those with his initial load?

When/where to call function that stores rendered snapshot of the data? The y-websocket server with it’s persistence will be the single source of truth for the documents that are being edited. But at some point we need these documents to be indexed into elasticsearch or our database. To do this we need the prosemirror state. Is it enough for clients to periodically send rendered content into our server? Or should this be done on the side of y-websocket? What would be the best even I can connect to and is there a function that I can use to convert into prosemirror state within Node?

dmonad · August 25, 2022, 5:52pm

Yes, the above approach looks correct

Good point. A document must only be initialized once. There are a few approaches to ensure that only a single entity may initialize a document.

Option 1: Use consistent hashing to define which server should be responsible
for initializing a certain document (e.g. by GUID): listOfServerIps[hash(GUID) % listOfServerIps.length]. Note that this approach is pretty complicated to
implement correctly (especially if you want to support dynamic scaling).
Option 2: Rely on a locking approach. There is, for example, the redis-redlock
approach. Personally, I’m not a big fan of locks in practice. They seem
expensive and can also fail. However, if this is only used ONCE to initialize
a document because you need to migrate somehow, then I think this might be the
best/easiest approach.

Either approach seems fine. You could try to polyfill the required DOM features
in your nodejs server. This is what I do to run y-prosemirror tests in nodejs.