Distributed offline editing with couch/pouchdb

samw · January 4, 2021, 10:35am

Hi,

Yjs looks super exciting for a (early experimental) project I’m working on. I’m building a distributed offline enabled editor with prosemirror and couch/pouchdb as a datastore. Has anyone experimented with using Yjs with pouchdb as both the datastore and communication channel?

samw · January 6, 2021, 8:45am

So I’m starting to try to build a PouchDB provider (I’m using the indexdb one as a starting point) and plug-in for Yjs. My plan is that a Y.doc will manage the whole Pouch document and be included in it as a binary attachment. Then have a top level Y.Map that is also exported to json at save time to build the pouch document, this can then be queried and indexed by the standard pouchdb api. So the user doesn’t ever change the pouch document directly, only the Y.doc, and with that we can get conflict free merges in PouchDB.

When fetching a document from pouch we will also fetch all conflicting versions and will merge the Y.docs. Then we watch the pouchdb change feed and continue to merge in changes as they arrive.

This method obviously only merges whole document versions, great for offline editing but not real-time where vectors are better. It seems a combination of this with the websocket or webrtc provider for real-time collaboration would work well. PouchDB revision keys are deterministic hashes of the document and so if you combine a PouchDB provider with a websocket provider to merge in update vectors the resulting document will have the same revision number, fully sidestepping the PouchDB/Couchdb sync (also important for the normal pouchdb sync). We would probably want to ‘pause’ or slow down the save to PouchDB while using a realtime collaborative provider to stop too many full document syncs in the background.

I’m also considering having an option for a pre-save extractor function, this would allow you to extract additional data (say from the y-prosemirror doc Y.XmlFragment) and add it to the main pouch json doc for indexing and searching.

I intend to opensource/contribute back anything that I can get working on this.

This plan should work well when you have a document open but the area where I have a question is when documents are not open. PouchDB will continue to sync in the background (when the app is open), and we can easily watch the change feed for new version conflicts of unopened documents. When using this with Prosemirror do we need to have the document open in Prosemirror (and have access to a browser DOM) when doing a merge or can we just naively merge the Y.docs? Would this cause a problem with the Prosemirror schema?

Anyway, also just wanted to say Yjs is amazing, having spent the last couple of days reading up on everything I’m seriously impressed with what Kevin had achieved!

dmonad · January 6, 2021, 1:29pm

Thanks @samw!

That would be really neat! It combines the advantages of Yjs with the indexing of a proper database.

You can just merge the Y.Docs. I also want you to point to Differential updates · Issue #263 · yjs/yjs · GitHub which will allow you to merge document updates without loading a Yjs document. I.e. Y.mergeUpdates([update1, update2, yjsDocState1, yjsDocRemoteState2, ..])

Regarding schema conflicts: When you load an invalid document with y-prosemirror, (e.g. it has two headlines instead of one that is specified in the schema), then y-prosemirror will automatically correct the Yjs document. This will happen automatically (usually by removing the invalid node).

Awesome! I’m looking forward to hearing about your progress

samw · January 8, 2021, 12:04pm

Hi Kevin,

Thanks for getting back to me, I’m making good progress, I have a simple version running (using TipTap v2 so I need to check if I can show it yet) that works with open “foreground” documents with both realtime and offline (conflicting) edits. The next job is to have a background conflict manager to handle merging conflicting edits without them being actively open at the time (for example when you go back online after editing multiple documents).

Y.mergeUpdates looks like a useful optimisation, although it will only apply if you are not extracting any values from the Y.doc into the PouchDB Json as that needs to happen after every update.

I have a question on ‘origins’, I’m trying to filter updates so that only local updates and conflict mergers are saved back to PouchDB, if I don’t and you have a document open from a number of synced user db’s at once then you create a cascade of database updates as each user receives and then saves the same update. So far I have done this by setting origin to ‘remote’ when doing an applyUpdate, and filtering by that on the doc.on('update'). Is this the correct way to do it? Is there a standard origin naming convention?

(As an aside, with the undoManager you can include origins in trackedOrigins but how do you exclude only remote changes?)

As I said before, this works well, but when you have multiple people actively working on a document at once people should probably use the Websocket or WebRTC provider as it will be more efficient and has awareness support (which I don’t think should be built into a PouchDB provider). One thought I had was it would be useful if the awareness protocol was able to nominate a ‘host’ or ‘primary user’ who is responsible to saving updates back to the database reducing the amount of database transactions and syncs. Has anyone tried this before?

Anyway, hope to have something to show soon.

dmonad · January 9, 2021, 3:11pm

Right, although it would make sense to set a denounce for extracting the values as this is potentially an expensive task to perform on every keystroke.

There is no naming convention. I actually prefer to set the provider object as the origin. You can potentially have two providers accessing the same document. In this case, you want to know if “this” provider performed the change or any other object.

For generic about whether a transaction was created remotely, you can create a remote transaction.

// the third parameter of Y.transact marks a transaction as remote
Y.transact(ydoc, () => {
  Y.applyUpdate(ydoc, update)
}, provider, true)

Then you can check whether an update was created remotely:

ydoc.on('update', (update, transaction) => {
  transaction.remote // => true iff update was created remotely
})

Note that ydoc.transact doesn’t have a third parameter.

I recommend to mark transactions as remote when the update was created remotely. This is useful meta-information. But for filtering updates (so you don’t store the same update again when you receive an update from pouch), I recommend to set origin = pouchProvider, and then perform an identity check on origin === this.provider when you want to store the object.

I recommend to whitelist origins that you want to track instead of tracking everything that was not created remotely. You might be able to use the transaction.remote flag, although I don’t recommend that.

Not that I know of. The awareness protocol as weak guarantees that the clients can elect a “primary user”. You probably want to use something like etcd for electing a primary user. If possible, you should avoid this concept.

Nice

samw · January 9, 2021, 4:32pm

That’s brilliant, thanks for the guidance.

I had missed the remote flag on Y.transact.

Edit:

You can ignore below, I think I can do it by having the provider as the origin in combination with the remote flag. For a conflict rev set the remote flag to false.

========

The problem I’m facing with using the provider as the origin is that with pouchdb you have two origins we want to keep track of. The first is a standard revision, it is already persisted to the db and could come from ourselves, in which case we ignore it completely (don’t reapply it to the Y.Doc, easy with tracking the pouchdb _rev), or from another node/user in which case it we update the Y.Doc. However we don’t want to save back to the db as its already there and so we filter it out in the update hander based on origin.

We then also have conflict revisions, these are created by pouchdb/couchdb when two nodes sync, commonly after one has been offline. This is where combining PouchDB with Yjs really shines, as we can just naively update the doc with all conflicting revisions. However this time we do want to save back to the db in order to ‘resolve’ the conflict.

So far I have done this with two string origins, ‘pouchdb-rev’ (not saved) and ‘pouchdb-conflict’ (saved), could these just static properties on the provider?

samw · January 9, 2021, 5:35pm

It doesn’t look like the local flag is passed to the update handler:

github.com

yjs/yjs/blob/97d97147105043de8232f71f80447eba920347cb/src/utils/Transaction.js#L343


if (!transaction.local && transaction.afterState.get(doc.clientID) !== transaction.beforeState.get(doc.clientID)) {
  doc.clientID = generateNewClientId()
  logging.print(logging.ORANGE, logging.BOLD, '[yjs] ', logging.UNBOLD, logging.RED, 'Changed the client-id because another client seems to be using it.')
}
// @todo Merge all the transactions into one and provide send the data as a single update message
doc.emit('afterTransactionCleanup', [transaction, doc])
if (doc._observers.has('update')) {
  const encoder = new DefaultUpdateEncoder()
  const hasContent = writeUpdateMessageFromTransaction(encoder, transaction)
  if (hasContent) {
    doc.emit('update', [encoder.toUint8Array(), transaction.origin, doc])
  }
}
if (doc._observers.has('updateV2')) {
  const encoder = new UpdateEncoderV2()
  const hasContent = writeUpdateMessageFromTransaction(encoder, transaction)
  if (hasContent) {
    doc.emit('updateV2', [encoder.toUint8Array(), transaction.origin, doc])
  }
}
transaction.subdocsAdded.forEach(subdoc => doc.subdocs.add(subdoc))

dmonad · January 10, 2021, 11:36am

Right. I added the transaction as the fourth parameter to the event handler in yjs@13.4.12

samw · January 14, 2021, 10:00pm

Hi!

So this is what I have so far, its missing centralised conflict handling of documents that are not currently open - that’s next but need to work one some other parts of my project for a bit.

gist.github.com

https://gist.github.com/samwillis/1465da23194d1ad480a5548458864077

y-pouchdb.ts

import * as Y from 'yjs'
import * as mutex from 'lib0/mutex.js'
import { Observable } from 'lib0/observable.js'
import PouchDB from 'pouchdb';


// This is the name of the top level Y.Map that is used to construct the main pouchDB 
// JSON document.
const topDataYMapName = 'data';

This file has been truncated. show original

hanspagel · February 23, 2021, 11:30pm

Just came here by accident, but felt the need to chime in quickly.

Impressive work, Sam! And don’t hesitate to share your tiptap v2 related code here (or anywhere else).