What is the correct way to apply document migrations?

Gin-Quin · December 20, 2023, 11:57am

We are working with documents that have a set of fields. We have Websocket and IndexedDB providers. We want to work offline-first, i.e. the document is first loaded from IndexedDB, then synced with the server if an internet connection is available.

Sometimes, we have migrations and need to update the fields of a document. For example, adding a new field.

It seems like a trivial task, but there is the risk of data being erased:

A user load a document.
He receives it first from IndexedDB
The last migration is applied and the document get a new field, let’s say foo, holding an empty string value. The set operation assigning an empty string value to foo is performed client-side, with a given client id.
The document is synced with the server. But maybe the server document already had the migration previously applied, with a non-empty foo value, from another client id.
The value of foo is now randomly getting the server value (the right value), or an empty value (because of the migration), depending on the client id that has the priority for the mutation.

That’s very dangerous. To put it simply, a migration can randomly erase data. Our workaround is to wait until the server is synced before applying any migration, but then we lose the offline-first approach.

Ideally, there should be a way to indicate “migration updates” from “user updates”, and make sure that user updates always take precedence over migration updates.

Is it something possible to do?

I’ve read that you can indicate a custom client ID before applying a migration (clientId = 0), but also that this is a very dangerous thing to do, and that this can lead to terrible situations of corrupted data.

braden · December 20, 2023, 6:20pm

Migrations are a huge pain and I’ve yet to find a one-size-fits-all solution.

Usually I just try to design around them so the business logic itself is idempotent, or in the case of LWW being problematic for a multi-user conflict, making it so that conflict is as minimally destructive as possible.

I often version fields (“tags_v3”, etc) , as well, and use the code that encapsulates my Yjs stuff to hide away legacy fields indefinitely, rather than delete them.

Migrations are, imo, the biggest glaring flaw with Yjs, and pretty much every local-first solution. I’ve lost many days to having to design around the migration problem—I’d wager supporting local-first back compat in any framework probably makes dev take 50% longer across the board. Yjs is amazing but I think it’s DX is held back a bit by its peer to peer, multimaster nature. Wonderful if your goal is to build a truly p2p app. But, If there was a server-authoritative version of the Yjs algorithm I’d jump on it in a heartbeat.

Gin-Quin · December 22, 2023, 1:31am

Thanks for sharing. It’s indeed an important problem to solve if you want local-first solution.

There is this post that explains it’s possible to create updates that never have precedence over other updates (which is exactly what we are looking for): Initial offline value of a shared document - #6 by dmonad

But there is this big warning: “However, once a client initializes state slightly differently, you will break all documents.”

Does someone know what “initializes state slightly differently” means? I’d like to understand why and how a document can be broken with a custom client ID.

Could it be dangerous if all document changes applied in a migration function were done with a client ID of zero?

I’d like to implement something like this:

function applyDocumentMigration(doc: Y.Doc) {
  const clientID = doc.clientID
  doc.clientID = 0

  const version = doc.getMap().get("version")

  switch (version) {
    case 0:
      // ... apply migrations from v0 to v1
    case 1:
      // ... apply migrations from v1 to v2
  }

  doc.clientID = clientID
}

Gin-Quin · December 22, 2023, 2:06am

I’ve got more informations about this danger. From another post:

And from the FAQ:

So, it seems safe if the migration function is the only place where the client ID is manually set. I think an even safer version would be to have a client ID equals to the document version. This ensures that two conflicting updates due to a migration will never have the same client ID:

function applyDocumentMigration(doc: Y.Doc) {
  const clientID = doc.clientID

  const version = doc.getMap().get("version")

  switch (version) {
    case 0:
      doc.clientID = 0
      // ... apply migrations from v0 to v1
    case 1:
      doc.clientID = 1
      // ... apply migrations from v1 to v2
  }

  doc.clientID = clientID
}

What do you think? Am I about to do something horribly wrong? Even though I think it’s safe, the repeated warnings make me a bit anxious.

MentalGear · February 17, 2024, 4:04pm

Also wondering about this, may I ask if you were successful with this? @Gin-Quin

@braden Have you heard of share-db, crdx and Derby Framework? Share-db (using OT which I think have to be centrally managed) seem a bit more aligned with the central authoritative server approach. Do let me know what you think !

MentalGear · February 19, 2024, 12:10pm

In the beginning I was thinking doing data migration on the server and client depending on their custom version of a Y.Doc. But then clients and the server would have overhead to constantly update the common data structure.

It seems like the only approach is to treat it like a public API that never allows to have properties deprecated.

Even though I didn’t like versioning field, the more I think about it, the more it seems like the best solution so far. One can still use an abstract class to wrap the Y.Doc in for stable access.

Maybe there are other lessons learned from other users here?