Random data loss in some documents

We have y-websockets setup with y-leveldb (with dynamodb-leveldown) for persistence, and we use ProseMirror in the frontend.

Recently there have been instances of people losing their documents. When this happens, usually the bottom few lines of document are missing when the document is opened.

I don’t understand it fully why it’s happening, but my hunch is that this is a race condition, and DynamoDB might have to play a role in this. I also think there’s either timeouts or lost writes happening with DynamoDB randomly.

Has anyone here seen similar behaviour before?

I have a solution in mind that circumvents this problem, and I’d like to hear any critique on it:

  1. Introduce a mechanism to store ProseMirror JSON snapshots of documents from the frontend in a PostgresDB (these also act as backups, and are saved every 5 seconds or something)
  2. Remove DynamoDB persistence, so that the docs are kept in memory
  3. When doc is first loaded from memory, initialize it from the latest snapshot from Postgres (ProseMirror JSON)
  4. When last person leaves the doc, persist a snapshot, just to be sure

I understand I’ll lose the undo ability when documents are re-opened, but this seems to be the safest solution I have for now to prevent data loss in any manner.

1 Like

Hi @manan-jadhav,

While dynamodb exposes a similar interface to leveldb (making it compatible with levelup), it doesn’t have the same correctness guarantees. When two users concurrently insert content into dynamodb, they might overwrite each other’s data.

Internally, y-leveldb stores document updates in a list where the keys for incremental updates look like this: ${documentname}-${clock} Here, clock is always one larger than the previous clock for that document.

Leveldb and indexeddb can guarantee exclusive access to a database (e.g. through transactions). We can read the previous clock entry and create a new entry with an incremented clock in the same transaction. However, dynamodb doesn’t support this unique feature. There is a race condition where two different processes check the clock simultaneously and generate a entries that overwrite each other.

Once you have dataloss, some parts of the document can’t sync anymore because the missing content can’t be referenced.

A better solution for this problem would be to ensure that only one process that access the database at a time.

Furthermore, you can prevent potential issues in your architecture by setting { resyncInterval: 1000 } as an option in y-websocket client to ensure that clients sync in regular intervals. Once dataloss is detected, a client might be able to sync-up the missing content (no guarantee).

I also had misplaced authentication logic, which meant there was a delay between handleUpgrade and the Y.js listeners being setup on the websocket. Which led to client-side syncState1 events getting missed by server in some cases.

Thank you @dmonad for the resyncInterval tip - it made me realize that because of the thing mentioned above, clients were never reaching synced state.

2 Likes