Detecting delayed awareness updates

We have recently faced some load issues in our sync servers (using Hocuspocus, multiple instances synced via Redis) and noticed an interesting behavior: during an incident, users would report seeing “ghost cursors” from their past selves, “replaying” the changes they made some time before.

We have not been able to reproduce this, but my understanding of what happens is:

  • user is connected to a sync server instance (S1) and sending awareness updates (from client id C1)
  • server faces heavy load and there’s lag consuming awareness updates from other instances (S2, S3, …)
  • user reconnects to a different server, say S2 (e.g. maybe because they reloaded the app), now the client id is C2
  • server S2 finally consumes the updates that originated in C1 (synced via S1) and sends then to C2
  • C2 “replays” old awareness states

We also received reports of this happening in collaborative sessions (with multiple users), so some naive app-level user id filtering wouldn’t suffice.

Now, given the scenario above, and without any knowledge of the awareness protocol, my first instinct would be to tag awareness updates with timestamps and then conditionally drop them in the receiving client based on a configurable grace period. Now, it seems that the awareness protocol actually uses a state-based CRDT, and in this case, at least with the current implementation, it doesn’t seem like that’s something achievable.

Is that understanding correct? Could there be an alternative solution?

Awareness updates already contain a timestamp (as in, an actual unix timestmap). However, you can’t trust users time. It is very common that users time is off by more than 1h. I’ve worked a lot of timestamps and I learned to only use them as “relative” markers. So, regardless of when the update was created, the Awareness instance keeps updates around for about 30 seconds after receiving it and then marks the user as offline.

If Hocuspocus chokes on updates, it should maybe discard awareness updates. It could also make sure to only distribute awareness updates, if the user that produced the update is still around. Servers should perform some kind of tracking of users and update awareness state when a user goes offline.

The Awareness CRDT is extremely fault-tolerant and works even if the server just replays messages. But it might make sense to add some additional logic for these cases.