Y-prosemirror persistence

dzonekl · November 30, 2020, 3:57pm

Hello,
Thank you for Yjs !
Our team working at a large media company in the benelux is looking into a collaboration ‘backend’ for prosemirror. We are going to do a POC, and want to test out how we can persist the document being edited to a store of our choosing. I read about leveldb integration, what would it take to make an integration to a different store? Or alternatively is y-prosemirror capable to issue events, which we can use to persist a document?

Thank You for some info on this, of course we will have learned a lot after trying.

dmonad · November 30, 2020, 6:08pm

Hi @dzonekl,

Yjs is designed to make it pretty easy to persist data in custom databases. As this is realtime data, there re a couple of caveats when designing shared editing backends (as you would have with any other backend).

The current documentation for the document update format, that you would need to persist in your custom database, is here: https://github.com/yjs/yjs#document-updates

I’m currently in the process of writing up my experience with building shared editing backends in the new Yjs documentation website: https://docs.yjs.dev/tutorials/persisting-the-document-to-a-central-database. I haven’t tackled the Tutorials section yet.

y-leveldb is build with levelup. Since levelup natively supports different databases (mongo, postgres, …) many users simply use y-leveldb as a generic adapter. But if you are serious about building a custom shared editing backend, then you should probably invest a bit more time.

I write open-source shared editing backends in exchange for funding. So there is always that option if you want to contribute back.

dzonekl · December 1, 2020, 8:16am

Hi dmonad,
Thanks for the pointers, we will also consider y-leveldb. as an adapter.
Thank you also for your offer to help. I think what we will do is do the proof of concept to learn about yjs. We want to see if we can fit it nicely in our AWS infra, perhaps even running the server side on lambda (Although, I can image the in-memory state of YJS in a lambda that comes and goes is perhaps not a good idea). We also also want to test resilience and how it scales.

From there, if we decide yjs is fit for our goals, we can evaluate if your help would benefit us moving quickly forward.

Rgds Christophe

flow · December 1, 2020, 8:54pm

Hi @dzonekl,

that sounds really interesting!

Actually I’m also in the middle of a prototype (frontend prosemirror prototype and in the progress of defining backend infrastructure).
My goal is to be able to run completely cloud native on AWS. I really like the Serverless-first approach.

So far I found the following possible solutions for the client connections:
One way would be to use the API Gateway (with Websockets) or there would be the AppSync service (with the subscription feature through websockets) (although at AppSync I have to figure out how to get yjs with GraphQL working).

For computing I also thought about lambdas. Although I really like the concept I’m not sure if they are the best fit, because of the same reasons you mentioned. Although, to know it for sure, I think I just have to try and test it…

For the store options I also thought about the leveldb driver. The level community offers different storage options. Here I could find packages for AWS S3 and DynamoDB.
But to really use them I do have to take a closer look to those community packages. Here I question myself if this is really the best approach to understand level and then those community integrations or if it makes more sense to create a custom yjs provider for dynamodb, s3 or another aws database offering (or perhaps use the y-mongodb provider and use an aws database that supports the mongodb api).

At the moment I fear the impact that Yjs always needs the complete document in the cache to be able to update it. This might lead to sacrifices in scaling or could be a dealbreaker for serverless.
The new introduced subdocuments feature might compensated a bit through the ability to split data and load it async.

Those are my thoughts & ideas so far. I’m really curious which approach your team will take and sharing your learnings would be much appreciated!

Oh, potentially interesting articles I found about when you really need the big league of real-time collaboration with websockets:

@dmonad You posted some time ago an open source service solution that stores websocket connections and calls backend services via HTTP. I forgot the name and can not find the link anymore. Do you still know the name of the product or have to GitHub link?

dzonekl · December 2, 2020, 9:28am

Hi @dmonad

We have several services already which use the Websocket functionality of the AWS API gw, and have good experience with this. This effectively converts websocket connect/disconnect and websocket messages into HTTP calls to the integration. I guess y-websocket server would need to be adapted to that…

When it comes to persistence with DynamoDB, S3. DynamoDB would be a candidate for us as well.It’s especially interesting when combined with DynamoDB streams events. Alternatively I read that we can also use HTTP call backs for persistence. This could be an easy forward for us calling a service doing the persistence. BTW, We would not only need to persist but also trigger downstream logic. (Like propagating Editing events to other services).

What we are after is a resilient system with load distribution. I have no clue yet about how much caching/memory would be required to host thousands of collab sessions. I read about alternatives to keep instances in sync. Something we need to understand better. When it comes to using Serverless, I can imagine we could use “provisioned concurrency” effectively not killing the lambda, but we would need to bring it sync and flush state when done… The resilience parts concerns us the most right now.

flow · December 2, 2020, 1:16pm

Hi @dzonekl,

great to hear your thoughts! I’m absolutely with you that the target should be a resilient system that scales. Good to hear that you also consider using DynamoDB.

I actually did not really get how the provisioned concurrency of lambdas fits in.
I assume that there are many collaborations happening in different prosemirror documents, where each document state is stored in it’s own Y.Doc.
If that is the case, there must be a differentiation between the different cached states. I think caching every document in the lambdas might not be ideal.

Another concern that I have is the fact that yjs needs an one-to-many architecture for collaborative real-time editing. That means I have to somehow get all the relevant client connectionIDs and send them the changes. A blog post on AWS does it via a third lambda function that iterates over all the connections and send them the changes. As one of my earlier posted links (the first that I mentioned in my last post) states, that this is not that much scalable.

Another thing that bothers me is (if I understood it correct) the fact that it could be possible to apply changes without loading the complete Y.Doc instance. This could be a huge benefit! (But would also be specific to the provider) Although to fully get an idea of this feature I have to dig deeper into the core of yjs & CRDTs.
The related post:

As far as I found there are two approaches to realise a one-to-many architecture (only a short high level overview):

Most common approach (and also supported with current yjs providers):

Sticky websocket connection where each client is connected to a host.
To be able to scale horizontally there is the redis PubSub mechanism that distributes changes to all the other hosts who can then send the changes to their connected clients.
Problem here is the downscaling because of long sticky sessions.
For that approach there is currently the y-websocket and y-redis package.

A new rising approach that not many people have done so far:

A client Websocket connection that is handled by a gateway that holds the client connections and converts the requests into HTTP messages and sends them to the backend.

I personally think that the second approach will be the future and the one I want to try, although there is still a lot to explore to get it running.
Happy to share my findings along the road and for collaboration on the topic

dmonad · December 2, 2020, 1:55pm

The nice thing about Yjs is that it integrates nicely into existing architectures. Usually, I suggest something like a central database to store changes, and a pubsub server to communicate changes to backend that serve clients. If DynamoDB supports events (notify backends when something changed), then DynamoDB might be a really nice option to store Yjs updates.

There are definitely many architectures that need to be explored. I appreciate it if you’d share your experiences with your approach. There is not really “the” way to do it yet. Rather, there are many options that all might work in different use-cases. This is something I like about CRDTs & Yjs. Infinite scaling possibilities.

Legendkeeper.com will likely sponsor my work on differential updates (computing updates without loading them to memory). It isn’t finalized yet, but I hope to begin next week with the work. I made a short write-up of my plans here: Differential updates · Issue #263 · yjs/yjs · GitHub

I don’t think I did. I’d love to have a service solution like that. The feature you are describing sounds similar to the HTTP callback solution in y-websocket.