Handling Eventual Consistency in CQRS
I am not sure how many of you have faced this when implementing CQRS in your APIs but it has been a tricky situation for us to handle the latency and errors between two different data sources while syncing them. What I mean is with CQRS, we have to use separate data source for Commands (writes) and Reads and when we do that, we have to sync the data between the two data sources whenever the write happens to Command data source.
Now there are two broader ways to write to both sources when a write Command is instructed.
- Write to the primary database and start writing to the READ database and return to the caller without waiting for the success/failure of the write to the READ database OR
- Write to the primary and READ database and return to the caller only when both of the succeeds. If any one of the fails, return appropriate error code instead of 200.
Now lets us call the first approach async and second one as sync way of writing to the READ database.
For async approach, we have multiple options to implement. We can just write and wait for the primary database while we use async library for writing to the READ database and return the response without waiting for the success/failure of the second operation. Another approach is that we can event based architecture where we just raise the event that a write has happened in primary database and somewhere on our bus there is a listener that listens to that event and write the data to the READ database. There are trade offs for both of the approaches which are out of scope of this discussion.
For sync approach, we have to wait for both writes and return to the caller with 200 only when both succeeds. If anyone of them fails, we have to return error code. We might implement this using transactions.
Now there are pros and cons of both async and sync approaches. Async way inherits a latency in new data being available to the caller to read. New data would be available for read but it might take some time, usually few milliseconds to seconds. I guess this is what is called eventual consistency. By nature of this approach, caller would never know if the data it wrote is available to read or not. It might present stale data to the caller if the write to the READ database has failed. Another problem that we faced is that for certain operations on the front end, we need to make a read call (GET/LIST) immediately after the write operation (POST/PUT/PATCH) e.g. a user can join a project when he/she is invited to the project and on the invitation accept screen we send a write Command to the API to add the user as member of the project and on success we take user to the project’s dashboard which first checks if the user is part of the project or not. Because there is a latency in the data being available for read to the caller, we intermittently faced problems where user was not able to see the project’s dashboard even after accepting the invite. As you might already guessed it what has happened here, we have to add artificial delay after the write Command to give enough time to thew new data to be available in READ database. Today we have multiple such use cases where are adding these artificial delays until we fix async nature. The same situation becomes more problematic when the write to the READ database didn’t go through and user is never able to see the project’s dashboard until support team fixes it.
Sync approach on the other hand does not have any of these problems, but it increases the overall latency of the Command operations of the API. I am not sure how much impact can be for a high volume API and it might not have significant cost for a relatively low volume API. I would like to know more details from readers about this aspect.
My Solution
What I did to resolve this problem is modified the CQRS pattern usage for our application. We modified our GET by ID endpoints of our API resources to fetch it from the Command database i.e. from the source of truth. This won’t have any performance hit on the API endpoint because we are fetching the resource by their primary key. This solution provides a good source of truth to the consumers of our API. There might be some latency in the new or updated records in our LIST endpoints but our GET endpoint would always return the source of truth.
This modification has served our purpose and apps are working pretty well with APIs with less number of workarounds (e.g. artificial delays are write operations). However, I am willing to know what are the side effects of this approach by deviating from the standard CQRS pattern, please let me know in comments.