Why? Imagine you're doing a rolling update. Some of your replica are running the newer version of SQLite and some are running the older version. They may not execute the high-level query in exactly the same way. For example, in the absence of an "order by" clause, select results' order is unstated. So imagine someone makes a mutation that depends on this: "insert ... select ... limit". (Maybe a dumb example but it can happen anyway.) Now the databases start to diverge, not only in underlying bytes and implementation-defined ordering but in actual row data.
I worked on a major distributed system that originally replicated high-level changes and switched to replicating low-level changes for this reason. We had a system for detecting when replicas didn't match, and replication of high-level changes was the biggest reason for diffs. (Hardware error was the second biggest reason; we added a lot of checksumming because of that.)
One of rqlite's big limitations is that it resyncs the entire DB at startup time. Being able to start with a "snapshot" and then incrementally replicate changes would be a big help.
I'm not sure I follow why it's a "big limitation"? Is it causing you long start-up times? I'm definitely interested in improving this, if it's an issue. What are you actually seeing?
Also, rqlite does do log truncation (as per Raft spec), so after a certain amount of log entries (8192 by default) node restarts work exactly like you suggested. The SQLite database is restored from a snapshot, and any remaining Raft Log entries are applied to the database.
We're storing a few GB of data in the sqlite DB. Rebuilding those when rqlite restarts is slow and intensive process compared to just using the file on disk over again.
Our particular use case means we'll end up restarting 100+ replica nodes all at once, so the way we're doing things makes it more painful than necessary.
Try setting "-raft-snap" to a lower number, maybe 1024, and see if it helps. You'll have much fewer log entries to apply on startup. However the node will perform a snapshot more often, and writes are blocked during the snapshotting. It's a trade-off.
It might be possible to always restart using some sort of snapshot, independent of Raft, but that would add significant complexity to rqlite. The fact the SQLite database is built from scratch on startup, from the data in Raft log, means rqlite is much more robust.
We need that sqlite file to never go away. Even a few seconds is bad. And since our replicas are spread all over the world, it's not feasible to move 1GB+ data from the "servers" fast enough.
Is there a way for us to use that sqlite file without it ever going away? We've thought about hardlinking it elsewhere and replacing the hardlink when rqlite is up, but haven't built any tooling to do that.
Today the rqlite code deletes the SQLite database (if present) and then rebuilds it from the Raft log. It makes things so simple, and ensures the node can always recover, regardless of the prior state of the SQLite database -- basically the Raft log is the only thing that matters and that is guaranteed to be the same under each node.
The fundamental issue here is that Raft can only guarantee that the Raft log is in consensus, so rqlite can rely on that. It's always possible the one of the copies of SQLite under a single node gets a different state that all other nodes. This is because the change to the Raft log, and corresponding change to SQLite, are not atomic. Blowing away the SQLite database means a restart would fix this.
If this is important -- and what you ask sounds reasonable for the read-only case that rqlite can support -- I guess the code could rebuild the SQLite database in a temporary place, wait until that's done, and then quickly swap any existing SQLite file with the rebuilt copy. That would minimize the time the file is not present. But the file has to go away at some point.
Alternatively rqlite could open any existing SQLite file and DROP all data first. At least that way the file wouldn't disappear, but the data in the database would wink out of existence and then come back. WDYT?
You start with empty db, and replay up to N; or if you have a db that matches log3 - start with that and only replay log4+?
log1: empty db log2..N: changes log3: snapshot at N log4..M: changes
Something to think about -- thanks.
The db is snapshotted at a preset time (easy way: lock db for writes whilst snap is in progress, preferably on read replicas) and then restored again (preferably, elsewhere) to see if restore did indeed work as expected (and isn't corrupt).
The latest log sequence number could also be stored in the same db (in a separate table) so that it is captured along with the db-snapshot, atomically.
Neat! This is reminiscent of the replay of event log in Smalltalk.
Hmm. Not having dug into your solution much, is it safe to say that the physical replication logs have something like logical checkpoints? If so, would it make sense to only keep physical logs on a relatively short rolling window, and logical logs (ie. only the interleaved logical checkpoints) longer?
One benefit to using physical logs is that you end up with a byte-for-byte copy of the original data so it makes it easy to validate that your recovery is correct. You'd need to iterate all the records in your database to validate a logical log.
However, all that being said, Litestream runs as a separate daemon process so it actually doesn't have access to the SQL commands from the application.
The main reason I would want both is to be certain that I was restoring the DB to a consistent state, even if the application is doing things stupidly (eg. not using transactions correctly. Or at all).
I suppose timestamps would be enough to view the logs interleaved or side-by-side in some fashion.
This package is part of Gazette , and uses a gazette journal (known as a "recovery log") to power raw bytestream replication & persistence.
On top of journals, there's a recovery log "hinting" mechanism  that is aware of file layouts on disk, and keeps metadata around the portions of the journal which must be read to recover a particular on-disk state (e.x. what are the current live files, and which segments of the log hold them?). You can read and even live-tail a recovery log to "play back" / maintain the on-disk file state of a database that's processing somewhere else.
Then, there's a package providing RocksDB with an Rocks environment that's configured to transparently replicate all database file writes into a recovery log . Because RocksDB is a a continuously compacted LSM-tree and we're tracking live files, it's regularly deleting files which allow for "dropping" chunks of the recovery log journal which must be read or stored in order to recover the full database.
For the SQLite implementation, SQLite journals and WAL's are well-suited to recovery logs & their live file tracking, because they're short-lived ephemeral files. The SQLite page DB is another matter, however, because it's a super-long lived and randomly written file. Naively tracking the page DB means you must re-play the _entire history_ of page mutations which have occurred.
This implementation solves this by using a SQLite VFS which actually uses RocksDB under the hood for the SQLite page DB, and regular files (recorded to the same recovery log) for SQLite journals / WALs. In effect, we're leveraging RocksDB's regular compaction mechanisms to remove old versions of SQLite pages which must be tracked / read & replayed.
I'm seeing files shrink down to 14% of their size (1.7MB WAL compressed to 264KB). However, your exact compression will vary depending on your data.
So how does that compare to logical replication? (Also I imagine packet size plays a role, since you have to flush the stream quite frequently, right? 1000 bytes isn't much more expensive than 431)
Logical replication would have significantly smaller sizes although the size cost isn't a huge deal on S3. Data transfer in to S3 is free and so are DELETE requests. The data only stays on S3 for as long as your Litestream retention specifies. So if you're retaining for a day then you're just keeping one day's worth of WAL changes on the S3 at any given time.
I could see myself using dqlite in the future so I’d like some more user reports from the trenches. Anyone shed some light on this?
"Since the HA direction needs etcd anyway.. I'm proposing dropping support for sqlite as the default embedded non-HA option and switch to embedded etcd as the default. This will reduce overall effort of maintainability of two entirely different datastores."
I understand what you're saying, but I don't think it's a compelling objection. Obviously, differences between versions -- even patched versions -- can results in subtle, unintended, differences in how the code works for a given program. But there is no reason to think a system that operates at a lower level ("replace key/block X, which should have contents C_x, with C_x'") is less vulnerable to this kind of issue, compared to one that operates at a higher level i.e. statement-based replication, which rqlite uses. In fact I would argue that the system that operates on higher-level of abstraction is less vulnerable i.e. to care about the subtle changes.
Another downside is that you might also propagate data corruption in case of bugs in the DB software (e.g. memory corruption) or hardware defects.
It certainly requires care to ensure all the replicas have software capable of understanding the new format before it's actually written, but it can be done. Likewise after writing the new format, you want to have a roll-back plan.
In SQLite's case, https://sqlite.org/formatchng.html says: "Since 2004, there have been enhancements to SQLite such that newer database files are unreadable by older versions of the SQLite library. But the most recent versions of the SQLite library should be able to read and write any older SQLite database file without any problems." I don't believe upgrading SQLite automatically starts using any of those enhancements; you'd have to do a schema change like "PRAGMA journal_mode=WAL;" first.
> Another downside is that you might also propagate data corruption in case of bugs in the DB software (e.g. memory corruption) or hardware defects.
This happens regardless.
Statement based (high level) replication is very useful for i.e. "insert into tbl0 select col0 from tbl1 order by col1" as you would only need to send the query not the individual row data.
If I want replicated SQLite, I'll look at dqlite or litestream instead, which sound more compatible with my design sensibilities. (Thanks, tyingq and benbjohnson!)
> rqlite uses Raft to achieve consensus across all the instances of the SQLite databases, ensuring that every change made to the system is made to a quorum of SQLite databases, or none at all.
What I understood from this sentence is that, if we have three instances, rqlite will make sure that every change is written to at least two. But what if two changes are written to two different pairs of instances? Then the three instances will have three different versions of the data. For example, change X is written to instances A and B, and change Y is written to B and C. Now A has X, B has X and Y, and C has Y only. How do you decide who is right?
Raft is based on having a leader decide what the next COMMIT is going to be, so B could never have X and Y at the same time (they could both be queued but other mechanisms could reject them).
Also data is not considered committed until more than half the cluster has acknowledged it (at which point the leader will know it and handle going forward), leader election also works in a similar way iirc.
As others mentioned, the visualization on https://raft.github.io/ is really good (You can affect it to create commits and control downtime of machines)
That would mean in 3 servers you need 2.5 aka 3 machines to commit a change. Then 4/5, 5/7, 6/9, 7/11. And I think it’s a wash anyway, because as the servers go up the fraction you need for quorum goes down, but the odds of falling behind or failing outright go up too. Not to mention the time during which 1/n machines are down due to an upgrade gets longer and longer the more machines you have, increasing the chances of double fault.
> That would mean in 3 servers you need 2.5 aka 3 machines to commit a change. Then 4/5, 5/7, 6/9, 7/11.
No, the requirement isn't 1/2 + 1. Any strict majority of the cluster is enough to elect a leader. So you need 2/3, or 3/4, or 3/5, and so on.
> Not to mention the time during which 1/n machines are down due to an upgrade gets longer and longer the more machines you have, increasing the chances of double fault.
Generally, this is not the case. If individual machine failures are random and equally probable, and if each machine is down on average less than 50% of the time, then adding more machines makes things better, not worse. (This is a basic property of the binomial distribution.)
Of course, if you have a single point of failure somewhere -- e.g. a network switch -- this assumption can be violated, but that's true regardless of how many machines you have.
The parent post is talking about the number that can go down while maintaining quorum, and you're talking about the number that need to remain up to maintain quorum. So you're both correct.
> That would mean in 3 servers you need 2.5 aka 3 machines to commit a change.
That seems wrong. You need N//2 +1 where "//" is floor division, so in a 3 node cluster, you need 3//2 +1, or 1+1 or 2 nodes to commit a change.
'Simple majority' is based on the number of the machines that the leader knows about. You can only change the membership by issuing a write. Write quorum and leadership quorum are two different things, and if I've got it right, they can diverge after a partition.
I'm also thinking of double faults, because the point of Raft is to get past single fault tolerance.
After a permanent fault (broken hardware) in a cluster of 5, the replacement quorum member can't vote for writes until it has caught up. It can vote for leaders, but it can't nominate itself. Catching up leaves a window for additional faults.
It's always 3/5 for writes and elections, the difference is that the ratio of original machines that have to confirm a write can go to 100% of survivors, instead of the 3/4 of reachable machines. Meaning network jitter and packet loss, slows down writes until it recovers, and an additional partition can block writes altogether, even with 3/5 surviving the partition.
Your example can't happen, because if (for instance) A is the leader, then C will not apply change Y without contacting the leader, which will tell it to apply X first.
If you want more details about how this handles all the edge cases -- for instance, what happens if the leader crashes -- the Raft paper is quite accessible: https://raft.github.io/raft.pdf
Surprisingly easy to understand, and a cool viaual.
Edit: http://thesecretlivesofdata.com/raft/ if you have some time seems to be a good step by step explanation on how it works in detail.
Alice, Bob and Charlie have a synced copy of the same database
Charlie goes on a plane and adds a loads of records without a connection to the other databases
Alice and Bob make no changes
Charlie comes home and syncs
Will Charlie lose all his changes, as his database is different to Alice and Bob's?
What happens if Alice, Bob and Charlie all makes changes offline then resync?
This is a fundamentally necessary tradeoff to provide strong consistency, as described by the CAP theorem.
Here is a real life scenario that I had to deal with in the past. Technician (carpenters) goes to clients home to repair furniture in the middle of nowhere, so no internet. Adds the necessary paperwork which is pictures, declarations, contract (signed and scanned) to the Clients table. This company was employing hundreds of such technicians throughout the many counties of Germany, each with a laptop on them running this app which was the backbone for getting paid/do the work. And was not uncommon to have more than one carpenter go to client home and do the repairs. Since each carpenter was paid according to its own work, each of them would create entries in their local Clients table and when getting back to HQ their work was manually uploaded to central DB, and only after that they got paid. I automated that (that was the job, to eliminate the thousands of hours that carpenters were wasting manually).
So given the above scenario, how is this system going to achieve that? Same table, and same client details even in table Clients, just different rows for different carpenters (foreign key to table Carpenters).
You usually replicate for failure tolerance and performance (this project only aims for the former).
I don't think it is. You're describing a use case that is distributed but explicitly does not want to enforce consistency—you want offline workers to all be able to keep working, and you're enforcing consistency after the fact and outside of the database itself.
Sometimes you had conflicts that needed resolution, but those weren't that frequent for our use case.
In the second case, Alice-Bob consensus overrides Charlie
SQLite has a great post on “When to Use” (and not use) SQLite.
Would be great if you included these same use cases in the ReamMe docs and make it clear if Rqlite can address them.
Would be good to hear from someone who used it what are the pros and cons of such a setup
One minor caveat ("Here be Dragons") I have (with respect to my own future adoption/production use), however:
>"Does rqlite support transactions?
It supports a form of transactions. You can wrap a bulk update in a transaction such that all the statements in the bulk request will succeed, or none of them will. However the behaviour or rqlite is undefined if you send explicit BEGIN, COMMIT, or ROLLBACK statements. This is not because they won't work -- they will -- but if your node (or cluster) fails while a transaction is in progress, the system may be left in a hard-to-use state. So until rqlite can offer strict guarantees about its behaviour if it fails during a transaction, using BEGIN, COMMIT, and ROLLBACK is officially unsupported. Unfortunately this does mean that rqlite may not be suitable for some applications."
PDS: Distributed transactions are extremely difficult to get exactly right -- so I'm not trying to criticize all of the hard work and effort that everyone has put into this (again, it's a great idea, and I think it has a terrific future).
But Distributed Transactions -- are what differentiate something like rsqlite from say, something like CockroachDB (https://www.cockroachlabs.com/docs/stable/architecture/life-...).
Of course, CockroachDB is a pay-for product with an actual company with many years of experience backing it, whereas rqlite, as far as I can intuit, at this point in time (someone correct me if I am wrong), appears to be a volunteer effort...
Still, I think that rqlite despite this -- has a glorious and wonderful future!
Again, a brilliant and laudable effort, suitable for many use cases presently, and I can't wait to see what the future holds for this Open Source project!
Maybe in the future some code-ninja will step up to the plate and add fully guaranteed, safe, distributed transactions!
Until then, it looks like a great idea coupled with a great software engineering effort!
As I said, "Favorited!".
> Comdb2 is a relational database built in-house at Bloomberg L.P. over the last 14 years or so.
rqlite is not 14 years old.