scottlamb 45 days ago [-]
It looks like this layers Raft on top of SQLite. I don't like when systems replicate high-level changes like "update users set salary = salary + 1000 where ...;" Instead, I prefer they replicate low-level changes like "replace key/block X, which should have contents C_x, with C_x'".

Why? Imagine you're doing a rolling update. Some of your replica are running the newer version of SQLite and some are running the older version. They may not execute the high-level query in exactly the same way. For example, in the absence of an "order by" clause, select results' order is unstated. So imagine someone makes a mutation that depends on this: "insert ... select ... limit". (Maybe a dumb example but it can happen anyway.) Now the databases start to diverge, not only in underlying bytes and implementation-defined ordering but in actual row data.

I worked on a major distributed system that originally replicated high-level changes and switched to replicating low-level changes for this reason. We had a system for detecting when replicas didn't match, and replication of high-level changes was the biggest reason for diffs. (Hardware error was the second biggest reason; we added a lot of checksumming because of that.)

benbjohnson 45 days ago [-]
I just open-sourced a streaming replication tool for SQLite called Litestream that does physical replication (raw pages) instead of logical replication (SQL commands). Each approach has its pros and cons. Physical replication logs tend to be larger than logical logs but I agree that you avoid a lot of issues if you do physical replication.

https://github.com/benbjohnson/litestream

mrkurt 45 days ago [-]
I've been following this, and am anxious for the direct-to-sqlite replication.

One of rqlite's big limitations is that it resyncs the entire DB at startup time. Being able to start with a "snapshot" and then incrementally replicate changes would be a big help.

otoolep 44 days ago [-]
rqlite creator here.

I'm not sure I follow why it's a "big limitation"? Is it causing you long start-up times? I'm definitely interested in improving this, if it's an issue. What are you actually seeing?

Also, rqlite does do log truncation (as per Raft spec), so after a certain amount of log entries (8192 by default) node restarts work exactly like you suggested. The SQLite database is restored from a snapshot, and any remaining Raft Log entries are applied to the database.

mrkurt 44 days ago [-]
Ah, ok that's some nuance I didn't know about!

We're storing a few GB of data in the sqlite DB. Rebuilding those when rqlite restarts is slow and intensive process compared to just using the file on disk over again.

Our particular use case means we'll end up restarting 100+ replica nodes all at once, so the way we're doing things makes it more painful than necessary.

otoolep 44 days ago [-]
But how do you know it's intensive? Are you watching disk IO? Is there a noticeable delay when the node starts before it's ready to receive requests?

Try setting "-raft-snap" to a lower number, maybe 1024, and see if it helps. You'll have much fewer log entries to apply on startup. However the node will perform a snapshot more often, and writes are blocked during the snapshotting. It's a trade-off.

It might be possible to always restart using some sort of snapshot, independent of Raft, but that would add significant complexity to rqlite. The fact the SQLite database is built from scratch on startup, from the data in Raft log, means rqlite is much more robust.

mrkurt 44 days ago [-]
Oh, we're reading the sqlite files directly. rqlite is really just a mechanism for us to propagate read only data to a bunch of clients.

We need that sqlite file to never go away. Even a few seconds is bad. And since our replicas are spread all over the world, it's not feasible to move 1GB+ data from the "servers" fast enough.

Is there a way for us to use that sqlite file without it ever going away? We've thought about hardlinking it elsewhere and replacing the hardlink when rqlite is up, but haven't built any tooling to do that.

otoolep 44 days ago [-]
Hmmmm, that's a different issue. :-)

Today the rqlite code deletes the SQLite database (if present) and then rebuilds it from the Raft log. It makes things so simple, and ensures the node can always recover, regardless of the prior state of the SQLite database -- basically the Raft log is the only thing that matters and that is guaranteed to be the same under each node.

The fundamental issue here is that Raft can only guarantee that the Raft log is in consensus, so rqlite can rely on that. It's always possible the one of the copies of SQLite under a single node gets a different state that all other nodes. This is because the change to the Raft log, and corresponding change to SQLite, are not atomic. Blowing away the SQLite database means a restart would fix this.

If this is important -- and what you ask sounds reasonable for the read-only case that rqlite can support -- I guess the code could rebuild the SQLite database in a temporary place, wait until that's done, and then quickly swap any existing SQLite file with the rebuilt copy. That would minimize the time the file is not present. But the file has to go away at some point.

Alternatively rqlite could open any existing SQLite file and DROP all data first. At least that way the file wouldn't disappear, but the data in the database would wink out of existence and then come back. WDYT?

e12e 44 days ago [-]
This may be hopelessly naive - but is the rebuilt sqlite db expected to be identical? Could a sha256 hash of the file be written into the raft log at intervals, and be used as a "valid" starting point?

Ie:

    log1: empty db
    log2..N: changes
    log3: snapshot at N
    log4..M: changes
You start with empty db, and replay up to N; or if you have a db that matches log3 - start with that and only replay log4+?
otoolep 44 days ago [-]
That's not a bad idea, might have promise. Hashing large databases files after every few log entries might be a performance hit....not sure.

Something to think about -- thanks.

ignoramous 35 days ago [-]
Snapshotting at intervals is how most point-in-time restores scale up.

The db is snapshotted at a preset time (easy way: lock db for writes whilst snap is in progress, preferably on read replicas) and then restored again (preferably, elsewhere) to see if restore did indeed work as expected (and isn't corrupt).

The latest log sequence number could also be stored in the same db (in a separate table) so that it is captured along with the db-snapshot, atomically.

mrkurt 44 days ago [-]
Rebuilding and then moving it in place sounds pretty nice to me.
m_sahaf 44 days ago [-]
> Today the rqlite code deletes the SQLite database (if present) and then rebuilds it from the Raft log.

Neat! This is reminiscent of the replay of event log in Smalltalk.

jlongster 44 days ago [-]
I built an app (https://actualbudget.com/) that uses a local sqlite db and syncs changes, and that's exactly how it works. It takes quite a different approach though, using CRDTs to represent changes and those are synced around. When a fresh client comes into play, it downloads the latest sqlite snapshot from a server and then syncs up.
pdimitar 43 days ago [-]
Consider adding family sharing or just general making / joining of a group with shared expenses and accounts. Me and my wife use such an app currently but it lacks features and ergonomy.
CuriouslyC 45 days ago [-]
This is tangential, but depending on your sql needs, CouchDB's replication story is amazing, and you can replicate to the browser using PouchDB. There is an optional SQL layer, but obviously the good replication story comes with some trade-offs.
benbjohnson 45 days ago [-]
That can be painful for sure. Litestream will do a snapshot on startup if it detects that it can't pick up from where it left off in the WAL. That can happen if Litestream is shut down and another process perform a checkpoint. But generally a restart will just use the existing snapshot & continue with the WAL replication.
webmaven 45 days ago [-]
> I just open-sourced a streaming replication tool for SQLite called Litestream that does physical replication (raw pages) instead of logical replication (SQL commands). Each approach has its pros and cons. Physical replication logs tend to be larger than logical logs but I agree that you avoid a lot of issues if you do physical replication.

Hmm. Not having dug into your solution much, is it safe to say that the physical replication logs have something like logical checkpoints? If so, would it make sense to only keep physical logs on a relatively short rolling window, and logical logs (ie. only the interleaved logical checkpoints) longer?

benbjohnson 45 days ago [-]
I suppose you could save both the physical and logical logs if you really needed log term retention. SQLite databases (and b-trees in general) tend to compress well so the physical logging isn't as bad as it sounds. You could also store a binary diff of the physical page which would shrink it even smaller.

One benefit to using physical logs is that you end up with a byte-for-byte copy of the original data so it makes it easy to validate that your recovery is correct. You'd need to iterate all the records in your database to validate a logical log.

However, all that being said, Litestream runs as a separate daemon process so it actually doesn't have access to the SQL commands from the application.

webmaven 38 days ago [-]
So, no logical checkpoints, logical logging would have to be a separate process.

The main reason I would want both is to be certain that I was restoring the DB to a consistent state, even if the application is doing things stupidly (eg. not using transactions correctly. Or at all).

I suppose timestamps would be enough to view the logs interleaved or side-by-side in some fashion.

paulryanrogers 44 days ago [-]
FYI in Postgresql (Pgpool anyway) and MySQL that would be query replication. Logical replication often means the row data. Binary replication often uses block level data.
jgraettinger1 45 days ago [-]
Here's another approach to the problem [0]:

This package is part of Gazette [1], and uses a gazette journal (known as a "recovery log") to power raw bytestream replication & persistence.

On top of journals, there's a recovery log "hinting" mechanism [2] that is aware of file layouts on disk, and keeps metadata around the portions of the journal which must be read to recover a particular on-disk state (e.x. what are the current live files, and which segments of the log hold them?). You can read and even live-tail a recovery log to "play back" / maintain the on-disk file state of a database that's processing somewhere else.

Then, there's a package providing RocksDB with an Rocks environment that's configured to transparently replicate all database file writes into a recovery log [3]. Because RocksDB is a a continuously compacted LSM-tree and we're tracking live files, it's regularly deleting files which allow for "dropping" chunks of the recovery log journal which must be read or stored in order to recover the full database.

For the SQLite implementation, SQLite journals and WAL's are well-suited to recovery logs & their live file tracking, because they're short-lived ephemeral files. The SQLite page DB is another matter, however, because it's a super-long lived and randomly written file. Naively tracking the page DB means you must re-play the _entire history_ of page mutations which have occurred.

This implementation solves this by using a SQLite VFS which actually uses RocksDB under the hood for the SQLite page DB, and regular files (recorded to the same recovery log) for SQLite journals / WALs. In effect, we're leveraging RocksDB's regular compaction mechanisms to remove old versions of SQLite pages which must be tracked / read & replayed.

[0] https://godoc.org/go.gazette.dev/core/consumer/store-sqlite

[1] https://gazette.readthedocs.io/en/latest/

[2] https://gazette.readthedocs.io/en/latest/consumers-concepts....

[3] https://godoc.org/go.gazette.dev/core/consumer/store-rocksdb

hinkley 45 days ago [-]
Do you use compression? And if so, how that affects the relative amount of network traffic vs logical.
benbjohnson 45 days ago [-]
Yes, Litestream uses LZ4 compression. I originally used gzip but the compression speed was pretty slow. B-tree pages tend compress well because they tend to be 50-75% full because they need space to insert new records and because pages split when they get full.

I'm seeing files shrink down to 14% of their size (1.7MB WAL compressed to 264KB). However, your exact compression will vary depending on your data.

hinkley 44 days ago [-]
Ah, that makes sense. Most inserts don't split pages, so are around n worst case pages, but once in a while you get 2n updates where most of them are half full, and so compress better.

So how does that compare to logical replication? (Also I imagine packet size plays a role, since you have to flush the stream quite frequently, right? 1000 bytes isn't much more expensive than 431)

benbjohnson 44 days ago [-]
Litestream defaults to flushing out to S3 every 10 seconds but that's mainly because of PUT costs. Each request costs $0.00005 so it costs about $1.30 per month. If you flushed every second then it'd cost you $13/month.

Logical replication would have significantly smaller sizes although the size cost isn't a huge deal on S3. Data transfer in to S3 is free and so are DELETE requests. The data only stays on S3 for as long as your Litestream retention specifies. So if you're retaining for a day then you're just keeping one day's worth of WAL changes on the S3 at any given time.

tyingq 45 days ago [-]
Dqlite replicates at the VFS layer of sqlite, which sounds like what you're looking for. https://github.com/canonical/dqlite
hinkley 45 days ago [-]
I haven’t gotten a straight answer out of the k3s people about why they dumped dqlite, just some comment about bugs.

I could see myself using dqlite in the future so I’d like some more user reports from the trenches. Anyone shed some light on this?

tyingq 45 days ago [-]
The initial issue seems to be saying that it's because they need to have etcd anyway, so consolidating on that removes a dependency. Which fits their simplicity goal. Though the issue appears to have been created by a user, not a maintainer.

"Since the HA direction needs etcd anyway.. I'm proposing dropping support for sqlite as the default embedded non-HA option and switch to embedded etcd as the default. This will reduce overall effort of maintainability of two entirely different datastores."

https://github.com/k3s-io/k3s/issues/845

hinkley 45 days ago [-]
I accepted that reason for them, but as I don’t benefit directly from switching to etcd, I’d rather know about what started the conversation than how it was concluded.
merb 44 days ago [-]
dqlite support was flaky and did go trough their translation layer which probably added complexity.
otoolep 44 days ago [-]
rqlite creator here.

I understand what you're saying, but I don't think it's a compelling objection. Obviously, differences between versions -- even patched versions -- can results in subtle, unintended, differences in how the code works for a given program. But there is no reason to think a system that operates at a lower level ("replace key/block X, which should have contents C_x, with C_x'") is less vulnerable to this kind of issue, compared to one that operates at a higher level i.e. statement-based replication, which rqlite uses. In fact I would argue that the system that operates on higher-level of abstraction is less vulnerable i.e. to care about the subtle changes.

lrossi 45 days ago [-]
If you replicate low level changes, you might not be able to do a live upgrade/downgrade if the version change affects the on disk format.

Another downside is that you might also propagate data corruption in case of bugs in the DB software (e.g. memory corruption) or hardware defects.

scottlamb 45 days ago [-]
> If you replicate low level changes, you might not be able to do a live upgrade/downgrade if the version change affects the on disk format.

It certainly requires care to ensure all the replicas have software capable of understanding the new format before it's actually written, but it can be done. Likewise after writing the new format, you want to have a roll-back plan.

In SQLite's case, https://sqlite.org/formatchng.html says: "Since 2004, there have been enhancements to SQLite such that newer database files are unreadable by older versions of the SQLite library. But the most recent versions of the SQLite library should be able to read and write any older SQLite database file without any problems." I don't believe upgrading SQLite automatically starts using any of those enhancements; you'd have to do a schema change like "PRAGMA journal_mode=WAL;" first.

> Another downside is that you might also propagate data corruption in case of bugs in the DB software (e.g. memory corruption) or hardware defects.

This happens regardless.

xd 45 days ago [-]
In MySQL/MariaDB this is what's known as non-deterministic behaviour so row or mixed replication is used to mitigate.

Statement based (high level) replication is very useful for i.e. "insert into tbl0 select col0 from tbl1 order by col1" as you would only need to send the query not the individual row data.

45 days ago [-]
ttul 45 days ago [-]
Sounds like you need to submit a patch!
scottlamb 45 days ago [-]
lol, this is an unhelpful reflex answer to any criticism of open source software. What I'm describing is a redesign. There's no point in submitting a patch for that. It'd be much more effort than a few lines of code, and it wouldn't be accepted. Open source means that anyone can fork. It doesn't mean that maintainers will automatically merge patches replacing their software with completely different software. The only way that will happen is if the maintainers decide for themselves it needs to be redesigned, and that starts with discussion rather than a patch. It's also a long shot compared to just finding different software that already has the design I prefer.

If I want replicated SQLite, I'll look at dqlite or litestream instead, which sound more compatible with my design sensibilities. (Thanks, tyingq and benbjohnson!)

monadic3 45 days ago [-]
Frankly your bad-faith commentary isn't helping the conversation either. I sincerely appreciate your cleaning up the tone at the end.
mech422 45 days ago [-]
So it looks like you can now distribute SQLite at the:

Stmt level: https://github.com/rqlite/rqlite

VFS Level: https://github.com/canonical/dqlite

Block Level: https://github.com/benbjohnson/litestream

Really cool enhancements to an awesome project!

blackbear_ 45 days ago [-]
I know nothing of consensus algorithms and distributed systems so bear with me please.

> rqlite uses Raft to achieve consensus across all the instances of the SQLite databases, ensuring that every change made to the system is made to a quorum of SQLite databases, or none at all.

What I understood from this sentence is that, if we have three instances, rqlite will make sure that every change is written to at least two. But what if two changes are written to two different pairs of instances? Then the three instances will have three different versions of the data. For example, change X is written to instances A and B, and change Y is written to B and C. Now A has X, B has X and Y, and C has Y only. How do you decide who is right?

whizzter 45 days ago [-]
The semantics of Raft has a "simple" (compared to the harder to understand Paxos) forward motion of events that is supposed to guarantee that you won't get into weird states regardless of if any particular node(s) goes down (I think it can surive (N/2)-1 dead machines in a cluster of N).

Raft is based on having a leader decide what the next COMMIT is going to be, so B could never have X and Y at the same time (they could both be queued but other mechanisms could reject them).

Also data is not considered committed until more than half the cluster has acknowledged it (at which point the leader will know it and handle going forward), leader election also works in a similar way iirc.

As others mentioned, the visualization on https://raft.github.io/ is really good (You can affect it to create commits and control downtime of machines)

hinkley 45 days ago [-]
It’s 1/2 + 1 isn’t it? So if the leader goes down at the exact moment of quorum, you you can still get quorum again.

That would mean in 3 servers you need 2.5 aka 3 machines to commit a change. Then 4/5, 5/7, 6/9, 7/11. And I think it’s a wash anyway, because as the servers go up the fraction you need for quorum goes down, but the odds of falling behind or failing outright go up too. Not to mention the time during which 1/n machines are down due to an upgrade gets longer and longer the more machines you have, increasing the chances of double fault.

teraflop 44 days ago [-]
> It’s 1/2 + 1 isn’t it?

> That would mean in 3 servers you need 2.5 aka 3 machines to commit a change. Then 4/5, 5/7, 6/9, 7/11.

No, the requirement isn't 1/2 + 1. Any strict majority of the cluster is enough to elect a leader. So you need 2/3, or 3/4, or 3/5, and so on.

> Not to mention the time during which 1/n machines are down due to an upgrade gets longer and longer the more machines you have, increasing the chances of double fault.

Generally, this is not the case. If individual machine failures are random and equally probable, and if each machine is down on average less than 50% of the time, then adding more machines makes things better, not worse. (This is a basic property of the binomial distribution.)

Of course, if you have a single point of failure somewhere -- e.g. a network switch -- this assumption can be violated, but that's true regardless of how many machines you have.

hinkley 44 days ago [-]
If the leader is down (the scenario you clipped out in your response) you need a strict majority with an even number of machines.
hinkley 44 days ago [-]
This is right for the wrong reason. See follow-up down-thread.
simtel20 44 days ago [-]
> It’s 1/2 + 1 isn’t it?

The parent post is talking about the number that can go down while maintaining quorum, and you're talking about the number that need to remain up to maintain quorum. So you're both correct.

However:

> That would mean in 3 servers you need 2.5 aka 3 machines to commit a change.

That seems wrong. You need N//2 +1 where "//" is floor division, so in a 3 node cluster, you need 3//2 +1, or 1+1 or 2 nodes to commit a change.

hinkley 44 days ago [-]
I think I see the problem.

'Simple majority' is based on the number of the machines that the leader knows about. You can only change the membership by issuing a write. Write quorum and leadership quorum are two different things, and if I've got it right, they can diverge after a partition.

I'm also thinking of double faults, because the point of Raft is to get past single fault tolerance.

[edit: shortened]

After a permanent fault (broken hardware) in a cluster of 5, the replacement quorum member can't vote for writes until it has caught up. It can vote for leaders, but it can't nominate itself. Catching up leaves a window for additional faults.

It's always 3/5 for writes and elections, the difference is that the ratio of original machines that have to confirm a write can go to 100% of survivors, instead of the 3/4 of reachable machines. Meaning network jitter and packet loss, slows down writes until it recovers, and an additional partition can block writes altogether, even with 3/5 surviving the partition.

teraflop 45 days ago [-]
In brief: at any point in time, one of the replicas is a "leader" which controls the order in which operations are committed. The changes occur in a defined sequence, and other replicas may lag behind the leader, but cannot be inconsistent with it.

Your example can't happen, because if (for instance) A is the leader, then C will not apply change Y without contacting the leader, which will tell it to apply X first.

If you want more details about how this handles all the edge cases -- for instance, what happens if the leader crashes -- the Raft paper is quite accessible: https://raft.github.io/raft.pdf

hinkley 45 days ago [-]
TL;DR: Raft updates are serialized (as in sequential).
edoceo 45 days ago [-]
Raft consensus: https://raft.github.io/

Surprisingly easy to understand, and a cool viaual.

jasonjayr 45 days ago [-]
IIRC, a monatomic counter is involved. The odd one out will realize it's behind the highest sequence number and discard it's updates to resync with the majority consensus.

Edit: http://thesecretlivesofdata.com/raft/ if you have some time seems to be a good step by step explanation on how it works in detail.

adsharma 45 days ago [-]
The visual is indeed cool. I also thought it'd be nice to use a chat like interface to learn raft.

Alice: /set a 1 Alice: /set b 2 Bob: /status

Etc

https://github.com/adsharma/zre_raft https://github.com/adsharma/raft

Bug reports welcome

mshenfield 45 days ago [-]
Going from a local db to one over a network has at least one risk. The SQLite docs gives developers the okay to write "n+1" style queries (https://www.sqlite.org/np1queryprob.html). When the db is on the same file system as the application this pattern is fine. But as soon a you add a network call it becomes a potential bottleneck.
hermitcrab 45 days ago [-]
So if:

Alice, Bob and Charlie have a synced copy of the same database

Charlie goes on a plane and adds a loads of records without a connection to the other databases

Alice and Bob make no changes

Charlie comes home and syncs

Will Charlie lose all his changes, as his database is different to Alice and Bob's?

What happens if Alice, Bob and Charlie all makes changes offline then resync?

teraflop 45 days ago [-]
As per the description, all updates must be replicated to a quorum of instances. If Charlie is on a plane without the ability to contact a quorum, he can't add records in the first place.

This is a fundamentally necessary tradeoff to provide strong consistency, as described by the CAP theorem.

unnouinceput 45 days ago [-]
In which case Charlie will have an additional local DB to record those records and when he gets back will use another protocol/method/system/whatever to add those new records? How about if everybody goes and adds records on same table?

Here is a real life scenario that I had to deal with in the past. Technician (carpenters) goes to clients home to repair furniture in the middle of nowhere, so no internet. Adds the necessary paperwork which is pictures, declarations, contract (signed and scanned) to the Clients table. This company was employing hundreds of such technicians throughout the many counties of Germany, each with a laptop on them running this app which was the backbone for getting paid/do the work. And was not uncommon to have more than one carpenter go to client home and do the repairs. Since each carpenter was paid according to its own work, each of them would create entries in their local Clients table and when getting back to HQ their work was manually uploaded to central DB, and only after that they got paid. I automated that (that was the job, to eliminate the thousands of hours that carpenters were wasting manually).

So given the above scenario, how is this system going to achieve that? Same table, and same client details even in table Clients, just different rows for different carpenters (foreign key to table Carpenters).

renewiltord 45 days ago [-]
This tool does not handle that problem. It is not meant to. It's for simultaneously available replicas. And this is the rare moment where thinking about replication vs synchronization as different is worthwhile.

You usually replicate for failure tolerance and performance (this project only aims for the former).

vorpalhex 44 days ago [-]
As other commentors have mentioned, this tool is not intended for that kind of case. You want a tool like PouchDB which handles this kind of setup, but have a different set of tradeoffs (they're eventually consistent, not strongly consistent).
wtallis 45 days ago [-]
> So given the above scenario, how is this system going to achieve that?

I don't think it is. You're describing a use case that is distributed but explicitly does not want to enforce consistency—you want offline workers to all be able to keep working, and you're enforcing consistency after the fact and outside of the database itself.

fipar 45 days ago [-]
I have not read the full description of this project yet, but it does mention the use or raft for consensus, so in your example, I would expect Charlie to not be able to add any records while being disconnected, because, if my understanding is correct: - Charlie would either be the leader, but then without getting confirmation of writes from enough followers, he would not be able to do any writes himself, or - Charlie would be a follower, and while disconnected would obviously get no writes from the leader.
mrkurt 45 days ago [-]
It doesn't work that way. rqlite is effectively a leader/follower model that uses raft for leader consensus. Writes can only happen online, and only to the leader.
hermitcrab 45 days ago [-]
Ok, thanks!
NDizzle 45 days ago [-]
What's messed up is that I was doing this kind of thing with Lotus Domino in the late 90s. I'm sure others were doing it before me, too.

Sometimes you had conflicts that needed resolution, but those weren't that frequent for our use case.

adsharma 45 days ago [-]
Charlie doesn't have to lose the data he saved on the plane. Don't know what the rqlite implementation does.

In the second case, Alice-Bob consensus overrides Charlie

45 days ago [-]
alberth 44 days ago [-]
@otoolep

SQLite has a great post on “When to Use” (and not use) SQLite.

Would be great if you included these same use cases in the ReamMe docs and make it clear if Rqlite can address them.

https://www.sqlite.org/whentouse.html

ClumsyPilot 45 days ago [-]
I think microk8s uses this to form a cluster, and k3s used to use it but moved back to etc.

Would be good to hear from someone who used it what are the pros and cons of such a setup

fasteo 45 days ago [-]
AFAIK, microk8s uses a similar - but not this - form of distributed sqlite. Specifically, it uses dqlite[1] "a C library that implements an embeddable and replicated SQL database engine with high-availability and automatic failover."

[1] https://github.com/canonical/dqlite

tyingq 45 days ago [-]
Probably worth mentioning that Canonical initially made dqlite to be the backing store for LXD. It uses the sqlite VFS as the client entry point, so it's a very easy transition for an existing sqlite app, just recompile with the new header.
ericlewis 45 days ago [-]
Expensify had a version of something like this back in like 2013/14 I think.
moderation 45 days ago [-]
The project is BedrockDB [0] and has been previously discussed [1].

0. https://bedrockdb.com/

1. https://news.ycombinator.com/item?id=12739771

ericlewis 44 days ago [-]
Nice! Though the blockchain part is new to me. Interesting they kept growing this.
peter_d_sherman 44 days ago [-]
First of all, great idea, and a brilliant and highly laudable effort!

Favorited!

One minor caveat ("Here be Dragons") I have (with respect to my own future adoption/production use), however:

https://github.com/rqlite/rqlite/blob/master/DOC/FAQ.md

>"Does rqlite support transactions?

It supports a form of transactions. You can wrap a bulk update in a transaction such that all the statements in the bulk request will succeed, or none of them will. However the behaviour or rqlite is undefined if you send explicit BEGIN, COMMIT, or ROLLBACK statements. This is not because they won't work -- they will -- but if your node (or cluster) fails while a transaction is in progress, the system may be left in a hard-to-use state. So until rqlite can offer strict guarantees about its behaviour if it fails during a transaction, using BEGIN, COMMIT, and ROLLBACK is officially unsupported. Unfortunately this does mean that rqlite may not be suitable for some applications."

PDS: Distributed transactions are extremely difficult to get exactly right -- so I'm not trying to criticize all of the hard work and effort that everyone has put into this (again, it's a great idea, and I think it has a terrific future).

But Distributed Transactions -- are what differentiate something like rsqlite from say, something like CockroachDB (https://www.cockroachlabs.com/docs/stable/architecture/life-...).

Of course, CockroachDB is a pay-for product with an actual company with many years of experience backing it, whereas rqlite, as far as I can intuit, at this point in time (someone correct me if I am wrong), appears to be a volunteer effort...

Still, I think that rqlite despite this -- has a glorious and wonderful future!

Again, a brilliant and laudable effort, suitable for many use cases presently, and I can't wait to see what the future holds for this Open Source project!

Maybe in the future some code-ninja will step up to the plate and add fully guaranteed, safe, distributed transactions!

Until then, it looks like a great idea coupled with a great software engineering effort!

As I said, "Favorited!".

fnord123 45 days ago [-]
FoundationDB and Comdb2 also use sqlite as a storage engine. Curious that they decided to implement yet another one.

https://www.foundationdb.org/

http://comdb2.org/

tyingq 45 days ago [-]
Rqlite appears to predate comdb2.
rapsey 44 days ago [-]
Literally the first sentence.

> Comdb2 is a relational database built in-house at Bloomberg L.P. over the last 14 years or so.

rqlite is not 14 years old.

tyingq 44 days ago [-]
I was looking at the github repo history. Was is publicly visible sooner than that would imply?
tyingq 44 days ago [-]
Answering my own question, Comdb2 was made available to the public on 1 January 2016, well after rqlite launched.
Conlectus 44 days ago [-]
One thing that jumps out at me after reading a lot of Jepsen analyses - does Rqlite assume that partitions form equality relations? That is, that all nodes belong to one and only partition group? This is not always the case in practice.
yjftsjthsd-h 44 days ago [-]
So the case of A can talk to B, B can talk to C, but A can't talk to C? (Making sure that I understand how you can be in multiple partitions)
Conlectus 44 days ago [-]
That's right, yes. This is apparently possible (if not common) given the networking setups of a lot of data centers.
jchrisa 45 days ago [-]
I'm curious how this relates to the Calvin protocol as implemented by FaunaDB. They both use Raft, but FaunaDB and Calvin have additional details about how transactions are retried and aborted. https://fauna.com/blog/consistency-without-clocks-faunadb-tr...
f430 44 days ago [-]
Could you use this to build a decentralized p2p app? If so, what gotchas and limitations are there?
otoolep 44 days ago [-]
No, rqlite is not suitable for that kind of application. All writes must go through the leader.
foolinaround 45 days ago [-]
We currently use browsers on several devices (both laptops and android) and rely on google sync currently. Maybe this could be used to sync bookmarks, history etc across my devices but still keep my data local to me?
JoachimSchipper 45 days ago [-]
This uses Raft, so a quorum of devices would need to be online at the same time. That's not what you want for browser sync.
szszrk 45 days ago [-]
rqlite is mentioned here quite often, multiple times last year. I don't think this entry brings anything new.