Hacker News new | comments | show | ask | jobs | submitlogin
Design Review: Key-Value Storage (mozilla.github.io)
305 points by espeed 4 months ago | hide | past | web | 85 comments | favorite

The page writes that LMDB "is exceptionally fast for our kinds of load", and then links to an in-memory microbenchmark: http://www.lmdb.tech/bench/inmem/

Aren't they interested in persistence of the key-value data? In my experience, once data is persisted to disk or SSD, LMDB is way slower from alternatives because it needs to operate in synchronous mode to avoid corruption (effectively flushing to disk after after every transaction committed). If operated in the non-default MDB_NOSYNC mode (which is the mode chosen in the above benchmarks), then there is a high probability to be left with an unreadable database file after a crash, thus losing all your data.

It is not fair to compare with other databases in sync mode, since they might operate safe but faster in async mode. For example sqlite with PRAGMA journal_mode=WAL and PRAGMA synchronous=NORMAL can operate in semi-asynchronous mode (fsync()ing sporadically) without fear of corruption in case of crash, because it keeps a WAL journal and is able to properly roll-back after a crash. This should be much faster than LMDB's default-and-safe synchronous mode, that msync()s on every value written.

Concidentally, I'm in the middle of testing out LMDB as replacement for SQLite - we've been running fine for a year with journal_mode=WAL, but synchronos=NORMAL without issues, and just added last week LMDB, but with MDB_NOSYNC... hmm... so far our real testing showed (for our use case) almost same results, except some 95-99% percentiles where SQLite goes much slower (but due to other factors). In synthetic benchmarking (suiting our needs) the difference is miniscule..

If you are using MDB_NOSYNC in production, keep in mind that the trade-off for the boost in performance is that a system crash can lead to an unreadable database.

Yup, we had SQLITE in the same fashion (synchrnous=OFF, and other non-safe settings, no WAL also) and had occasional crashes, then someone had to manually delete the database (in our case it's for a desktop app, a tool, so no server-like requirements, yet we want it to be stable).

Would remove the MDB_NOSYNC then, and see how it goes...

We have only ever compared LMDB in synchronous mode to other DBs in synchronous mode, and LMDB in asynch mode to other DBs in asynch mode. Come on, that's too obvious.

And LMDB beats the crap out of SQLite, in any mode. http://www.lmdb.tech/bench/microbench/

Replacing SQLite's Btree engine with LMDB makes the SQLite footprint smaller, faster, and more reliable too. https://github.com/LMDB/sqlightning

> And LMDB beats the crap out of SQLite, in any mode. http://www.lmdb.tech/bench/microbench/

Alright, I went through this page to see why my experience is different. While it mentions that they enabled SQLite's WAL journal, it also mentions that the synchronous writes were performed with PRAGMA synchronous=FULL.

I believe that if they set PRAGMA synchronous=NORMAL, they will get an ACI-reliable database that is way faster than LMDB with write-to-disk workloads.

Most of the measurements on that page are not really useful to me; they do not persist the data (db on tmpfs) or they do not care about reliability, using dangerous settings. Only few measurements are both persisting the data and writing safely, but the SQLite configuration is sub-optimal for them.

It's not that simple. What you mean by saying "(a)synchronous mode" is very different from database to database. See my above comment. Is SQLite synchronous or asynchronous if you configure it as journal_mode=WAL and PRAGMA synchronous=NORMAL?

For my purposes as an application developer, I care about comparing databases operating in safe mode i.e. a system crash should never cause total data loss. According to my experience SQLite's safe mode is many times faster than LMDB's safe mode, with a write workload. While LMDB is thrashing the disk and achieves only a handful of write transactions per second.

SQLite's safe mode is not comparable to LMDB's; SQLite is vulnerable to silent data loss in a crash.


LMDB is not.

Isn't this from the paper "all filesystems are not created equal"? [1] If you search the tables for "sqlite-wal" you will see that it shows zero vulnerabilities.

[1] https://www.usenix.org/system/files/conference/osdi14/osdi14...

Ah, the earlier result was for SQLite-rollback, their default mode. IMO any DB should default to its safest mode.

I have been comparing multiple key-value stores and lmdb has been the simplest and fastest to use with or without MDB_NOSYNC. It beats almost every kv store in writes.. and in reads it is untouchable by anything by a large magnitude.

Agreed for the read workloads, I have the same experience. It was designed primarily for an LDAP directory after all, an application that is very read-heavy.

Where on this page does it say that LMDB was run with MDB_NOSYNC? Tests were run with the DB on a tempfs and then again (lower down on the page) on ext4.

Section 4:

"using a 512GB Samsung 830 SSD and an ext4 partition.

The actual drive characteristics should not matter because the test datasets still fit entirely in RAM and are all using asynchronous writes"

How does this compare to sstable?

Didn't they review badger from dgraph.


Here is the comparison back in 2017: https://blog.dgraph.io/post/badger-lmdb-boltdb/

It supports concurrent ACID transactions with serializable snapshot isolation (SSI) guarantees.

Hopeful this goes better than WebSQL. WebSQL ran into standardization problems because they started with an existing implementation (SQLite) instead of a spec as their base implementation--meaning anyone making a greenfield implementation would need to implement all the quirks of SQLite to be compatible with other implementations. Sadly it died as a standard and we were stuck with either local storage or indexdb.

There is no intent to standardize this or expose it to the web. This is purely for use internal to Firefox. Of course that may end up meaning that it is used as the internal storage for some web feature.

The first sentence on that page says "We propose the standardization of a simple key-value storage...usable from JS, Java, Rust, Swift, and C++"

I assumed this meant an API callable from webasm/js. Did I miss something?

Ah, found my error:

"Not-yet or never goals for this proposal are:

Standardization via a standards body as a web API."

So this is "internal" stuff I guess.

Poor wording perhaps. There are a bunch of places in Firefox that use key-value stores, this proposes standardising on one type of store.

I'm really surprised LSM trees didn't get more commentary. 'All are targeted at server workloads' - sure, but they're also incredibly popular and appear to be as close to the 'one-size-fits-all solution for storage' as we've found.

Also typically more complicated and require a separate compaction process. They're good for writing lots of data, but not so great for random reads.

Author of Badger here. Our design of separating keys and values has gotten us incredibly fast writes, while still keeping the read latencies neck-to-neck against B+ trees. Worth checking out: https://github.com/dgraph-io/badger

Badger looks nicely done! Did you end up needing to change much in implementing Badger from what was described in the WiscKey paper?

Badger is really good, i'm using it.

This looks very interesting. Do you have any builtin compression or tips on compressing data?

Nice, why the choice of go ?

In case someone involved in this reads this thread: the document does not specify which LMDB version you tested. I suggest you run your tests with the `mdb.master` branch, i.e. the work towards a future 1.0, and not the stable 0.9 branch. The answers to several of your interrogations will depend on that: with `mdb.master` you can use the VL32 mode which greatly improves usage on 32 bit platforms, and Windows support is much better.

Regarding NFS: I have recently started testing LMDB on NFS v4 and had no issues so far, but with a single process using the database. AFAIK the warning at http://www.lmdb.tech/doc/ is only for multiple processes using the DB concurrently. I am still not entirely sure there won't be any mmap-related issues, but so far so good.

Regarding "being careful": this is a very important point. The LMDB API does not hold your hand, it lets you do dangerous things which will result in corruption of your database, which you will discover too late. I suggest writing a wrapper around the API to ensure you are using it correctly. (I wish there was a compile flag like LUA_USE_APICHECK [1] for LMDB, which could help detect problems like this, but there isn't.)

[1] https://www.lua.org/manual/5.3/manual.html#4

The main reason for that warning about NFS is that people will try it, see that it seems to work, and then get careless and try to use the same DB from two different hosts at once. It's inevitable when you're working on files living on networked filesystems, and it cannot work. NFS doesn't offer any cache coherency guarantees, and the mutexes used for synchronizing writers only work on the host that created them.

Not sure what you're talking about re: the API letting you corrupt your database.

If we're using lmdb for something that's largely or entirely read only, how viable is using it on NFS with (shudder) lock files or something like that?

If it's 100% read-only, you could probably use it safely. Make sure its filesystem on the NFS server is mounted read-only, and obviously all the clients' NFS mounts must also be read-only. As for "largely read only" - if there are any writers at all, all bets are off.

LMDB automatically detects read-only filesystems, and turns off its locking in that case, so it should perform as well as anyone could expect NFS to perform.

Things like closing DBIs, transactions and cursors in the wrong order, for instance.

That won't corrupt the DB. It might corrupt the heap of your current process, but the DB won't be harmed.

C doesn't have destructors. If you free stuff in the wrong order, you lose. That's nothing to do with LMDB's API. Even if LMDB cleaned everything up internally, you'd still have dangling pointers in your app space. No language or API can prevent that.

The only thing you can do in the API that might possibly corrupt the DB is to use MDB_RESERVE and then store a value bigger than the space you reserved. That generally causes a SEGV, and you'll discover very quickly that you have a bug in your code. LMDB will fail fast, every time, and every failure will be a bug in your own code. Makes debugging very quick.

Glad to hear that.

To be honest the only real corruption issues I had with LMDB in practice were dues to me using MDB_NOSYNC, and broken mmap behavior on some Android devices (on external storage).

I have a layman question if somebody could please answer. I have never in my entire life seen databases fail. But db failures and issues seem to be brought up all the time. Now I understand that part if this maybe the cost function associated with them. But I'm sure there's also something that I have no clue about. So my questions are:

1) what kind of problems do databases actually face.

2) what kind of scenarios create those problems.

3) how does a programmer go about testing them?

The easiest scenario to imagine is a hardware failure or power outage. The database was in the middle of doing something, and then was prevented by a hard drive dying or the lights going out. One way to test such a thing is to literally unplug the computer to see how it handles the failure.

So, let's say you have a client/server application... the client is telling the server (database) to write some records to the database. In the middle of the write, you pull the plug. Some questions you'd want to know: what does the database look like when it restarts? Can we read it? What is the current state? Did any of the new data get written? What does the client think was written? If there was an uncommitted database transaction, was the database left unaltered?

It's just as important to test the client in these scenarios. While the server may have crashed, what does the client think happened? Was it waiting for an ACK or "OK" message? Did it get the message? If the update failed, what does the client do in that situation?

Things can get even more complicated if you're thinking of replication across different servers. If one of the servers fails, how does the replication work? Do sessions fail over to other servers? How many servers are required? If there was a corrupted record, did it propagate or was it scrubbed?

Thank you for explaining so well and clearly!

To you and others, are there any other scenarios too that happen in production?

Enough material here to scare anyone about databases


Some things I have seen in production:

The disk become inoperative during a write, this can be either silent or writes start to return errors. Again, how does the database look like after the problem is solved.

A large operation exceeds the capacity of the server to deal with intermediary state. It runs out of memory, disk, or in some not great DBs it loses control of some locks and gets deadlocked. Can it recover with only the partial log data?

Disks lie about data being written, what happens if one of the problems happen between the disk saying the data was written and it actually getting written?

And, of course, when you move beyond a single server things get way more complex.

You'd be surprised by how often I've seen a database fail in prod simply because it ran out of disk space. In both cases monitoring software was running but misconfigured.

We ship a consumer application with databases in it (multiple). Even the ACID ones fail all the time, maybe 1-3% of our userbase has had a corruption at some point.

Can you talk about the reasons of the failure?

This doc says LevelDB has no transacion, thats not true, they have batch writes and LEVEL-DB is not implemented in Go. Its implemented in CPP

> LEVEL-DB is not implemented in Go. Its implemented in CPP

There is more than one implementation of LevelDB. This one (https://github.com/syndtr/goleveldb), in Go, is used in major projects such as https://github.com/ethereum/go-ethereum.

Transactions let you perform any series of SQL commands with various expectations around data safety and locking guarantees depending on isolation level.

Batch writes provide a tiny subset of the full possibilities of transactions. While sufficient in many cases, that cannot be generalized to "LevelDB supports transactions".

That's just your pet definition of transaction. That is not universally accepted.

Atomicity and consistency. The classic case is balance transfer. Batch writes only cope with this to the degree that you can model your computation purely in database primitives that can execute server-side as part of the batch. If your computation involves anything else, such as external data or a computation that can't be modelled, then microbatches aren't good enough: you risk lost updates and inconsistent reads.

Universal acceptance is not a valid criterion for human beings at scale.

Schizophrenic people don’t accept facts like “this wall exists”, yet we still reach agreement as a society that a wall exists and don’t try to walk through walls because they exist.

You are of course welcome to hold any belief you wish, but believing something that puts you in direct contradiction to an entire industry significantly raises the bar of proof you must provide in order to convince others to listen to you.

Your reply does not provide that proof, and thus your argument is not persuasive.

I would love to see an example of someone actually knowledgeable about databases who has a different definition for transaction.

Atomicity, consistency, isolation, durability.


But that's just like, the industry's opinion, man.

Maybe I completely misunderstood the parents description of a transaction, but that was exactly what they were saying.

This is a great example of how to conduct a technical design review.

I see their have some questions (like how good is the windows and android support) that are not answered? or only internally? I think will be good to see what them found.

Windows and Android are fully supported, have been for years. As are iOS and MacOSX and all the BSDs.

Thats great to know, and I suspect that. But what I'm saying is that them ask some valid questions but where are the results of their findings?

Because if I read this, I could conclude LMDB could have troubles in that areas...

Curious about how this compares to TiKV...

It really sounds like they should take a look at this:


"One appealing aspect of LMDB is its relative ease of use from multiple processes, above and beyond its basic capabilities as yet-another-fast-key-value-store."

simdb is only lock free and thread safe. While LMDB is benchmarked at around 10k writes, this should be able to do millions of mixed reads and writes with 4 modern cores. LMDB seems to use a separate lock file to sync multiple threads/processes. The only catch here is that the keys aren't sorted, which doesn't seem to be a requirement of theirs.

LMDB is awesome, can't tell how many times I've wished to have access to it from the browser...

How does this compare to Redis?

Redis isn't embeddable, but a standalone server application, and thus it's not really the same space.

Thank you for asking this question. I don't know a lot about databases, but I would have thought redis, as a major key-value store, would have been part of this evaluation. So your answers were helpful.

It actually stores data.

Bring back Mork!


Mork actually sucked even as a key-value store. It's only decent if your requirements are a) only lookup on a fixed, autoincrement integer ID, b) the only operation you're likely to do is load an entire record or store an entire record at once, and c) parallelism is not in your vocabulary.

Disclaimer: I'm one of the last people to make functional changes to mork.

That's not a "Disclaimer". It's actually the polar opposite... a "Claimer"? ... I think "Source" is the best word to use there.

Hah! :) I've been meaning to call out other misuses of "Disclaimer" vs. "Disclosure" before, but I usually don't because it's just tedious.

NB: NB works for me.

Seems like an instance of reinventing the wheel ...

From the article:

"We propose ‘buying’, not building, the core of such a solution, and wrapping it in idiomatic libraries that we can use on all platforms.

We propose that LMDB is a suitable core (see Appendix A for options considered): it is compact (32KB of object code), well-tested, professionally maintained, reliable, portable, scales well, and is exceptionally fast for our kinds of load. We have engineers at Mozilla with prior experience with LMDB, and their feedback is entirely positive."

How so? The project is building upon an existing key-value store (LMDB) to replace some custom data stores in Firefox using flat text files, JSON files, Kinto, and SQLite.

Curious comment. My reading of it is exactly the opposite of reinventing any wheels.

Why do they make this proposal? What does that mean?

It's a proposal for making use of a new storage engine to store Firefox internal data.

I thought that the issues like this would be blocking:


Who knows how it should work on 32-bit systems?

And isn't endianess also a problem? And doesn't SQLite solve both problems by default? And isn't also possible to configure SQLite to be very fast, if one know what he's doing?

Btw I'd expected that the "notes here" https://docs.google.com/document/d/1bwbpqPb58a0GcEyB4W424pyf... are conclusions, but the document seems to be not publicly accessible.

According to @hyc in https://monero.stackexchange.com/questions/2606/are-the-lmdb..., "As of v0.10.0, yes the LMDB files are cross-compatible between 32 and 64bit architectures. They have always been cross-compatible between OSs. They are still byte-order dependent but almost everyone uses little-endian CPUs these days so it's not much of an issue."

So the endianness answer is, from your link: "They are still byte-order dependent but almost everyone uses little-endian CPUs these days so it's not much of an issue." Which just means "we solve the problem by ignoring it completely."

And what about the limitations on the 32-bit system? Isn't there also needed to use memory space of RAM for the complete size of the database, that's how that database works if I understood? Which makes LMDB effectively unsuitable for 32-bit systems:


"your user will need to either enable PAE on their system, or upgrade to 64-bit CPU. If neither of these is an option in your application, then you cannot use a memory mapped file larger than your available address space"

In short it still looks that LMDB is designed effectively only for one-endianness and only for 64-bit systems, which is still limiting for many use cases.

Again, it seems that for LMDB is again "solving" the problem by ignoring it. Which is OK if it fits your use case... But shouldn't Firefox work properly on 32-bit systems? Or did they completely decide that they don't want to target any 32-bit platform anymore?

Moreover, there are use-case scenarios where the memory mapped file approach can suffer from problems of unnecessarily reading the page that will anyway be completely overwritten. My conclusion is... if LMDB "works for you" fine, but do properly your research first. I still believe SQLite covers much more use cases and is safer starting point for many of them, including Firefox use cases, until I read that they really decided to reduce these.

The endianness answer is "we thought about it, and it's a non-problem 99% of the time, and not worth further thought." You can't use a DB concurrently from multiple hosts, so it's unnecessary to support both endiannesses at once. If you need to copy a DB between different endian hosts, export it with mdb_dump and reimport it with mdb_load on the other machine. Don't be an ass.

There are no limitations on 32bit systems using MDB_VL32 in the 1.0 branch, you're spewing crap without any idea what you're talking about. Whoever's writing on stackoverflow doesn't know what they're talking about either.

> Moreover, there are use-case scenarios where the memory mapped file approach can suffer from problems of unnecessarily reading the page that will anyway be completely overwritten.

That's only true if you use a writable mmap, and the DB is larger than RAM. The default for LMDB is not to use a writable mmap, so this isn't an issue that affects most LMDB users. It is also already documented, so you choose it at your own risk.

> Again, it seems that for LMDB is again "solving" the problem by ignoring it. Which is OK if it fits your use case... But shouldn't Firefox work properly on 32-bit systems? Or did they completely decide that they don't want to target any 32-bit platform anymore?

Yes and no, respectively. Yes, Firefox should work properly on 32-bit systems; and no, we didn't completely decide that we don't want to target 32-bit platforms anymore.

Rather, Firefox plans to use LMDB where it fits one of its many use cases for persistent storage (and not use it where it doesn't).

The StackOverflow thread you referenced describes a blockchain program that expects to use "a few GB of lmdb diskspace." Whereas Firefox's use cases for LMDB are sized in the range of a few KiB to a few MiB. Firefox has no plans to start storing a blockchain.

Claimer: I'm the engineer integrating LMDB into Firefox.

The last issue is merely one of performance, not capability.

I’ve successfully used LMDB in 32-bit land, though it takes some effort. I had to scan available virtual memory for the largest contiguous chunk and use that.

Bigger issues are growing the database size and performance in low memory. No transactions can be running to increase the map size, which is usually hard to coordinate.

Also, in low memory situations, LMDB’s performance suffers tremendously. It can be 100x slower, and commits can take seconds. You won’t run into it unless you are really hammering it, and the developer usually won’t notice because they usually have lots of RAM.

Fwiw, we've benchmarked on Raspberry Pis with slow SDcards. You want to talk about low RAM situations and slow I/Os. The reality is still the same though - every other DB engine is many times slower under the same conditions.

Even when the DB is 5x or 50x larger than RAM... http://www.lmdb.tech/bench/hyperdex/

re: 32bit land, You should look into MDB_VL32 in mdb.master as other posters have suggested.

If any single commit takes multiple seconds, you've probably written too much data in a single transaction. Above about 512MB it starts preemptively flushing intermediate pages out, to keep the in-RAM footprint limited. Some of those intermediate pages will need to be read back in if your transactions are hitting all over the DB. This is where the main slowdown comes.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact