Optimizing MongoDB: Lessons Learned at Localytics

I recently gave a presentation on MongoDB at MongoNYC 2011. The presentation was titled “Optimizing MongoDB: Lessons Learned at Localytics.” The purpose of the presentation was to give a run-down of the tips and tricks learned as we’ve grown on MongoDB. These include tactics we’ve used to keep our data and index sizes down, as well as gotchas and how we met challenges. Here I will elaborate on the finer details, as well as address some points that were raised by others.

Before I get into the details, here’s a high level of what’s covered:

Here are the slides from the talk, if you'd like to peruse them:

Optimizing MongoDB: Lessons Learned at Localytics

Optimize your data

This section is all about basic tips to decrease the size of your documents. When storing many documents, these small wins can really add up.

One thing I’d like to elaborate on was the idea of prefix indexes. This concept already exists in other databases, but in MongoDB we have to emulate it ourselves. The idea is that if you have a large field with a low cardinality (i.e., there aren’t many possible values), then you’ll save tons of space by only indexing the first few bytes. This is done by splitting the large field into two. The first field is called the prefix, and it’s what you’ll index. The other field contains the remainder of the field (suffix). When you do a find, be sure to specify the prefix and suffix in the query.

How many bytes you index is subject to how many possible values you expect. BSON (MongoDB’s internal format) can represent 32-bit and 64-bit integers much more efficiently than arbitrary length binary strings, so I usually pick a 4 or 8 byte prefix packed into an integer.

As another note, for UUIDs, MD5, SHA, etc., I said to use Binary data types instead of hex strings. This saves a ton of space. One finer detail was that I specified using Binary subtype zero. Some language drivers still default to type 2, but type 2 is both deprecated and uses a handful more bytes in BSON form. Something I didn’t say – which Brendan of Scala driver fame points out - is that you should use type 3 and 5 for UUID and MD5 respectively.

Fragmentation

In this section, I detail how fragmentation happens, and how it also arises during migrations. As @ikai said, “pay attention in your OS class.” Fragmentation was what nailed Foursquare late last year. Thankfully, a great post-mortem was provided.

Fortunately, you can easily deal with fragmentation. The simplest way is to regularly compact your database. Up until now you’d use “mongod –repair”, but in 1.9+ you’ll be able to do a faster in-place compaction. Both methods hold a lock for the duration of the compaction, so you need to compact slaves, swap them in as primaries, repeat. This might sound like a lot of work, but it’s really pretty easy.

10gen is aware of the need for online compaction (i.e., doesn’t lock for the entire duration of compaction). You can be sure they are working on solutions. Online compaction will be a huge plus, and address one of the larger criticisms of MongoDB.

You can also avoid fragmentation with careful padding of records (which MongoDB will attempt to do automatically), but it means you need a good idea of how big your records tend to get.

At the end of the day, every database needs regular maintenance, whether they do it automatically or not, and you have to plan ahead for how this affects system load.

Migration

As noted in the presentation, fragmentation can also happen as a result of chunk migrations in a sharded system. Fortunately, this is easily avoided by either pre-splitting and moving or by choosing better shard keys.

Pre-splitting is pretty obvious, but why a better shard key helps might not be. Avoiding fragmentation with a better shard key usually means choosing a shard key that has insert time as its first component (FYI, ObjectId does innately). In absence of deletes, MongoDB will keep appending data to files. If your shard key is prefixed on time, a chunk will contain data inserted around the same time and thus near each other on disk. When items get moved, larger contiguous ranges of disk will get moved, meaning less fine grained page fragmentation.

The other implication of insert time based shard keys is that you don’t add to old chunks, only the latest chunk. This means the newest chunks are hot, but it also means you only split (but not necessarily move) what’s hot in RAM. You can avoid constantly having one hot chunk by increasing the coarseness of your time prefix. For example, if you use a day granularity, then you’ll have one hot chunk at the start of a new day, but it will quickly split and move such that the rest of the day sees even spread across the shards.

There is an advanced tactic for avoiding hot spots with time based shard keys – pre-create chunks and move them while their empty. Do this periodically before the start of each new time interval. In other words, if your shard key is {day: X, id: Y}, before the start of each day pre-split enough chunks for the upcoming day and balance them across shards with moveChunk.

Hardware/EC2

Here I share some tricks and lessons learned about hardware and running in EC2.

Some tricks such as pre-fetching and “shard per core” are really only necessary today because of MongoDB’s simple concurrency model. In the future they will become pointless, because 10gen is working on large improvements to the both the granularity of their locks and when they yield. This is, in my opinion, an area which will have a huge impact on MongoDB’s performance. 10gen is very aware of this, and in the MongoNYC conference it was stated that they expect to provide major improvements to concurrency in every release for the next year. It’s high on their todo.

Other lessons, such as those about EC2 and EBS, are really just observations/data to help you frame your approach. Be sure to check out these references:

Working Set in RAM Controversy

I’d like to address the reaction the slides received on Twitter, particularly the “Working Set in RAM” slide. This slide involved a graph from a benchmark we did to understand the degradation in performance when your working size grows beyond RAM. We spun up an instance on EC2, attached 16 EBS volumes, installed MongoDB, and hammered it with clients. The workload was an infinite loop: write 500 bytes in a new doc, query a random old doc with equal probability, repeat. The graph shows ops/sec over time. It dips when data plus indexes grows beyond memory, taking about two hours to bottom out.

…the real point of the “Working Set in RAM” slide is that, as with any database, when your working set is in RAM you will achieve orders of magnitude more throughput than when it is not

It’s a rather pathological case, because the working set is the whole DB, and the read spread has even probability. In the real world, this is rarely the case. With most workloads, you tend to read the more recently written data, meaning that even if you have tons of data, you probably only read the most recently written portions the vast majority of the time.

For example, you might have 100GB of data, but 90% of the time you read things written in the last 10GB. In that case, you’d still be able to service most requests from RAM with a modest setup. However, what we tested was a much less forgiving scenario. Consequently, the graph taken by itself could misrepresent the actual outcome for “normal” workloads.

Extreme scenarios aside, the real point of the “Working Set in RAM” slide is that, as with any database, when your working set is in RAM you will achieve orders of magnitude more throughput than when it is not. That’s all I wanted to say with that slide. Some critics, however, picked out the exact numbers (something which in hindsight I should have left out without more explanation) and used them as evidence against MongoDB. “MongoDB only gets 1k ops/sec when hitting disk? That’s terrible, I can do 10K ops/sec in [insert conventional RDBMS] on my hardware.”

MongoDB can also do that. The test I showed was on EC2 with EBS, a rather slow IO system. Given commodity hardware with a RAID array of high RPM drives, I’ll show you 10K+ ops/sec on the same disk-bound test. Most comments attacking the 1k ops/sec shown in those graphs are really complaining about EC2/EBS, not MongoDB, even though they might want to blame MongoDB.

Case in point, for workloads identical to the test shown, MySQL will do roughly the same ops/sec directly on EC2 or using Amazon RDS. I’ve also tried several other databases with similar workloads. You’ll have a hard time doing random disk-bound work on EC2 regardless of the DB.

The bottom line is that the controversy around that graph and its disk-bound throughput number really isn’t MongoDB vs [insert your favorite DB], it’s all about the scenario, the hardware, and the workloads for which a particular database is designed. Many of the counter-points involving other databases were actually referencing different tests and/or different hardware. It’s important that we recognize the distinction and know what we’re actually faulting. Declaring that your DB crushes MongoDB – when you’re talking about different workloads or hardware – is especially irresponsible without context. It sets unrealistic expectations for those new to these systems and databases.

I highly recommend checking out the comments in this Hacker News post, Scaling with MongoDB (or how Urban Airship abandoned it for PostgreSQL). There are some very insightful comments about working set, disk-bound workloads, and how MongoDB compares to other databases. I’ve commented with some of my own extended thoughts in the post, but the threads involving FooBarWidget and rbranson are particularly enlightening.

Conclusion

Overall, MongoDB has been great. I had no intention to come across as negative, and was in fact rather shocked when people cited my presentation as proof to abandon MongoDB. While my presentation might have included some cons, every database will present similar challenges. They all require tweaking (and so does your data model). In the end, we achieved the performance we needed at the cost we needed.

I’m still open to new ideas, though, so I encourage discussion on other approaches. MongoDB isn’t perfect and could especially use improvements to its concurrency and compaction. It’s important that we approach these databases scientifically, putting them through the wringer with relevant tests. Claims of throughput and performance should be taken with a grain of salt. What were the test parameters, what was the hardware? Read up on these databases, understand their internals, and design accordingly. Don’t take numbers in marketing material entirely on faith.

That said, there are reasons beyond performance for choosing a database. Apart from MongoDB itself, 10gen has provided us rock solid support. Even though it’s an open source database, it’s nice knowing that there is a company dedicated to its advancement. Many new solutions – especially in NoSQL land – don’t have that kind of force backing it. Some do, but 10gen is definitely up there.

On top of that, the community has been stellar. MongoNYC 2011 was good proof of that. There’s also documentation, language support, 3rd party services, feature set, ease of use… the list keeps going.

Another factor has been stability. I’ve done rigorous stability tests on many databases, and among the new wave of databases, MongoDB (1.6+) was one of the best (also huge props to Riak, equally stable if not better). While it’s hard to beat the stability of tried and true DBs like MySQL and PostgreSQL, it’s been worth the risk to get the benefit of integrated sharding and auto-failover that’s easy to use. At MongoNYC 2011, Harry Heymann of Foursquare cited this as his number one reason for choosing MongoDB over sticking with their existing RDBMS solution. Check out the video of Harry’s “MongoDB at Foursquare” presentation.

We are very happy with MongoDB and 10gen. I’m just trying to help other people by sharing what I’ve learned.