Posted on | February 3, 2011 | 11 Comments
I think the number one benefit of Redis was bringing data structures to the head of table in the battle for the one true datastore . That said it can be more than a little painful to watch it go through the normal growing pains associated with a promising project. At RGB, we looked at Redis to solve the activity stream like many before us. After getting this all going in EC2 there was some frustration around the uncertainty of persistence in the face of failures that might occur. We are a shop that already decided that the premium for AWS RDS was enough to remove most of the pain points associated with MySQL. We have fast moving data that grows against the stream of world news. When we learned that Redis VM wasn’t performing at scale in production environments it really spooked us. When antirez said he was looking down the barrel of implementing his own BTree (not even the best solution for modern storage backends) from scratch I started to get upset. Like angry upset. When the news started to float about the filesystem datastore (one file per key) I started to look for other solutions. Redis is great, however we don’t have the operational resources to baby sit any piece of our architecture. Redis works great until it hits the wall; that wall being over 80% of your machine’s RAM. A lot of different solutions are being looked at to remedy Redis, but based on our collective experience on all this stuff few will disagree that it’s early days. I think the best option is for Redis to adopt a pluggable persistence strategy and support an open leader. A couple of solid candidates for embedding are:
- Berkley DB
What follows are ideas about how maybe we can go the other way and apply the well behaved deterministic data structure operations found in Redis to the databases that are industry proven and do the most important thing really well: read/write data really well.
What else has been done in the space?
In the comments I saw many rock star engineers try to talk antirez off the ledge. Justin Sheehy of Basho pretty much offered up bitcask to solve the persistence problem, but no bite. Don’t get me wrong, I love Redis and the future of Redis Cluster is really bright, but this “not made here” / “our community can wait for our two man team” can only go so far before the Ah Ha moment of Redis is co-opted by other players in the space. This brings me to my first major point. The idea of datastructures in the DB is not new:
- MySQL Stored Routines Library (AKA ”Arrays,Sets,Hashes,Queues,and Stacks” implemented efficiently in stored procedures )
- Written by Giuseppe Maxia, QA Director at Continuent, back in ’05
- Not as fast as Redis, but it does what it needs to do (support large datasets with limited memory while keeping your data safe)
- With some minor improvements things could be tuned for massive vertical scale (think RDS and big box environments)
- Even simple modulo sharding on the keys could allow the system to better scale on many cpu/core servers by minimizing IO locks. Wih this strategy the simple tables used to support the datasets are replicated to create partitions. Native MySQL hash partitions might also be a good bet, but just having a separate table means the opportunity to shard at the client level opens up. With client side key sharding we can now host partitions on different databases
- The libary supports array_merge and could be extended easily for diffs for full set support.
So should everyone just move back to MySQL? Hell no, of course not. It’s just proof that antirez has options about where to spend his time. Playing around in the mud with core peristence doesn’t save time. Primary key and single integer secondary key lookups are fast on any database. These types of ranged searches would be limited in nature to just collecting elements of a list. This is fast everywhere. What is slow are the arbitrarily complex index interactions that come with the relational world. Simple data structures make things faster regardless of the technology used in the implementation. So what alternatives do we have for persistence? Given the example above I think SQLite could be a great implementation leveraging the power of stored procedures to implement atomic datasets. Redis waste the RAM it has on core storage instead of using it effectively as a cache and working memory for arbitrarily large datasets (only limited by node resources). Redis should be only concerned with the client, network and caching layers and leave persistence to more mature and capable backends. Redis could make those backends more capable by orchestrating partitions, backups and bulk data moves once Redis Cluster comes online.
SQLite is really only one suggestion. Using BerkleyDB or Innostore could remove the slowness and pain of interacting with the data through SQL. I believe Membase is currently dealing with the pain of slow performance using SQLite once internal cache is exhausted. Indeed, something like Innostore or bitcask might be a better bet.
Alas, the benefits of Redis are clear and co-optable. For example, Basho’s Riak has a great node/link model for representing data. They added a Solr abstraction on top of this model to court users trying to solve the scalable search pain point. What really stops them from adopting a simple data structures interface for the majority of developers who learned how to think that way in their freshman year? 6.001 (Data Structures) was the first CS class I took at MIT.