A Lazy Sequence

Impedance mismatch with CouchDB

01 December 2010

‘At PARC we had a slogan: “Point of view is worth 80 IQ points.” It was based on a few things from the past like how smart you had to be in Roman times to multiply two numbers together; only geniuses did it. We haven’t gotten any smarter, we’ve just changed our representation system. We think better generally by inventing better representations; that’s something that we as computer scientists recognize as one of the main things that we try to do.’ — Alan Kay

CouchDB has proved itself to be a great platform for the core features of this site. Couch holds clumps of loosely structured JSON data in documents and allows the developer to create a variety of static views using map and reduce functions. I am going to examine some of the choices made and tradeoffs I have encountered since choosing to build this site on top of couch.

The content on this site is organized into “fragments” that each have a nominal type and set of fields, some of which are optional, and where possible share consistent names. For example, every fragment that has chronological information has a timestamps field that is a listi of date stamp strings. A view returns a potentially heterogenous collection of these fragments which are then passed through to controllers in the Clojure application with very little modificationii. Clojure’s multimethods are used to act polymorphically on these fragmentsiii.

This loose structure and late bound polymorphism allows for simple core logic. New fragment types can be trivially added, and, for example, will fall into all the appropriate heterogenous collections. The majority of these benefits are things that would be quite ugly (or at least require a lot of work) to achieve with a more traditional SQL RDBMSiv. I specifically looked at Couch as a solution to the ugliness I had previously had to deal with attempting to implement this site with SQL and Python/Django.

However, despite the ease at which the core logic is able to be implemented, I have run into a number of issues that cause me to be exceedingly hesitant to recommend couch to another developer, or to consider using on another project of my own. There are 4 major points that I will examine in more detail:

  • Limits on querying.
  • Reduce function limits.
  • Decomposing and reusing view logic.
  • Cached views.

Limits on querying

Couch does not have a query model in the same sense as an SQL DB. There is no general querying facility that can crosscut the data in any way you wish. On the up side, you don’t ever have to manage indexes; Couch is smart enough to determine these for you.

In place of the familiar queries, Couch supports the definition of Views. A view can then be queried with a small number of argumentsv. A view is a map function and optionally a related reduce function pair. While there is ad hoc view creation available it is intended only for development and does not scale to a deployed system. Map allows a view to produce as many key-value pairs for any document as it chooses. The view is then ordered by the keys. As you might expect, map only works on one document at once and works without any context.

To get the most out of couch, you need to learn how to reorganizing your thinking in terms of map functions generating the more appropriate key for your purposes. Have a look at this suggested mechanism for supporting paginated results for an example of some of the curious edge cases this system requires.

Reduce function limits

Reduce functions theoretically provide a way to increase the sophistication of these views. Unfortunately, due to implementation details bleeding through to the user, reduce functions have been saddled with a significant impediment: There is a maximum size to the data returned from any reduce function. This is because Couch stores the results of the reduce in the nodes of the B*Tree for each level of that view. This limitation allows couch to easily recalculate only the parts of a view that need to be recalculated when a document is revised.

Like map, reduce also must act referentially transparent. It does however have the caveat that you do know if it is rereducing (e.g. reducing the results of an earlier reduce rather than the results of a map).

I have run into issues surrounding the extent of what can be achieved with views, and reduce in particular, while trying to implement some of the more interesting areas around tagging such as calculating related tags. It seems that I will have to build an out-of-DB system to do the actual processing and then store it back as a document.

Decomposing and reusing view logic

Of all my issues, this is probably the most significant. There are two factors:

  • The JavaScriptvi for map and reduce functions is stored in Couch as a string, and is not able to reference any external library of functions beyond the standard lib and some utilities provided by couch.
  • Views only operate on the original pool of documents, and are not able to be a view of views.

It is fairly obvious that having to copy and paste common code across various view functions is error prone and time consuming. A bit sad for 2010. For example, to avoid emitting unpublished blog posts many of my views have:

if (doc.type == "blog" && doc.status != 1) return;
if (!doc.timestamps || doc.timestamps.length == 0) return;

While SQL isn’t famous for its reuse and composability, the reborn ClojureQL project provides a very clean alternative for accessing SQL DBs from Clojure.

Cached views

In general cached views are very handy; a lot of couch’s power comes from them. However, this does mean that views that provide time based are not soundvii.

For instance I cannot have a view that filters out fragments that have publish dates in the future. It is another case where I have had to push logic out of the DB and into my application. This is doubly an issue as some of the logic (see above) does already live in the database views.

Update: David Nolen suggests that Filtered Replication may handle this.

Conclusions

My experience with SQL to solve the core of my site has been unpleasant and difficult to maintain, however there has been no impossible aspect. CouchDB on the other hand has proved wonderful for the core of my site, but then very difficult, nearly impossible even, for other aspects.

This leaves me in an unpleasant position of deciding whether to stay with the existing solution and create bespoke out-of-DB processes for maintaining some data and views, such as related tags, and storing it back in as a document, switching over to an SQL system that is a weaker fit for the core but more generally applicable, or some frankenstien “data-mullet” setup.

See Also

  1. CouchDB – The Definitive Guide
  2. Clojure’s Solutions to the Expression Problem – Chris Houser’s Clojure Conj presentation.

Footnotes

  1. The first timestamp is the published date, any others are modifications.
  2. I deserialize timestamps and remove any fragments that are published in the future.
  3. E.g. for templating or url creation.
  4. I am well aware that SQL and/or relational guru’s may object on this point.
  5. For more detail about what you can query on see HTTP View API: Querying Options
  6. While I am using clutch for my Clojure → CouchDB access, I am not using the Clojure view engine. It may be the case that this would solve my problem.
  7. Observant readers will realize these shenanigans break referential transparency and thus is verboten in Couch. While the API in general protects you from writing referentially opaque map and reduce functions, Time just happens to be one area of the javascript standard library that lets the unsuspecting programmer shoot themselves in the foot.