Cloudant’s Rather Awesome dbcopy for re-reducing the output of existing map/reduce data.

I love Cloudant, simple as, their implementation of CouchDB as a DDaaS (as opposed to DBaaS) has no real competitor – as CouchDB develop’s, that may, of course, become a problem as it gives dominance in a market, but for now, we’re good.

One of their best features is the dbcopy command, this copies the output of a map/reduce view to either the same database or another database entirely. Why would you want to do this when you can just create another view? Simple really, you already have the data you need and you don’t want to re-build a view across a huge dataset. The output of a map/reduce view often is a single number or statistics on those numbers. For CouchDB to iterate over large documents is significantly slower than iterating over numbers or the output of _stats – as an example, the rate at which a view was updated recently on our database took a week when I added dbcopy to it, vs near immediately for the view on the dbcopy database (based on a constant request to update the view as I have set up for our database).

Or to put it another way – imagine having two books: one is a book where each page is full with a story, another book summarises that page’s story with a bullet point – how much quicker can you read the second book? A LOT quicker! Yes the meat of the story is lost, but so what, you don’t want meat, you want analysis.

Right so, how is this done? Let’s use an example (I’ve abstracted this from a use case at Media Skunk Works, it may look a bit convoluted at times, but I can’t give away the real use case!)

Let’s imagine you have a database of users who search your site for certain keywords; during registration you ask for their country, so you have that information.

In your map, you emit, as the key: [country, user, keyword], so you can find all keywords for a user in a country, all users in a country and all countries, but what if you want to find all users who searched for a particular keyword ?

You could add a new view, but, if you add “dbcopy”: “user-stats” (for example, “user-stats” is the database to copy the reduce output to) to your view definition, you can use that new database to re-work that data.

So, in the “user-stats” database, add a view that does the following: “if(doc.key.length === 3) emit([doc.key[2], doc.key[1]], 1)”.
And voila, your new database now has a lookup that outputs [Keyword, User] = 1 for each document allowing you to re-group and re-stat data based on Keyword and User. The bonus is, that’s a huge performance gain. But not only that, you could also add a view that has “emit([doc.key[2], doc.key[0]], 1)” to see what user’s from each country lookup with minimal performance detriment – in fact 0 performance detriment to your primary database.



Bookmark the permalink.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.