Optimizing a Big RethinkDB Query, and a Correction

April 18 2015 Rethinkdb Mistake

Yesterday I wrote a post about how much I liked RethinkDB and people seemed to like it a lot. Particularly the claim I made about optimizing a fuzzy text search over 200,000 records down to 9ms.

That claim was wrong.

I'll walk through the various bubble and squeak below, but for now know that a) RethinkDB secondary indexes don't work on non-deterministic values and b) the actual figure is 629ms give or take. That's still impressive, to me.

Either way, I made a gigantic error. I have nothing to gain by falsifying a claim like this; RethinkDB isn't paying me and it doesn't serve me one bit to lie. What it does is suggest that I don't investigate things thoroughly and all I can say is that I spent days and days on this - and found the right things, I just didn't relay them properly. Sigh.

Here's what I found, and the investigations I did. I'll show you the numbers and how I got to them, and then we'll discuss that fuzzy text thing.

Optimizing a Big Query

First, if you want to play along I just pushed the data generator up to Github. Follow the instructions and you'll have two tables with 100,000 records.

The query we need to write is a relatively simple one: get the shipping status for a given company. This means we need to pull company information from the sales table and the shipping status from the fulfillments table and, for our purposes, let's use Stark Group as our test company (it's a fake name generated in the data generator).

If we reason our way through this query, we might start by pulling the sales information we want:

r.db("music").table("sales").filter(function(sale){
    return sale("customer")("company").match("Stark")
  })

This is a fuzzy text match. This is where I screwed up in my post yesterday. I didn't run this exact query when I created the example, what I did run was this query:

r.db("music").table("sales").getAll("Stark Group", {index : "company"})

This came back lightning fast because I have a secondary index on the customer.company field. For some reason which I just can't figure out, in my mind I recalled "fuzzy text match" when I wrote that post yesterday, which isn't the case at all.

Anyway - without any indexing the first query comes back in about 580 to 780ms on my Macbook Pro; not too bad really. Let's continue with the example...

Building Up Our Filters and Joins

The next step is to restrict the data we want and then join it over to the fulfillments table. For the rest of this example I'll show the work I did do, as opposed to the fuzzy match - which I'll address later on.

r.db("music").table("sales")
  .filter({customer : {company : "Stark Group"}})

Note that here I'm showing a different way to run a filter - I tend to like using callbacks (or "lambdas") but some complained that it looked noisy. Here's a less-noisy way of doing things.

Now let's add a join to fulfillments:

r.db("music").table("sales")
  .filter({customer : {company : "Stark Group"}})
  .innerJoin(r.db("music").table("fullfillments"), function(sale, fulfillment){
    return sale("sale_id").eq(fulfillment("sale_id"))
  })

This works, but it's not exactly correct, and it takes forever to run. An inner join in RethinkDB is designed for sequences of data, which we have here, but there's another join we can use since we're working with primary and foreign keys (for lack of a better name) - an eqJoin.

You might be wondering why I chose to show this example then? Simple: if you come from a SQL background, like I do, an "inner join" seems the obvious choice. I made this mistake :).

If you run the query above it takes quite a while. We need to use an eqJoin and it needs to be optimized.

Understanding Joins

When you use joins in RethinkDB, you connect two sequences of data based on some type of equality. Currently, RethinkDB supports 3 kinds of joins:

Obviously I could have made a better choice in the case of this query.

When you use an eQjoin in RethinkDB you join a table or sequence on the "left" to a table/sequence on the right based on a primary key or secondary index. This can be confusing, so let's see an example:

r.db("music").table("fullfillments")
  .eqJoin("sale_id",r.db("music").table("sales"))
  .count()

Note that I need to start with the "child" table, then join the "parent" table. Seems a bit backwards. Let's tweak this for better readability. I'll start by creating a secondary index:

r.db("music").table("fullfillments").indexCreate("sale_id")

I don't need to specify anything else here since I'm building an index on a top-level key. Given that the index is based on sale_id, I can now do an eqJoin in this way:

r.db("music").table("sales")
  .eqJoin("id", r.db("music").table("fullfillments"), {index : "sale_id"})

Equality joins in RethinkDB are optimized and much more efficient than and innerJoin. You can see this when I plug it into our query:

eqjoined

2 seconds - much better!

Order Of Operations

Exactly when you apply the filters and joins is very important. Here, I'm filtering the sales table first, and then applying the join. This restricts the results and makes the join operation much easier. If we flip that around, we mess things up pretty good.

This is how it's done with SQL - you join up your tables/views/data and then filter/order it. With RethinkDB you chain together your functions to do what you need: grouping, filtering, shaping and reducing your data. How you do this is quite important when optimizing.

To see this, just move that eQjoin one line up... you'll be waiting a while.

That 9ms Claim

Let's reduce the noise in this query and rely on indexing a bit more. First, I'll create an index for the company name called "company":

r.db("music").table("sales")
  .indexCreate("company", r.row("customer")("company"))

I can now use this in my query, allowing me to ditch the filter:

r.db("music").table("sales")
  .getAll("Stark Group", {index : "company"})
  .eqJoin("id", r.db("music").table("fullfillments"), {index : "sale_id"})
  .map({
    company : r.row("left")("customer")("company"),
    email : r.row("left")("customer")("email"),
    status : r.row("right")("status")
  })

11ms_query

This goes off in 11ms (plus or minus a few ms). I'm able to use getAll() here because I'm querying by company name.

It should be noted here that RethinkDB, like many database systems out there, implements a caching scheme to optimize query results. This result is fast enough that I'm quite certain it's working over in-memory results. I don't think this disqualifies my findings, but I do think it's worth noting.

Applying a Fuzzy Match

What happens if I use a fuzzy search? The result time grows, of course. The good news: it's still pretty dang fast. The bad news is that it was pretty negligent of me to make that mistake:

fuzzy_search

Hope you find this exercise helpful - I'll do better next time.