Elasticsearch and Joining

March 12, 2013

by Felix Hürlimann

With the success of elasticsearch, people, including us, start to explore the possibilities and mightiness of the system. Including border cases for which the underlying core, Lucene, never was originally intended or optimized for. One of the many requests that come up pretty quickly is the whish for joining data across types or indexes, similar to an SQL join clause that combines records from two or more tables in a database. Unfortunately full join support is not (yet?) available out of the box. But there are some possibilities and some attempts to solve parts of issue. This post is about summarizing some of the ideas in this field.

To illustrate the different ideas, let’s work with the following example: we would like to index documents and comments with a one to many relationship between them. Each comment has an author and we would like to answer the question: Give me all documents that match a certain query and a specific author has commented on it.

Compound documents

The first attempt to model this in elasticsearch is to flatten out the document and comments into pairs. This comes at the price of increased storage requirement and potentially massive duplication of content in the index. Creating new comments is relatively simple but any change in the document itself requries a re-index of many records.

Index:

curl -XPUT http://localhost:9200/documents/document/1 -d '{
    "title": "Habemus Papam",
    "content": "In Rome, white smoke rose from the chimney atop of the Sistine Chapel.",
    "comment_text": "Oh my god ...",
    "comment_author": "John Doe",
    "comment_date": "2013-03-11T13:13:13"
}'

curl -XPUT http://localhost:9200/documents/document/2 -d '{
    "title": "Habemus Papam",
    "content": "In Rome, white smoke rose from the chimney atop of the Sistine Chapel.",
    "comment_text": "Finally!",
    "comment_author": "Jane Roe",
    "comment_date": "2013-03-10T14:14:14"
}'

The query then is straight forward:

curl -XPOST http://localhost:9200/documents/document/_search -d  '{
    "query": {
        "bool" : {
            "must" : [{
                "match" : {
                     "title" : "papam"
                 }
             },
             {
                 "match": {
                     "comment_author": "John Doe"
                 }
             }]
        }
    }
}'

Subsequent searches

Another approach is to model the documents and comments in different indices (or types) and execute two subsequent queries. The advantage here is that we get rid of the duplicate content in the index at the price of an extra request (including network overhead and parsing the response from the first request).

Index:

curl -XPUT http://localhost:9200/documents/document/1 -d '{
    "title": "Habemus Papam",
    "content": "In Rome, white smoke rose from the chimney atop of the Sistine Chapel."
}'

curl -XPUT http://localhost:9200/comments/comment/1 -d '{
    "text": "Oh my god ...",
    "author": "John Doe",
    "date": "2013-03-11T13:13:13",
    "document": 1
}'

curl -XPUT http://localhost:9200/comments/comment/2 -d '{
    "text": "Finally!",
    "author": "Jane Roe",
    "date": "2013-03-10T14:14:14",
    "document": 1
}'

Query:

curl -XPOST http://localhost:9200/comments/comment/_search -d  '{
    "query": {
        "match" : {
            "author" : "John Doe"
        }
    }    
}'

curl -XPOST http://localhost:9200/documents/document/_search -d  '{
    "query": {
        "match": {
            "title" : "papam"
        }
    },
    "filter": {
        "ids" : {
            "values" : ["1"]
           }
    }
}'

One potential improvement here might be to store the relations in an external different storage system, but I am not going into this as I would like to solve the problem in elasticsearch only.

Object type array combined with script

A next attempt is to reduce the number of requests to one and still keep the content deduplicated. A straight forward approach is to model the comments as an object type array into the document:

Index:

curl -XPUT http://localhost:9200/documents/document/1 -d '{
    "title": "Habemus Papam",
    "content": "In Rome, white smoke rose from the chimney atop of the Sistine Chapel.",
    "comments": [
        {
            "text": "Oh my god ...",
            "author": "John Doe",
            "date": "2013-03-11T13:13:13"
        },
        {
            "text": "Finally!",
            "author": "Jane Roe",
            "date": "2013-03-10T14:14:14"    
        }
    ]
}'

Query (naive attempt):

curl -XPOST http://localhost:9200/documents/document/_search -d  '{
    "query": {
        "bool" : {
            "must" : [{
                "match" : {
                     "title" : "papam"
                 }
             },
             {
                 "match": {
                     "comments.author": "John Doe"
                 }
             }]
        }
    }
}'

This seems to work, right? But there is a problem with that, try to do the same query with “John Roe” as author and you see that document 1 is returned although there is no such author. The problem is that the author query is executed against all comment authors and not restricted to one entry in the list. There is a little hack that could fix this:

curl -XPOST http://localhost:9200/documents/document/_search -d  '{
    "query": {
        "custom_score" : {
            "query" : {
                "match": {
                    "title" : "papam"
                }
            },
            "script": "for (comment: _source.comments) { if (comment.author == author) { return _score }; return -1;}",
            "params": {"author": "John Doe"}
        }
    },
    "min_score": 0.0
}'

But I do not recommend to use this, rather read on.

Index time join

Elasticsearch offers you another type that solves the above problem which is called nested type. Internally, nested objects are indexed as additional documents. Together with the main document the nested documents are indexed into one block, i.e. locality is guaranteed, as all comments are shared with their documents. Which makes this approach quite efficient. At the price that all comments (and the document) need to be re-indexed upon any change in them.

To use nesting, the type of the comments need to be set accordingly in the mapping, then the document itself looks exactly as in the previous example:

Mapping:

curl -XPUT http://localhost:9200/documents
curl -XPUT http://localhost:9200/documents/document/_mapping -d '{
    "document" : {
        "properties" : {
            "comments" : {
                "type" : "nested"
            }
        }
    }
}'

Index:

curl -XPUT http://localhost:9200/documents/document/1 -d '{
    "title": "Habemus Papam",
    "content": "In Rome, white smoke rose from the chimney atop of the Sistine Chapel.",
    "comments": [
        {
            "text": "Oh my god ...",
            "author": "John Doe",
            "date": "2013-03-11T13:13:13"
        },
        {
            "text": "Finally!",
            "author": "Jane Roe",
            "date": "2013-03-10T14:14:14"    
        }
    ]
}'

And the query uses a nested filter to fulfill our requirement:

curl -XPOST http://localhost:9200/documents/document/_search -d '{
    "query": {
        "filtered": {
            "query": {
                "match": {
                    "title" : "papam"
                }
            },
            "filter" : {
                "nested" : {
                    "path" : "comments",
                    "query" : {
                        "match" : {
                            "comments.author": {
                                "query" : "John Doe",
                                "operator" : "and"
                            }
                        }
                    }
                }
            }
        }
    }
}'

Query time join

Another possibility that elasticsearch offers you is to join at query time by modelling the document/comment in a parent/child relationship. For that you define the document type to be the parent of the comment type in the mapping:

Mapping:

curl -XPUT http://localhost:9200/documents
curl -XPUT http://localhost:9200/documents/comment/_mapping -d '{
    "comment" : {
        "_parent" : {
            "type" : "document"
        }
    }
}'

Then you can index (and update) the documents and comments independently by linking them with the additional ‘parent’ parameter:

curl -XPUT http://localhost:9200/documents/document/1 -d '{
    "title": "Habemus Papam",
    "content": "In Rome, white smoke rose from the chimney atop of the Sistine Chapel."
}'

curl -XPUT 'http://localhost:9200/documents/comment/1?parent=1' -d '{
    "text": "Oh my god ...",
    "author": "John Doe",
    "date": "2013-03-11T13:13:13"
}'

curl -XPUT 'http://localhost:9200/documents/comment/2?parent=1' -d '{
    "text": "Finally!",
    "author": "Jane Roe",
    "date": "2013-03-10T14:14:14"
}'

For the query you have several option, you can ‘join’ them with the ’has_child’ filter, as used in the example below, but also the opposite ’has_parent‘ clause exists.

curl -XPOST http://localhost:9200/documents/document/_search -d '{
    "query": {
        "match": {
            "title": "papam"
        }
    },
    "filter": {
        "has_child": {
            "type": "comment",
            "query" : {
                "match": {
                    "author": {
                        "query" : "John Doe",
                        "operator" : "and"
                    }
                }
            }
        }
    }
}'

The advantage of this approach is to have separate update cycles for documents and comments, which comes at the cost of slightly less performant queries. But this is especially worth it, if you have big records that you want to join that are not static. One important thing to note is that the ‘has_child’ clause only returns parents, no child data is returned. Therefore an extra request might be needed if you want to show them both at the same time.

Conclusion

Some join-style queries are possible with elasticsearch. Nested types are fast and efficient but lack some flexibility. Parent/child types are more flexible and real useful for true independantly updatable documents, at the cost of some query performance. The challenge is to model your data in a way that it fits into one of these two options.

If you have more complex requirements for join, a new feature introdcued in the latest beta release may can help you. It introduces another feature that allows for a kind of join by looking up filter terms in another index or type. This allows then e.g. for queries like ‘Show me all comments from documents that relate to this document and the author is ‘John Doe’.