none
Crazy high RU cost on query with join - bad index/query optimization? RRS feed

  • Question

  • After evaluating DocumentDB some time ago and deciding it was not for me at that time, i re-evaluated it over the last couple of days and tried some query scenarios. While much has improved and I nearly was convinced DocumentDB could be my solution for the future, I found the following scenario that seems like a showstopper.

    Consider a collection containing documents like this:

    "Article": {
          "UserId": 1,
          "Tags": [
            "state/unread"
          ],
          "FeedTags": [
            "organize/label/IT News"
          ],
          "DocumentType": "UserArticle",
          "id": "UA38876157",
          "ArticleId": 38876157,
          "Author": null,
          "FeedId": 2498,
          "PublishedEpoch": 1460311140,
          "Title": "Awesome article title"
    }

    So I store news articles from different feeds (FeedId) for different users (UserId). I want to store all articles from feeds a user is subscribed to in a collection. I planned to have multiple collections partitioned by UserID, so some users will live side-by-side in each collection.

    Now, while most query scenarios look good performance- and RU-wise, I especially wanted to test the "Tags" collection where, like in the above document example, I store the unread information as a tag. It means a lot of the same values for this field, which is not ideal for indexing, but as I limit the amount of unread items very strictly, it should be doable. But the question is - when I have multiple users on the same collection, and all of them have those unread tags, how does DocumentDB perform here and optimize the query? Turns out, it performs very bad. What I did to test this is:

    • I added articles for 2 users to a collection, a total of several 10.000 articles
    • For user A, I added a unread tag, like in the document example above, to many articles, about 10.000 of them
    • For user B, I just added one single unread tag to a single document, leaving all other documents untagged

    And then I executed the following query for user B, the one with only a single unread tag:

    SELECT top 50 c FROM c
    join t in c.Tags
    where c.UserId = 1
    and c.DocumentType = "UserArticle"
    and t = "state/unread"
    order by c.ArticleId desc

    This causes a request charge of 1.628 RUs, which is insanely high compared to 30 to 60 RUs i have usually for other query scenarios.

    So it seems like DocumentDB is not able to reduce the query by UserId and then applying the index search for tags only on the resulting subset of data. Due to the nested kind of this data and the inner-document-join DocumentDB utilizes, I expected it would be able to do something like that.

    Comparing it to my current SQL Server solution, where this really is a piece of cake scenario with 1:n relationship between an Articles and a Tag table, I mainly see the difference in the UserId that is also available on the Tag table that helps SQL server to narrow down the index search.

    How can I tackle this in DocumentDB? Would it help to also add the UserID to each tag? So instead of having:

          "Tags": [
            "state/unread"
          ]

    Using something like:

    "Tags": [
        {
            "Tag": "state/unread",
            "UserId": 1
        }
    ]

    and include both values in the condition, like: where t.tag = "state/unread" and t.UserId = 1?

    As I said, somehow I expected, due to the nested nature of DocumentDB documents,  DocumentDB would handle this fine. But it does not seem like this.

    What is a good approach to solve such a scenario without having to deal with such high RU costs?

    Thanks :)



    Thursday, April 21, 2016 11:38 PM

Answers

All replies

  • Bernhard, can you please include an activity ID of the query? We'll investigate why the query reports a high number of request units, and suggest a workaround (if necessary).
    Thursday, April 21, 2016 11:51 PM
  • Bernhard, can you please include an activity ID of the query? We'll investigate why the query reports a high number of request units, and suggest a workaround (if necessary).

    Hi Aravind!

    Thanks for the (very) fast response :)

    I executed the query again right now and here is the activity ID for it: d2caad39-8be6-423a-be3d-51e750e7038f

    Thursday, April 21, 2016 11:55 PM
  • Can you please email me at arramac at microsoft dot com? I think we understand the root cause, but would like to get some additional data to dig deeper.

    If possible, we'd like to get some sample documents as well as the indexing policy of your collection.

    Friday, April 22, 2016 9:49 PM
  • Hi Bernhard,

    Did you try using LINQ and seeing if the performance is better? Assuming you have C# Classes that map to those documents you could do something like:

    var client = new DocumentClient("your_database_uri", "your_database_key");
    var enumerableResults = client.CreateDocumentQuery<Article>("your_collection_uri", new FeedOptions() { MaxItemCount = 50 }).Where(x=>x.Tags.Contains("state/unread") && x.DocumentType == "UserArticle");

    Tuesday, April 26, 2016 12:41 PM
  • After an offline investigation/discussion, we found that increasing the indexing policy for Strings to Range with precision to -1, the queries completed with low RUs and latency.

    Monday, May 9, 2016 2:54 AM