none
Performance of Range Queries on Partition Keys and Row Keys RRS feed

  • Question

  • Hello community,

    When I do range queries on Windows Azure Table based on Partition Key and Row Key, does it ALWAYS result in a full table scan?

    I will try to explain my question using an example. Let's say my table stores blog entries. The entries have reversed ticks generated from the time the blog entry was posted as the row key. As for the partition key, all the blogs from 2007 have the letter 'A' as the partition key, those from 2008 have letter 'B' and so on (I know this is very crude, but I wanted to make the example very simple).

     

    RowKey = (DateTime.MaxValue.Ticks - postedOn.Ticks).ToString();
    PartititionKey = ((char) (65 + postedOn.Year - 2007)).ToString();
    

    If I now want to select 10 most recent blogs posted within this year (2010), the query would probably touch only one partition (partition 'D') of the table, right?

    Now, what if I want to select "all entries that were posted in this year and the last year", my where clause would look like 'where PartitionKey >= 'C'. Does anyone know if the query would scan all entities also within partitions 'A' and 'B' as well? This can affect performance if the number of partitions and the entities within those partition is very large. Hence, I am eager to know the answer if anyone knows about how the REST queries are actually translated behind the scenes.

    Thanks and regards,

    Saksham 

     

     

    • Moved by DanielOdievichEditor Tuesday, September 28, 2010 10:13 PM forum migration (From:Windows Azure)
    Wednesday, May 5, 2010 10:28 AM

Answers

  • Hi Saksham,

     

    I'll try to explain using examples here.

     

    Query1: PartitionKey == "1" and RowKey == "B"

    This provides the entire index which enables it return the single row

     

    Query2: PartitionKey == "1" and RowKey > "B" and RowKey < "D"

    Looks up partition key 1 and returns rows where rowkey ranges from B through D. The amount of processing depends on the size of the range B-D.

     

    Query3: PartitionKey > "1" and PartitionKey < “3”

    Looks up partition key 1 through 3 and returns all rows. So it does scan the partitions ranging from 1 through 3 but not the entire table.

     

    Query4: SomeProperty == “Foo”

    Results in the entire table being scanned

     

    Query5: RowKey > “A” and RowKey < “F”

    This results in scanning all partitions too since the partition key has not been used

     

    Query6: (PartitionKey > "1" and PartitionKey < “3”) or (PartitionKey > “5” and PartitionKey < “8”)

    We do not optimize here at this time, but we have had future feature requests for that, so this will result also in a full  table scan too.

    So here it is more efficient to break this down into two sets of queries:

    #1 (PartitionKey > "1" and PartitionKey < “3”)

    #2 (PartitionKey > “5” and PartitionKey < “8”)

     

     

    Query7: PartitionKey == "A" and (RowKey == "A" or RowKey == “F”)

    This results in scanning the entire Partition "A".

     

    2 through 7 may return continuation tokens which should be used in subsequent requests to retrieve the entire result.

     

    Hope that helps,

    Jai

    • Proposed as answer by ants_super Friday, May 7, 2010 4:01 AM
    • Marked as answer by Saksham Gautam Friday, May 7, 2010 9:54 AM
    • Edited by Jai Haridas Friday, July 2, 2010 3:32 AM
    Thursday, May 6, 2010 3:36 PM

All replies

  • -- When I do range queries on Windows Azure Table based on Partition Key and Row Key, does it ALWAYS result in a full table scan?

    No. The Azure Table Service would use the PartitionKey to identify which partition servers to send the query to and would then use the RowKey as an index to identify the data. Provided your range query identifies only a subset of the data this would not result in a table scan.

    It is important to remember that any query not identifying a single entity by PartitionKey and RowKey may result in a continued query wherein only some of the results are returned along with a continuation token used to indicate the start point for a subsequent query to retrieve the rest of the data in an iterative process. There have been pathological cases reported where the first call to a query returned zero records and a continuation token.

    Wednesday, May 5, 2010 5:01 PM
    Answerer
  • According to this PDC09 deep dive (28 mins in), if you use any predicate other than = in a key, it will result in a scan.

    I think the only way to avoid this in your scenario is to run asyncronous queries for each partition key, then union the results together on completion.

    Anthony.

    Wednesday, May 5, 2010 10:07 PM
  • Hi,

    I'm a bit confused now. Range query implies that we are using an operator other than the '='. Two answers to the thread and one answer is a 'Yes' and another one a 'No'. :)

    Anthony, the asynchronous query for each partition works fine if I know each and every partition in my table storage. In some cases, I might not know up front what my partitions are. A bad table design I know, but I just want to highlight that there might be cases when one does need range queries based on partition keys.

    Neil, when you said "The Azure Table Storage Table Service would use the PartitionKey to identify which partition servers to send the query...", did you mean that (going back to the example in my original post) if my query has > operator on partition key the Azure Table infrastructure is smart enough to ignore some partitions? Could you kindly confirm?

    Many thanks,
    Saksham

    Thursday, May 6, 2010 10:12 AM
  • Hi Saksham,

     

    I'll try to explain using examples here.

     

    Query1: PartitionKey == "1" and RowKey == "B"

    This provides the entire index which enables it return the single row

     

    Query2: PartitionKey == "1" and RowKey > "B" and RowKey < "D"

    Looks up partition key 1 and returns rows where rowkey ranges from B through D. The amount of processing depends on the size of the range B-D.

     

    Query3: PartitionKey > "1" and PartitionKey < “3”

    Looks up partition key 1 through 3 and returns all rows. So it does scan the partitions ranging from 1 through 3 but not the entire table.

     

    Query4: SomeProperty == “Foo”

    Results in the entire table being scanned

     

    Query5: RowKey > “A” and RowKey < “F”

    This results in scanning all partitions too since the partition key has not been used

     

    Query6: (PartitionKey > "1" and PartitionKey < “3”) or (PartitionKey > “5” and PartitionKey < “8”)

    We do not optimize here at this time, but we have had future feature requests for that, so this will result also in a full  table scan too.

    So here it is more efficient to break this down into two sets of queries:

    #1 (PartitionKey > "1" and PartitionKey < “3”)

    #2 (PartitionKey > “5” and PartitionKey < “8”)

     

     

    Query7: PartitionKey == "A" and (RowKey == "A" or RowKey == “F”)

    This results in scanning the entire Partition "A".

     

    2 through 7 may return continuation tokens which should be used in subsequent requests to retrieve the entire result.

     

    Hope that helps,

    Jai

    • Proposed as answer by ants_super Friday, May 7, 2010 4:01 AM
    • Marked as answer by Saksham Gautam Friday, May 7, 2010 9:54 AM
    • Edited by Jai Haridas Friday, July 2, 2010 3:32 AM
    Thursday, May 6, 2010 3:36 PM
  • Hi Jai,

    This is very interesting as everything I've seen so far states that range queries and OR queries on partition keys will result in a full table scan. Is this behaviour documented anywhere that you know of?

    I asked the question about OR's in a partition query earlier, which confirmed that it resulted in a table scan. So just confirming that OR's will scan all partitions, but ranges won't?

    Cheers,
    Anthony.

    Thursday, May 6, 2010 9:48 PM
  • Hi Anthony,

     

    “OR” in queries is not optimized at present but as I mentioned above, we have had feature requests for that.

     

    Query 6 is not optimized at present because of the OR condition on PartitionKey and this query will result in a table scan.

     

    Query 2 and 3 are range queries (2 ranges over the RowKey and 3 over PartitionKey), but will not result in table scans.

     

    Let me know if this answers your question. We will also be posting more details on our blog in the near future.

     

    Thanks,

    Jai

     

    • Proposed as answer by ants_super Friday, May 7, 2010 4:01 AM
    Friday, May 7, 2010 3:43 AM
  • Fantastic thanks!

    Friday, May 7, 2010 4:01 AM
  • Thanks a bunch Jai. Will stay tuned in to the Azure Storage Blog for more optimizations.

    Friday, May 7, 2010 10:10 AM
  • PartitionKey == "A" and SomeProperty == "Foo"

    That results in a partition scan, correct?

    (PartitionKey > "1" and PartitionKey < "6") and SomeProperty == "Foo"

    That results in a scan of all partitions between "1" and "6", correct?

    Wednesday, March 13, 2013 9:16 PM
  • Hi Saksham,

     

    I'll try to explain using examples here.

     

    Query1: PartitionKey == "1" and RowKey == "B"

    This provides the entire index which enables it return the single row

     

    Query2: PartitionKey == "1" and RowKey > "B" and RowKey < "D"

    Looks up partition key 1 and returns rows where rowkey ranges from B through D. The amount of processing depends on the size of the range B-D.

     

    Query3: PartitionKey > "1" and PartitionKey < “3”

    Looks up partition key 1 through 3 and returns all rows. So it does scan the partitions ranging from 1 through 3 but not the entire table.

     

    Query4: SomeProperty == “Foo”

    Results in the entire table being scanned

     

    Query5: RowKey > “A” and RowKey < “F”

    This results in scanning all partitions too since the partition key has not been used

     

    Query6: (PartitionKey > "1" and PartitionKey < “3”) or (PartitionKey > “5” and PartitionKey < “8”)

    We do not optimize here at this time, but we have had future feature requests for that, so this will result also in a full  table scan too.

    So here it is more efficient to break this down into two sets of queries:

    #1 (PartitionKey > "1" and PartitionKey < “3”)

    #2 (PartitionKey > “5” and PartitionKey < “8”)

     

     

    Query7: PartitionKey == "A" and (RowKey == "A" or RowKey == “F”)

    This results in scanning the entire Partition "A".

     

    2 through 7 may return continuation tokens which should be used in subsequent requests to retrieve the entire result.

     

    Hope that helps,

    Jai

    Hi. I know this is old post, but I have a question what about PartitionKey >= 1 and PartitionKey < 5 and RowKey >= 11 and RowKey < 22? Will it hit index, can it be efficient? And one more question about Query5, will it hit index despite scanning all partitions?
    Thursday, February 15, 2018 12:11 PM
  • @Adil Mammadov: Could you elaborate more on the statement “Will it hit index”? In the meantime, you can refer this Stack Overflow thread for more details. Also, it would be better to create a new forum thread since the current one is from 2010 and a lot has changed in Azure since then.

    -----------------------------------------------------------------------------------------------------

    Do click on "Mark as Answer" on the post that helps you, this can be beneficial to other community members.

    Monday, February 19, 2018 5:30 AM