MSDN > 論壇首頁 > SQL Azure — Getting Started > How do we know there are more than 500 results?
發問發問
 

已答覆How do we know there are more than 500 results?

  • Wednesday, 25 June, 2008 8:49Jamie ThomsonMVP使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    In the Tips and Tricks thread (http://forums.msdn.microsoft.com/en-US/ssdsgetstarted/thread/67c0570d-acb6-4dff-9b9b-bc0ee37e9c27) Tony said:

    "If a query has more than 500 entities in the result set then you need to run the query multiple times to get all the results (500 at a time)."

    Is there any way of knowing that there is exactly 500 results or more than 500 results? If the total number of results is divisible by 500 then at some point I'll make a request that doesn't return anything and I'd rather not make another request when I don't have to.


    Also, what if the results change between requests? Imagine the scenario where the first GET ultimately returns more than 500 results but then someone DELETEs all of the data prior to the second GET. In that case the I don't get the correct result. Whatever happened to ACID? I know its very much an edge case (probably a ridiculous one) but you get the point.

    -Jamie
    http://jamiethomson.spaces.live.com/ | http://blogs.conchango.com/jamiethomson

解答

  • Wednesday, 25 June, 2008 18:27Tony PetrossianMSFT, 版主使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     已答覆
    Jamie,

    Currently there is no way to know how many entities are in the container unless you implement some accounting for your container and start keeping a count.  This would work in most cases where the create/delete rates are low and you don't mind the overhead of tracking.  The only other way is to get all the results until you get less than 500 or zero entities returned.  As you said this would cause an extra query when the number of entities are multiple of 500. 

    Regarding the second part of your question -- ACID has gone ACiD in this case to provide better performance and scalability.  In any database implementation, you need to decide what level of Isolation is required for your query.  if you want a guarantee that the data under a query is not changing while you are executing a query then you would need to set your Isolation to serializable or maybe repeatable read.  Most of these guaranties come at a heavy performance price because they have to lock the table/index or range that you are querying.  This mean attempts to insert while your query is running will fail or be held back.

    In our case we don't offer the option of isolating your multiple read queries against any creates so there is a possibility that new entities will be created in the first 500 entity set while you are query the second 500. In another word, no transaction support across multiple requests. 
    Tonyp

所有回覆

  • Wednesday, 25 June, 2008 18:27Tony PetrossianMSFT, 版主使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     已答覆
    Jamie,

    Currently there is no way to know how many entities are in the container unless you implement some accounting for your container and start keeping a count.  This would work in most cases where the create/delete rates are low and you don't mind the overhead of tracking.  The only other way is to get all the results until you get less than 500 or zero entities returned.  As you said this would cause an extra query when the number of entities are multiple of 500. 

    Regarding the second part of your question -- ACID has gone ACiD in this case to provide better performance and scalability.  In any database implementation, you need to decide what level of Isolation is required for your query.  if you want a guarantee that the data under a query is not changing while you are executing a query then you would need to set your Isolation to serializable or maybe repeatable read.  Most of these guaranties come at a heavy performance price because they have to lock the table/index or range that you are querying.  This mean attempts to insert while your query is running will fail or be held back.

    In our case we don't offer the option of isolating your multiple read queries against any creates so there is a possibility that new entities will be created in the first 500 entity set while you are query the second 500. In another word, no transaction support across multiple requests. 
    Tonyp
  • Wednesday, 25 June, 2008 19:00Dave Robinson - SQL AzureMSFT, 擁有者使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    Ok,

    I am just thinking out loud here, but what if we did something like pass the hash of the resultset to you in the header. That way on subsequent calls you can compare the previous hash to the newly received hash and you would be able to identify whether the resultset changed during your recursive calls...But then you would need to manage in your code when the resultset changed...

    Jamie, you mention that this might be an edge case, which I agree, but just looking to get your thoughts around it.

    -Dave
  • Wednesday, 25 June, 2008 20:53Jamie ThomsonMVP使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    David Robinson - MSFT said:

    Ok,

    I am just thinking out loud here, but what if we did something like pass the hash of the resultset to you in the header. That way on subsequent calls you can compare the previous hash to the newly received hash and you would be able to identify whether the resultset changed during your recursive calls...But then you would need to manage in your code when the resultset changed...

    Jamie, you mention that this might be an edge case, which I agree, but just looking to get your thoughts around it.

    -Dave

    Regarding the resultset changing between GETs....
    I like the hash idea. At least give us some information that we can act upon - don't jsut let something occur without us ever kowing about it.

    here's another edge case although probably a more realistic one. we get the results back in EntityID order, right? So I GET the first batch of 500 results but inbetween the first GET and the second GET someone PUTs an entity into the first batch of 500 (I'm assuming that its possible to do that - I don't know whether these IDs are auto generated sequentially or not - for argumnt's sake let's say they're not). Everything after the inserted entity "shuffles down one" so the entityID that was last in batch #1 becomes first in batch #2 and I end up getting it again on the second GET. hence I've got the same entity twice.

    Is that a possibility? Because if it is it feels really really bad.

    Like you  I'm thinking out loud here really. This thread may reach a logical conclusion at some point :)


    Regarding the empty last GET...
    Is it possible to put a flag (or similar) in each batch that says either "I'm the last one" or "There's more to come"?

    Lastly...will that value of 500 be configurable?

    Thanks for the replies so far guys. Like I said, these are edge cases, but in my experience those are the things that cause you the most headaches.
    -Jamie
    http://jamiethomson.spaces.live.com/ | http://blogs.conchango.com/jamiethomson
  • Thursday, 26 June, 2008 2:14Mike Amundsen 使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    FWIW, here's some of my thoughts on how to approach paging in this kind of environment.

    Paging Metadata
    First, when the query is returned, I suggest some meta-data be returned as well. For example:
    <x:Paging records="1200" returned="500" pages="24" this-page="3" />

    You might also consider making these into an HTTP Header (if you don't want to 'dirty' the returned XML, or someday you no longer return XML)
    X-SSDS-Paging: records:1200;returned:500;pages:24;this-page:3

    Hashing and ETags
    Second, the "hash" idea can be handled via the existing HTTP "ETag" header. Upon generating XML for the return, hash the contents (I use MD5) to produce a value that is returned as the ETag. I've been poking Jeff Currier for ETags on list queries, so it would be cool if this hash also solves some paging issues.

    Isolated Paged Collections
    Also, the 'shifting' collection issue is not isolated to cloud services, right? Any paged requests can fall prey to this kind of thing. Currently, one way we solve this in our current project is to make the request with a "return-with-paging" flag that prompts the biz layer to make the full request and set that aside (cache the full set, actually) and, along with the first page, return a key (guid) for future requests. All future requests are against the keyed collection (not the live data). The keyed collection is kept in memory with a sliding expiration (say 30 secs) to help w/ clean ups. Any requests with that key to expired collections, just generates a new direct query and keyed collection again.

    Controlling Page-Size
    Finally, I'll put my plug in for allow us to set the page size in our queries. 500 is fine if you have small records. But of the entities are large documents, 500 might be quite a bit for apps to use efficiently (500 resumes?).




    Mike Amundsen [http://amundsen.com/blog/]
  • Thursday, 26 June, 2008 3:20Dave Robinson - SQL AzureMSFT, 擁有者使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    Great input,

    Some of the items discussed here as Mike points out are not isolated to cloud services. All this is really valuable feedback, and Jeff and I spent some time talking about this today. Some of the items we are pointing out in this thread are edge cases and some we plan to address in future sprints.

    The last thing I want you guys to think is we monitor these forums and give out canned answers. When we say that your feedback drives the product features, we really mean it. This is not some marketing spin, this is the absolute truth.

    You guys rock,
    Dave
  • Thursday, 26 June, 2008 6:26Tony PetrossianMSFT, 版主使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     


     here's another edge case although probably a more realistic one. we get the results back in EntityID order, right? So I GET the first batch of 500 results but inbetween the first GET and the second GET someone PUTs an entity into the first batch of 500 (I'm assuming that its possible to do that - I don't know whether these IDs are auto generated sequentially or not - for argumnt's sake let's say they're not). Everything after the inserted entity "shuffles down one" so the entityID that was last in batch #1 becomes first in batch #2 and I end up getting it again on the second GET. hence I've got the same entity twice.

    Is that a possibility? Because if it is it feels really really bad.


    More thinking out load here:

    The case you are describing is not possible  -- The entity ID is something that you assign so they are not auto generated or sequential.  The query returns the entities in sorted order of the entity id strings.

    Lets consider the following scenario and for simplicity lets assume the page size if 5 instead of 500:

    1. you have 10 entities in a container with IDs "a",  "b",  "c", "d", "e", "f", "g", "h", "i", "j" 
    2. you run a query to get the first page  (e.id > "")
    3. you get 5 entities "a", "b", "c", "d", "e" (the first page)
    4. someone inserts an entity in between the range of the first page -- for example an entity with the id "aa"
    5. then you run your second query to get the second page with the predicate e.id > "e" -- because the id of the last entity in the first page was "e"
    6. the second page returns "f", "g", "h", "i", "j"  -- you are not going to get the entity with the id = "e" again because you had e.id > "e"

    The situation here is that you don't get the entity with the ID "aa" because in the sort order "aa" is in between "a" and "b".  The only way for you to find out about the entity with the ID "aa" is to rerun the first query again. 

    If you repeat the above scenario and change "aa" to "ee" then you will get the new entity in your second page because "ee" > "e" and the shuffle down happens over the range that you have not seen yet so there are no duplications.


    I think it is important to note that when you run four queries to get 2000 entities each query runs independently without knowledge of any query that came before it or after it.  This is not like running a single query and returning the result one page at a time.

    Here is how you should think about the paging:

    from e in entities where e.Id >  ""        select e"  is just like saying    select top 500 * from table where id > ""
    from e in entities where e.Id > "0500" select e"  is just like saying    select top 500 * from table where id > "0500"
    from e in entities where e.Id > "1000" select e"  is just like saying    select top 500 * from table where id > "1000"
    from e in entities where e.Id > "1500" select e"  is just like saying    select top 500 * from table where id > "1500"


    Let me know if this helps.

    Thanks


    Tonyp
  • Thursday, 26 June, 2008 7:13Mike Amundsen 使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    Tony Petrossian said:

    <snip>
    I think it is important to note that when you run four queries to get 2000 entities each query runs independently without knowledge of any query that came before it or after it.  This is not like running a single query and returning the result one page at a time.

    Here is how you should think about the paging:

    from e in entities where e.Id >  ""        select e"  is just like saying    select top 500 * from table where id > ""
    from e in entities where e.Id > "0500" select e"  is just like saying    select top 500 * from table where id > "0500"
    from e in entities where e.Id > "1000" select e"  is just like saying    select top 500 * from table where id > "1000"
    from e in entities where e.Id > "1500" select e"  is just like saying    select top 500 * from table where id > "1500"
    </snip>


    OK, this is sinking in a bit more.

    So 'paging' is not really what's happening. we get a max of X recs from a query and then, if we like, we can make an additional request with a filter that uses the s:Id of the last record as a boundary.

    from e in entities where e["last-updated"]>"2008-06-25" && e.Id> "0500"

    I see the reasoning.  You don't have deal with executing the full query, paging the result set, etc.  yeah, ok.

    But I'd *still* like to see an indicator that more rows are available. 
    <s:more-data="true" />



    Mike Amundsen [http://amundsen.com/blog/]
  • Thursday, 26 June, 2008 9:54Jamie ThomsonMVP使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    Tony,
    It helps a lot. Thank you. I don't think there's any real issue here - I'm just playing devil's advocate really. Your illustration of the scenario is very useful as is thrashing our thoughts around I'm sure you'll agree. I agree with Mike that using the last ID is dead easy.

    Let's be honest. Making a request when there is no more data can only happen when there are more than 499 records in the resultset (which won't happen often) and of those there's a 1 in 500 chance of it happening. On the other hand, if we DO want to return more than 500 results then we are going to have to code to allow for it therefore just for brevity of code if nothing else it might be useful to know if there is any more data to be returned or not. Maybe when more people are on board you should put it to a vote :)

    A GET that returns no data isn't a huge problem when your previous requests collectively returned a multiple of 500.

    I'm enjoying discussing it anyway!

    -Jamie


    http://jamiethomson.spaces.live.com/ | http://blogs.conchango.com/jamiethomson
  • Thursday, 26 June, 2008 16:47Tony PetrossianMSFT, 版主使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    I agree with both of you that it would be useful to know if a query that you run had capped results or not.  This does not have to be related to paging but just any query.

    I'll talk to Dave and the devs.

    Thanks much!
    Tonyp
  • Friday, 27 June, 2008 20:56Dave Robinson - SQL AzureMSFT, 擁有者使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    In answer to Jamie's question regarding the 500 entity number (as seen on his blog posting). We are running through some use cases and trying to come up with the best possible number here. We also are looking into making this number adjustable with a pre-defined max...

    HTH,
    Dave
  • Friday, 27 June, 2008 23:07Jamie ThomsonMVP使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    David Robinson - MSFT said:

    In answer to Jamie's question regarding the 500 entity number (as seen on his blog posting). We are running through some use cases and trying to come up with the best possible number here. We also are looking into making this number adjustable with a pre-defined max...

    HTH,
    Dave



    Blimey. You actually READ it? :)
    http://jamiethomson.spaces.live.com/ | http://blogs.conchango.com/jamiethomson
  • Saturday, 28 June, 2008 0:17Tony PetrossianMSFT, 版主使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    Some thinking here....

    Lets assume (hypothetical here) that I have an application that needs to display results to a user in a web page that can reasonably hold 50 results.  Most of the time the user looks at the first page and 10% of the time they want to see the second page and 3%  of the time they want to see the 3rd page and only 0.1% of the time they go beyond 4th page.    Here is an example, how often do you go passed the 10th page of the google search results?  
     
    For this application I really don't want to pay the cost of fetching all the results and sending them to me unless I need to display them...

    Now let's assume I can configure the page size and set it to anything between 1 and 1000.   For this application I just need the first 50 -- but I want to know if there are more than 50 results so that I can give the user the option to go to the second page.  Or maybe I want to know there is a second page so I can get it and cache it incase the user wants to see it.    I would configure the page size request to 51 for the query

    • If I get 50 or less then there is only one page and I display it
    • If I get 51 then I display the first 50 and use the 51 result as my indication that there is more than 1 page -- I don't know how much more but I know there is a second page.  You need to write some logic to repeat this pattern and maybe you re-query from the 51st position or reuse the 51 row from the last query and get the next 50....but you get the idea..

     


    Tonyp
  • Saturday, 28 June, 2008 1:13Mike Amundsen 使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    Tonyp:

    The 50/51 thing is an interesting workaround.  Of course it assumes you can control the 'page size' in the query.

    I actually have started to map out a similar pattern where the SSDS Proxy Server can get the first 500 records and then cache the results locally, delivering the first 50 (or some other number) to the client. Subsequent requests from the client go the proxy which determines if it's ok to use the local cache or go get more records from the server. This can even be done completely on the client (pull 500 and then show 50 at a time, until it's time to ask for more).

    Again, these approaches always assume some code standing between the client and the data. We also need to keep in mind cases where I might not have full control over the clietn apps that access the data. Mashups and other 'Web 2.0 API' scenarios can exist where individuals are scripting their own clients against the shared data. In these cases, the ability of SSDS to provide these 'page-size' and 'more-record' meta data elements will be key.





    Mike Amundsen [http://amundsen.com/blog/]