none
replacing Group By with row_number(partition by ...)

    Question

  • Hello,
    I have an indexed view and I query this view to get counts on how many products each seller has in a specific category. This query works perfect and produces results in 45ms. I want to join additional tables and apply additional WHERE filters to the result set. I am not able to do it because I had to get products_pid out of With clause to be able to join other tables after CTE. I couldn't add products_pid to the select list because it would require me to have in the Group By. After reading articles in this forum, I found out that Group By can be replaced by row_number() over (partition by...). I re-wrote my query using row_number(). But this query takes 1100 ms to execute. Is this the write way to get rid of Group By? What am I doing wrong here and why it takes so long to execute?
    Thank you,
    Eugene.

    Original query:
    set statistics time on
    GO
    with keys as (
    select getproductssellers.productslocation_cid,
    count(distinct getproductssellers.products_pid) as CNT
    from GetProductsSellers WITH (NOEXPAND)
    WHERE getproductssellers.products_subcategory=26
    group by getproductssellers.productslocation_cid
    )
    select customer.businessname as Seller, productslocation_cid, CNT
    from keys
    inner join customer on productslocation_cid=customer.cid
    order by CNT desc

    New query:
    set statistics time on
    GO
    with keys as (
    select getproductssellers.products_pid, getproductssellers.productslocation_cid,
    count(getproductssellers.products_pid) over (partition by productslocation_cid) as CNT,
    row_number() over(partition by productslocation_cid order by products_pid) as test
    from GetProductsSellers WITH (NOEXPAND)
    WHERE getproductssellers.products_subcategory=26
    --group by getproductssellers.productslocation_cid
    )
    select customer.businessname as Seller, productslocation_cid, CNT
    from keys inner join customer on productslocation_cid=customer.cid
    where test=1
    order by CNT desc

    Data in the View:
    Products_subcategory Productslocation_cid Products_pid record_count
    16 1 8869 1
    16 1 8870 1
    16 1 12073 1
    16 12 12677 1
    16 12 14313 1
    16 12 18829 1
    26 12 21365 1
    26 1 22191 1
    26 12 24539 1
    26 12 24541 1

    Results:
    Seller ProductsLoaction_cid CNT
    StoreA 1 1
    StoreB 12 3

    Friday, September 24, 2010 2:46 PM

All replies

  • It looks like your second query is the same as your original query that can be slightly changed this way:

     

    Original query:
    set statistics time on 
    GO
    with keys as (
    select getproductssellers.productslocation_cid, 
    count(distinct getproductssellers.products_pid) as CNT,
    min(Products_Pid) as Products_Pid
    from GetProductsSellers WITH (NOEXPAND)
    WHERE getproductssellers.products_subcategory=26
    group by getproductssellers.productslocation_cid
    )
    

     

    I think the execution difference may be that the new query will need to return this big result set back also will need to order every record, while getting just MIN should be quick enough.

    In addition, COUNT(..) over (partition ...)

    and COUNT(distinct ...)

    are not the same. In a sense, DENSE_RANK() is close to count(distinct) -- we need to get max (dense_rank) value to get count distinct.

    Try this variation instead.


    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    Friday, September 24, 2010 3:13 PM
    Moderator
  • Thanks Naomi! I tried with DENSE_RANK(), but it gives me count as 1 for all records. Maybe it's because I am not getting MAX from DENSE_RANK(), but I am not sure where to use max on it. Wouldyou help me please?

    set statistics time on
    GO
    with keys as (
    select   getproductssellers.products_pid, getproductssellers.productslocation_cid,
    dense_rank()  over (partition by productslocation_cid order by products_pid) as CNT,
    row_number() over(partition by productslocation_cid order by products_pid) as test
    from GetProductsSellers WITH (NOEXPAND)
    WHERE getproductssellers.products_subcategory=287
    --group by getproductssellers.productslocation_cid
    )
    select customer.businessname as Seller, productslocation_cid, CNT
    from keys inner join customer on productslocation_cid=customer.cid
    where test=1
    order by CNT desc

    Friday, September 24, 2010 4:38 PM
  • I think you misunderstood my suggestion. I suggested to use your original query and just add MIN(Products_PID) as Products_pID to it. This is what your ROW_NUMBER query is doing.

    You would not be able to change your second query to use dense_rank instead of the count or I don't see a simple way to do so. My point was that your new query is not the same as the original as COUNT(*) over (...) will not give you a distinct count.


    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    Friday, September 24, 2010 4:44 PM
    Moderator
  • I believe your original query is fine... I don't think u can do any thing better than that..

    Probably Look at the indexes and execution plan why its delaying??

     

    Can you tell CID is covered by index or not and also how many records the query returning etc??

    Friday, September 24, 2010 4:49 PM
  • Now I understand. Thank you! But if I use min(products_pid) in the select list, it passes only 1 products_pid out of WITH clause and my further JOINs on products_pid will join only on pid from other tables. What should I do here?
    Friday, September 24, 2010 4:54 PM
  • Ramireddy,

    Are you talking about original query or new query? Which one is fine?

    Friday, September 24, 2010 4:58 PM
  • set statistics time on
    GO
    with keys as (
    select   getproductssellers.productslocation_cid,
    count(distinct getproductssellers.products_pid) as CNT
    from GetProductsSellers WITH (NOEXPAND)
    WHERE getproductssellers.products_subcategory=287
    group by getproductssellers.productslocation_cid
    )
    select  customer.businessname as Seller, productslocation_cid, CNT
    from keys
    --inner join containstable(products, txtcat, '8GB')
    --as key_tbl on products_pid=key_tbl.[KEY]
    inner join customer on productslocation_cid=customer.cid
    --inner join productslocation on products_pid=productslocation.pid
    --inner join productslocationattr on productslocation.laid=productslocationattr.laid
    --LEFT OUTER JOIN Address ON ProductsLocation.aID = Address.aID
    --WHERE
    --((ProductsLocation.aID = -1) OR (Address.City LIKE 'jh'
    --AND Address.State LIKE 'jh' )) 
    ----and products.subcategory=287
    --and productslocationattr.postdate>'9/1/2010'
    --and productslocationattr.productActive='True'
    order by CNT desc
    Friday, September 24, 2010 5:16 PM
  • I think you need to do:

    ;with cnts as (
    
    select  getproductssellers.productslocation_cid, 
    count(distinct getproductssellers.products_pid) as CNT
    from GetProductsSellers WITH (NOEXPAND)
    WHERE getproductssellers.products_subcategory=287
    group by getproductssellers.productslocation_cid
    ), cte_All as (select getproductssellers.productslocation_cid,
    getproductssellers.Products_pID, 
    cnts.CNT
    from GetProductsSellers WITH (NOEXPAND) inner join cnts
    on GetProductSellers.ProductsLocation_cID =
    cnts.ProductsLocation_cID
    WHERE getproductssellers.products_subcategory=287)
    
    and then use cte_All instead of the Keys in your query - it will have a distinct count and also the necessary IDs you can use to join with the other tables.


    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    Friday, September 24, 2010 5:31 PM
    Moderator
  • Naomi, thank you for the solution. Unfortunately, it will not work in this case because Count is done in the beginning and all the JOINs and conditions I am going to apply at the end will not be reflected in the count. So far, I came up with this query. The problem here is that I have to bring all the pid from the subquery to compare them with pid in the Count View. If the result set is large it takes most of the time of this query. Is there a trick or another way of matching with PID that I can use here?

    set statistics time on
    GO
    select  customer.businessname as Seller, getproductssellers.productslocation_cid as CID,
    count(distinct getproductssellers.products_pid) as CNT
    from GetProductsSellers WITH (NOEXPAND)
    inner join customer on getproductssellers.productslocation_cid=customer.cid
    WHERE products_subcategory=287
    and products_pid in
    (
    select  products_pid
    from _dta_mv_0_9987 WITH (NOEXPAND)
    --inner join containstable(products, txtcat, 'video')
    --as key_tbl on _dta_mv_0_9987.products_pid=key_tbl.[KEY]
    --inner join customer on keys.cid=customer.cid
    --inner join productslocation on keys.CID=productslocation.cid
    --inner join productslocationattr on productslocation.laid=productslocationattr.laid
    LEFT OUTER JOIN Address ON ProductsLocation_aID = Address.aID
    WHERE
    ((ProductsLocation_aID = -1) OR (Address.City LIKE 'jh'
    AND Address.State LIKE 'jh' ))  and
    products_subcategory=287
    and productslocationattr_postdate>'09/1/2010'
    and productslocationattr_productActive='True'
    )
    group by getproductssellers.productslocation_cid, customer.businessname
    order by CNT desc

    Monday, September 27, 2010 4:55 PM
  • Any  progress?
    Kalman Toth, SQL Server & Business Intelligence Training; SQL 2008 GRAND SLAM
    Friday, October 01, 2010 8:04 PM
    Moderator
  • No, this is what I have so far.
    Tuesday, October 12, 2010 1:00 AM