none
How can I optimize this distinct count?

    Question

  • The table looks like this:
    IpScan:

    ipNum: bigint
    scanDate: datetime
    scanValue: int

    with 18 milions rows.

    How can I optimize this query:
    (It takes 60 seconds to run)

    select scanValue,
    count(*) count,
    count(distinct ipNum) distinctcount
    from IpScan
    group by scanValue
    order by scanValue
    


    the result looks like this:

    scanValue    count         distinctcount
    0                 108           84
    1                 240           182
    2                 1003         681
    4                 1388813    834866
    5                 15821923   8703358
    6                 7               6
    7                 38             27
    8                 8488         5798
    9                 807116      560491
    • Edited by drealecs Tuesday, February 23, 2010 6:43 PM
    Tuesday, February 23, 2010 3:48 PM

Answers

  • Any progress?

    Just a note: Distinct is an "expensive" operation, if you can, avoid it.



    Kalman Toth, SQL Server & BI Training, SSAS OLAP, SSIS, SSRS; http://www.SQLUSA.com
    • Marked as answer by drealecs Tuesday, February 23, 2010 6:54 PM
    • Unmarked as answer by drealecs Tuesday, February 23, 2010 6:54 PM
    • Marked as answer by drealecs Wednesday, February 24, 2010 6:58 AM
    Tuesday, February 23, 2010 6:48 PM
    Moderator
  • I shoud make a view with count(*) as countVal group by ipNum and scanValue, create an index on this view, and make a select with sum(countVal) and count(countVal) group by scanValue
    But later, I have to go away for an hour now.
    Do you think it might work?
    • Marked as answer by drealecs Wednesday, February 24, 2010 5:54 AM
    Tuesday, February 23, 2010 7:15 PM
  • Success!
    Less than 1 second.

    creating the view:

    create view dbo.SubIpScan with schemabinding
    as select ipNum,scanValue,count_big(*) scans
    from dbo.IpScan group by ipNum,scanValue
    

    Creating the clustered index

    create unique clustered index SubIpScan_cluster on SubIpScan(scanValue,ipNum)
    

    Creating indexes

    create index SubIpScan_scanValue on SubIpScan(scanValue)
    
    create index SubIpScan_ipNum on SubIpScan(ipNum)
    
    create index SubIpScan_scans on SubIpScan(scans)
    

    Runing the new query:

    select scanValue,sum(scans) count,count(*) distinctcount
    from SubIpScan with (noexpand)
    group by scanValue
    order by scanValue
    • Marked as answer by drealecs Wednesday, February 24, 2010 7:39 AM
    Wednesday, February 24, 2010 7:38 AM

All replies

  • You could try putting an index on [scanValue]

    http://sqlblog.com/blogs/jamie_thomson/ | @jamiet | About me
    Tuesday, February 23, 2010 3:52 PM
  • You can try alternative query (SQL Server 2005 an up), but doubtfully more performant

    select ScanValue, max(Count) as [Count], max(Rank) as DistinctCount from (select ScanValue, count(*) over (partition by ScanValue) as [Count], dense_rank() over (partition by ScanValue order by IPNum) as Rank from Scans) X group by ScanValue
    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    Tuesday, February 23, 2010 4:00 PM
    Moderator
  • Hello,

    Well a GROUP BY with a DISTINCT is a double whammy.

    Try removing the GROUP BY. It's not needed.

    Adam


    Dibble and dabble but please don't babble.
    Tuesday, February 23, 2010 4:06 PM
  • He counts distinct IPNum - so we do need Group BY
    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    Tuesday, February 23, 2010 4:16 PM
    Moderator
  • GROUP BY ipnum itself to get the DISTINCT COUNT. Then add WITH ROLLUP to get the total count:

    SELECT
            scanValue,
            COUNT(ipnum)    distinctcount,
            COUNT(*)        count
            GROUPING(ipnum)    ipnum_all
    FROM
            IpScan
    GROUP BY
            scanValue,
            ipNum
    WITH ROLLUP
    Tuesday, February 23, 2010 4:16 PM
    Moderator
  • Ah,

    So why use the DISTINCT?

    Edit: The GROUP BY should create the distinction.

    Adam
    Dibble and dabble but please don't babble.
    Tuesday, February 23, 2010 4:17 PM
  • You could try putting an index on [scanValue]

    http://sqlblog.com/blogs/jamie_thomson/ | @jamiet | About me

    I have not clustered indexes on all 3 columns.
    My table could get to 30-40 million rows; I have to reduce the time a lot.
    Suggest more options, please. I'll test them and tell you the timings.
    Tuesday, February 23, 2010 6:35 PM
  • Ah,

    So why use the DISTINCT?

    Edit: The GROUP BY should create the distinction.

    Adam
    Look at the results. You understand why I need distinct and group
    Tuesday, February 23, 2010 6:37 PM
  • Any progress?

    Just a note: Distinct is an "expensive" operation, if you can, avoid it.



    Kalman Toth, SQL Server & BI Training, SSAS OLAP, SSIS, SSRS; http://www.SQLUSA.com
    • Marked as answer by drealecs Tuesday, February 23, 2010 6:54 PM
    • Unmarked as answer by drealecs Tuesday, February 23, 2010 6:54 PM
    • Marked as answer by drealecs Wednesday, February 24, 2010 6:58 AM
    Tuesday, February 23, 2010 6:48 PM
    Moderator
  • Hello,

    I see...

    Well, with that many rows, with that simple of a query, you're not going to do any better.

    You can 'try' an indexed view but I'm not sure if that'll help.

    Adam
    Dibble and dabble but please don't babble.
    Tuesday, February 23, 2010 6:50 PM
  • I shoud make a view with count(*) as countVal group by ipNum and scanValue, create an index on this view, and make a select with sum(countVal) and count(countVal) group by scanValue
    But later, I have to go away for an hour now.
    Do you think it might work?
    • Marked as answer by drealecs Wednesday, February 24, 2010 5:54 AM
    Tuesday, February 23, 2010 7:15 PM
  • You could try putting an index on [scanValue]

    http://sqlblog.com/blogs/jamie_thomson/ | @jamiet | About me

    I have not clustered indexes on all 3 columns.
    Sorry, something may have got lost in translation there. Do you mean you have a CI across all 3 columns? I would have thought just putting the CI on [scanValue] would be worth a try.

    http://sqlblog.com/blogs/jamie_thomson/ | @jamiet | About me
    Tuesday, February 23, 2010 7:20 PM
  • 60 Seconds is going to be tough to reduce.

    All I can say is, you don't know until you try.

    Best of luck,

    Adam
    Dibble and dabble but please don't babble.
    Tuesday, February 23, 2010 7:23 PM
  • Yes, the time was reduced even without a view to 15 sec:

    select scanValue,sum(cnt),count(*)
    from (select ipNum,scanValue,count(*) cnt
    from IpScan group by ipNum,scanValue) IpScan2
    group by scanValue
    order by scanValue
    I'll try with a view as soon as I'll get to work.
    Wednesday, February 24, 2010 5:50 AM
  • Success!
    Less than 1 second.

    creating the view:

    create view dbo.SubIpScan with schemabinding
    as select ipNum,scanValue,count_big(*) scans
    from dbo.IpScan group by ipNum,scanValue
    

    Creating the clustered index

    create unique clustered index SubIpScan_cluster on SubIpScan(scanValue,ipNum)
    

    Creating indexes

    create index SubIpScan_scanValue on SubIpScan(scanValue)
    
    create index SubIpScan_ipNum on SubIpScan(ipNum)
    
    create index SubIpScan_scans on SubIpScan(scans)
    

    Runing the new query:

    select scanValue,sum(scans) count,count(*) distinctcount
    from SubIpScan with (noexpand)
    group by scanValue
    order by scanValue
    • Marked as answer by drealecs Wednesday, February 24, 2010 7:39 AM
    Wednesday, February 24, 2010 7:38 AM
  • lol not to be greedy, but if the view was successful, why was everyone's else's marked as answered except for mine?

    Edit: ...and you removed the DISTINCT. Guess I deserve no credit. lol

    Adam
    Dibble and dabble but please don't babble.
    Wednesday, February 24, 2010 3:18 PM