none
MInimize table records by just selecting top 3 of each group RRS feed

  • Question

  • I have a large table (800,000+ records) that I want to get the top 3 of each [subject]

    I tried doing a subquery but it has taken over an hour to process

    SELECT subject, date, id
    FROM table
    WHERE table.id IN
       (SELECT TOP 3 id
       FROM table AS Dupe                              
       WHERE Dupe.subject= table.subject
       ORDER BY Dupe.Date DESC, Dupe.ID DESC) 
    ORDER BY table.subject, table.Date, table.ID;

    Is breaking down the larger table down into smaller tables, then running the query a more efficient solution? If so, I'd like to break down to 7000 records per table, but keep all the records per [subject] in their own table (so no same subjects in different tables.)

    What is the best way to get the top N of the entire table?


    • Edited by JHarding08 Wednesday, January 2, 2019 9:59 PM
    Wednesday, January 2, 2019 9:57 PM

Answers

  • What is the best way to get the top N of the entire table?

    Hi JHarding08,

    I do not believe that splitting a large table into smaller tables will dramatically improve performance, though a small improvement can be seen.

    I believe more in an efficient use of Keys.

    For instance, with the query:

       (SELECT TOP 3 id
       FROM table AS Dupe                              
       WHERE Dupe.subject= table.subject
       ORDER BY Dupe.Date DESC, Dupe.ID DESC) 

    If Dupe.Date is not a Key field, then all records must be searched to find which 3 are the Top. On the other hand, if Dupe.Date is a Key field, the Top 3 are already there within hand reach.

    Imb.

    • Marked as answer by JHarding08 Thursday, January 3, 2019 8:53 PM
    Wednesday, January 2, 2019 10:15 PM

All replies

  • What is the best way to get the top N of the entire table?

    Hi JHarding08,

    I do not believe that splitting a large table into smaller tables will dramatically improve performance, though a small improvement can be seen.

    I believe more in an efficient use of Keys.

    For instance, with the query:

       (SELECT TOP 3 id
       FROM table AS Dupe                              
       WHERE Dupe.subject= table.subject
       ORDER BY Dupe.Date DESC, Dupe.ID DESC) 

    If Dupe.Date is not a Key field, then all records must be searched to find which 3 are the Top. On the other hand, if Dupe.Date is a Key field, the Top 3 are already there within hand reach.

    Imb.

    • Marked as answer by JHarding08 Thursday, January 3, 2019 8:53 PM
    Wednesday, January 2, 2019 10:15 PM
  • Dupe.Date is not a key field, any three records can be the top 3. I will remove and try the query again
    Wednesday, January 2, 2019 10:18 PM
  • Dupe.Date is not a key field, any three records can be the top 3. I will remove and try the query again

    HI JHarding08,

    And what about the field Dupe.Subject?

    Imb.

    Wednesday, January 2, 2019 10:34 PM
  • that isnt a key field, it is the field to group on
    Wednesday, January 2, 2019 10:58 PM
  • that isnt a key field, it is the field to group on

    Hi JHarding08,

    And …

    It is an important field for the retrieval of the data. So why not a Key field?

    Imb.

    Thursday, January 3, 2019 8:10 AM
  • Are the dates distinct per subject?  If so, probably the most efficient solution would be to restrict the query on the COUNT of rows where the date value is >= the date value per subject.  This would be most efficiently done by a JOIN of two instances of the table, but could also be done by means of a correlated subquery.  Neither would return an updatable result table, however, so if you do want to be able to edit data in the result table, e.g. if the query is used as a form's RecordSource, then you'd need to use the VBA DCount function as below:

    SELECT *
    FROM [Table]
    WHERE DCOUNT("*","[Table]","Subject = """ & [Subject] & """ AND [Date] >=#" & Format([Date],"yyyy-mm-dd") & "#") <=3;

    Domain aggregation functions are notoriously slow, however.  For optimal performance it is essential that the Subject and Date columns are indexed non-uniquely.  If performance is still unacceptable, as I suspect it might be, then you could add a SeqNo column of long integer number data type to the table, again indexing the column non-uniquely, and fill it with sequential values in descending date order per subject with:

    UPDATE [Table]
    SET SeqNo = DCOUNT("*","[Table]","Subject = """ & [Subject] & """ AND [Date] >=#" & Format([Date],"yyyy-mm-dd") & "#");

    You'd then simply restrict a query as follows:

    SELECT *
    FROM [Table]
    WHERE SeqNo <=3;

    I've assumed in the above that the Subject column is of text data type.  If it's a number data type. then the literal quotes delimiters, each represented by a contiguous pair of quotes characters, would of course be removed from the expression.

    When inserting new rows into the table the SeqNo value can be computed per subject group in code.  You'll find an illustration of how to do this in CustomNumber.zip in my public databases folder at:

    https://onedrive.live.com/?cid=44CC60D7FEA42912&id=44CC60D7FEA42912!169

    Note that if you are using an earlier version of Access you might find that the colour of some form objects such as buttons shows incorrectly and you will need to  amend the form design accordingly.  

    If you have difficulty opening the link, copy the link (NB, not the link location) and paste it into your browser's address bar.

    If the dates are not distinct per subject, then the primary key ID column can be brought into play as a tie breaker when computing the sequential numbers.  You'll find examples in RowNumbering.zip in my same OneDrive folder.

    Ken Sheridan, Stafford, England



    • Edited by Ken Sheridan Thursday, January 3, 2019 11:20 AM Typo corrected.
    Thursday, January 3, 2019 11:10 AM
  • The actual data is a flat file import from https://gbif.org/

    There are 44 columns in the import file- https://drive.google.com/file/d/11PkIMhw1GHaqBpsWbWt_v4Y9b0dUWacC/view?usp=sharing

    The main purpose of this task is to get the top 3 of each species.  From the data, the primary key looks to be gbifID since they are all unique values.

    I would like to keep all the columns and just restrict to three rows of each species

    A working table was created where taxonrank="SPECIES"

    • Edited by JHarding08 Thursday, January 3, 2019 5:00 PM
    Thursday, January 3, 2019 4:56 PM
  • You have not confirmed whether the dates are distinct per species or not.   If they are, I think your best option would be to assign subsets of sequential integer values per species to a column in the way I described.  It would take a long time to execute, but it's a once only task, though if further rows are added to the table from further imports it would be necessary to renumber the sequences by executing the UPDATE statement again.

    If the dates are not distinct per species then you'd need to bring the gbifID candidate key into play as the tie breaker when computing the sequences.


    Ken Sheridan, Stafford, England

    Thursday, January 3, 2019 5:56 PM
  • There really isnt a date that I can use for that function.  the only unique values per record seem to be gbifid and occurenceid.
    Thursday, January 3, 2019 6:23 PM
  • The query runs quickly using

    SELECT tbl_working.species, tbl_working.gbifid, tbl_working.name
    FROM tbl_Working
    WHERE ((([tbl_working].[gbifid]) In (SELECT TOP 5 gbifid                           
       FROM tbl_working AS Dupe                             
       WHERE Dupe.species = tbl_working.species       
       ORDER BY Dupe.gbifid DESC)))

    but as soon as I try to order by tbl_working.species, it lags

    name is a concatenation of species_gbifId_license

    This is the working table - https://drive.google.com/file/d/1djwpbn2DNN7-saoR8FLozblntdyj8qSh/view?usp=sharing

    • Edited by JHarding08 Thursday, January 3, 2019 6:41 PM
    Thursday, January 3, 2019 6:25 PM
  • There really isnt a date that I can use for that function.  the only unique values per record seem to be gbifid and occurenceid.

    As I said earlier, if the dates are not distinct within each subset, then a candidate key can be used as the tie breaker.  Below is the expression from my RowNumbering demo to which I referred you earlier, slightly amended to number the rows in descending date order per customer.  You'd just need to substitute your table name for Transactions, the name of your key for TransactionID, the name of your species column for CustomerID, and the name of your date column for TransactionDate, to insert sequential numbers for each subset in descending date order.  Remember than columns of text data type need the value delimited with literal quotes characters in the expression.  A literal quote character can be represented by a single quote ' or a contiguous pair of double quotes "".  Date literals are delimited with the # character, and must be in US date format or an otherwise internationally unambiguous format; I've used the ISO standard for date notation of YYYY-MM-DD below.

    DCOUNT("*","Transactions","CustomerID = " & CustomerID & "
    AND TransactionDate  >=  #" & Format(TransactionDate,"yyyy-mm-dd") & "#
    AND (TransactionID >= " & TransactionID & "
    OR TransactionDate <> #" & Format(TransactionDate,"yyyy-mm-dd") & "#)")

    Once each subset of rows is sequentially numbered it is then a trivial task to return the latest n rows  per species.

    Be sure to index the columns, including the sequential number column, appropriately before inserting the sequential numbers.  Indexing is crucial to good performance.

    Ken Sheridan, Stafford, England


    • Edited by Ken Sheridan Thursday, January 3, 2019 9:25 PM Typo corrected
    Thursday, January 3, 2019 9:24 PM
  • There really isnt a date that I can use for that function.  the only unique values per record seem to be gbifid and occurenceid.

    Hi JHarding08,

    Primary keys must have unique values. Secondary (or alternate) keys can be unique, or have duplicate values.

    Imb.

    Thursday, January 3, 2019 11:10 PM