none
[SOLVED] Removing duplicate records from a table?

    Question


  • Forgive my FoxPro ignorance but I am primarily a PHP / MySQL developer.  I manage a web application that accepts data from .DBF files generated by a FoxPro application.  I have noticed a lot of duplicate records in the tables I receive and I imagine there are many more tables in this application that also contain duplicates.  I would like to present a solution to remove these duplicate entries but do not have sufficient FoxPro knowledge to do so.  So instead I will present the algorithm I would like to create and maybe someone can tell me if this is possible in FoxPro (or if there is another more simple method).

    THE_TABLES = #get all DBFs in application directory#
    if THE_TABLES is empty
      end program
    end if

    foreach THE_TABLES as TABLE
      NRECORDS = #count records in table#
      NDISTINCT = #count distinct records in table
      if NRECORDS = NDISTINCT
        skip
      end if
     
      #create TMP_TABLE with same structure as TABLE#
      #select distinct from TABLE into TMP_TABLE#
      NTEMP = #count records in TMP_TABLE#
      if NTEMP = NDISTINCT
        #empty TABLE#
        #select * from TMP_TABLE into TABLE#
      end if
      #destroy TMP_TABLE#
    next

    The tables in this application are "free tables" if that makes a difference.  I would also be curious if there would be any special actions to take in case of tables that contain memo or blob data.

    I appreciate any help provided.
    Wednesday, March 19, 2008 7:36 PM

Answers

  • It starts with deciding what is a duplicate. From your pseudocode I take it you accept it as a duplicate when all fields are same. Next thing to consider if you can exclusive table locks or not. Possibly you do?

     

    Then this would be a simple solution assuming you want to do that for all tables in a given folder (code is a slow implementation thinking it may contain datatypes such as memo,blob):

     

    Code Snippet

    local lcFolder, ix, lcField, jx, llCanIndex, oPrevious, oCurrent

    local array aTables[1]

     

    lcFolder = getdir("","Select data folder")

    if !empty(m.lcFolder)

      for ix=1 to adir(aTables, addbs(m.lcFolder) + "*.dbf" )

          use ( addbs(m.lcFolder) + aTables[m.ix,1] ) exclusive

          llCanIndex = .F.

          for jx = 1 to fcount()

              if !(type(field(m.jx)) $ "MQG")

                  lcField = field(m.jx)

                  llCanIndex = .t.

                  index on &lcField tag dummy

                  exit

              endif

          endfor

          if m.llCanIndex

              scatter name oPrevious memo blank 

              scan

                  scatter name oCurrent memo

                  if compobj(oPrevious, oCurrent) && duplicate

                       delete

                  else

                       scatter name oPrevious memo

                  endif

              endscan

          endif

          pack

          use

      endfor

    endif

     

     

    Another implementation ignoring that there might be memo, blob:

     

    Code Snippet

    local lcFolder, ix

    local array aTables[1]

     

    lcFolder = getdir("","Select data folder")

    if !empty(m.lcFolder)

      for ix=1 to adir(aTables, addbs(m.lcFolder) + "*.dbf" )

          use ( addbs(m.lcFolder) + aTables[m.ix,1] ) exclusive alias RemDupe

          select distinct from RemDupe into cursor myCursor nofilter

          zap

          insert into RemDupe select * from myCursor

          use in "RemDupe"

      endfor

    endif

     

     

    PS: Codes are off the top of my head. Not tested. Be sure to try after backing up.
    Wednesday, March 19, 2008 8:45 PM

All replies

  • In my experience, the tricky part about dealing with duplicates is deciding what constitutes a duplicate. Are you looking for absolutely identical records or for records that are identical from the POV of a human reader? The former you can automate, the latter you can only partially automate.

    Yes, you can write code like you've outlined, but you can actually do it without copying all the data into another table and back. You can just use SQL DELETE to delete the records, once you know how to identify them, and then PACK the table.

    Tamar
    Wednesday, March 19, 2008 8:25 PM
    Answerer


  • Thank you for your response.

    I'm looking for identical records where the columns are identical across the board.

    Unless SQL DELETE behaves differently than MySQL DELETE, I do not believe I would be able to use it to delete the duplicate records as there's no where clause I could build that would separate them from the one I want to keep.

    Again, thank you for your time.
    Wednesday, March 19, 2008 8:39 PM
  • It starts with deciding what is a duplicate. From your pseudocode I take it you accept it as a duplicate when all fields are same. Next thing to consider if you can exclusive table locks or not. Possibly you do?

     

    Then this would be a simple solution assuming you want to do that for all tables in a given folder (code is a slow implementation thinking it may contain datatypes such as memo,blob):

     

    Code Snippet

    local lcFolder, ix, lcField, jx, llCanIndex, oPrevious, oCurrent

    local array aTables[1]

     

    lcFolder = getdir("","Select data folder")

    if !empty(m.lcFolder)

      for ix=1 to adir(aTables, addbs(m.lcFolder) + "*.dbf" )

          use ( addbs(m.lcFolder) + aTables[m.ix,1] ) exclusive

          llCanIndex = .F.

          for jx = 1 to fcount()

              if !(type(field(m.jx)) $ "MQG")

                  lcField = field(m.jx)

                  llCanIndex = .t.

                  index on &lcField tag dummy

                  exit

              endif

          endfor

          if m.llCanIndex

              scatter name oPrevious memo blank 

              scan

                  scatter name oCurrent memo

                  if compobj(oPrevious, oCurrent) && duplicate

                       delete

                  else

                       scatter name oPrevious memo

                  endif

              endscan

          endif

          pack

          use

      endfor

    endif

     

     

    Another implementation ignoring that there might be memo, blob:

     

    Code Snippet

    local lcFolder, ix

    local array aTables[1]

     

    lcFolder = getdir("","Select data folder")

    if !empty(m.lcFolder)

      for ix=1 to adir(aTables, addbs(m.lcFolder) + "*.dbf" )

          use ( addbs(m.lcFolder) + aTables[m.ix,1] ) exclusive alias RemDupe

          select distinct from RemDupe into cursor myCursor nofilter

          zap

          insert into RemDupe select * from myCursor

          use in "RemDupe"

      endfor

    endif

     

     

    PS: Codes are off the top of my head. Not tested. Be sure to try after backing up.
    Wednesday, March 19, 2008 8:45 PM

  • That should give me enough to investigate this on my own, so thank you for that.

    Any future readers should consider this topic solved unless I resurrect it.

    Thanks again.
    Wednesday, March 19, 2008 8:50 PM
  • hai

     

    use adir to get the list of files and use select stmt to eliminate duplicate records

     

    CLOSE ALL                &&CLOSE ALL DATABASE/TABLE/VIEW ETC

    CLEAR                    &&CLEARS THE SCREEN

     

    SET DATE BRITISH         &&SETS THE DATE TO BRITISH.  DD/MM/YYYY

    SET CENTURY on           &&CENTURY ON  I.E. 2008 INSTEAD OF 08

    SET TALK OFF             &&OUTPUT OF COMMANDS TO SCREEN DOES NOT SHOW

    SET DELETED ON           &&RECORDS MARKED FOR DELETION WILL NOT BE CONSIDERED

    ADIR(abc,"*.dbf")        &&PLACES DBF NAMES INTO ARRAY ABC. PLS SEE HELP 

    B = ALEN(ABC,1)          &&B WILL STORE LENGTH OF ARRAY

     

    PRIVATE DBNAME           &&VARIABLE TO STORE DBF NAMES

    STORE "" TO DBNAME       &&VARIABLE TO STORE DBF NAMES

    FOR I = 1 TO B           &&LOOP THROUGH ARRAY

    DBNAME = ABC(I,1)        &&STORE THE DBF NAME

    SELECT 0   

    USE &DBNAME EXCLUSIVE ALIAS MAINFILE                 &&OPEN THE DBF

    SELECT * FROM MAINFILE GROUP BY KEYFLD INTO DBF X    &&GROUP BY KEYFIELDS TO ELIMINATE

                                                         DUPLICATE AND STORE IN DBF X

    SELECT X                       &&CLOSE THE TEMPOROARY DBF

    USE                            &&CLOSE THE TEMPOROARY DBF

     

    SELECT MAINFILE                &&SELECT YOUR ORIGINAL FILE

    ZAP                            &&COMPLETELY EMPTY IT. PLS SEE HELP FOR ZAP COMMAND

    APPEND FROM X                  &&ADD DATA FROM X.DBF WHICH DOES NOT CONTAIN DUPLICATE

    SELECT MAINFILE                &&SELECT AND CLOSE THE MAINFILE

    USE                            &&SELECT AND CLOSE THE MAINFILE

    NEXT I                         &&LOOP THROUGH FOR ALL DBF

     

    Thursday, March 20, 2008 5:50 AM
  • In fact, you can use SQL DELETE for stuff like this. You need a subquery that builds the list of duplicated records, and you need some way to distinguish the one to keep from the one to delete. That last part may be the sticking point in your case.

    Here's an example (from my book, Taming VFP's SQL) of a case where you're not matching every field:

    DELETE Cust ;
        FROM Cust ;
              JOIN ( ;
                SELECT CustomerID, Dups.* ;
                    FROM Cust ;
                      JOIN ( ;
                        SELECT NVL(Address, "") AS Address, NVL(City, "") AS City, ;
                               NVL(Region, "") AS Region, NVL(Country, "") AS Country, ;
                               COUNT(*) AS nCount ;
                            FROM Cust ;
                            GROUP BY Address, City, Region, Country ;
                            HAVING nCount > 1) Dups ;
                        ON NVL(Cust.Address, "") = Dups.Address ;
                        AND NVL(Cust.City, "") = Dups.City ;
                        AND NVL(Cust.Region, "") = Dups.Region ;
                        AND NVL(Cust.Country, "") = Dups.Country) DupIDs ;
                  ON Cust.CustomerID > DupIds.CustomerID ;
                  AND NVL(Cust.Address, "") = DupIDs.Address ;
                AND NVL(Cust.City, "") = DupIDs.City ;
                AND NVL(Cust.Region, "") = DupIDs.Region ;
                AND NVL(Cust.Country, "") = DupIDs.Country


    In fact, if you have a primary key, that's all you'd need to distinguish them.

    Tamar
    Thursday, March 20, 2008 8:33 PM
    Answerer