Answered by:
Optimization challenge! Checksum calc takes too long ...

Question
-
Hi!
I have to find all duplicate files on a large set of different volumes with very many folders and files.
Where filename, filelength and timestamp (in different folders) are identical, we're assuming duplication. But where only the name and length are a match, we don't know.
To try and match these unknown files I calculate a "rotate left 31-bit checksum" as below, but with so many large files to process, every instruction adds a relatively huge amount to the total processing time.
Q1: Can anyone optimize the code below? (i.e. shorten the loop-time)
Q2: Is there maybe a better (faster) way to create a reasonably robust checksum?
Code snippet:
Dim sBuf() As Byte ' file buffer
Dim i, fLen, curCRC, bit30, bit31 As Long ' counter, length, CRC & store for bits 30 & 31' At this point the entire binary file is already in sBuf(), and fLen is the number of bytes read. curCRC = 0
For i = 0 To fLen
' save 30th and 31st bit of current CRC
bit30 = curCRC And &H20000000
bit31 = curCRC And &H40000000' shift left bits 1 - 29
curCRC = (curCRC And &H1FFFFFFF) * 2If bit30 <> 0 Then
' put old bit 30 back as bit 31 (shift it left)
curCRC = curCRC Or &H40000000
End IfIf bit31 <> 0 Then
' put old bit 31 back as bit 1 (rotate it left to bit 1 skipping bit 32)
curCRC = curCRC Or &H1
End IfcurCRC = curCRC Xor sBuf(i) ' update the simple CRC value
Next iAll input gratefully received :)
- Edited by MicroXand Wednesday, January 20, 2016 2:49 PM
Wednesday, January 20, 2016 2:48 PM
Answers
-
Hi Imb-hb,
Comparing strings is a great idea when there is enough free memory to read in two files, but many of our files are far too large for this.
The big advantage of string compare is that there is then no doubt at all about the files being equal.
What I have actually done is to write a routine in C that opens the files and calculates the checksums. This then feeds the results back via a text file for Access to continue processing.
Both the file-handling and the bit manipulation loops are a lot faster that way.
I'll close the question here.
The original Xand
- Marked as answer by MicroXand Friday, January 29, 2016 8:04 AM
Friday, January 29, 2016 8:04 AM
All replies
-
Hi MicroXand,
Without complete code, I am not sure how you achieved your requirement. But based on your description, do you want to find the different files according properties of files? In my option, loop through files in folders would waste much time. I suggest you create two tables with the same structure.
I think the tables could have these fields like folder, filename, filelength, timestamp and filepath. And then you could get file information from these files. When you want to get the different files, you could query from these two tables. I suggest you test for a try.
Best Regards,
Edward
We are trying to better understand customer views on social support experience, so your participation in this interview project would be greatly appreciated if you have time. Thanks for helping make community forums a great place.
Click HERE to participate the survey.Thursday, January 21, 2016 3:24 AM -
Hi Edward, thanks for your input.
I already have all the available file info in tables, and 'real' duplicates are then easy to find as we are assuming that files that are identical with regard to name. timestamp and length are simple copies.
My problem comes with files that have the same name and length, but different timestamps - are these true copies or not? Therefore I need an extra evaluation that is not available from file info, e.g. a file checksum.
(Actually the same problem arises where users may have stored two copies without making changes, then we could have files that are only identical with regard to filelength and possibly extension).
BTW, I found a 15% improvement by removing the "<> 0" from the 30/31 bit-testing. Unless someone comes up with a better replacement for the whole checksum generation then I guess I'll have to go with this. We have also started looking at tools that are already 'out there'.
Interestingly, "*2" on a long is exactly as efficient as long + long, and removing endifs by putting the if/then on one line (i.e. if x then y instead of if x then \n y \n endif) makes no difference.
Regards. MX.
The original Xand
- Edited by MicroXand Thursday, January 21, 2016 7:41 AM
Thursday, January 21, 2016 7:40 AM -
Where filename, filelength and timestamp (in different folders) are identical, we're assuming duplication. But where only the name and length are a match, we don't know.
Hi MicroXand,
Why not just compare the contents of the two files, using: If (Get_file(file1) <> Get_file(file2)) Then ..., using
Function Get_file(cur_filename As String) As String Dim cur_file As Integer cur_file = FreeFile Open cur_filename For Binary As #cur_file Get_file = Input(LOF(cur_file), #cur_file) Close #cur_file End Function
Imb.
- Proposed as answer by Edward8520Microsoft contingent staff Friday, January 29, 2016 7:35 AM
Thursday, January 21, 2016 8:36 AM -
Hi Imb-hb,
Comparing strings is a great idea when there is enough free memory to read in two files, but many of our files are far too large for this.
The big advantage of string compare is that there is then no doubt at all about the files being equal.
What I have actually done is to write a routine in C that opens the files and calculates the checksums. This then feeds the results back via a text file for Access to continue processing.
Both the file-handling and the bit manipulation loops are a lot faster that way.
I'll close the question here.
The original Xand
- Marked as answer by MicroXand Friday, January 29, 2016 8:04 AM
Friday, January 29, 2016 8:04 AM -
The big advantage of string compare is that there is then no doubt at all about the files being equal.
I'll close the question here.
Hi Xand,
Even after closing the question ...
When file size and/or free memeory is a problem, then you can loop through the file, reading chunks of 1000 or 10,000 bytes (or what you want), until there is a difference.
When there is no difference found, then the files are equal.
Imb.
Friday, January 29, 2016 1:22 PM