CSV change detection RRS feed

  • Question

  • Hi, I am seeking to load all the rows/cells from a CSV file on a forward-only basis, but also monitor that file for changes; down to either a row or cell level of granuality.

    The first is fairly straightforward, the second sounds straightforward but thinking about the problem further is potentially complicated.

    I would like to be informed of rows being deleted, added, or updated in a CSV file so that I don't have to completely reload all the CSV data again to know what has been changed since the last time I read through it. A few ways I have thought of so far:

    - Cache the last read version of the file. When the file changes, compare the rows of new CSV file with the rows of the old one. Use this comparison to generate the change list. But, this requires you to always keep the last CSV file cached; not necessarily a bad thing, but a requirement. Obviously, this requires some way of stating a row key (so that we know which rows to compare when looking through the two files). Probably this could either be a surrogate key (row index) or a combination of cell(s) on a single row.

    - I think other approaches exist, such as that used by Microsoft's (now fairly inactive) Sync Framework which generated change metadata and stored this in a local database. This change metadata was then used in some way to generate the granular change list if the file changed.

    I assume this has certainly been done before. Could anyone suggest best practices, perhaps even some well-used libraries that have already invented the wheel in this case?

    Thank you


    • Edited by Markos101 Thursday, September 28, 2017 4:49 PM
    Thursday, September 28, 2017 4:49 PM

All replies

  • How large of a CSV are you loading?
    Since it is a "file" operation, it may be enough to load the file (or portions of it into RAM) and use the set operators on it.
    The .Except() method is very powerful if compared in both directions.
    You could test the file using strings or a custom class if you only want certain fields to be unique.

    This is a classic "audit" of a file, where you keep a "master" set aside for comparison with incoming data.

    Does the entire row/string of the CSV count as the key?


    Thursday, September 28, 2017 10:14 PM
  • CSV is just plain text file, so you should not expect there be "fancy" change tracking capabilities and you should always expect you have to read the whole content to compare if the timestamp had changed. (but then you need to enable that option in registry to rely on timestamp too if you work on Vista or later)

    So your first suggestion is practically the only way. And if the file size is significantly large, you'll want to cache the previous version of file at some location on disk instead.

    While you need to load both file and read the whole content, you may use diff like libraries to help you focus on the modified rows. Note that the quality of diff tools varies if there are both insert and delete action, or rows will be moved around (as we can see on the diff capability of old Visual SourceSafe is not quite satisfactory as it can fail to recognize a moved block and treat it as insert + delete of large block including the unchanged lines, result in more "change" entries than it should)

    Use .Except() with explicit key field comparison as THines01 suggested to be sure. But then again since you'll have no way to make sure the key fields do not duplicate if user edit it with external editor, it could be more tricky to compare if there's no unique key requirement (with unique key requirement you can simply reject the file if duplicate key records are found).

    Friday, September 29, 2017 1:31 AM
  • Hello, and thank you for your replies.

    I cannot assume unfortunately that the files will always be small enough to load into RAM for a relatively fast in-memory comparison, which is why I suggested a forward-only (one row at a time) way of filtering through the CSV file.

    The problem with this is that, whilst I don't have to load the whole CSV file into memory if I go through it in forward-only fashion, obviously the order of the keys in both files can change; in this case, I cannot rely on going through both files in a forward-only fashion once to make a comparison. Possibly, I could cache only the keys of the rows in each file in memory and then use these to work out the added, removed and still present rows first before making use of this in scanning the files again to work out any changes to the rows still present. This is assuming of course that (as mentioned) the keys are unique and that all the keys in the file are not sufficiently large to cause memory problems.

    The other alternative is that I treat every change as a removal of all the rows in the last CSV file and an add of all the rows in the new one. This is rather brute force however, but it would certainly work every time.

    Indeed, there does not seem to be a completely solid, deterministic way of doing this.

    Friday, September 29, 2017 7:08 AM