none
FileSystemWatcher reliability

    Question

  • Hi,

    On my project we are seeking to replace a time and resource consuming file-system polling step, which checks every file in a rather large directory structure for changes since the last poll, with a callback event through the FileSystemWatcher running from a resident program.  We had it deployed for awhile when some strange problems arose which our logs indicated could only be the result of FileSystemWatcher missing events.  As a result, to confirm what the problem was, I wrote a test program for the FileSystemWatcher.  The results show that no matter the documented steps taken to ensure reliability, there are always a small number of file system modifications which the FileSystemWatcher does not raise any event for (a non-zero false negative rate).  Unfortunately, for our application to work correctly, there must be absolutely no false-negatives. 

    I have taken all the steps recommended in the documentation to remedy unreliability to no avail: The rate stays consistent regardless of the InternalBufferSize - I have tried increasing it all the way to 1 MB.  Additionally, no Error events are raised in the event of a missed file system change, which would at least allow our application to know the file system is dirty.

    My test procedure involves performing 1000 sets of 10 random, sequentially executed, common changes to a file system: Creating files/directories, deleting them, modifying files, and moving/renaming.  One program was written to host the monitor, and another was written to tell the first program to spawn the monitor, and then randomly modify the file system.   The event callbacks I am concerned with are Changed, Deleted and Renamed.  The results for the test cases I have run are:

    (1) In the test case with a 5% chance of deleting a file, 5% deleting a directory structure, 5% renaming a file, 10% chance of creating a directory, 25% chance of creating a file, and 50% chance of modifying an existing file, the false negative rate for detected Changed, Deleted, and Renamed events run between 0.15% - 0.35%.  File name size for creation was random between 1 and 32 alphanumeric characters.  This was tested with InternalBufferSize at 8k, 32k, and 1m.  With the 1m InternalBufferSize, the Error event was never raised (I have not yet tried using the Error event with a smaller buffer size).  The monitored directory the test was run on was initially empty or sparsely populated.

    (2) In the test case which *only* modifies existing files (the most important event to catch for our application), the false-negative rate ran at 0.04% - 0.09%.  The InternalBufferSize was tested at both 8k and 1m, and the Error event was never raised.  The monitored directory was initially sparsely populated.

    In addition, on the chance this was a strange threading issue, I also tried placing lock() control blocks on the List<string> in which the app was storing the filenames of the change events, inside of the Changed, Deleted, and Renamed delegate implementations.  This had no effect on the test results.

    I would greatly appreciate any insight or advice.  Cheers.
    Thursday, December 06, 2007 12:48 AM

Answers

  • So I actually did finally resolve this issue myself with a workaround, after lots of extensive testing, including mucking around with the sample driver in the IFSKit (Installable File System).  I used the IFS sample driver to test its reliability against FileSystemWatcher, and discovered that both had approximately the same error rates under the test scenarios I posted earlier - quite unnerving, as a driver running in kernel-space should not be exhibiting the same problems a user-space callback does - that indicates that the kernel was not making the appropriate calls to the driver!  Before I go into the explanation, here's the code snippet which gets the FileSystemWatcher to report back all changed files:

    Code Snippet
    //drive should be 'C', 'D', etc...
    //This flushes the *volume* C:, not the *directory* C:\
    static void FlushVolume(__wchar_t drive)
    {
        //construct fully qualified volume path
        String^ volumeStr = "\\\\.\\" + drive + ":\0";    
        cli::array<__wchar_t>^ cawVolumeStr = volumeStr->ToCharArray();

        //copy from managed cli::array on heap to unmanaged array on the stack
        //(there's probably a better way to do this, but that's not important right now)
        WCHAR wcVolumeStr[7];
        for(int i = 0; i < 7; ++i)
            wcVolumeStr[i] = cawVolumeStr[i];

        HANDLE volume = CreateFileW(    wcVolumeStr,
                        GENERIC_WRITE,
                        FILE_SHARE_WRITE,
                        NULL,
                        OPEN_EXISTING,
                        0,
                        NULL);

        if( volume == INVALID_HANDLE_VALUE)
        {
            DWORD result = GetLastError();
            throw gcnew Exception("Could not open handle to " + volumeStr + ", error: " + GetErrorString(result));
        }

        //The documented way to flush an entire volume
        if( !FlushFileBuffers(volume) )
        {
            DWORD result = GetLastError();
            throw gcnew Exception("Could not flush " + volumeStr + ", error: " + GetErrorString(result));
        }

        if( !CloseHandle(volume) )
        {
            DWORD result = GetLastError();
            throw gcnew Exception("Could not close " + volumeStr + ", error: " + GetErrorString(result));
        }
    }


    Running this code snippet before checking the changed lists resulted in 100% accuracy in my tests.  What appears to be going on is that the IO Manager (internal to the kernel) is queueing up disk-write requests in an internal buffer, and the actual changes are not physically committed until some condition is met - I believe this is the "write-behind caching" feature.  The problem appears to be that the user-space callback via FileSystemWatcher/ReadDirectoryChanges does not occur when disk-write requests are inserted into the queue, but rather occurs when they are leaving the queue and being physically committed to disk.  From what I can infer through observation, the lifetime of an item in the queue seems to be based on (1) whether more writes are being inserted in the queue or (2) if another application requests a read on an item that is in the queue for a write.  For all the intents and purposes of a user-space program, how this queue behaves can be considered non-deterministic.  The code above explicitly flushes the queue of a particular volume, forcing all the callbacks to occur - see the documentation for FlushFileBuffers and then put 2-and-2 together with the documentation on the I/O Manager to see how I came up with this line of reasoning. 

    Of course, FlushFileBuffers could be doing something entirely different which is resulting in success in my test runs, but at least this explanation seems somewhat plausible to me.
    Monday, April 14, 2008 9:15 PM
  •  

    Yes even the documentation says it can overflow and loose track so it's not a 100% solution.

     

    You might want to consider every now and then looking for all files with a last access time, creation or write > some threshold and include them with your list. It's not the best and it's a bit of a polling approach which I hate.

    Monday, December 10, 2007 5:18 AM

All replies

  •  

    Yes even the documentation says it can overflow and loose track so it's not a 100% solution.

     

    You might want to consider every now and then looking for all files with a last access time, creation or write > some threshold and include them with your list. It's not the best and it's a bit of a polling approach which I hate.

    Monday, December 10, 2007 5:18 AM
  • The problem is not that it overflows (even though it should never overflow under the test scenario), the problem is that the Error event is never raised.  If *any* file system change occurs, it is the contractual responsibility of FileSystemWatcher to *always* raise some event, and to *never* let a change pass silently. 

    If the FileSystemWatcher is not meant to replace polling as a method for a program to keep exact track of a directory structure, then its documentation is entirely misleading, because that is the exact purpose the documentation appears to target.

    Wednesday, December 12, 2007 2:19 AM
  •  

    I agree with your disappointment of the FileSystemWatcher class.

     

    Interestingly, the documentation for InternalBufferSize states:
    Increasing the buffer size can prevent missing file system change events. Note that an instance of the FileSystemWatcher class does not raise an Error event when an event is missed or when the buffer size is exceeded, due to dependencies with the Windows operating system.

     

    Quite poor really.

     

    What I'm wondering about is the statement "If there are many changes in a short time, the buffer can overflow. This causes the component to lose track of changes in the directory, and it will only provide blanket notification."

     

    This I understand. If windows is unable to keep track of all the individual changes within the directory, all it can do is give "blanket notification". Fair enough.

     

    In your extensive testing, was "blanket notification" given in extreme cases or none at all? I couldn't find a definition of "blanket notification", I'm assuming it would give a notification that there was any change made at the directory level rather than at each file.

     

    Thanks
    Ronny

    Monday, April 14, 2008 2:08 AM
  • You're right... It's simply not reliable.

     

    Try to consider why you need to do this sort of thing in the first place. Is there any other way of notifying your application of the need to do something to a file? Where are the files coming from? Are they not part of a greater structure of an application or is this purely a filesystem related application? If they are part of a greater architecture, how can it be modified to include a direct deterministic route from the process which creates the files, to the process that handles them? Even under the most primitive architectures, you can use batch files to launch processes, or database triggers, abstract the filesystem interaction via a service, or pass messages between applications through interprocess communication or via a shared queue or flag in the registry, a file on disk, or what not.. There are lots of ways to get two programs to coordinate actions...

     

    All of these involve varying degress of coupling and engineering, but if reliability is your primary concern, then the system will need to be engineered for a deterministic linkage of some sort between the process that generates the files and the process that handles them.

     

    Depending on the kernel and the filesystem for that indicates that there is a broken link in the design somewhere. That's a bigger problem. Of course, if your program is only concerned and involved with filesystem operations, then this doesn't aply, but that's rarely the case.

     

    Anyhow, for a better understanding of what FileSystemWatcher is doing, then you'll need to learn about the ReadDirectoryChangesW Win32API call (from kernel32.dll)...

     

    ReadDirectoryChangesW

    http://msdn2.microsoft.com/en-us/library/aa365465(VS.85).aspx

     

    The FileSystemWatcher class just wraps an asychronous call to that function, which calls back to a .Net method on each change event, and maintains a queue of the results of what those callbacks contains.

     

    You could try implementing your own version of this, and manage the queue and interpretation of the callback data yourself. I would be surprised if this produced better results though, since the underlying issue is that the Win32 calls themselves are not very reliable.

     

    Hope that helps,

    Troy

    Monday, April 14, 2008 8:04 PM
  • So I actually did finally resolve this issue myself with a workaround, after lots of extensive testing, including mucking around with the sample driver in the IFSKit (Installable File System).  I used the IFS sample driver to test its reliability against FileSystemWatcher, and discovered that both had approximately the same error rates under the test scenarios I posted earlier - quite unnerving, as a driver running in kernel-space should not be exhibiting the same problems a user-space callback does - that indicates that the kernel was not making the appropriate calls to the driver!  Before I go into the explanation, here's the code snippet which gets the FileSystemWatcher to report back all changed files:

    Code Snippet
    //drive should be 'C', 'D', etc...
    //This flushes the *volume* C:, not the *directory* C:\
    static void FlushVolume(__wchar_t drive)
    {
        //construct fully qualified volume path
        String^ volumeStr = "\\\\.\\" + drive + ":\0";    
        cli::array<__wchar_t>^ cawVolumeStr = volumeStr->ToCharArray();

        //copy from managed cli::array on heap to unmanaged array on the stack
        //(there's probably a better way to do this, but that's not important right now)
        WCHAR wcVolumeStr[7];
        for(int i = 0; i < 7; ++i)
            wcVolumeStr[i] = cawVolumeStr[i];

        HANDLE volume = CreateFileW(    wcVolumeStr,
                        GENERIC_WRITE,
                        FILE_SHARE_WRITE,
                        NULL,
                        OPEN_EXISTING,
                        0,
                        NULL);

        if( volume == INVALID_HANDLE_VALUE)
        {
            DWORD result = GetLastError();
            throw gcnew Exception("Could not open handle to " + volumeStr + ", error: " + GetErrorString(result));
        }

        //The documented way to flush an entire volume
        if( !FlushFileBuffers(volume) )
        {
            DWORD result = GetLastError();
            throw gcnew Exception("Could not flush " + volumeStr + ", error: " + GetErrorString(result));
        }

        if( !CloseHandle(volume) )
        {
            DWORD result = GetLastError();
            throw gcnew Exception("Could not close " + volumeStr + ", error: " + GetErrorString(result));
        }
    }


    Running this code snippet before checking the changed lists resulted in 100% accuracy in my tests.  What appears to be going on is that the IO Manager (internal to the kernel) is queueing up disk-write requests in an internal buffer, and the actual changes are not physically committed until some condition is met - I believe this is the "write-behind caching" feature.  The problem appears to be that the user-space callback via FileSystemWatcher/ReadDirectoryChanges does not occur when disk-write requests are inserted into the queue, but rather occurs when they are leaving the queue and being physically committed to disk.  From what I can infer through observation, the lifetime of an item in the queue seems to be based on (1) whether more writes are being inserted in the queue or (2) if another application requests a read on an item that is in the queue for a write.  For all the intents and purposes of a user-space program, how this queue behaves can be considered non-deterministic.  The code above explicitly flushes the queue of a particular volume, forcing all the callbacks to occur - see the documentation for FlushFileBuffers and then put 2-and-2 together with the documentation on the I/O Manager to see how I came up with this line of reasoning. 

    Of course, FlushFileBuffers could be doing something entirely different which is resulting in success in my test runs, but at least this explanation seems somewhat plausible to me.
    Monday, April 14, 2008 9:15 PM
  •  

    Hey IWasHere, you have to get a gold star for your post. Going into kernel mode to do some investigation... excellent! And finding a plausible answer as well...

     

    For my application, I still need to have a snapshot of the file system to make periodic comparisons, but having an understanding of what might be going on under the hood is really useful.

     

    Many thanks
    Ronny

    Friday, May 02, 2008 1:28 AM