none
Very strange phenomenon: output file filled or truncated with binary zeros. What could cause that? RRS feed

  • Question

  • Very strange phenomenon: output file filled or truncated with binary zeros

    I have recently encountered a very strange phenomenon which I'm at a loss to account for.  I have had reports in the last few weeks from 4 users who say that they have lost data from the file that my application writes to.  I have been able to see two of the files.  One of them is about 34KB and viewed in a binary editor it contains nothing but binary zeros.  The other is about 40MB and if I view it in a binary editor, it is fine up to offset 02600000 (hex), but from 02600000 onwards it contains nothing but binary zeros.

    I don't know for sure that all 4 have experienced the same thing, but it sounds as if 3 of them have had very similar truncation, and the 4th has lost everything.

    The way my app saves is like this: the app writes to a temporary file.  If any error occurs, the save is aborted. If the save succeeds, the original file is deleted and the temporary file is renamed to replace it.  None of the users reported any problems when they saved.  The problem occurred the next time they tried to view their data.

    I estimate the chance that this was caused by a bug in my code as approaching zero.  I am a very experienced coder and I have checked it meticulously and can find no way that my code could have done this.  The code uses CFile::Write to write to the file.  The truncated file actually split a keyword that was being written out.  Low-level buffering code might conceivably have done that, and my code does do its own buffering, but this part of the code has not been changed in over 10 years, and has behaved faultlessly throughout that time.  If an error did occur, I can think of no way that it would result in binary zeros being output.  Also, why do zeros start at hex 02600000?  Surely it can't be a coincidence that it starts at a round number that ends in 5 zeros?  But this is a round number only in hex terms, and my code doesn't use hex for buffer sizes (or any sizes). All sizes are specified in decimal values.

    So I ask myself - Why after years of working fine have I had 4 reports like this in a window of less than 3 weeks? Why binary zeros?  Why in the truncation case I say did they start on a (hex) round number?

    I have not been able to confirm that they're all using the same version of my software, but it looks like that.  That version has been around for just under a year.  Why problems now?  I only know the Windows versions for two of them.  One had Windows 7 and another had Windows 10.  One user said they hard recently run a defrag on their hard drive.  Could that be relevant?  An attempt to defrag an SSD maybe? (I don't know if they have SSDs).  A virus?  A weird new fault in a service patch, miraculously affecting both Windows 7 and Windows 10?

    Any ideas anyone?


    Simon

    Thursday, February 11, 2016 1:33 PM

All replies

  • My first advice would be to set aside any conviction you might have that it couldn't
    possibly be a bug in your code. I've heard that too many times over the years (decades)
    to accept it at face value. The problem with adopting that stance is that it potentially
    blinds you to what is actually going on in your code. You "see" what you want to see or
    expect to see, not what's actually there.

    As to what can cause a file to have binary zeros at the end, my first guess would be
    premature ending of reading of the input file. If the program doesn't check the state
    of the input file after each read, but just assumes validity and controls the writing
    based on the size of the input file - well lets see what happens in a simulation:

    #include "stdafx.h"
    #include <iostream>
    #include <fstream>
    using namespace std;
    
    void MyExit() {system("pause");}
    
    int _tmain(int argc, _TCHAR* argv[])
    {
        atexit(MyExit);
        ifstream ifs("readme.txt", ios::binary);
        if(!ifs)
            {
            cout << "Input file ope failed.\n";
            return -1;
            }
    
        ofstream ofs("readme.out", ios::binary);
        if(!ofs)
            {
            cout << "Output file ope failed.\n";
            return -1;
            }
    
        ifs.seekg(0, ios::end);
        long flen = ifs.tellg();
        ifs.seekg(0, ios::beg);
    
        char ch;
        int reccnt = 0;
        //while(reccnt != flen) // A
        //    {                 // A
        //    ifs.get(ch);      // A
        while(!ifs.get(ch).eof() && reccnt != flen) // B
            {                                       // B
            ++reccnt;
            ofs.put(ch);
            ch = 0x00;
            //if(reccnt == flen / 2) ifs.setstate(ios::failbit); // break it
            }
    
        ofs.close();
        ifs.close();
        return 0;
    }
    

    Whether A or B is used to control the I/O the results are the same. While the failbit is
    not set for the input file, the output file is the same size and contents as the input
    file. It's a "perfect" copy. But if something causes the file get() to fail all bets are
    off. This is simulated by intentionally setting the fail state halfway through the copy.
    We get an output file which is the same size as the input file, but contains a bunch of
    binary zeros at the end:

    Moral: Make no assumptions.

    In standard C and C++ file handling routines, if an input file is opened in "text" (translate)
    mode then any embedded 0x1A characters will be treated as an EOF character. Thus it's
    important to open in binary mode when dealing with files which might contain such a character.
    I don't know if CFile emulates this behaviour or not, or whether it just handles newlines
    when in "text" mode and ignores 0x1A.

    Another issue that can cause file routines to misbehave is corruption of the file objects
    by overwrites caused by bad pointers or buffer overruns. Such a bug is not directly related
    to the file handling code in the program. It's caused by a bug elsewhere in the code. Note
    as well that such a bug could have been there for years without manifesting itself by
    affecting the file functions. But changes elsewhere in the code can result in shifting or
    rearranging the memory layout so that the bug - for example a buffer overrun - is now
    breaking something different than before.

    - Wayne

    Thursday, February 11, 2016 6:23 PM
  • Hi Wayne - thanks for the ideas and for taking the trouble to respond so thoughtfully.  Much appreciated.  If I were in your position, I would probably think it's most likely to be a coding problem too - mainly because it's hard to think of any other plausible explanation.  But it doesn't look credible from where I'm sitting.  I didn't start with any conviction that it couldn't possibly be my code.  Actually it was almost the opposite.  I pretty much assumed it must be my code because I couldn't think of a plausible alternative.  So I looked long and hard at the code to think how it might have happened (e.g. as a result of bugs elsewhere in the code, bad pointers, buffer overruns, etc).  And although I could imagine any number of ways in which there could be a bug like that, I totally failed to come up with any way that it would give that result.  I liked your input file idea, incidentally.  But it doesn't work for me because there is no input file.  My program is writing out data stored in memory.  There is no file read, nor anything that looks at all like the equivalent of it.  Somehow, if you're right, a huge number of zero-byte writes have to happen, without my program ever crashing or having any reason to think that anything has failed.  It's very tightly written code, with error-checking everywhere.  It is much-used and has no bugs at all that I'm aware of.  There are no previously unexplained errors or strange symptoms.  Nothing.  Even if the error detection failed, where did all the zero bytes come from?  Bear in mind that in each case, the program never crashed or appeared to misbehave in any way - until the next time when the users came to load the data into memory again.  The lack of any other symptoms suggests that the problem may have happened after the save, and not during it.

    Incidentally, even your input file idea, nice though it is, wouldn't account for the fact that the zeros start at a hex number ending in five zeros.  That's a lot of zeros.  Seems like it must be a clue.  It also doesn't account for the fact that error reports suddenly all arrived at about the same time, after a very long period of no problems at all.


    Simon

    Thursday, February 11, 2016 7:54 PM
  • One further thought:  100000 as a hex number is 1,048,576 decimal.  This is a quote from Wikipedia in the page on megabyte:  "A common usage has been to designate one megabyte as 1048576bytes (2 to the 20th power bytes), a measurement that conveniently expresses the binary multiples inherent in digital computer memory architectures."  It's at a multiple of that number that the problem arises in the one example of the truncation that I have seen.  What could possibly cause that if it's not a coincidence?  The only thing I could think of was a serious error in a defrag process.

    Simon

    Thursday, February 11, 2016 8:17 PM
  • Have the clients reported data corruption in files other than yours?
    Thursday, February 11, 2016 8:26 PM
  • They haven't mentioned that no.  But I haven't had that much information from any of them.  It took me a while to decide that there was (probably) a pattern to this.  I've since sent more emails to them, asking a more questions, but haven't had anything useful yet.

    Simon

    Thursday, February 11, 2016 9:35 PM
  • >the zeros start at a hex number ending in five zeros.
    >That's a lot of zeros.  Seems like it must be a clue.

    In my experience, a single sample does not a pattern make.

    Are these corrupt files on computers at different sites? Or the same? Different customers
    at different locations aren't likely to have done the same thing resulting in the same
    kind of corruption at or about the same time - such as a defrag that misfired. Or a power
    spike. Or a malware infection. etc. If the only thing the four clients have in common is
    that they are all using your program, that seems to focus the likelihood on it as being
    the source. On the other hand, if they are all at the same physical location - or connected
    physically in some way such as via a network - that opens up other possibilities.

    - Wayne
    Thursday, February 11, 2016 9:39 PM
  • The pattern wasn't the hex number.  The pattern was 3 clients reporting symptoms that strongly suggested truncation, and a 4th with a file that was effectively empty (except for binary zeros), which could be seen as a limiting case of truncation - all out of the blue and very close to each other in time.  Of the two files that I saw, both were filled with binary zeros where there should have been data.  One filled the entire file.  That's already very odd.  The fact that the binary zeros started at bytes 0 and bytes 2600000 respectively, can't be called a pattern.  But it still looks very interesting, and seems most likely to be a clue of some kind.

    It may help to appreciate that the write my app does is a very simple one.  It just writes every byte of the file (a new file) from start to end.  The first couple of dozen bytes of the file are fixed and always the same.  There just isn't that much to go wrong.  And yet even these bytes disappear completely in the empty file.  How?  It's easy to say that there must be something I've missed.  But I have been coding for a long time.  I'm very experienced.  And it just isn't that complicated.  It's not the code.  There's something else going on here.

    Incidentally Wayne, your most recent post makes a bad probability error.  The customers are all entirely unrelated to one another.  You seem to be saying that that means that the chances of the 4 of them having exactly the same underlying problem (unless it's my code) is remote.  If I had picked them at random our of a large domain, what you say might be true.  But I didn't. They picked themselves (out of a large domain) by reporting a very similar problem to me within a very tight time window.  Consequently, whatever the cause is (defragging, virus, my app - whatever), the chance that it is common to all of them is actually pretty high.  See Gerd Gigerenzer's 'Reckoning with Risk' if you aren't convinced.


    Simon

    Friday, February 12, 2016 8:08 AM
  • I've had another report of what appears to be the same problem.  A user has sent us a file which has somehow become filled with binary zeros - nothing but binary zeros.  In this case, though, the user isn't even using the latest version of our software.  They are using a version that is about 4 years old.  This is the very first report we have had of this problem with that version of the software.

    This person is running Windows 8.1 (of the other people who reported this problem, one we know was  running Windows 7 and one was running Windows 10).

    This person said the problem arose when they closed their laptop lid - while the app was still running I think.  I presume this will have cause hibernation to occur.  Could that be the problem?  Could it be related to hibernation?  Any thoughts anyone?  Malware?


    Simon

    Wednesday, March 2, 2016 12:09 PM
  • >I've had another report of what appears to be the same problem.  A user has sent us a file which has somehow become filled with binary zeros

    Is there any common factor with the customers reporting this problem -
    such as they use the same popular Anti-Virus product perhaps?

    Dave

    Wednesday, March 2, 2016 12:25 PM
  • We've been looking for a common factor, but we haven't found it yet. Our information is pretty incomplete it has to be said.

    Simon

    Wednesday, March 2, 2016 4:03 PM
  • The person says that they do not use an Anti-Virus product because there PC is not connected to the internet!  They say they contact me via email from a Mac.

    I thought it might help focus minds if I included a little of my code.  This is how writing occurs (this is slightly simplified, but really only very slightly - there's hardly anything to simplify):

    #define IO_BUFF_SIZE 25000 static CFile *g_pFile = NULL; static BYTE * g_lpBuff = NULL; // Pointer to buffer static int g_iPos = -1; // Current pos in buffer void IoOpenFileForWrite((LPCTSTR lpFileName) { g_iPos = 0; g_pFile = new CFile(); CFileException fileException; g_pFile->Open(lpFileName, CFile::modeCreate | CFile::modeWrite | CFile::shareExclusive, &fileException); g_lpBuff = (char *)malloc(IO_BUFF_SIZE); } void IoWriteLineW(const WCHAR *lpString, int iStrLen, BOOL bAddLineTerminator) { ASSERT(g_pFile && g_lpBuff); ASSERT(lpString && iStrLen >= 0 && iStrLen < IO_BUFF_SIZE); ASSERT(iStrLen == lstrlen(lpString)); int iPos = 0; WCHAR ch; while (iPos < iStrLen && (ch = *(lpString+iPos))) { if (g_iPos >= IO_BUFF_SIZE) ioFlushBuffer(); ++iPos; *((WCHAR *)(g_lpBuff+g_iPos)) = ch; g_iPos += 2; } if (bAddLineTerminator) { if (g_iPos >= IO_BUFF_SIZE) ioFlushBuffer(); *((WCHAR *)(g_lpBuff+g_iPos)) = '\r'; g_iPos += 2; if (g_iPos >= IO_BUFF_SIZE) bOK = ioFlushBuffer(); *((WCHAR *)(g_lpBuff+g_iPos)) = '\n'; g_iPos += 2; } if (g_iPos >= IO_BUFF_SIZE) ioFlushBuffer(); } void ioFlushBuffer(void) { g_pFile->Write(g_lpBuff, g_iPos); g_iPos = 0; }

    BOOL IoCloseFile(bool bFlushBuffer /* = true */)
    {
    BOOL   bOK = TRUE;

    if (g_pFile == NULL)
    return TRUE;

    ASSERT(g_pFile && g_lpBuff);

        if (g_iPos && bFlushBuffer)
            bOK = ioFlushBuffer();

    try
    {
    g_pFile->Close();
    delete g_pFile;
    free(g_lpBuff);
    }
    catch(...)
    {
    ASSERT(FALSE);
    bOK = FALSE;
    }

    g_pFile = NULL;
    g_lpBuff = NULL;

    return bOK;
    }


    All calls to all of these functions are enclosed within a try ... catch... block, with appropriate error-handling.  Yes g_lpBuff could be overwritten by a wild write somewhere in my code.  So it could be pointing to hyperspace if that happened.  But (a) Would you expect it to be pointing to an area filled with zeros if that happened?  Seems unlikely.  (b) And wouldn't it crash when  the buffer was freed? I've had no reports of crashes by any of the people reporting these problems, or anyone else.  Not for years.

    None of the ASSERTs have ever been activated in any of our testing.  Obviously that's debug only though.

    The initial few bytes that get written to each file are constant data.  But, in 2 out of the 3 cases where i've seen the files, everything is getting zero'd.   Just a reminder: I have 2 files which are 100% zero bytes, and one file which is 100% zeros from hex 02600000 onwards.

    I'm really struggling to even come up with plausible candidates for what could be going wrong.  I can't help thinking that that hex 02600000 address has got to be a clue.  It bears no relation to my own 25000 (decimal) buffer size.  Right now, my best hunch is that some low-level code that CFile is sitting on top of, is doing this (some MFC dll? of a dll that is sits on top of?).  But why now?  My code (both versions) has been working well for years (big pool of users). Why all these versions of Windows?  And why isn't it a well-known thing?  None of the options seem remotely plausible.  But something must explain it.

    I'm using Visual Studio 2010 with MFC incidentally. Mine is not a .net app.  Any of the great brains at Microsoft fancy a challenge?  Someone must have an idea of something that could explain that hex 02600000 truncation start.


    Simon

    Thursday, March 3, 2016 10:54 AM
  • The person says that they do not use an Anti-Virus product because there PC is not connected to the internet!

    By that, do they mean that they've not installed any 3'rd party AV?
    Have they disabled any built-in AV (Defender, or whatever it's called
    today)?

    I thought it might help focus minds if I included a little of my code.

    Nothing immediately leaps out at me that would explain your problem as
    you've not shown how you call those methods, but generally it makes my
    hair stand on end - globals, malloc/free, your own caching :(

    Have you run comprehensive tests on debug builds with all the compiler
    run-time checking options enabled?

    Dave

    Thursday, March 3, 2016 2:51 PM
  • I doubt if they've disabled Windows Defender, so they probably are using that, whether they're aware of it or not.

    The code is ancient code, originally ported from C I think.  It probably has barely been touched in more than 15 years I'd guess.  And until a couple of weeks, we've never needed to look at it.

    You asked about calls to these functions.  There are a grand total of 4 calls to the 'IoWriteLineW' function in the entire program, and they are all from one function (this old code is a very old, low-level layer which has a more recent layer on top of it), and are all in this form:

    IoWriteLineW(sLine, sLine.GetLength(), TRUE);

    where sLine is a parameter (CString &sLine).  The last parameter is FALSE in one of the calls.

    That's it.  (OK, to be absolutely honest, one of the CStrings is a const CString *, and yes the code does check if the pointer is non-NULL before def-referencing it).

    Of course we do loads of debug checking.  We use ASSERTs everywhere.  We normally use 'Default' for Basic Runtime Checks, but switching that to 'Both' and enabling exception-breaking for runtime exceptions, produces nothing in tests.  Switching 'Smaller Type Check' to 'Yes' caused loads of complaints about GDI header files, so I gave up in the end and switched it back to 'No'.  Any other checks you think we should do?

    But in any case, this code has been running absolutely fine for a long, long time.  Suddenly, we get these weird symptoms.  Why now?  What kind of bug in the code could explain all these zero bytes?   What could explain the zero bytes that start at hex 02600000?


    Simon

    Thursday, March 3, 2016 4:08 PM
  • I doubt if they've disabled Windows Defender, so they probably are using that, whether they're aware of it or not.

    I'd suggest they disable it - and see if the issue ever occurs again.

    But in any case, this code has been running absolutely fine for a long, long time.  Suddenly, we get these weird symptoms.  Why now?  What kind of bug in the code could explain all these zero bytes?   What could explain the zero bytes that start at hex 02600000?

    You have to suspect something environmental, which is why I suggested
    AV. In my experience they're almost always the issue in these
    unexplainable file handling problems.

    Dave

    Thursday, March 3, 2016 4:31 PM
  • It sounds like a line of enquiry that's worth pursuing, I agree.  I'll try to find out what AV they all use.


    Simon

    Thursday, March 3, 2016 5:03 PM
  • Hello from the distant future!

    I wanted to chime in here and say that users of our apps have, over the years, occasionally sent us "corrupted" save files that turned out to be filled with zeros. After each of these reports I spent time studying and bulletproofing the code, but even after all this, we still occasionally get these reports.

    In my code, I go so far as to save the file to a temporary location first, and then read it back in to ensure that it contains data, before copying it over the existing save file. Because of this, the only spot in my code that could be causing the zeroes is these lines (where the fully verified temp file is copied over the user's old save file):

    if (CopyFile(tempFile, lpszPathName, FALSE) != FALSE) 
    {
    DeleteFile(tempFile);
    // report success in errorlog
    }
    else
    {
    // report error
    }

    So unless CopyFile (a win32 function) is failing, I don't see how my code could be responsible for the zeros.

    Like you, I've also suspected overzealous virus scanners, as an effective quarantine system would presumably zero out the original file while keeping it the same size. However, I've never been able to pin anything down. Did you ever have any more progress on this?

    Tuesday, October 22, 2019 7:15 PM
  • Hi kmeboe

    No I haven't made any more progress on this.  In fact, I haven't had any more reports of it happening either.  We had a sudden flurry of them, and then nothing since.  I didn't change the code (wouldn't have known what changes to make).  It just stopped happening.  It's a mystery.


    Simon

    Monday, November 18, 2019 10:18 AM
  • Is your temporary location the common %TEMP% directory (user's AppData\Local\Temp) ?

    How do you generate the filename? Can it be something else conflicting with your temp file names (overwriting)?

    Do you create these files in sharing mode?

    -- pa

    Tuesday, November 19, 2019 2:07 AM