none
Parsing files larger than 4gb

    Question

  • Hi all.

    Not sure if this is the best forum for this question, but hopefulle someone can help...

    The scenario:
    I have data files (for visualization purposes) that are far larger than 4gb. I need to parse the data in these files during the init. of an application. There is no need to have access to all the data at once in memory, rather I would like to read in a segment of a file, parse it and then read the next segment, until the EOF and all data has been parsed (= model created).

    The platform:
    Development is done using VS2008, VC++, no-mfc if possible. The target platform is 32/64Bit Windows XP/Vista.

    The solution (first attempt):
    I was hoping that using memory mapped files to access segments of the file on disk would work, since theoretically they allow sequential access to even very large files without the need to hog system memory. I built a prototype using a combination of CreateFile, CreateFileMapping and MapViewOfFile.

    This works for files < 4Gb, but right now I am struggling to find a way to create a FileMapping that starts at (or for that matter is larger than) a position beyond the first 4 Gb.

    I was thinking I could do something like the following:
        1. CreateFile to get a handle to the whole file (Seems to work for any size that the OS can handle).
        2. CreateFileMapping of a segment of the file.
        3. MapViewOfFile to create a view of the currently mapped segment of the file.
        4. Read/parse the data visible/accessable in the view.
        5. UnmapViewOfFile.
        6. Repeat from (2) until the entire file has been read/parsed.

    The Problem:
    The problem is that I am having trouble understanding how to create a file mapping of a portion of the file that lies beyond the first 4 gb. If anyone has experience with this and perhaps could offer some ideas or some sample code I would very much appreciate it.

    Also, is this indeed the recommended way to read/parse files larger than 4Gb on a windows box?

    Thank you very much.
    All Watched Over by Machines of Loving Grace...
    • Edited by globbe Friday, November 14, 2008 2:16 AM
    Friday, November 14, 2008 2:09 AM

Answers

  • The problem you're experiencing must be a coding error on your part, as CreateFileMapping and MapViewOfFile both work in 64-bit offsets (high and low DWORDS), so they both work natively with files larger than 4GB. Can you post some of your code?

    EDIT: To clarify, the dwMaximumSizeHigh and dwFileOffsetHigh parameters of CreateFileMapping and MapViewOfFile (respectively) are the parameters that allow you to access more than 4GB at a time, as they are the upper 32-bit portions of a 64-bit offset.
    • Edited by ildjarn Friday, November 14, 2008 2:56 AM
    • Marked as answer by globbe Monday, November 17, 2008 12:20 AM
    Friday, November 14, 2008 2:54 AM

All replies

  • The problem you're experiencing must be a coding error on your part, as CreateFileMapping and MapViewOfFile both work in 64-bit offsets (high and low DWORDS), so they both work natively with files larger than 4GB. Can you post some of your code?

    EDIT: To clarify, the dwMaximumSizeHigh and dwFileOffsetHigh parameters of CreateFileMapping and MapViewOfFile (respectively) are the parameters that allow you to access more than 4GB at a time, as they are the upper 32-bit portions of a 64-bit offset.
    • Edited by ildjarn Friday, November 14, 2008 2:56 AM
    • Marked as answer by globbe Monday, November 17, 2008 12:20 AM
    Friday, November 14, 2008 2:54 AM
  • Thank you very much for your fast reply. No doubt that this is an error on my part. The thing is that I am a bit lost as to how to handle the dwMaximumSizeHigh and dwFileOffsetHigh. If you could provide a sample call to CreateFileMapping that would map a file larger than 4Gb that would be very helpful. I guess I need to brush up on my Win32 coding...

    BTW: Does this mean that setting both the dwMaximumSizeHigh and dwMaximumSizeLow to 0(zero) would map the entire file, even if it is say 1-2Tb?

    Thank you!
    All Watched Over by Machines of Loving Grace...
    • Edited by globbe Friday, November 14, 2008 3:27 AM
    Friday, November 14, 2008 3:03 AM
  • For a file mapping to succeed there must be a contigous region of free space in your address space which is equal to the size of the mapping

    Friday, November 14, 2008 9:18 AM
  • Again, it would be easier if you posted your code for us to correct than for us to re-explain everything that is already sufficiently explained in the documentation. This also ensures you've actually written some code and aren't just trying to get other people to write it for you.

    To clarify a few things:

    The way the high and low DWORD offsets work is to create what is effectively a 64-bit pointer. If the following is a 64-bit pointer in hex representation: 0x1234567890ABCDEF, then the red portion is the high DWORD and the blue portion is the low DWORD. The easiest way to deal with these offsets is to use a ULARGE_INTEGER -- use QuadPart to treat it as a 64-bit number for your purposes, and use HighPart and LowPart when calling MapViewOfFile.

    When calling CreateFileMapping for read-only purposes (unless you're passing SEC_LARGE_PAGES), you generally pass 0 for both dwMaximumSizeHigh and dwMaximumSizeLow, so the entire file is mapped regardless of size (even if it's > 4GB). Now even though the entire file is mapped, you're only going to look at one small chunk of it at a time. This is accomplished by calling MapViewOfFile, passing a 64-bit offset for dwFileOffsetHigh and dwFileOffsetLow, and telling it how big a chunk you want to look at with dwNumberOfBytesToMap. E.g., to look at the first 64KB of the file, you would pass 0 for dwFileOffsetHigh and dwFileOffsetLow, and 0x10000 for dwNumberOfBytesToMap. Or to look at bytes 2199078830080 through 2199078895616 you would pass 0x200 for dwFileOffsetHigh, 0x3500000 for dwFileOffsetLow, and 0x10000 for dwNumberOfBytesToMap.
    Friday, November 14, 2008 4:47 PM
  • Ildjarn. Again, thank you for your reply. I assure you that I am not trying to get you to write any code for me and I have read the documentation back and forth. To be honest, your first reply was all that was needed to put me in the right direction and the problem has been solved. Below is the call that I ended up with.

    I'm sorry that I could not answer you before you wrote your second reply. I am probably in a different time-zone (GMT+9) from you. However, because it was a very good and accesable explanation of the high and low DWORD offsets, I am sure this will be of value to anyone else battling these issues.

    Anyways, I can't post the entire code but here's a few snippets:

    //Open the file for read
    _fileHandle =
    CreateFile(fileName, GENERIC_READ, FILE_SHARE_READ, 0, OPEN_EXISTING, 0, 0);

    //Create File map (entire file)
    _fileMapHandle = 
    CreateFileMapping(_fileHandle, 0, PAGE_READONLY, 0, 0, 0);

    //Create the view (looped in a forward-read manner only)
    _pointerToData = //ViewSize is a DWORD, MStart is a LARGE_INTEGER
    (BYTE *)MapViewOfFile(_fileMapHandle, FILE_MAP_READ, MStart.HighPart,  MStart.LowPart, ViewSize);

    Thank you.


    All Watched Over by Machines of Loving Grace...
    Monday, November 17, 2008 12:20 AM
  • If you are only reading records from a large file, ReadEx and WriteEx are probably a better choice.
    I am a professional developer and a vegan. I also am a skilled web developer. I also study economics and play chess.
    Monday, November 17, 2008 2:25 AM