none
document size in office binary format RRS feed

  • Question

  • I want to know the size of the document .doc, .xls , .ppt from their binary format. Please let me know how can I locate this information or any method to calculate this.

    Wednesday, July 17, 2013 6:59 AM

Answers

  • Hi Haider,

    I understand now.  These binary Office files (.doc, .xls, and .ppt) are all compound files per [MS-CFB].  This is a FAT-like structure within a file.  Unfortunately, there is no one offset containing a value that tells you how many bytes make up the compound file.  That's kind of part of the dynamic, file-system-within-a-file nature of compound files. 

    First, I must say that this is a non-supported scenario, which I'm sure you probably already know.  Having said that, based on my knowledge of compound files, I will make an attempt to describe what you would need to do. 

    Although there is no single value, there are a few values that you can use as a start and then the rest will require some algorithmic approach.  If you look at [MS-CFB] 2.1 "Compound File Sector Numbers and Types", you will see a table (the second one) that has a list of the valid sector types in a compound file.  These are what you will be essentially counting.  Several of them are already counted for you in the header:

    In [MS-CFB] 2.2 "Compound File Header", the following are counted in fields:

    1) number of Directory sectors

    2) number of FAT sectors

    3) number of mini FAT sectors

    4) number DIFAT sectors

    That makes those four easy to compute.  Just take those numbers and multiply by the size of a sector (determined by inspecting the Major Version and Sector Shift values in the header - it will be either 512 or 4096). 

    You can also add in the size of two sectors for the:
    5) header
    6) range lock sector

    Now comes the more difficult part.  To determine the amount of user-defined data in the streams, you need to:
    a) scan each Directory sector, looking at all the directory entries (they are fixed 128 byte records)
    b) for storage and unallocated directory entries, do nothing
    c) for stream and the root storage directory entries, there is a size field ([MS-CFB] 2.6.1 Compound File Directory Entry); add the accumulated sizes to the total.  For each of these stream sizes, I would round up to the nearest sector size multiple.

    Now another slightly difficult part.  To determine the total size of unallocated/free sectors, you'll have to scan all the FAT sectors, looking for the reserved sector number, 0xFFFFFFFF.  For each of these, add the size of a sector.

    That should give you the total size of the compound file.  There should be no reason to understand the individual Office files formats (i.e. .doc, .xls, .ppt) since they constrained within the CFB format.  I realize that this was a very terse, high level description and needless to say, it requires familiarity with the CFB format.  

    I can give you more details as you get deeper into this, but I recommend that you read [MS-CFB] first to understand what I've laid out.

    Thanks,
    Tom


    Thursday, July 18, 2013 7:45 AM
    Moderator
  • Hi Haider,

    >> but it does not include the size of the reserved FAT sector that you have mentioned in your mail to add.

    By this are you referring to the "...unallocated/free sectors, you'll have to scan all the FAT sectors, looking for the reserved sector number, 0xFFFFFFFF..." from my outline?  If so, then I wouldn't worry too much about that.  They may or may not be written out, however, I included them because I don't think it will hurt to overestimate the size of the file slightly (unless you run off the end of a physical disk sector or something). 

    Please clarify which part of my outline you are confused by.

    Tom

    Friday, July 26, 2013 10:58 PM
    Moderator

All replies

  • Hello Haider

    Thank you for contacting Microsoft Support. A support engineer will be in touch shortly to assist you further.

    Thanks.


    Tarun Chopra | Escalation Engineer | Open Specifications Support Team

    Wednesday, July 17, 2013 4:12 PM
  • Hi Haider,

    I'm not sure I completely understand what you're asking.  The containing file system will provide the properties of the file, including it's size in bytes.  Can you clarify what "size" you are trying to determine?  Each format will have different meanings for "size" and will require understanding the format to determine, for example, the number of characters, words, paragraphs, slides, cells, etc...

    Best regards,
    Tom Jebo
    Escalation Engineer
    Microsoft Open Specifications

    Wednesday, July 17, 2013 7:03 PM
    Moderator
  • Hi Tom,

    Thanks, for the reply.

    From size I mean to say the total number of bytes that a file occupied or size of the file. To be precise I am looking for something that we get from fstat of C function. For example bitmap file contains size of the bitmap in its header and if we read that much byte from the disk we get the whole bitmap. Actually I am writing code to recover the office files from the formatted volume. From the file signature I can recognize the office file but I want to know how much bytes should I read to get complete file (obviously it will be for non fragmented file i.e. file is stored in contiguous sector).

    I hope this will help you to get what I am looking for.

    Thursday, July 18, 2013 5:44 AM
  • Hi Haider,

    I understand now.  These binary Office files (.doc, .xls, and .ppt) are all compound files per [MS-CFB].  This is a FAT-like structure within a file.  Unfortunately, there is no one offset containing a value that tells you how many bytes make up the compound file.  That's kind of part of the dynamic, file-system-within-a-file nature of compound files. 

    First, I must say that this is a non-supported scenario, which I'm sure you probably already know.  Having said that, based on my knowledge of compound files, I will make an attempt to describe what you would need to do. 

    Although there is no single value, there are a few values that you can use as a start and then the rest will require some algorithmic approach.  If you look at [MS-CFB] 2.1 "Compound File Sector Numbers and Types", you will see a table (the second one) that has a list of the valid sector types in a compound file.  These are what you will be essentially counting.  Several of them are already counted for you in the header:

    In [MS-CFB] 2.2 "Compound File Header", the following are counted in fields:

    1) number of Directory sectors

    2) number of FAT sectors

    3) number of mini FAT sectors

    4) number DIFAT sectors

    That makes those four easy to compute.  Just take those numbers and multiply by the size of a sector (determined by inspecting the Major Version and Sector Shift values in the header - it will be either 512 or 4096). 

    You can also add in the size of two sectors for the:
    5) header
    6) range lock sector

    Now comes the more difficult part.  To determine the amount of user-defined data in the streams, you need to:
    a) scan each Directory sector, looking at all the directory entries (they are fixed 128 byte records)
    b) for storage and unallocated directory entries, do nothing
    c) for stream and the root storage directory entries, there is a size field ([MS-CFB] 2.6.1 Compound File Directory Entry); add the accumulated sizes to the total.  For each of these stream sizes, I would round up to the nearest sector size multiple.

    Now another slightly difficult part.  To determine the total size of unallocated/free sectors, you'll have to scan all the FAT sectors, looking for the reserved sector number, 0xFFFFFFFF.  For each of these, add the size of a sector.

    That should give you the total size of the compound file.  There should be no reason to understand the individual Office files formats (i.e. .doc, .xls, .ppt) since they constrained within the CFB format.  I realize that this was a very terse, high level description and needless to say, it requires familiarity with the CFB format.  

    I can give you more details as you get deeper into this, but I recommend that you read [MS-CFB] first to understand what I've laid out.

    Thanks,
    Tom


    Thursday, July 18, 2013 7:45 AM
    Moderator
  • Hello Tom,

    Thanks for the answer. My observation is slightly differs from your answer. I have seen the size matches if I do the following

    ( number of minifat sector + number of fat sector + number of difat sector (my version is 3.xx so I am not considering number of directory sector)) * 512

    + 512 for header

    + 512 for rangelock

    + size field of directory structure (aligning it to 512 boundary), if directory is root or stream

    in the above calculation sector size is taken as 512.

    Now this much addition gives the size but it does not include the size of the reserved FAT sector that you have mentioned in your mail to add. This observation make me confused if I am doing the right thing. I have also do the calculation manually using offvis tool (Microsoft Office Visualization tool) over the document(wdf_arch.doc) at the link http://msdn.microsoft.com/en-us/library/windows/hardware/gg463314 and that too give me the same result.

    could you please confirm?

    Thanks, for your help.

    -Haider.

    Monday, July 22, 2013 10:55 AM
  • Hi Haider,

    >> but it does not include the size of the reserved FAT sector that you have mentioned in your mail to add.

    By this are you referring to the "...unallocated/free sectors, you'll have to scan all the FAT sectors, looking for the reserved sector number, 0xFFFFFFFF..." from my outline?  If so, then I wouldn't worry too much about that.  They may or may not be written out, however, I included them because I don't think it will hurt to overestimate the size of the file slightly (unless you run off the end of a physical disk sector or something). 

    Please clarify which part of my outline you are confused by.

    Tom

    Friday, July 26, 2013 10:58 PM
    Moderator