none
Developing a tool to recognise MS Office file types ( .doc, .xls, .mdb, .ppt )

    Question

  • Dear Sir/Madam,

                  I am supposed to develop a tool which can identify the correct Microsoft Office File types (.doc, .xls, .mdb, .ppt). The tool will take an MS office file ( .doc, .xls, .mdb, .ppt) as input and will return the correct file type.
    The file identification should be done based on the contents, not the extension which can be renamed. How can i solve this problem? Can the binary format of MS office files be helpful in this regard? any other solution?

    In addition, OpenOffice opens the correct type of application for each of these MS office files so it means OpenOffice have some way to differentiate between MS office file types. Any OpenOffice library which can be helpful to me? anything else you want suggest?

    Thanks. :0)
    Saleem
    Thursday, June 25, 2009 9:11 AM

Answers

  • Hello Saleem,

     

    The Binary File Specifications may add value to your goal, but the code effort would be non-trivial.  You can familiarize yourself with the binary specifications for DOC, XLS, and PPT here, http://msdn.microsoft.com/en-us/library/cc313105.aspx  However, there is no binary specification for MDB.

     

    A simpler, but less reliable, way to accomplish your goal of determining the file type based on the contents is to examine the container specification, which is the Compound File Binary (CFB) specification, found here, http://msdn.microsoft.com/en-us/library/cc546605.aspx

     

    One possible algorithm you could employ is to search for byte patterns beyond the CFB header that are consistent with a type of document.  The header for the CFB format always begins with this byte pattern, “D0 CF 11 E0 A1 B1 1A E1”.  First verify you have a CFB file (it contains the correct header byte pattern) then search for a byte pattern specific (but not unique) to a type of file.  For example, you could verify it is CFB then search for this byte sequence that indicates the file is probably a DOC file type, “57 6F 72 64 2E 44 6F 63 75 6D 65 6E 74 2E”, which is the byte sequence for ”Word.Document.”  The same technique could be used to determine if the file is probably an XLS file type by searching for this byte sequence, “4D 69 63 72 6F 73 6F 66 74 20 45 78 63 65 6C 00”, which is the byte sequence for “Microsoft Excel.”  However, DOC, XLS, and PPT documents may contain any of the other types of documents embedded in them.  For example, a DOC file may contain an XLS document and/or PPT document, etc.  Therefore, you would find the byte sequence per the previous examples but it would not be 100% reliable that the file is a  DOC (or other) file type.  Another problem is that the CFB format can/will become fragmented so you would likely want to parse according to the CFB specification to improve your results (FAT/MINIFAT chain), rather than simply reading the file byte by byte.

     

    A slightly better algorithm is to use the extension in combination with the previous algorithm to obtain a greater degree of confidence concerning the file type.

     

    A much better way is of course the most challenging, which is to attempt to parse each type of file (perhaps just the top level of structures) based on the extension after verifying the CFB header exists.  This is where you would need to refer to the Binary Specifications for each document type and parse the structures accordingly (the first link above).   Most applications align the extension of the document to the type of document then attempt to parse based on the extension (errors would imply incorrect file type or corrupt document).

     

    I hope this helps.


    Regards,
    Mark Miller
    Escalation Engineer
    US-CSS DSC PROTOCOL TEAM

    Tuesday, June 30, 2009 7:54 PM
  • Hi saleem,

    I apologize for the long delay as I was away for a while.  I hope you have resolved your follow up question as unfortunately I am not aware of a unique string for XLA or PPT add-in files.  An approach to this problem may be to create a blank document then a document for an add-in type for which you are searching and compare with hex editor and/or windiff.

    Regards,
    Mark Miller
    Escalation Engineer
    US-CSS DSC PROTOCOL TEAM
    • Marked as answer by Chris Mullaney Friday, September 4, 2009 11:34 PM
    Friday, August 28, 2009 12:36 PM

All replies

  • Hi Saleem:
    I have alerted the Protocol Documentation Team to your question about MS Office file type. A member of the team will be in touch soon.
    Regards, Obaid Farooqi
    Thursday, June 25, 2009 6:01 PM
    Moderator
  • Hi obaid,

             Thanks for help. I would appreciate if the team member can communicate with me by the following email address imagine2060@gmail.com. 

    Hope to get a reply. :-)

    Regards,
    Saleem


    Sunday, June 28, 2009 1:07 PM
  • Hello Saleem,

     

    The Binary File Specifications may add value to your goal, but the code effort would be non-trivial.  You can familiarize yourself with the binary specifications for DOC, XLS, and PPT here, http://msdn.microsoft.com/en-us/library/cc313105.aspx  However, there is no binary specification for MDB.

     

    A simpler, but less reliable, way to accomplish your goal of determining the file type based on the contents is to examine the container specification, which is the Compound File Binary (CFB) specification, found here, http://msdn.microsoft.com/en-us/library/cc546605.aspx

     

    One possible algorithm you could employ is to search for byte patterns beyond the CFB header that are consistent with a type of document.  The header for the CFB format always begins with this byte pattern, “D0 CF 11 E0 A1 B1 1A E1”.  First verify you have a CFB file (it contains the correct header byte pattern) then search for a byte pattern specific (but not unique) to a type of file.  For example, you could verify it is CFB then search for this byte sequence that indicates the file is probably a DOC file type, “57 6F 72 64 2E 44 6F 63 75 6D 65 6E 74 2E”, which is the byte sequence for ”Word.Document.”  The same technique could be used to determine if the file is probably an XLS file type by searching for this byte sequence, “4D 69 63 72 6F 73 6F 66 74 20 45 78 63 65 6C 00”, which is the byte sequence for “Microsoft Excel.”  However, DOC, XLS, and PPT documents may contain any of the other types of documents embedded in them.  For example, a DOC file may contain an XLS document and/or PPT document, etc.  Therefore, you would find the byte sequence per the previous examples but it would not be 100% reliable that the file is a  DOC (or other) file type.  Another problem is that the CFB format can/will become fragmented so you would likely want to parse according to the CFB specification to improve your results (FAT/MINIFAT chain), rather than simply reading the file byte by byte.

     

    A slightly better algorithm is to use the extension in combination with the previous algorithm to obtain a greater degree of confidence concerning the file type.

     

    A much better way is of course the most challenging, which is to attempt to parse each type of file (perhaps just the top level of structures) based on the extension after verifying the CFB header exists.  This is where you would need to refer to the Binary Specifications for each document type and parse the structures accordingly (the first link above).   Most applications align the extension of the document to the type of document then attempt to parse based on the extension (errors would imply incorrect file type or corrupt document).

     

    I hope this helps.


    Regards,
    Mark Miller
    Escalation Engineer
    US-CSS DSC PROTOCOL TEAM

    Tuesday, June 30, 2009 7:54 PM
  • Hello Mark,

                  Thanks for detail explantion. I'm in process of developing the tool through strings (magic numbers). I have few questions and would like to hear from you regarding it.


    I'm finished with MS word and now i am trying to find unique strings within Excel files (xls, xlt, xla). I couldn't found any string which is unique to xla (excel add-in) files, is there anything unique in xla files which can be used for identification of xla files? :) In addition, do you know about any string which is unique to ppa (powerpoint add-in)?

    After completing identification of office files earlier, i will move towards the identification of office 2007 files.

    Looking forward to hear from you. :)

    Regards,
    saleem






                  
    Wednesday, July 22, 2009 1:25 PM
  • Hello Saleem,

    I'm glad the information I provided was helpful.  I will get back to you soon on your follow up question.

    Regards,
    Mark Miller
    Escalation Engineer
    US-CSS DSC PROTOCOL TEAM
    Thursday, July 30, 2009 3:55 PM
  • Hi saleem,

    I apologize for the long delay as I was away for a while.  I hope you have resolved your follow up question as unfortunately I am not aware of a unique string for XLA or PPT add-in files.  An approach to this problem may be to create a blank document then a document for an add-in type for which you are searching and compare with hex editor and/or windiff.

    Regards,
    Mark Miller
    Escalation Engineer
    US-CSS DSC PROTOCOL TEAM
    • Marked as answer by Chris Mullaney Friday, September 4, 2009 11:34 PM
    Friday, August 28, 2009 12:36 PM
  • This link will help you out very much :

    http://stackoverflow.com/a/35625330/964053

    You can also look here if you are interested in some other solutions :

    http://stackoverflow.com/questions/2897328/is-there-any-library-to-access-ole-structured-storage-from-c

    Cheers! ;)


    • Edited by Kostas0 Thursday, February 25, 2016 4:48 PM
    Thursday, February 25, 2016 4:47 PM