none
Microsoft Extensions to TIFF? RRS feed

  • Question

  • This is a wishlist for a new spec, rather than a query on the existing specs.

    Microsoft products sometimes put extra tags into TIFF documents, and of course older Microsoft Office versions produced MDI (which I think of as a TIFF variant) using the Microsoft Office Document Imaging Writer virtual printer.

    Some initial investigations indicate three kinds of compression:
    There are new kinds of image compression:
     - MODI_BLC  34718
     - MODI_PTC  34720
     - MODI_VECTOR   34719

    MODI_VECTOR appears to basically be Enhanced Metafile.

    MODI_BLC and MODI_PTC are not understood.

    In addition to the compression methods, there are unknown tags (fields). These unknown properties appear to occur in both TIFF and MDI files.

    37679 - appears on every page, looks like the text version of the document contents. The content are 0x01 0x00, followed by a length (4 byte aka long) which is 6 bytes less than the actual length of this field (i.e. it is the remaining length), followed by the UTF8 text version. Each phrase is delimited by a space followed by a newline (0x20 0x0a aka ' \n'). The end is 0x0d 0x00.

    37680 - only appears to occur on the first page, always appears to be length 4096, always starts with 0xd0 0xcf 0x11 0xe0 0xa1 0xb1 0x1a 0xe1, then a string of zeros, and then varies. Perhaps some kind of metadata dictionary? It is located at the end of the file, and there are 16-bit wide characters that look like "Root Entry", "CONTENTS" (sometimes more than once, even if only one page), "prop2" (sometimes more than once), "prop3" (somtimes more than once), "DICT", "Summary Information", "Owner" and some names. There might be some random stuff / fill in there too. Also appears to be a consistent bit of stuff "AuvsxjatP0udlw1Aaq5eubr5h" (this
    might not be ASCII though - there is a 0x05 0x00 always on the front of it.

    37681 - appears on every page, always stars with 0x02 0x00 (+ 0x00, 0x00?), then varies. Possibly the thumbnail image?

    Would it be possible to get some clarification / confirmation on the compression methods and unknown tags (including any additional tags not yet found)? A spec would be idea, but given that MDI isn't so common and the preference to move to XPS, perhaps just some notes here?


    • Changed type Steve Smegner Wednesday, November 19, 2008 4:43 AM
    Friday, October 10, 2008 9:45 AM

Answers

  • Hi Brad,

    I apologize for the delay.

     

    I can now confirm the Product Group will document the TIFF tags, and the expected timeframe for the documentation is the end of August.

     

    Regards,

    Mark Miller

    Escalation Engineer

    US-CSS DSC PROTOCOL TEAM

    Friday, February 27, 2009 2:50 PM

All replies

  • Hi Brad,

    Thanks for your post.

    We'll let you know as soon as we have news or questions.

    Regards,

    SEBASTIAN CANEVARI - MSFT SEE Protocol Documentation Team
    Friday, October 10, 2008 2:01 PM
  • Sebastian,

    Is there any news, or even a timeframe for this?

    Brad
    Thursday, October 30, 2008 3:47 PM
  • Brad,
    Could you describe your use case scenario?

    Steve Smegner

    Application Development Consulting Group

    Thursday, November 20, 2008 8:45 AM
  • Hi Steve,

    The idea is to provide better support for TIFF and MDI files produced by Microsoft Office Document Imaging on other platforms. I'm particularly interested in Okular.

    Right now, we have TIFF support, and the various pages display fine. That is done using libtiff (http://www.remotesensing.org/libtiff/). I'd like to provide the users with whatever support we can (just as for the .snp case).

    There are essentially two aspects to this:

    1. Support for the microsoft-unique TIFF tags (37679,37680 and 37681 are the ones I know of). I do have initial support for the text extraction part (just implemented - see http://websvn.kde.org/?view=rev&revision=886464 for the actual code changes), but not for the other two tags.

    2. Display of MDI files in the same way we currently display TIFF files. That requires knowledge about the three MDI-specific codecs (per my original request).

    The bigger concept here is that given that TIFF is an industry standard format, I'd like to see Microsoft document its extensions to that format.
    Thursday, November 20, 2008 10:24 AM
  • Greetings Brad,

    I wanted to let you know that we have not forgotten this request. Due to the holidays and the deprecated nature of the MDI formats we are still tracking down the nature of these compression tags. My sources are back from vacation and the holidays and I hope to have an update for you very soon. Thanks for your patience.

    Steve Smegner
    Application Development Consulting Group

    Friday, January 9, 2009 4:26 PM
  • Steve,

    Thanks for the continuing work on this, and for the status update.

    Much appreciated.

    Brad
    Friday, January 9, 2009 9:58 PM
  • Hi Brad,

    I am on the Open Specification Protocols Documentation team, and have taken ownership of this issue.  I have followed this through to conclusion where Steve left off with our Product Group.

    We do not have standalone documentation of the MDI file format and don’t currently have plans to create any since the format is considered obsolete and we no longer recommend using it.  You may want to review this page: http://office.microsoft.com/en-us/help/HP062193601033.aspx.  Saving files in the TIFF format would be the more portable option.

    Regards,
    Mark Miller
    Escalation Engineer
    US-CSS DSC PROTOCOL TEAM
    Thursday, February 5, 2009 3:43 PM
  • Hi Mark,

    I appreciate that MDI is obsolete, however (as you point out) TIFF is not. My original wishlist concerned both MDI and TIFF, which might have confused things. So lets exclude MDI, and only deal with TIFF files as produced by contemporary Microsoft applications.

    There are private tags (fields) in TIFF files produced by those tools, as noted in my original request:
    37679, 37680, 37681.

    Is documentation of those tags available under the Interoperability Principles? I can understand that they may not be (given that they are explicitly private tags), I'd just prefer not to have to figure them out using a binary editor...

    Brad
    Friday, February 6, 2009 12:48 AM
  • Hi Brad,

    I apologize for the delay.

     

    I can now confirm the Product Group will document the TIFF tags, and the expected timeframe for the documentation is the end of August.

     

    Regards,

    Mark Miller

    Escalation Engineer

    US-CSS DSC PROTOCOL TEAM

    Friday, February 27, 2009 2:50 PM
  • It is now the end of August.

    Please tell me where I can find this documentation.

    Thanks!

    Monday, August 31, 2009 4:52 PM
  • Hello Phil,

     

    I checked on the status of the documents with our Product Group and they are not yet ready.  I apologize for the delay.  The documentation for the TIFF tags turned out to be much more involved and complex than expected.  The Product Group informs me that the documentation should be ready by the end of the year.

     

    Having said this, if you can provide more specifics on what you are trying to accomplish or need for TIFF tag details we may be able to assist you in the interim.

     

    Regards,
    Mark Miller
    Escalation Engineer
    US-CSS DSC PROTOCOL TEAM

     

    Wednesday, September 9, 2009 2:22 PM
  • Hi Mark,

    Thanks for the offer, but what I really need is the documentation so I can add support for this TIFF information to my metadata extraction utility.  I am particularly interested in the details of tag 37680 (0x9330) if indeed this is a "metadata dictionary".

    - Phil

    Thursday, September 10, 2009 1:54 PM
  • Mark,

    Can you advise which private TIFF tags are used in Microsoft products (by number, and if possible, the name of the tag)?

    Can you confirm that 37679 (if present) is always the text version of the page content, per my original post?

    Can you advise whether 37680 is some kind of metadata dictionary? I recognise that the documentation for the tag may not yet be available.

    Can you advise whether 37681 is some kind of thumbnail? I recognise that the documentation for the tag may not yet be available.

    Brad
    Saturday, September 26, 2009 6:31 AM
  • Hi Brad,

    I'll research this and respond asap.

    Regards,
    Mark Miller
    Escalation Engineer
    US-CSS DSC PROTOCOL TEAM
    Saturday, September 26, 2009 2:47 PM
  • Hi Brad,

    The Product Group is addressing your request for details of these TIFF tags and hopefully I will have that information for you soon.
     

    Regards,

    Mark Miller

    Escalation Engineer

    US-CSS DSC PROTOCOL TEAM

    Friday, October 2, 2009 6:02 PM
  • Brad,

    We are still investigating this inquiry.

    Dominic Salemno
    Senior Support Escalation Engineer
    Friday, October 16, 2009 3:39 AM
  • Hi Brad,

    The product group is still working on your request, and I will respond as soon as they do.

    Regards,
    Mark Miller
    Escalation Engineer
    US-CSS DSC PROTOCOL TEAM
    Thursday, November 5, 2009 10:23 PM
  • Hi Brad,

    I have information for you regarding your forum post on Saturday, September 26, 2009.

    Can you please send me an Email Address that will allow me to send you files?

    Regards,

    Mark Miller

    Escalation Engineer

    US-CSS DSC PROTOCOL TEAM

    Monday, November 16, 2009 7:11 PM
  • Mark,

    How is this progressing? Are we still expecting to hit the end of 2009?

    Brad
    Thursday, December 17, 2009 12:48 AM
  • Hi Brad,

    Mark is out of the office until the new year so I thought I'd update you on what's happening wrt the TIFF document.  I'm working on getting an ETA date for the document but in the meantime, I gwanted to make sure you were aware of Mark's blog posted 10 days ago which contains some of the pertinent information.

    I will post back here shortly to let you know what I find out about ETA on the TIFF document.

    Saturday, December 19, 2009 4:33 PM
  • Brad,

     

    In reviewing our request for full TIFF documentation with our product team, it appears that we had a miscommunication some time ago.   In the documentation that we’ve already published we have detailed the three additional tags that Microsoft Office uses.   Beyond those tags,  the Adobe specification contains the full specification as the TIFF format originated there.

     

    Let me know if you have what you need.

     

    Best regards,

    Tom Jebo

    Senior Support Escalation Engineer

    Microsoft Open Specification Documentation Support

    • Marked as answer by Tom Jebo MSFT Tuesday, December 29, 2009 7:18 PM
    • Unmarked as answer by Brad Hards Wednesday, December 30, 2009 3:47 AM
    Tuesday, December 29, 2009 7:10 PM
  • Tom,

    I don't think that is meant to be the full description. It isn't a bad description of 37680, but it doesn't really describe the format of 37679 or 37681 at all.

    Lets look at Tag 37681. The document you've pointed to says that this tag "contains positioning information which describes where the text from Tag 37679 appears on the page and information about the position of other objects such as images, tables, and hyphens. The information in this tag is used by the MODI application to enable its text selection feature."

    There is no description of how the contents of Tag 37681 relates to the contents of Tag 37679. There is no description of the positioning convention.  There is no description of how the contents of the tag are to be interpreted as position / locations / extents. There is no description of how "other objects such as images, tables, and hyphens" are encoded. It just isn't there.

    Also, I think the sample code would be better if it was encoded in a neutral format, such as zip.  Encoding as a windows .exe was an unusual choice for interoperability...

    Brad
    Wednesday, December 30, 2009 3:33 AM
  • This is rather humorous (in a sad way).  Now, over a year after the original request, we finally get a reference to some documentation which at least mentions the TIFF tags in question.

    However, I have a similar problem to Brad in that the documentation is incomplete.

    From my point of view (metadata extraction), I need to know the TIFF format and have a name for the tag.  For example:

    37679 - ASCII - DocumentText

    37680 - UNDEFINED - OLEPropertySetStorage

    37681 - SHORT - DocumentTextPosition

    And some basic details about the format of 37679 and 37681 could be useful (as Brad points out).

    Thanks.



    Monday, January 4, 2010 5:30 PM
  • Hi Brad,

     

    Thank you for your follow up on the TIFF tag documentation.  I will work with our Product team to provide answers for your follow up questions and comments.

     

    Regards,

    Mark Miller

    Escalation Engineer

    US-CSS DSC PROTOCOL TEAM

    Wednesday, January 6, 2010 7:59 PM
  • Hi Brad,

    We are still working on this and I will update you as soon as possible.

    Regards,

    Mark Miller

    Escalation Engineer

    US-CSS DSC PROTOCOL TEAM

    Saturday, January 23, 2010 12:22 AM
  • The topic on this thread is being handled offline. We will report back any results on the thread when they are available.

    Thank you. Chris


    Thanks - Chris
    Wednesday, March 3, 2010 5:40 PM
  • Hi,

    I just wanted to add that I am also interested in the results of this.

    Despite being a deprecated format there are lots of historical documents out there in MDI format that need to be accessible now and in the future.

     

    Regards, Jon

    Friday, May 7, 2010 10:49 AM
  • Hi everyone – Below is a brief update in response to the issues and questions identified on this thread. We will provide further updates on this thread as more information becomes available.

     

    TIFF Extensions:

    We will provide documentation for the tag with layout information. We are starting work on this now and should be able to deliver it by early fall.  We discussed this at length with the involved technical people and we strongly recommend that if anyone wants to write out similar OCR layout information that they develop a more modern XML-based format rather than perpetuating the existing binary stream.

     

    MDI Format:

    We do not anticipate documenting this format, nor do we believe there is code that could be efficiently converted to a platform agnostic libtiff type implementation. We are investigating whether we can release a tool to provide bulk conversion from MDI to TIFF or XPS.

     

    Thanks – Chris

    • Edited by Chris Mullaney Wednesday, June 30, 2010 7:56 PM formatting
    Monday, June 21, 2010 5:42 PM
  • Hi Brad,

    I am on the Open Specification Protocols Documentation team, nd have taken ownership of this issue.  I have followed this through to conclusion where Steve left off with our Product Group.

    We do not have standalone documentation of the MDI file format and don’t currently have plans to create any since the format is considered obsolete and we no longer recommend using it.  You may want to review this page: http://office.microsoft.com/en-us/help/HP062193601033.aspx.  Saving files in the TIFF format would be the more portable option.

    Regards,
    Mark Miller
    Escalation Engineer
    US-CSS DSC PROTOCOL TEAM

    This is what I'm looking for, Many thanks to your description!
    Thursday, September 16, 2010 11:41 AM
  • Sorry to drag this back up, but is there yet any documentation regarding how to read 37681?  I read the Tif Format Guidance document and the code sample (which seems to have dealt mostly with 37680) but unless I'm missing it, I don't see any specifics about 37681.  Just need the text positions so I can convert a bunch of older scanned documents to a new format.

     

     

     

    Monday, February 7, 2011 2:10 AM
  • Hi, James,

     

    Thank you for your question.  We are researching this for you and will post a response as soon as we can.

     


    Bryan S. Burgin Senior Escalation Engineer Microsoft Protocol Open Specifications Team
    Monday, February 7, 2011 8:24 PM
    Moderator
  • Hi James,

    I will investigate this issue and follow up with you.

    Regards,

    Mark Miller

    Escalation Engineer

    US-CSS DSC PROTOCOL TEAM

    Thursday, February 10, 2011 3:40 PM
  • I appreciate that, thanks very much
    Friday, February 11, 2011 5:40 PM
  • Hi James,

    Thank you for your patience.  I am still pursuing the definitive answer to this question and will update on this forum soon.

    Regards,

    Mark Miller

    Escalation Engineer

    US-CSS DSC PROTOCOL TEAM

    Tuesday, March 8, 2011 9:21 PM
  • Hi James,

    I have not forgotten about this.  I am still working to get an answer.

    Regards,

    Mark Miller

    Escalation Engineer

    US-CSS DSC PROTOCOL TEAM

    Monday, April 4, 2011 1:56 PM
  • Hi James,

    Just wanted you to know this has not been forgotten.  Our Product Group is still working on this request.

    Regards,

    Mark Miller

    Escalation Engineer

    US-CSS DSC PROTOCOL TEAM

    Thursday, May 26, 2011 1:37 PM