Microsoft Extensions to TIFF?
-
Friday, October 10, 2008 9:45 AMThis is a wishlist for a new spec, rather than a query on the existing specs.
Microsoft products sometimes put extra tags into TIFF documents, and of course older Microsoft Office versions produced MDI (which I think of as a TIFF variant) using the Microsoft Office Document Imaging Writer virtual printer.
Some initial investigations indicate three kinds of compression:
There are new kinds of image compression:
- MODI_BLC 34718
- MODI_PTC 34720
- MODI_VECTOR 34719
MODI_VECTOR appears to basically be Enhanced Metafile.
MODI_BLC and MODI_PTC are not understood.
In addition to the compression methods, there are unknown tags (fields). These unknown properties appear to occur in both TIFF and MDI files.
37679 - appears on every page, looks like the text version of the document contents. The content are 0x01 0x00, followed by a length (4 byte aka long) which is 6 bytes less than the actual length of this field (i.e. it is the remaining length), followed by the UTF8 text version. Each phrase is delimited by a space followed by a newline (0x20 0x0a aka ' \n'). The end is 0x0d 0x00.
37680 - only appears to occur on the first page, always appears to be length 4096, always starts with 0xd0 0xcf 0x11 0xe0 0xa1 0xb1 0x1a 0xe1, then a string of zeros, and then varies. Perhaps some kind of metadata dictionary? It is located at the end of the file, and there are 16-bit wide characters that look like "Root Entry", "CONTENTS" (sometimes more than once, even if only one page), "prop2" (sometimes more than once), "prop3" (somtimes more than once), "DICT", "Summary Information", "Owner" and some names. There might be some random stuff / fill in there too. Also appears to be a consistent bit of stuff "AuvsxjatP0udlw1Aaq5eubr5h" (this
might not be ASCII though - there is a 0x05 0x00 always on the front of it.
37681 - appears on every page, always stars with 0x02 0x00 (+ 0x00, 0x00?), then varies. Possibly the thumbnail image?
Would it be possible to get some clarification / confirmation on the compression methods and unknown tags (including any additional tags not yet found)? A spec would be idea, but given that MDI isn't so common and the preference to move to XPS, perhaps just some notes here?
- Changed Type Steve Smegner Wednesday, November 19, 2008 4:43 AM
Answers
-
Friday, February 27, 2009 2:50 PMModerator
Hi Brad,
I apologize for the delay.
I can now confirm the Product Group will document the TIFF tags, and the expected timeframe for the documentation is the end of August.
Regards,
Mark Miller
Escalation Engineer
US-CSS DSC PROTOCOL TEAM
- Proposed As Answer by Mark Miller_DSCMicrosoft Employee, Moderator Friday, February 27, 2009 2:50 PM
- Marked As Answer by Mark Miller_DSCMicrosoft Employee, Moderator Friday, February 27, 2009 2:50 PM
All Replies
-
Friday, October 10, 2008 2:01 PMModeratorHi Brad,
Thanks for your post.
We'll let you know as soon as we have news or questions.
Regards,
SEBASTIAN CANEVARI - MSFT SEE Protocol Documentation Team -
Thursday, October 30, 2008 3:47 PMSebastian,
Is there any news, or even a timeframe for this?
Brad
-
Thursday, November 20, 2008 8:45 AM
Brad,
Could you describe your use case scenario?Steve Smegner
Application Development Consulting Group
-
Thursday, November 20, 2008 10:24 AMHi Steve,
The idea is to provide better support for TIFF and MDI files produced by Microsoft Office Document Imaging on other platforms. I'm particularly interested in Okular.
Right now, we have TIFF support, and the various pages display fine. That is done using libtiff (http://www.remotesensing.org/libtiff/). I'd like to provide the users with whatever support we can (just as for the .snp case).
There are essentially two aspects to this:
1. Support for the microsoft-unique TIFF tags (37679,37680 and 37681 are the ones I know of). I do have initial support for the text extraction part (just implemented - see http://websvn.kde.org/?view=rev&revision=886464 for the actual code changes), but not for the other two tags.
2. Display of MDI files in the same way we currently display TIFF files. That requires knowledge about the three MDI-specific codecs (per my original request).
The bigger concept here is that given that TIFF is an industry standard format, I'd like to see Microsoft document its extensions to that format.
-
Friday, January 09, 2009 4:26 PMGreetings Brad,
I wanted to let you know that we have not forgotten this request. Due to the holidays and the deprecated nature of the MDI formats we are still tracking down the nature of these compression tags. My sources are back from vacation and the holidays and I hope to have an update for you very soon. Thanks for your patience.
Steve Smegner
Application Development Consulting Group -
Friday, January 09, 2009 9:58 PMSteve,
Thanks for the continuing work on this, and for the status update.
Much appreciated.
Brad
-
Thursday, February 05, 2009 3:43 PMModerator
Hi Brad,
I am on the Open Specification Protocols Documentation team, and have taken ownership of this issue. I have followed this through to conclusion where Steve left off with our Product Group.
We do not have standalone documentation of the MDI file format and don’t currently have plans to create any since the format is considered obsolete and we no longer recommend using it. You may want to review this page: http://office.microsoft.com/en-us/help/HP062193601033.aspx. Saving files in the TIFF format would be the more portable option.
Regards,
Mark Miller
Escalation Engineer
US-CSS DSC PROTOCOL TEAM- Proposed As Answer by Mark Miller_DSCMicrosoft Employee, Moderator Thursday, February 05, 2009 3:43 PM
- Marked As Answer by Chris MullaneyMicrosoft Employee Thursday, February 05, 2009 5:41 PM
- Unmarked As Answer by Brad Hards Friday, February 06, 2009 12:48 AM
-
Friday, February 06, 2009 12:48 AMHi Mark,
I appreciate that MDI is obsolete, however (as you point out) TIFF is not. My original wishlist concerned both MDI and TIFF, which might have confused things. So lets exclude MDI, and only deal with TIFF files as produced by contemporary Microsoft applications.
There are private tags (fields) in TIFF files produced by those tools, as noted in my original request:
37679, 37680, 37681.
Is documentation of those tags available under the Interoperability Principles? I can understand that they may not be (given that they are explicitly private tags), I'd just prefer not to have to figure them out using a binary editor...
Brad -
Friday, February 27, 2009 2:50 PMModerator
Hi Brad,
I apologize for the delay.
I can now confirm the Product Group will document the TIFF tags, and the expected timeframe for the documentation is the end of August.
Regards,
Mark Miller
Escalation Engineer
US-CSS DSC PROTOCOL TEAM
- Proposed As Answer by Mark Miller_DSCMicrosoft Employee, Moderator Friday, February 27, 2009 2:50 PM
- Marked As Answer by Mark Miller_DSCMicrosoft Employee, Moderator Friday, February 27, 2009 2:50 PM
-
Monday, August 31, 2009 4:52 PMIt is now the end of August.Please tell me where I can find this documentation.Thanks!
-
Wednesday, September 09, 2009 2:22 PMModerator
Hello Phil,
I checked on the status of the documents with our Product Group and they are not yet ready. I apologize for the delay. The documentation for the TIFF tags turned out to be much more involved and complex than expected. The Product Group informs me that the documentation should be ready by the end of the year.
Having said this, if you can provide more specifics on what you are trying to accomplish or need for TIFF tag details we may be able to assist you in the interim.
Regards,
Mark Miller
Escalation Engineer
US-CSS DSC PROTOCOL TEAM -
Thursday, September 10, 2009 1:54 PMHi Mark,Thanks for the offer, but what I really need is the documentation so I can add support for this TIFF information to my metadata extraction utility. I am particularly interested in the details of tag 37680 (0x9330) if indeed this is a "metadata dictionary".- Phil
-
Saturday, September 26, 2009 6:31 AMMark,
Can you advise which private TIFF tags are used in Microsoft products (by number, and if possible, the name of the tag)?
Can you confirm that 37679 (if present) is always the text version of the page content, per my original post?
Can you advise whether 37680 is some kind of metadata dictionary? I recognise that the documentation for the tag may not yet be available.
Can you advise whether 37681 is some kind of thumbnail? I recognise that the documentation for the tag may not yet be available.
Brad -
Saturday, September 26, 2009 2:47 PMModeratorHi Brad,
I'll research this and respond asap.
Regards,
Mark Miller
Escalation Engineer
US-CSS DSC PROTOCOL TEAM -
Friday, October 02, 2009 6:02 PMModeratorHi Brad,
The Product Group is addressing your request for details of these TIFF tags and hopefully I will have that information for you soon.
Regards,
Mark Miller
Escalation Engineer
US-CSS DSC PROTOCOL TEAM
-
Friday, October 16, 2009 3:39 AMBrad,
We are still investigating this inquiry.
Dominic Salemno
Senior Support Escalation Engineer -
Thursday, November 05, 2009 10:23 PMModeratorHi Brad,
The product group is still working on your request, and I will respond as soon as they do.
Regards,
Mark Miller
Escalation Engineer
US-CSS DSC PROTOCOL TEAM -
Monday, November 16, 2009 7:11 PMModeratorHi Brad,
I have information for you regarding your forum post on Saturday, September 26, 2009.
Can you please send me an Email Address that will allow me to send you files?
Regards,
Mark Miller
Escalation Engineer
US-CSS DSC PROTOCOL TEAM
-
Thursday, December 17, 2009 12:48 AMMark,
How is this progressing? Are we still expecting to hit the end of 2009?
Brad -
Saturday, December 19, 2009 4:33 PMModerator
Hi Brad,
Mark is out of the office until the new year so I thought I'd update you on what's happening wrt the TIFF document. I'm working on getting an ETA date for the document but in the meantime, I gwanted to make sure you were aware of Mark's blog posted 10 days ago which contains some of the pertinent information.
I will post back here shortly to let you know what I find out about ETA on the TIFF document. -
Tuesday, December 29, 2009 7:10 PMModerator
Brad,
In reviewing our request for full TIFF documentation with our product team, it appears that we had a miscommunication some time ago. In the documentation that we’ve already published we have detailed the three additional tags that Microsoft Office uses. Beyond those tags, the Adobe specification contains the full specification as the TIFF format originated there.
Let me know if you have what you need.
Best regards,
Tom Jebo
Senior Support Escalation Engineer
Microsoft Open Specification Documentation Support
- Marked As Answer by Tom Jebo MSFTModerator Tuesday, December 29, 2009 7:18 PM
- Unmarked As Answer by Brad Hards Wednesday, December 30, 2009 3:47 AM
-
Wednesday, December 30, 2009 3:33 AMTom,
I don't think that is meant to be the full description. It isn't a bad description of 37680, but it doesn't really describe the format of 37679 or 37681 at all.
Lets look at Tag 37681. The document you've pointed to says that this tag "contains positioning information which describes where the text from Tag 37679 appears on the page and information about the position of other objects such as images, tables, and hyphens. The information in this tag is used by the MODI application to enable its text selection feature."
There is no description of how the contents of Tag 37681 relates to the contents of Tag 37679. There is no description of the positioning convention. There is no description of how the contents of the tag are to be interpreted as position / locations / extents. There is no description of how "other objects such as images, tables, and hyphens" are encoded. It just isn't there.
Also, I think the sample code would be better if it was encoded in a neutral format, such as zip. Encoding as a windows .exe was an unusual choice for interoperability...
Brad -
Monday, January 04, 2010 5:30 PMThis is rather humorous (in a sad way). Now, over a year after the original request, we finally get a reference to some documentation which at least mentions the TIFF tags in question.However, I have a similar problem to Brad in that the documentation is incomplete.From my point of view (metadata extraction), I need to know the TIFF format and have a name for the tag. For example:37679 - ASCII - DocumentText37680 - UNDEFINED - OLEPropertySetStorage37681 - SHORT - DocumentTextPositionAnd some basic details about the format of 37679 and 37681 could be useful (as Brad points out).Thanks.
-
Wednesday, January 06, 2010 7:59 PMModerator
Hi Brad,
Thank you for your follow up on the TIFF tag documentation. I will work with our Product team to provide answers for your follow up questions and comments.
Regards,
Mark Miller
Escalation Engineer
US-CSS DSC PROTOCOL TEAM
-
Saturday, January 23, 2010 12:22 AMModeratorHi Brad,
We are still working on this and I will update you as soon as possible.
Regards,
Mark Miller
Escalation Engineer
US-CSS DSC PROTOCOL TEAM
-
Wednesday, March 03, 2010 5:40 PM
The topic on this thread is being handled offline. We will report back any results on the thread when they are available.
Thank you. Chris
Thanks - Chris -
Friday, May 07, 2010 10:49 AM
Hi,
I just wanted to add that I am also interested in the results of this.
Despite being a deprecated format there are lots of historical documents out there in MDI format that need to be accessible now and in the future.
Regards, Jon
-
Monday, June 21, 2010 5:42 PM
Hi everyone – Below is a brief update in response to the issues and questions identified on this thread. We will provide further updates on this thread as more information becomes available.
TIFF Extensions:
We will provide documentation for the tag with layout information. We are starting work on this now and should be able to deliver it by early fall. We discussed this at length with the involved technical people and we strongly recommend that if anyone wants to write out similar OCR layout information that they develop a more modern XML-based format rather than perpetuating the existing binary stream.
MDI Format:
We do not anticipate documenting this format, nor do we believe there is code that could be efficiently converted to a platform agnostic libtiff type implementation. We are investigating whether we can release a tool to provide bulk conversion from MDI to TIFF or XPS.
Thanks – Chris
- Edited by Chris MullaneyMicrosoft Employee Wednesday, June 30, 2010 7:56 PM formatting
-
Thursday, September 16, 2010 11:41 AM
Hi Brad,
I am on the Open Specification Protocols Documentation team, nd have taken ownership of this issue. I have followed this through to conclusion where Steve left off with our Product Group.
We do not have standalone documentation of the MDI file format and don’t currently have plans to create any since the format is considered obsolete and we no longer recommend using it. You may want to review this page: http://office.microsoft.com/en-us/help/HP062193601033.aspx. Saving files in the TIFF format would be the more portable option.
Regards,
Mark Miller
Escalation Engineer
US-CSS DSC PROTOCOL TEAM
This is what I'm looking for, Many thanks to your description! -
Monday, February 07, 2011 2:10 AM
Sorry to drag this back up, but is there yet any documentation regarding how to read 37681? I read the Tif Format Guidance document and the code sample (which seems to have dealt mostly with 37680) but unless I'm missing it, I don't see any specifics about 37681. Just need the text positions so I can convert a bunch of older scanned documents to a new format.
-
Monday, February 07, 2011 8:24 PMModerator
Hi, James,
Thank you for your question. We are researching this for you and will post a response as soon as we can.
Bryan S. Burgin Senior Escalation Engineer Microsoft Protocol Open Specifications Team -
Thursday, February 10, 2011 3:40 PMModerator
Hi James,
I will investigate this issue and follow up with you.
Regards,
Mark Miller
Escalation Engineer
US-CSS DSC PROTOCOL TEAM
-
Friday, February 11, 2011 5:40 PMI appreciate that, thanks very much
-
Tuesday, March 08, 2011 9:21 PMModerator
Hi James,
Thank you for your patience. I am still pursuing the definitive answer to this question and will update on this forum soon.
Regards,
Mark Miller
Escalation Engineer
US-CSS DSC PROTOCOL TEAM
-
Monday, April 04, 2011 1:56 PMModerator
Hi James,
I have not forgotten about this. I am still working to get an answer.
Regards,
Mark Miller
Escalation Engineer
US-CSS DSC PROTOCOL TEAM
-
Thursday, May 26, 2011 1:37 PMModerator
Hi James,
Just wanted you to know this has not been forgotten. Our Product Group is still working on this request.
Regards,
Mark Miller
Escalation Engineer
US-CSS DSC PROTOCOL TEAM

