none
MSOffice files corrupted when writing in/out of VARBINARY field

    Question

  • Hello,

    I'm having a problem retrieving MSOffice files from a VARBINARY field in a SQL Server 2008 database. 

    Here's the table setup...

    CREATE TABLE TestDb.dbo.Attachments (
      AttachmentID INT NOT NULL
      ,CourseID INT NULL
      ,Creator INT NULL
      ,Date DATETIME
      ,DisplayOnWeb VARCHAR(2)
      ,Filename VARCHAR(200)
      ,Extension VARCHAR(12)
      ,Contents VARBINARY(MAX)
      ) ON [PRIMARY];

    ...and here's the code I use to write the files in...

    DECLARE @Command NVARCHAR(2048);
    SET @Command = N'UPDATE TestDb.dbo.Attachments '
                  + 'SET Contents = (SELECT * FROM OPENROWSET(BULK 'C:\Attachments\orig06.doc', SINGLE_BLOB) AS Contents) '
                  + 'WHERE AttachmentID = 6';
    EXECUTE sp_executesql @Command;

    ...and here's what I use to write files back out...

    DECLARE @Command NVARCHAR(2048);
    SET @Command = N'BCP "SELECT Contents FROM TestDb.dbo.Attachments WHERE AttachmentID = 6" QUERYOUT '
                  + 'C:\test_query_output\output06.doc -n -ServerName\SQLServerName -Usa -Pmypassword ';
    EXECUTE master..xp_cmdshell @Command;

    So, I run this in a process for several hundred files including DOC, DOCX, XLS, XLSX, and PDF.  The PDF files open fine.  All the MSOffice files won't open because they're corrupted.  When I open up the old and new files in a text editor the only difference is a series of characters added at the front of the new file.

    Here are two versions of the same DOCX file opened in Notepad++.  The original version starts with the characters "PK"...

    The version copied out of SQL Server has eight characters that have been added in front of the "PK" characters...


    Here's an original XLSX file.  Again, the file starts with the characters "PK"...

    And here's the file copied out of SQL Server, again with an additional eight characters at the beginning...


    The same thing happens with PDF files, but they can be read just fine in Adobe Reader.  Here's an original file...

    ...and here's the same file copied out of SQL Server...

    I also compared the contents of the VARBINARY field for the different file types, and they start with the same series of characters for both DOCX and XLSX files (0x504B030414000600080000002100...).  That suggests to me that the extra characters are added during the export process and not the import process.  I've tried tinkering with some of the BCP switches, substituting "-n" with "-w", "-c", and "-C RAW", but none of these made any difference and the documentation states I should be using "-n" with binary files.

    I know some respondents might say I should be doing this using FILESTREAM, and I will look into that later.  However, the current SQL Server documentation I'm looking at suggests that if most files are under 1MB then I should still be using a VARBINARY field.  And in any case, I suspect this is some kind of encoding error and I'd like to learn what the solution is since it could show up when I'm using other technologies.

    Thanks very much for your suggestions, Bill J


    Bill Jones wjones20@emich.edu







    • Edited by Bill in MI Wednesday, May 08, 2013 2:08 PM
    Wednesday, May 08, 2013 2:03 PM

All replies

  • If I recall correctly, I read about similar problem in Jacob Sebastian blog on www.beyondrelational.com Unfortunately, last few days I am having troubles searching this site. I am not sure if it's a local problem or problem with the site. I suggest to try searching Jacob's blog there on 'varbinary openrowset' keywords.

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog

    Wednesday, May 08, 2013 3:56 PM
  • Naomi,

    Thanks for your suggestion.  Here's the post I think you're referring to...

    http://beyondrelational.com/modules/2/blogs/28/posts/10402/why-does-excel-2007-prompt-for-repair-when-opening-a-document-stored-in-sql-server-2008-filestream-c.aspx

    It describes a very similar problem with these differences: 1) He's using the FILESTREAM datatype, 2) the extra characters get added at the end instead of the beginning, and 3) he isolated the problem to .NET scripting he'd created to perform the export. 

    I'm going to use his test of copying the field contents directly from the database to a file and see if that will load normally. More later...

    Thanks, Bill


    Bill Jones wjones20@emich.edu

    Wednesday, May 08, 2013 4:19 PM
  • Yes, that was the blog I was thinking of.

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog

    Wednesday, May 08, 2013 4:40 PM