none
Determine if data is GZip'd

    Question

  • We have a web service that can store arbitrary data users have sent to us as a base-64 string (files, images, etc - we dont care).  I have read that the first 3 bytes of a GZip'd file are "1F 8B 08", so when I get the string, I would like to know if this was previously compressed.  If it has NOT Been compressed, I will convert it back to bytes, GZip compress it, convert the GZip bytes back to Base-64 and then store the Base-64 string.  This should substantially reduce our storage size, so I really need to do this.

    What I cant figure out is how to determine the real BYTE values of the base-64 string so I can check to see if they are the right signature.  If the data was previously compressed, I will just store the original Base-64 string from the client and save some considerable overhead.

    So my question is "How can I determine the Byte Values of the first 3 bytes of the base-64 string?"

    Friday, February 18, 2011 7:57 PM

Answers

  • Yes, the gzip file header starts with bytes "1F 8B 08"...although the last byte (08 - the compression method) may have other values in the future. 

    Since Base64 encoding works by encoding 3 unencoded bytes into 4 encoded ones, the process to check for the gzip file header very easy: The string will start with "H4sI" (ths Base64 encoded version "1F 8B 08").

    As mentioned previously however, this isn't foolproof...that is; what if you get an uncompressed stream that happens to start with those bytes?  It's probably not all that likely but it could still happen.

    ShaneB

    Friday, February 18, 2011 11:47 PM

All replies

  • I dont think this scenario would be possible, because the only way you know that the file content is compressed is with the .gz file extention. however for the computer data is always data if you look into "file content" perspective. The other reason is what if your original (un-compressed) file had this "1F 8B 08" as original data?
    Balaji Baskar
    http://codesupport.wordpress.com
    Click on "Vote As Helpful" and "Mark As Answer" if this has helped you.
    Friday, February 18, 2011 9:40 PM
  • Even if you try to zip a file that has already been zipped, a good compression scheme outputs a that file is identical to what you started with.
    Mark the best replies as answers. "Fooling computers since 1971."

    http://rudedog2.spaces.live.com/default.aspx

    Friday, February 18, 2011 9:49 PM
    Moderator
  • Our scenario is:

    a) Customer chooses to store a "file" in our system

    b) Customer converts this file (probably a binary file of some sort) to a Base-64 string and sends it to our web service

    c) We store the file in our database, however, if the base-64 string is big (say bigger than even 1K), we can get a lot of benefit out of GZiping the string prior to putting it into our DB.  Then when the Customer Requests it back, we Un-GZip it prior to returning it.

    d) We have Extension Methods to String (Compress() and UnCompress()) that do the GZipping just fine.  What I DO NOT want to do is GZip a "chunk of data" that was already GZip'd - it would not do a whole lot for us.

    I was hoping to be able to look at the base-64 string that was sent to me, and figure out if the first 3 bytes were of the signature that GZip uses, and if so SKIP the compression step and just store it as is.

    However, my problem is that I dont know how to interpret the "first 3 bytes" of the base-64 string to see if they are indeed the magic signature.

    So my question remains, how do I know the BYTE value of the first few bytes of a chunk of data that is represented as a Base-64 string.

    Thanks.

    Friday, February 18, 2011 10:38 PM
  • Yes, the gzip file header starts with bytes "1F 8B 08"...although the last byte (08 - the compression method) may have other values in the future. 

    Since Base64 encoding works by encoding 3 unencoded bytes into 4 encoded ones, the process to check for the gzip file header very easy: The string will start with "H4sI" (ths Base64 encoded version "1F 8B 08").

    As mentioned previously however, this isn't foolproof...that is; what if you get an uncompressed stream that happens to start with those bytes?  It's probably not all that likely but it could still happen.

    ShaneB

    Friday, February 18, 2011 11:47 PM
  • Thank you.  This indeed tells me what to look for, and by examining one in my debugger I did see H4sI - That is what I will look for, but still, is there some link someplace that tells me how to interpret the base-64 characters into byte hex values?  That was my question. 
    Friday, February 18, 2011 11:54 PM
  • Wiki explains how the conversion from 3 to 4 bytes is done.

    http://en.wikipedia.org/wiki/Base64

    ShaneB

    Saturday, February 19, 2011 12:02 AM