locked
System.IO.Compression not as good as compressed folder RRS feed

  • Question

  • I'm getting much better compression when I make a compressed folder (Windows XP) than I am using DeflateStream or GZipStream.  I thought these were the same algorithms used in PKZIP and for compressing folders.  Why such bad compression?

    DeflateStream: 3544Kb -> 1261Kb
    GZipStream: 3544Kb -> 1261Kb
    Windows XP: 3544Kb -> 804Kb

    So how can I get the same compression ratio as Windows XP?

    Thanks,

    Jeremy
    Saturday, December 24, 2005 5:03 AM

Answers

  • Chris, thanks for filing the bug report.  Looks like Microsoft will fix this in the future.  For now, I'll investigate using a 3rd party solution.

    -Jeremy
    Wednesday, January 4, 2006 4:36 PM

All replies

  • Apparently you can't.  What you see in the Deflate/GZipStream classes is all there is as far as the .NET Framework is concerned.
     
    However, this result is curious.  I too thought XP was using simple ZIP compression.  The two .NET compression streams don't expose compression quality parameters like stand-alone GZip/Zip applications, but even so the difference looks very big to me.  It's more like ZIP vs RAR than different ZIP settings.
    Saturday, December 24, 2005 8:54 AM
  • It looks like the .NET compression routines are broken.  Here are some new results:

    Original: 372K
    Deflate: 540K  <--- Expanded!
    Gzip: 540K
    Windows XP: 353K
    WinZip: 353K

    On the off chance I'm doing something wrong, here's the code to save all three versions (file is a byte[]):

                File.WriteAllBytes("c:\\data_txt.txt", file);  // Save original file

                // Save compressed version of file
                DeflateStream deflate = new DeflateStream(File.OpenWrite("c:\\data_zip.txt"),
                                                          CompressionMode.Compress, false);
                deflate.Write(file, 0, file.Length);
                deflate.Close();

                // Save compressed version of file
                GZipStream gzip = new GZipStream(File.OpenWrite("c:\\data_gzip.txt"),
                                                 CompressionMode.Compress, false);
                gzip.Write(file, 0, file.Length);
                gzip.Close();

    -Jeremy
    Thursday, December 29, 2005 2:51 AM
  • Holy moly, you're right!  I could reproduce the effect with a file containing random bytes, as listed below.
     
    Random bytes never compress well, of course, but both .NET compression streams actually expanded the file size by a whopping 50%. The stand-alone WinRAR program only adds a few header bytes which is what should happen in this case.
     
    Original file -- 100,000 bytes
    ZipStream -- 153,829 bytes
    GZipStream -- 153,847 bytes
    WinRAR Zip -- 100,142 bytes (regardless of quality setting)
     
    I've used several runs to get different random values, always deleting the files in-between just to be sure.  The results were virtually identical.
     
    using System;
    using System.IO;
    using System.IO.Compression;
     
    namespace CompressionTest {
     
        public class MainClass {
     
            public static void Main() {
     
                // Save original file with random bytes
                byte[] file = new byte[100000];
                Random random = new Random();
                random.NextBytes(file);
                File.WriteAllBytes("data_txt.txt", file);
     
                // Save compressed version of file
                DeflateStream deflate = new DeflateStream(
                    File.OpenWrite("data_zip.txt"),
                    CompressionMode.Compress);
                deflate.Write(file, 0, file.Length);
                deflate.Close();
     
                // Save compressed version of file
                GZipStream gzip = new GZipStream(
                    File.OpenWrite("data_gzip.txt"),
                    CompressionMode.Compress);
                gzip.Write(file, 0, file.Length);
                gzip.Close();
            }
        }
    }
    Thursday, December 29, 2005 3:21 PM
  • This is very curious, by the way. I've been using GZipStream for months to compress XML files, and that works very well. The compression rate is about 10:1 or better -- nothing to complain about.
     
    Possibly Microsoft neglected to add detection code for poorly compressible data that should simply be stored. What data did you put in your file? Your WinZip/XP compression rates don't look so great, either -- was that random test data, too, or perhaps some binary data that's close to random data?
    Thursday, December 29, 2005 3:23 PM
  • Well, it looks like all the Microsofties who would know about this issue are on holiday so I filed a bug report:
     
     
    Happy new year everyone!
    Saturday, December 31, 2005 5:36 PM
  • GZipStream is just a wrapper around DeflateStream, so you'll always see similar performance from these (with GZipStream being slightly less efficient in some cases due to the compatibility bits it adds).

    Standalone compression utilities, like the PKZIP, perform file-based compression, which is subtly different from stream compression. When compressing a file, more analysis is possible (since all bits are known at the start), and memory allocation (mainly dictionary size) can be optimized. File-based utilities can even pick the most efficient algorithm based on analysis of the input bits.

    Stream compression, on the other hand, has to 'take each bit as it comes' and is more restricted memory-wise, mostly because the working set of the stream compressor has to be predictable (and small, especially for general-purposes classes like DeflateStream).

    So, the short answer to your question is, possibly a bit disappointing, "don't use stream compression if you want the smallest possible output". Fortunately, there are several third-party compression toolkits (just Google "ZIP toolkit .NET"), many of which offer much better-performing compression algorithms than Deflate, which helps even when compressing streams.

    '//mdb

    P.S. It's perfectly normal for the data size to increase after compression if the input is random or otherwise unsuitable for the algorithm used. File-based compression utilities will opt to just store the original file in this case: pure stream compressors can't do that for obvious reasons. Of course, you can always look at the original data size and the compressed stream size, and decide which one to persist (setting a flag somewhere to indicate the format, of course...) yourself.

    Monday, January 2, 2006 9:49 AM
  • That sounds like a reasonable explanation for the current state but it's no excuse why it couldn't be better. I don't see a reason why compression streams can't be buffered. The analysis buffer shouldn't have to be bigger than a few kilobytes to determine whether compression makes sense or straight storage should be used.
    Monday, January 2, 2006 12:20 PM
  • > like the PKZIP, perform file-based compression, which is subtly different from stream compression.

    I don't buy it.  The file was written with one function call, so it should compress the same as PkZip since it's supposed to be same algorithm.  Even if it were broken in chunks of (say 256 bytes), the stream would only be expanded by 1% (about 3 bytes per chunk), and not a whopping 50%.

    Re: System.IO.Compression broken.

    -Jeremy

    Tuesday, January 3, 2006 4:48 AM
  • > it should compress the same as PkZip since it's supposed to be same algorithm

    PKzip and other file-based compression utilities can and will store files using wildly different algorithms or compression parameters. For example, here's a header dump (with the CRC and Attribute colums removed to save space) of a test ZIP file I just created using WinZIP:

     Length  Method   Size  Ratio    Date     Time   Name
     ------  ------   ----- -----    ----     ----   ----
     156000  DeflatN  81904  48%  12/17/2005  15:45  newcodes.txt
      82026  Stored   82026   0%  12/18/2005  14:47  newcodes.zip
     156000  DeflatF  83125  47%  12/17/2005  15:45  newcodes2.txt

    Newcodes.txt and newcodes2.txt are the exact same plaintext file, both compressed using the Deflate algorithm. Still, there is a noticable difference in compression ratio between the default N(ormal) Deflate configuration and the F(ast) version I forced via the command line.

    Results for a simple stream-based compressor (such as the one included in the .NET framework) will typically be in the 'DeflatF' range. This is a fact of life for stream compressors: to keep memory usage predictable and acceptable, they can't buffer too much of the stream, making look-ahead optimizations less effective.

    You'll also see that the ZIP file I added was 'Stored' instead of compressed. In this case, WinZP noticed that the file expanded after running it through the compression algorithm, and decided to discard the compressed version and store the original file. Note that this is not a function of the Deflate (or any other) algorithm, but an explicit check the programmer of the ZIP utility put in place.

    You can do the same with the System.IO.Compression streams: wrap them up in a class of your own, and persist either the plaintext or the compressed stream based on the final result. The fact that the (very basic) .NET stream compressor doesn't implement this functionality itself isn't a defect: you would need to do the exact same thing when using, say, zlib.

    Of course you're free to petition Microsoft for more full-featured compression (even though just going the third-party route sounds a lot better to me...), but the behavior of the current System.IO.Compression classes has always been as expected for me (including being able to supply streams to other Deflate implementations...).

    To prove there is a bug in DeflateStream, you would need to demonstrate significant differences in the output, for the same input file, of DeflateStream and another RFC1950 implementation, e.g. zlib. However, since such a bug would also cause major interoperability issues, and MS most likely used a RFC1950 reference implementation for DeflateStream anyway, I doubt there are any issues here.

    '//mdb
    Tuesday, January 3, 2006 9:25 AM
  • Now you're quibbling over semantics. Yes, as a naive implementation of a zip algorithm DeflateStream is correct. But such a naive implementation that neglects to check for incompressible data is not what a user of this class expects when literally every other available implementation does perform this check -- including Microsoft's very own Windows XP folder compression!
     
    Nor can I see how using a third-party library would be "better" than having this functionality built into the standard library -- why then have a standard library in the first place?  You say that a stream-based compressor cannot use "too much" memory to perform look-ahead optimization -- but is it not correct that DeflateStream performs no look-ahead optimization whatsoever since it does not check for incompressible data at all? Or that only a few kilobytes of buffer space would be required to perform this check?
     
    If this functionality is too difficult to implement in the stream itself, then Microsoft should provide a wrapper stream that performs after-the-fact checking when the stream has been closed, and add a warning to the documentation that a temporary file will be created for the original data. That would be fine with me, too.
     
    The present state of System.Compression reminds me of the original version of Math.Round with its insane "banker's rounding". That was technically "correct", too, but it wasn't what (almost) everyone expected -- so Microsoft had to fix it in version 2.0.
    Tuesday, January 3, 2006 10:27 AM
  • Chris, thanks for filing the bug report.  Looks like Microsoft will fix this in the future.  For now, I'll investigate using a 3rd party solution.

    -Jeremy
    Wednesday, January 4, 2006 4:36 PM
  • Does the problems still exists in the .NET Framework 3.0 ?
    Wednesday, November 8, 2006 8:36 PM
  • I'm interested in compressing a folder to a file as well and would like to know if this is possible in .NET 3 and if the compression ratio issue has been fixed.

    However, if XP's compression algorithm is better and seeing as how there doesn't (yet) seem to be a simple folder-compression command in the .NET API, couldn't we just call a Shell command and have XP itself compress a source folder to a zip file? Mind you, I don't actually know what Shell command would do this for us, but I would think it's worth looking into at least.

    Ideas?

    Friday, November 24, 2006 5:54 PM
  • The problem does not exist in NetFX 3.0 because it didn't exist in version 2.0 either. The DeflateStream object applies the Deflate algorithm to data on a stream - the decision whether or not to use the output of the Deflate algorithm has to be made outside the DeflateStream object; for example, if you write code to read and write ZIP files (you can find an example of this at http://blogs.msdn.com/dotnetinterop/archive/2006/04/05/567402.aspx), it is up to you to determine what compression algorithm you will use for each stream inside the archive, including no compression at all (which the example code from the link above does NOT do). I repeat: DeflateStream is doing its job, and anyone complaining that it gives worse results than a ZIP utility doesn't understand the difference between the ZIP format itself, and the compression of data streams inside a ZIP file.

    As it happens, the compression algorithm in DeflateStream does appear to have a bug, but it is something entirely different. Please don't make MS staff waste their time simply because you do not understand what you're talking about.

    Wednesday, November 29, 2006 12:25 PM
  • There's nothing to fix! DeflateStream is doing its job just fine. If anything, what's missing it a class in the Framework that can handle ZIP files (in which case the decision to store a copy of an original file instead of a compressed version would be made within the code for that class).
    Wednesday, November 29, 2006 12:28 PM
  • I Have Compressed a file with both compressed folder and system.io.compression.deflatestream.

    After discounting all the zip file header information the compressed file has a size of 1466 Bytes with system.io and 1316 Bytes with compressed folder.

    This would imply that the system.io compression is not as good!

    Monday, March 12, 2007 1:20 PM
  • The DEFLATE algorithm supports different compression levels - it is likely that the Compressed Folders are specifying a higher level than DeflateStream. Other things that can vary are the size of the window, the amount of memory used, etc. All of these settings have an effect on how the Deflate algorithm works, and since neither XP nor NetFX allow us to check those settings, we can't really compare them.

    Anyway, all of this is just to say, once again, that the compression algorithm in NetFX is _not_ broken.

    D.

    Monday, March 12, 2007 1:29 PM
  • Not saying that compression is *broken* just not *as good* both are using level 8 in this instance I believe.

    you can use 010 editor (a hex editor) which has a template to parse zip files to view the compression level in the file header.

    both the examples I gave are using COMP_DEFLATE (8).

    It maybe a stream implementation vs a file implementation, however I believe that according to the deflate spec it shouldn't matter. I'll play around with my compression buffer settings to see if I can make a difference. Or maybe someone who knows exactly how the deflate algorithm works can explain the difference to me?

    Once again, I agree that the system.io.compression namespace is not *Broken* however I am intrigued by the compression differences.

    Monday, March 12, 2007 3:07 PM
  • The 8 you see in the ZIP header indicates the compression algorithm, not the Deflate compression level. AFAIK, that info is not stored at all, since the decompression algorithm doesn't care.

    Again, even if they were both using the same algorithm and the same level, differences such as the window size and the amount of memory used could have an effect on compression level. If you have a look at the documentation for the zlib implementation of Deflate you'll see what I'm talking about.

    Monday, March 12, 2007 4:11 PM
  • Folks,

     

    I have a specific question....

    Is it possible using System.IO.Compression to unzip a file that has multiple files within?

    I am not sure if that is possible to do using the Dot Net provided API!!

    Monday, October 15, 2007 9:34 PM
  •  Michiel de Bruijn wrote:
    Standalone compression utilities, like the PKZIP, perform file-based compression, which is subtly different from stream compression. When compressing a file, more analysis is possible (since all bits are known at the start), and memory allocation (mainly dictionary size) can be optimized. File-based utilities can even pick the most efficient algorithm based on analysis of the input bits.

    Stream compression, on the other hand, has to 'take each bit as it comes' and is more restricted memory-wise, mostly because the working set of the stream compressor has to be predictable (and small, especially for general-purposes classes like DeflateStream).


    It is likely that stream compression is less efficient than file compression. But one can suppose that if GZipStream could work on large chunks of data, it would reach a better compression rate.

    I needed to serialize a big list of objects into a file.
    The file I got was (too) large : about 843 Ko
    So I tried to use GZipStream. I started doing it this way :
                    using (Stream fic = File.Open(fileName, FileMode.Create, FileAccess.Write))
                    {
                        IFormatter frm = (IFormatter)new BinaryFormatter();
                        System.IO.Compression.GZipStream zipper =
                            new System.IO.Compression.GZipStream(
                                fic,
                                System.IO.Compression.CompressionMode.Compress,
                                true);
                        frm.Serialize(zipper, myObjectToSerialize);
                        zipper.Close();
                        fic.Close();
                    }
    The result was a 723 Ko file : poor compression rate !

    So I tried to use a MemoryStream between the "serializer" and the "zipper", so I could send all of the stream into the zipper in one time :
                    using (Stream fic = File.Open(fileName, FileMode.Create, FileAccess.Write))
                    {
                        MemoryStream ms = new MemoryStream();
                        IFormatter frm = (IFormatter)new BinaryFormatter();
                        frm.Serialize(ms, myObjectToSerialize);
                        System.IO.Compression.GZipStream zipper =
                            new System.IO.Compression.GZipStream(
                                fic,
                                System.IO.Compression.CompressionMode.Compress,
                                true);
                        ms.WriteTo(zipper);

                        zipper.Close();
                        fic.Close();
                    }

    The result was a 276 Ko file : much better compression rate.

    With a file compression tool, I reached 192 Ko in zip format at the best compression rate. So GZipStream is not as good, but not ridiculous neither.

    Of course this solution can not be used if the stream to compress is too large to stand in a MemoryStream...
    Wednesday, January 30, 2008 4:35 PM
  • I do not believe you.

     I used the 2.0 framework GZipStream to compress an array of UTF8 bytes and the save these bytes to a files.

    I then used a third party tool using GZip to compress the same bytes and saved the output to a file.

    In both cases the only input was an array of bytes and the output was an array of bytes.

    The results were that the framework generated a size that was three times the size of the third party.

    Both used the GZip and had the exact same input.

    Thursday, March 20, 2008 11:59 AM
  • I will say this once more: the DEFLATE compression algorithm takes a number of parameters that can have an enormous impact on the amount of compression it provides. If you took the time to read up a bit (maybe RFC 1951), or at least played around with the options in your compression tool, you would be able to verify this yourself.

     

    You're obviously smart enough to understand that there are people who know more about this topic than you do, so why don't you just give it a rest and trust that they're telling you the truth?

    Thursday, March 20, 2008 6:02 PM
  • There is definitely something wrong with it IMHO. I had a data file 139 MB in size. Compressed with Info-Zip it's 99 MB. Compressed using DeflateStream it's 150 MB. I only made one Write() call to DeflateStream with the entire byte[]. If I'm passing in all the data at once, it shouldn't have any look ahead problems. It already has all the data. Maybe technically the algorithm is correct, but, it doesn't seem to be used very intelligently. Also, to say, just check the input and output file sizes and use the one that is smaller doesn't cut it because speed matters also.
    Saturday, August 2, 2008 5:22 PM
  • @ All :  I am facing similar problems of file expansion instead of compressing and i am quite amazed by the fact that you people reported this problem almost  2yrs back and from then few advancements has been made in the .NET framework latest offering being .net framework 4.0 and still you have the same problem...
    Sunday, September 5, 2010 5:43 AM
  • There is definitely something wrong with it IMHO. I had a data file 139 MB in size. Compressed with Info-Zip it's 99 MB. Compressed using DeflateStream it's 150 MB. I only made one Write() call to DeflateStream with the entire byte[]. If I'm passing in all the data at once, it shouldn't have any look ahead problems. It already has all the data. Maybe technically the algorithm is correct, but, it doesn't seem to be used very intelligently. Also, to say, just check the input and output file sizes and use the one that is smaller doesn't cut it because speed matters also.


    to try and clear this up,

    what Michiel de Bruijn is saying is that a stream is not a file.

    a stream can be written to memory, a file, a port or plenty of other places. because of this it can't compress the same way as a file compression tool as this will only compress to a file other wise you couldn't use it for say streaming xml over network to a remote server.

     

    so yes Microsoft could make this routine compress to files better but only if they remove all the other functionality that it needs to be a stream.

    plus buffering the data is infeasible as how much do you buffer?

    the whole stream, as this is the only to maximise compression?

    not going to happen as the only time a stream knows its finished is when its closed so this would mean that nothing would happen until you closed your stream forcing massive overheads on your transmission mediums. plus making communication over networks impossible.

    without the stream knowing where it is destined for you can't even buffer the data at the packet level as the packet size is dependent on the hardware interface layer and the device drivers .

     

    so what Microsoft has given us is a compressible stream not a file compressor as such comparing 1 to the other is not right or feasible.


    so if your in need of a very small compressed file then you need to use a specific implementation of a file compressor which Microsoft have not provided meaning you either write your own or obtain one from a 3rd party.


    Definition of a Beginner
    Someone that is unaware of the rules that need to be obeyed

    Definition of an Expert
    Someone that know when and which rules to ignore
    Thursday, November 18, 2010 11:50 AM
  • Regardless of the reason why the compression is poor, I still believe microsoft should revisit and improve this class. Event if it as simple as having to expose compression settings throught the class.

    And no, it doesn't matter that its implemented as a stream. Data is data, just a list of bytes. No matter the implementation of reading them, you should be able to compress the bytes the same.

    Thursday, December 9, 2010 3:36 PM
  • I agree with Kratz.  I have the same problem with the GZipStream class, I.E. it increases size instead of decreasing and when it does decrease it does so very minimally

    I understand and appreciate compressing a stream is different than a file but I have a few comments and observations.  I have been doing development for longer than most people, developing in every operating system out there and some that are not even out there anymore and just about every major profesisional language. 

    It is not a difficult challenge to program a stream compression solution to NOT increase the size of the original input stream, be that as it may it is also NOT a difficult challenge to add a property or 2 to allow the consumer of the class to control the behavior.  I am shocked that after over 6 years Microsoft has not answered the challenge with either A) a File Compression class or B) an enhanced GZipStream class to provide more control or better compression.  The GZipStream class as a general class is unusable for any real world application expected to work with an unpredictable data set.  It is only usable in the most controlled and narrow environments.

    I find it curious that the JSHARP GZIPOutputStream class behaves exactly as I would want.  It never increases the size, 99% of the time shrinks the original stream AND compresses file sizes in a much more significant manner.  And it behaves from a class consumer’s perspecitve exactly like the GZipStream class.  So the following code using the GZipStream class:

     

              // Open the file as a FileStream object.
              infile = new FileStream(fd.FileName, FileMode.Open, FileAccess.Read, FileShare.Read);
              byte[] buffer = new byte[infile.Length];
              // Read the file to ensure it is readable.
              int count = infile.Read(buffer, 0, buffer.Length);
              if (count != buffer.Length)
              {
                infile.Close();
                Console.WriteLine("Test Failed: Unable to read data from file");
                return;
              }
              infile.Close();
              string strOutFile = Path.GetDirectoryName(fd.FileName) + "\\" +Path.GetFileNameWithoutExtension(fd.FileName) + ".gz";
              outfile = new FileStream(strOutFile,FileMode.CreateNew);
              // Use the newly created memory stream for the compressed data.
              <strong>GZipStream </strong>compressedzipStream = new <strong>GZipStream</strong>(outfile, CompressionMode.Compress, true);
              compressedzipStream.Write(buffer, 0, buffer.Length);
              // Close the stream.
              compressedzipStream.Close();
              outfile.Close();
    
    

     is Equivelent to:

              // Open the file as a FileStream object.
              infile = new FileStream(fd.FileName, FileMode.Open, FileAccess.Read, FileShare.Read);
              byte[] buffer = new byte[infile.Length];
              sbyte[] sbuffer = new sbyte[infile.Length];
              // Read the file to ensure it is readable.
              int count = infile.Read(buffer, 0, buffer.Length);
              if (count != buffer.Length)
              {
                infile.Close();
                Console.WriteLine("Test Failed: Unable to read data from file");
                return;
              }
              infile.Close();
              string strOutFile = Path.GetDirectoryName(fd.FileName) + "\\" + Path.GetFileNameWithoutExtension(fd.FileName) + ".gz";
              outfile = new FileOutputStream(strOutFile);
              // Use the newly created memory stream for the compressed data.
              <strong>GZIPOutputStream </strong>compressedzipStream = new <strong>GZIPOutputStream</strong>(outfile,buffer.Length);
              for ( int i = 0; i < buffer.Length; ++ i)
              {
                sbuffer[i] = (sbyte)buffer[i];
              }
              compressedzipStream.write(sbuffer, 0, sbuffer.Length);
              // Close the stream.
              compressedzipStream.close();
              outfile.close();
    
    

     

    So in the GZipStream class example, it is an unusable solution.  It fails to give good compression on random data increasing the overall size quite frequently.  But the second jsharp GZIPOutputStream example which is a virtual identical code example works exactly as one would expect.  Now I  understand the underlying implementations are completely different, but the developer that wrote the jsharp implementation should be re engaged to implement the same solution on the GZipStream class.  If someone gave me the task to implement a GZipStream class and I implemented this same class, I would have been asked to "fix" it before it ever got released.

    I disagree with closing this "bug" out with "As designed".  That is a cop out and it should be re opened and fixed.

    I would be interested in if and when this is planned on being addressed in .net.  For now my C# .net 3.5.1 solution relies on using the JSharp Redistributable and it works perfectly. 

     

    Thursday, June 9, 2011 4:44 PM