none
A Question About Binary Serialization And The Ultimate File Size

    Question

  • Hi folks,

    I have a class library which I'm currently building. In that, one of the classes will be serialized (binary) and all of that is working fine, but I'm disappointed with the size of the resultant binary file.

    I have (among many other things) three somewhat large text (string) properties. You can think of them as:

    First One: A paragraph
    Second One: A section of several paragraphs
    Third One: The whole chapter!

    Obviously that's hyperbolic, but if we assume that to be correct, given binary serialization of those properties will have the paragraph, several paragraphs, and chapter ALL from that one instance of the class.

    Suffice that the resultant binary file isn't small - and that's just one of what may be thousands of instances of the class.

    I had a thought about that which is the crux of my question (hoping some of you have had to deal with this before):

    If I take the three texts (strings) to a byte array (each) then those will be serialized - and not the raw text - it seems to me that would make for a more compact resultant file.

    Would it?

    Would I run into an issue with serialization or deserialization of those byte arrays?

    I realize that some will reply "well try it and find out!", and indeed I could, but I'm hoping that some of you have faced this issue in the past and might could proffer what your ultimate solution was. It adds a layer of complexity to everything which, if it's a futile exercise, I'd just as soon avoid to start with. ;-)

    Thanks in advance! :)


    Please call me Frank :)

    Tuesday, February 11, 2014 7:19 PM

Answers

  • Text data can be packed using UUENCODE and unpacked using UUDECODE which will reduce the file size. by 30% to 40%.


    jdweng

    Tuesday, February 11, 2014 7:37 PM
  • I have no idea and this article BinaryFormatter vs. Manual Serializing at code project may or may not have anything to do with what you're talking about. :(

    Please BEWARE that I have NO EXPERIENCE and NO EXPERTISE and probably onset of DEMENTIA which may affect my answers! Also, I've been told by an expert, that when you post an image it clutters up the thread and mysteriously, over time, the link to the image will somehow become "unstable" or something to that effect. :) I can only surmise that is due to Global Warming of the threads.

    Tuesday, February 11, 2014 7:44 PM
  • To reduce the size, you can also introduce an intermediate stream (e.g. DeflateStream) that compresses and decompresses the serialized data. See an example: http://social.msdn.microsoft.com/Forums/vstudio/en-US/2cfcbac4-2b21-4247-a774-c54037ea6da9. This will compress not only your texts, but also data associated with serializer.


    Tuesday, February 11, 2014 8:06 PM
  • Hi Frank,

    To address the main question first: I don't know, but I'm curious, so someone is going to have to try it and find out LOL  I suppose that depending on how the string is being serialized, it could be using two bytes per character which may be unnecessary given the actual string contents.  So there could be some potential for savings there.  But I would expect the serialization of an ASCII string to be nearly equivalent in size to the serialization of a byte array.

    As to an ultimate answer however, my approach would probably be two-fold:

    In regard to the string content itself, assuming I understand the situation correctly, "First One" and "Second One" are subsets of "Third One" so only "Third One" actually has to be serialized; the others can be represented by CharacterRange instances.  The First One can be represented by two integers (starting index and length, or start/stop index, depending on how you look at it).  The Second One can be represented by an array of integer pairs.  This should reduce the amount of data to serialize quite a bit when the strings are long.  And you could measure the string length to determine which kind of serialization to do if you really want to squeeze that file size down.

    The second thing I would do would be to use a GZip stream around the file stream to see how much advantage there was to just compressing the whole binary.  If I was unimpressed by the results, I might try zipping just the string content then serializing that binary output as part of my object graph.  With a text-heavy class, one of those methods is likely to yield favorable results.

    If I still wasn't happy with my file size, then I'd start to look at an indexing routine to either gather all the text together for a single zip operation or to manually index out the text content and then only store the root word list and index info in my object graph.  Or choose a different storage mechanism then binary serialization if I was really going to need the performance.

    HTH!


    Reed Kimble - "When you do things right, people won't be sure you've done anything at all"

    Tuesday, February 11, 2014 8:08 PM
    Moderator
  • To reduce the size, you can also introduce an intermediate stream (e.g. DeflateStream) that compresses and decompresses the serialized data. See an example: http://social.msdn.microsoft.com/Forums/vstudio/en-US/2cfcbac4-2b21-4247-a774-c54037ea6da9. This will compress not only your texts, but also data associated with serializer.


    I like that idea - but my thinking is doing the same after the fact so either way, the result will be a compressed version of the "real thing".

    Thanks!


    Please call me Frank :)

    Sometimes it makes a difference where you do it.

    Human language tends to have a lot of repeatable patterns, so plain text generally compresses pretty well.  Sometimes the compression algorithms do better (compression ratio wise) if you just compress the text and not the rest of the binary graph.  It situational, but I have seen instances where the rest of the data from the graph seems to degrade the algorithm's ability to compress the text (which is the bulk of the content bytes) and a better result is achieved by zipping the text and adding those bytes to the graph (without compressing the rest of the graph).  I'd suggest making a test file where you zip just the text itself to get a baseline for how compressible it is.  That will give you something to compare against when testing the rest of the object graph for compression gains.


    Reed Kimble - "When you do things right, people won't be sure you've done anything at all"

    Tuesday, February 11, 2014 8:23 PM
    Moderator
  • That's what I would do. Start with the easiest and work backward until I hit a number I'm happy with.  Note that its really quite simple to add the compression stream - just wrap the stream that your formatter is using now with a new compression stream then pass the compression stream to the formatter.

    Reed Kimble - "When you do things right, people won't be sure you've done anything at all"

    Tuesday, February 11, 2014 8:26 PM
    Moderator
  • Well, the markup within the strings kinda throws the natural language patterns out the window anyway lol  So yeah, in this case there's probably little chance of reducing the compression ratio with excess data.

    Reed Kimble - "When you do things right, people won't be sure you've done anything at all"

    Tuesday, February 11, 2014 8:38 PM
    Moderator

All replies

  • Text data can be packed using UUENCODE and unpacked using UUDECODE which will reduce the file size. by 30% to 40%.


    jdweng

    Tuesday, February 11, 2014 7:37 PM
  • Text data can be packed using UUENCODE and unpacked using UUDECODE which will reduce the file size. by 30% to 40%.


    jdweng

    Thanks Joel,

    I'm not familiar with those. Could you explain more please?


    Please call me Frank :)

    Tuesday, February 11, 2014 7:43 PM
  • I have no idea and this article BinaryFormatter vs. Manual Serializing at code project may or may not have anything to do with what you're talking about. :(

    Please BEWARE that I have NO EXPERIENCE and NO EXPERTISE and probably onset of DEMENTIA which may affect my answers! Also, I've been told by an expert, that when you post an image it clutters up the thread and mysteriously, over time, the link to the image will somehow become "unstable" or something to that effect. :) I can only surmise that is due to Global Warming of the threads.

    Tuesday, February 11, 2014 7:44 PM
  • I have no idea and this article BinaryFormatter vs. Manual Serializing at code project may or may not have anything to do with what you're talking about. :(

    I am doubtful about that given strings though.

    Of course I might very well stand to be corrected! ;-)


    Please call me Frank :)

    Tuesday, February 11, 2014 7:49 PM
  • I have an idea about all of this - I'm confident enough that I'm going to call this one "done".

    Thanks everyone! :)


    Please call me Frank :)

    Tuesday, February 11, 2014 7:56 PM
  • To reduce the size, you can also introduce an intermediate stream (e.g. DeflateStream) that compresses and decompresses the serialized data. See an example: http://social.msdn.microsoft.com/Forums/vstudio/en-US/2cfcbac4-2b21-4247-a774-c54037ea6da9. This will compress not only your texts, but also data associated with serializer.


    Tuesday, February 11, 2014 8:06 PM
  • Hi Frank,

    To address the main question first: I don't know, but I'm curious, so someone is going to have to try it and find out LOL  I suppose that depending on how the string is being serialized, it could be using two bytes per character which may be unnecessary given the actual string contents.  So there could be some potential for savings there.  But I would expect the serialization of an ASCII string to be nearly equivalent in size to the serialization of a byte array.

    As to an ultimate answer however, my approach would probably be two-fold:

    In regard to the string content itself, assuming I understand the situation correctly, "First One" and "Second One" are subsets of "Third One" so only "Third One" actually has to be serialized; the others can be represented by CharacterRange instances.  The First One can be represented by two integers (starting index and length, or start/stop index, depending on how you look at it).  The Second One can be represented by an array of integer pairs.  This should reduce the amount of data to serialize quite a bit when the strings are long.  And you could measure the string length to determine which kind of serialization to do if you really want to squeeze that file size down.

    The second thing I would do would be to use a GZip stream around the file stream to see how much advantage there was to just compressing the whole binary.  If I was unimpressed by the results, I might try zipping just the string content then serializing that binary output as part of my object graph.  With a text-heavy class, one of those methods is likely to yield favorable results.

    If I still wasn't happy with my file size, then I'd start to look at an indexing routine to either gather all the text together for a single zip operation or to manually index out the text content and then only store the root word list and index info in my object graph.  Or choose a different storage mechanism then binary serialization if I was really going to need the performance.

    HTH!


    Reed Kimble - "When you do things right, people won't be sure you've done anything at all"

    Tuesday, February 11, 2014 8:08 PM
    Moderator
  • To reduce the size, you can also introduce an intermediate stream (e.g. DeflateStream) that compresses and decompresses the serialized data. See an example: http://social.msdn.microsoft.com/Forums/vstudio/en-US/2cfcbac4-2b21-4247-a774-c54037ea6da9. This will compress not only your texts, but also data associated with serializer.


    I like that idea - but my thinking is doing the same after the fact so either way, the result will be a compressed version of the "real thing".

    Thanks!


    Please call me Frank :)

    Tuesday, February 11, 2014 8:13 PM
  • Reed,

    Your supposition of the strings being related is *sort of* true but not in terms of being substrings.

    The first is plain unformatted text, the second is RTF text (of the first one), and the third is a really large "in-line HTML tagged HTML string".

    As to the compression - that's exactly where I'm going with this, but I'm not (right now) going to do it on the class instance per se, but on the entire set (all instances).

    Or that's my thinking anyway. ;-)


    Please call me Frank :)

    Tuesday, February 11, 2014 8:15 PM
  • To reduce the size, you can also introduce an intermediate stream (e.g. DeflateStream) that compresses and decompresses the serialized data. See an example: http://social.msdn.microsoft.com/Forums/vstudio/en-US/2cfcbac4-2b21-4247-a774-c54037ea6da9. This will compress not only your texts, but also data associated with serializer.


    I like that idea - but my thinking is doing the same after the fact so either way, the result will be a compressed version of the "real thing".

    Thanks!


    Please call me Frank :)

    Sometimes it makes a difference where you do it.

    Human language tends to have a lot of repeatable patterns, so plain text generally compresses pretty well.  Sometimes the compression algorithms do better (compression ratio wise) if you just compress the text and not the rest of the binary graph.  It situational, but I have seen instances where the rest of the data from the graph seems to degrade the algorithm's ability to compress the text (which is the bulk of the content bytes) and a better result is achieved by zipping the text and adding those bytes to the graph (without compressing the rest of the graph).  I'd suggest making a test file where you zip just the text itself to get a baseline for how compressible it is.  That will give you something to compare against when testing the rest of the object graph for compression gains.


    Reed Kimble - "When you do things right, people won't be sure you've done anything at all"

    Tuesday, February 11, 2014 8:23 PM
    Moderator
  • That's what I would do. Start with the easiest and work backward until I hit a number I'm happy with.  Note that its really quite simple to add the compression stream - just wrap the stream that your formatter is using now with a new compression stream then pass the compression stream to the formatter.

    Reed Kimble - "When you do things right, people won't be sure you've done anything at all"

    Tuesday, February 11, 2014 8:26 PM
    Moderator
  • Sometimes it makes a difference where you do it.

    Human language tends to have a lot of repeatable patterns, so plain text generally compresses pretty well.  Sometimes the compression algorithms do better (compression ratio wise) if you just compress the text and not the rest of the binary graph.  It situational, but I have seen instances where the rest of the data from the graph seems to degrade the algorithm's ability to compress the text (which is the bulk of the content bytes) and a better result is achieved by zipping the text and adding those bytes to the graph (without compressing the rest of the graph).  I'd suggest making a test file where you zip just the text itself to get a baseline for how compressible it is.  That will give you something to compare against when testing the rest of the object graph for compression gains.

    I *think* I can see your point, but to give you an example of what I'm working with:

            <Serializable()> _
        Public Class CodeBlockManager
            <NonSerialized()> _
            Private Shared ckZipUnlockCode As String = "REDACTED"
    
            Private _id As Long
    
            Private _codeName As String
            Private _codeType As String
            Private _codeRTF As String
            Private _codePlainText As String
            Private _codeHTML As String
    
            Private _codeNotes As String = ""
            Private _minFramework As String = ""
            Private _description As String = ""
            Private _findIt As String = ""

    So I think it all stands to be potentially reduced by compression.

    Your thoughts?


    Please call me Frank :)

    Tuesday, February 11, 2014 8:28 PM
  • Well, the markup within the strings kinda throws the natural language patterns out the window anyway lol  So yeah, in this case there's probably little chance of reducing the compression ratio with excess data.

    Reed Kimble - "When you do things right, people won't be sure you've done anything at all"

    Tuesday, February 11, 2014 8:38 PM
    Moderator
  • Well, the markup within the strings kinda throws the natural language patterns out the window anyway lol  So yeah, in this case there's probably little chance of reducing the compression ratio with excess data.

    Reed Kimble - "When you do things right, people won't be sure you've done anything at all"

    Well so far it's taken (a single instance) from 16K to 3K, so I'm encouraged.

    It's all in flux yet of course, but it's an improvement! :)

    *****

    Thanks again for your help!


    Please call me Frank :)

    Tuesday, February 11, 2014 8:40 PM
  • This is what I have so far.

    Currently I'm not compressing it "on the fly", but rather writing the actual binary file out to a .tmp file, compressing it, writing that to a file (that's .bin), then deleting the .tmp file.

    I'll try to optimize it at some point in the future, but for the purposes of this, I've commented out the deletion of the .tmp file and following shows the results with three instances being serialized:

    Not horrible I don't think!

    If you look at the contents of the .bin file, I think it's fairly well obvious (relative to the original):

    Thanks again everyone - I really appreciate the input!

    :)


    Please call me Frank :)


    Tuesday, February 11, 2014 9:22 PM