BinaryFormatter is catastrophically slow!
-
venerdì 15 giugno 2012 05:36
Hello,
I'm currently comparing performance of serialization/deserialization implementation in Java and C#. A have programmed very simple database of books which consist of these objects:
class Library { private List<Book> _listBooksByName = new List<Book>(Program.BOOKS_COUNT); private List<Book> _listBooksByOrder = new List<Book>(Program.BOOKS_COUNT); private Dictionary<String, List<Book>> _booksInGenres = new Dictionary<String, List<Book>>();
public static Library Load(string filename) { Library lib = null; Stream fileIn = null; try { IFormatter formatter = new BinaryFormatter(); fileIn = new FileStream(filename, FileMode.Open); lib = (Library)formatter.Deserialize(fileIn); fileIn.Close(); } catch (Exception ex) { Console.WriteLine(ex.Message); if (fileIn != null) { fileIn.Close(); } lib = null; } return lib; } public static bool Save(string filename, Library lib) { bool ok = true; Stream fileOut = null; try { fileOut = new FileStream(filename, FileMode.Create); IFormatter formatter = new BinaryFormatter(); formatter.Serialize(fileOut, lib); fileOut.Close(); } catch (Exception ex) { Console.WriteLine(ex.Message); if (fileOut != null) { fileOut.Close(); } ok = false; } return ok; } //other member, methods.. ... }; class Book { private String _name; private int _order; private List<Chapter> _chapters = new List<Chapter>(); private int[] _borrowingTimes = null; / ... //other members and methods }; class Chapter { private String _name; public String Name { get { return _name; } } public Chapter(String name) { _name = name; } }
The programs creates a database of books, where each book has random number of chapters (max. 30). It stores the data to disk (Library.Save()), then it reloads the data from disk (Library.Load()) and then it makes some tests (data access and so on..it's not important).
The C# program uses the BinaryFormatter to store the database (library) to disk. The serialization works fine but deserialization is catastrofically slow in comparison to the equivalent program in Java. For example:
When I generate a library with 50.000 books, the deserialization takes 8,7 seconds in C# and 4,3 seconds in Java - OK it's not so bad. But when I generate 500.000 books, so while the deserialization takes 1008 seconds in C#, the Java deserialization takes only 43 seconds - it's 23x faster than C#!!!
So the question is what I'm doing wrong? Or is there something wrong with the .NET BinaryFormatter?
I can provide complete program if someone is interested in... Unfortunatelly it's not possible to attach it zipped to this post and I don't want to paste it directly to the text as it is not so short...
The program runs under .NET Framework 4.0.
- Modificato PetrMachacek venerdì 15 giugno 2012 05:53
Tutte le risposte
-
venerdì 15 giugno 2012 05:50
You remind me to somebody long ago who told that a Sort with Cobol was much slower than in Fortran.
He made a Cobol program where he converted from his Fortran program every instruction to a Cobol instruction. He did not know that there was a Sort instruction in Cobol.
Is there no better overall solution for your problem than using 1 to 1 instructions from Java to C#?
For instance why use a binary formatter if you want to create an XML file, .Net is loaded with possibilities for that.
One of them
http://support.microsoft.com/kb/815813
Success
Cor -
venerdì 15 giugno 2012 07:21I'm sorry, but I don't think your example with Cobol and Fortran describes this situation. I'm using .NET's build-in serialization mechanism for the C# program and Java's build-in serialization mechanism for the Java program.
>>For instance why use a binary formatter if you want to create an XML file, .Net is loaded with possibilities for that.
Why do you think I want to create an XML file? Actually I don't want an XML file..
The reason why I don't use some kind of XML serializer is, that it does not work automatically for all types. For example my Library class contains Dictionary which is not serializable by XmlSerializer (it throws exception: the IDictinary is not serializable). Ok, I can use DataContractSerializer, but I have to decorate each serializable class and member by DataContractAttribute or DataMemberAttribute respectively ... and so on..I don't want to do this. I want fully automatic serialization, where the programmer need not to worry about any additional manual operations with class, because it is source of bugs in future ("Uh.. I'm sorry my customer, I forgot decorate the new member one year ago and it's the reason why your data are not complete now..").
But ok, forget the mention about Java. My question should be changed as follows:
Why the C# deserialization time is not linear to number of serialized objects?
50.000 objects -> 8,7 seconds for deserialization
100.000 objects -> 33 seconds (I expect 19 seconds)
500.000 objects -> 1008 secons (I expect 87 seconds)
Am I doing something wrong or the problem is in the BinaryFormatter implementation?
But I found this post, where someone has similar problem and no one was able to answer him:
http://social.msdn.microsoft.com/forums/en-US/csharpgeneral/thread/ae2d5ccb-af67-44b3-ae36-67ffbb8fbb8b
Probably there is some problem with BinaryFormatter...
-
venerdì 15 giugno 2012 09:12Moderatore
"Why the C# deserialization time is not linear to number of serialized objects?"
Likely because it wasn't optimized for very large object graphs. One primary use for binary serialization is remoting and in remoting you tipically don't deal with 500.000 objects.
But IMO you're really asking the wrong question, you should ask "should I use binary serialization for this task?"
My answer to that is simply: "no way". What you are trying to do is better served by some sort of database, a file based database like Sqlite, Sql Compact or even an Access database. Alternatively you could devise your own binary file format but that's probaly more work than needed. As for reasons not to use serialization:
- binary serialization is not easy to read from anything but your .NET app
- the file structure is strongly coupled to the object structure, if you change the objects you may have problems with older files
- it's not possible to read only parts of the file in memory, the file format is pretty much opaque and it doesn't allow random access to its contents
- even if deserialization would happen in 87 seconds that's still a lot of time. If you tell me, as a customer of your app, that the application needs more than 1 minute to load I would tell you "well, that sucks"
-
venerdì 15 giugno 2012 09:43
During deserialization a lot of typechecking, securityvalidation and ISerializable method handling takes place. Object lookups and memory allocation are also taking their toll I suspect. I expect the UnsafeDeserialize method to be much faster. I also expect you'll get much higher performance by batching your object graph into smaller subsets and storing the start/end positions in the header of your binary file if you really want to take this approach.
if not, indeed check out the in memory/in process databases or at a nosql solution.
Also do check out the backwards compatibility issues you will run into with this approach, they will be much harder to solve if you hadn't thought of those an many developers know the xml/datacontract serializer considerations a lot better than those surrounding the ISerializable interface.
My blog: blog.jessehouwing.nl
-
venerdì 15 giugno 2012 11:01
To: Mike Danes
Thanks for reply. All your arguments I know and I agree with them.
You wrote:
But IMO you're really asking the wrong question, you should ask "should I use binary serialization for this task?"
This is exactly the question I asked myself. And it was the reason why I made the test (benchmark). First I have to say that the data (Book, Chapters, Library) are artificial. In fact in real application I would used SQL database.
I'm testing more technologies and approaches and so on (for example C++ persitency based on pointer swizzelling in comparison to Java and .NET binary serialization, embedded database approach...etc.. it's too long story ;)).. The poor performance of the BinaryFormatter suprised me and the surprise was even greater when I tested the same approach in Java. Ok, Java and .NET are different technologies but why Java's binary deserialization time can be linear to number of objects and .NET's cannot? Are the binary deserialization algorithms so different? In my opinion the .NET binary deserialization should be revised.
I don't expect a reply for the "why". It's more the rhetorical question. I've already got the answer.
- Modificato PetrMachacek venerdì 15 giugno 2012 12:03
-
venerdì 15 giugno 2012 11:06
The UnsafeDeserialize() didn't improve the deserialization performance. But thanks for your hint.
- Modificato PetrMachacek venerdì 15 giugno 2012 11:09
-
venerdì 15 giugno 2012 11:23Moderatore
"why Java's binary deserialization time can be linear to number of objects and .NET's cannot? Are the binary deserialization algorithms so different? In my opinion the .NET binary deserialization should be revised."
Likely because someone at some point took a decision to optimize .NET's binary deserialization for a particular scenario but not for others. Pretty much all kind of serializations that involve object graphs need a way to preserve object identity and references. This is done by using some sort of object id that is mapped at load time to object instances. It appears that .NET binary formatter uses a list + linear scan and this could explain the poor performance for a large number of objects. I suppose they could have used a hashtable but that will probably use more memory, there's always some trade off to be made somewhere.
- Contrassegnato come risposta Bob ShenMicrosoft Contingent Staff, Moderator martedì 26 giugno 2012 08:55

