Performance - varbinary(max) on very large number of files

Jawab Performance - varbinary(max) on very large number of files

  • Saturday, July 28, 2012 6:34 AM
     
     

    We have a system for Document Management System where we are storing nearly 2 Terabytes of documents on file system and maintaining their link in database; the size of files on average is like 80-100 KB; but the number of files must be too much; may be 20 million (not for sure since we didn't count them yet). The documents are also keep growing as these are our customers documents which we need to save online for lifetime.

    Now, we are having a DMS Revamping project; our concern in revamping is that; can we use varbinary(max) for this much large number of files? How this can affect performance? As a matter of fact, I am a DBA and looking after any issues related to database and I need to highlight any performance related issue in the same regard;

    My questions rephrased: How the performance of other databases and other processes/queries within the same database can be affected if we use this approach? How the performance will be affected if this table will be used in joins? How the backup operations and de - fragmentation will work out for such a large table?

    Personally, I feel that this is not much scalable approach but I have no solid word from Microsoft; on contrary, we have comparisons on web sites where it is recommended that for short size of files; saving them in varbinary(max) is a better approach (with respect to performance). Any help in this regard will be much helpful.

All Replies

  • Saturday, July 28, 2012 8:51 AM
     
     

    So how is performance today? If you store that many documents in the Windows file system, you need some mechanisms to keep directory sizes in check, and it can be a good thing to turn off generation of 8.3 names.

    2TB is a bit of a challenge, and careful planning is to recommend. Not only for the performance of the application, but also for manageability. That is, backups, disaster recovery and so on. Then again, with the current solution that backup and restore is even more of a challenge, since you have a heterogenous environment.

    I don't see that with a good design that you would get bad performance. With bad design, all sorts of bad things can happen. Obviously.

    It seems that it would be a good idea to but the blobs on a separate filegroups, you can use the TEXTIMAGE clause in the CREATE TABLE statement to achieve this.

    You can backup filegroups independently, but you can of course not restore them willy-nilly. It sounds as if the documents once stored never changes. It would be appealing to at some point close a filegroup and make it readonly, but I don't think you can do this within a table - unless you use partitioning.

    I guess that users are able to search the documents? Full-text indexing would of course be of interest in that case.

    Personally, I would not consider a design with documents in the file system with that volume. But with SQL 2012 the new FileTable feature may be of interest. In this case the documents are in the file system, but still part of the database. On the other hand, I think this is mainly intended for situations where users wants to be able to access documents through Explorer and work with them as any other file.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
  • Saturday, July 28, 2012 11:47 AM
     
     

    Dear Erland,

    Many thanks for so many suggestions. Let me clarify a bit more about my situation. Our documents are read-only; we don't need any full-text search, metadata search will suffice. Our documents generation or insertion is very frequent, let say 4K to 5K documents each day. Taking backups of large database is a big headache for me in our current environment but that is not particularly my question. My question in very simple words is that "will saving documents in BLOB or varbinary(max) in SQL Server with that much volume is a scalable solution". I am afraid that whole system will get die.

    Currently, it is implemented as a File System; and links are being maintained in database; and the current DMS application logic creates a new sub-folder after every few days or after every couple of thousand of files; so the backups of these files from our infrastructure team is seamless and having no impact on our database operations.

    Regarding TEXTIMAGE clause in CREATE TABLE and FileTable approach of 2012; i will further look into these features, but still hope to get some solid answer on VARBINARY(MAX) is either scalable storage solution of LARGE no. of documents as YES or NO.

    Anyways, I am still thankful to all points you have given.

  • Saturday, July 28, 2012 6:46 PM
     
     Answered

    Do you have any information that tells you that the system will die?

    I have implemented a solution like this in SQL Server, but if I was asked to architect one, my initial approact would be to store the documents in SQL Server. This makes backup/restore easier - since the database is a single unit. It also makes transactions simple. With a file-system solution you can never be sure that you don't either have documents in the file system without paths, or paths without documents.

    Since you only do metadata search, this means that the only access to the documents is when a user decides to view it. That is a the pages occupied by the document are only read when needed. From an application perspective, the database will work like it does today. The main challenge for you will be the backups. But you are not the first to face a 2TB database.

    Overall, I just don't see any reason why the system would die.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    • Marked As Answer by Fahad M. Nasim Wednesday, August 01, 2012 8:45 AM
    •  
  • Wednesday, August 01, 2012 8:45 AM
     
     

    Dear Erland,

    No, I don't have any solid information by what do I mean as "system will die"; but what I "feel" is that; when 20-30 or may be more users will connect to database and start getting documents and their metadata and in the meanwhile; 20 users may be uploading metadata and documents; the system may not be well performing.

    Here my challenge is that; if storing small documents with very large number in SQL Server is not a really scalable solution then i must highlight it to my management; because they are insisting for this solution based on the "word from Microsoft" that saving documents in sql server (varbinary(max)) is efficient for small sized documents;

    By the way; 2 TB is the initial volume of the project; it will for sure increase to may be 3 TB or so within a year or two; the rate of transactions and reads is also somewhat high;

    Anyways, what I can perceive from your response is that; this should be no problem and the solution must be working fine even with large volume of documents. Based on this assumption; I will not further argue with my management on performance only.

    Anyways thanks for all your responses.

  • Wednesday, August 01, 2012 10:20 PM
     
     

    What you could is of course to build a proof-of-concept. That is, a simple app where you have some 100 simulated users that upload and retrieve documents without the bells and whistles of the real application.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se