locked
Storing documents in SQL database RRS feed

  • Question

  • Hi all,

    First of all, I'm not a DBA ;-)
    We have a need to store a large number of small documents in a repository. These docs (usually PDFs) will need to be indexed by a few parameters. Do you think that it is a good idea to store 500000 PDF files (150KB each) in an SQL server ? Will it be effective ? What about the performance ? I imagine that the other solution would be to store the docs on a filesystem, but I would need to create a directory hierarchy to limit the number of files per directory. SQL server would be easier.
    Can you give me feedbacks about this solution ?

    Regards,

    /Anonymous01
    • Moved by Tom PhillipsEditor Wednesday, March 10, 2010 3:39 PM Database Design Question (From:SQL Server Database Engine)
    Wednesday, March 10, 2010 10:02 AM

Answers

  • Hello,

    Storing documents and objects in your database may lead to database fragmentation, and will impact buffer cache and the size of your log files.

    Since SQL Server 2008 you have the option to use Filestream which is used to store binary objects on a NTFS file system, with the difference that this filestream storage is considered part of the database and it benefits of transactional consistency and integrated point-in-time backup restore.

    In terms of performance, since you will be storing small files, storing those documents on the database will give you better performance that using Filestream. Filestream shows better performance storing files greater than 1 MB.

    Another way to this is storing the binary objects on the file system and store the UNC path on the database, but it has some security concerns and you will need expend more resources maintaining those files like having to backup the database and those files separately.


    Hope this helps.

    Regards,

    Alberto Morillo
    SQLCoffee.com

    Wednesday, March 10, 2010 12:07 PM
    Answerer
  • So, here is the thing. There are problems storing large objects in the database, but (and here is the thing) if you need what it gives you, it is worth it (and filestream makes it even better, but still has its share of difficulties in usage.)

    My feeling on this is that if you need the transactional integrity of having the files in the database be certainly the ones you stored when you wrote the row in the "index" table, then it is worth it.  My prime example is a database of employee pictures.  If this is for the company directory, then it probably doesn't matter too much if the files are not right.  So if an goofy employee changes a picture of an employee to a chimpanzee, well it is just funny.

    But if this is for a security guard system that has people's pictures that are stored when the picture is taken and will give a person access to secure areas...then storing the data in a database gives you a warm-fuzzy that the security is much harder to crack, and since the index row is written with the indexing information AND the bits, you are far safer to go that way.

    So in your case, if the security and usability features (some searching, like full text that you can do over documents, though not sure about PDFs) that you get from storing it in the database are not there, then storing them not in the database (even filestream) is probably better.  Unless you use an image device, your pictures would need to be stored on the same server as the database, which can be limiting. If you just store a UNC path (or even some form of relative path) you could theoretically keep each file on a different server (you would be considered insane, but you could). I think that you would find that implementing some form file location where when you reach the max number of files in a server/directory you create a new directory or swap to a new server, would be pretty easy.  And you could have queries/triggers that manage all of that for you such that your middle tier objects just got back a path/filename and they new where to store the document.

    I hope this clears it up, but storing large objects gets better in each version of SQL Server, but until filestream can be stored externally on a device that you can create from commodity hardware, it seems like too difficult to work with unless you need something it gives you.
    Louis

    Thursday, March 11, 2010 4:04 AM
  • I think we should do the analysis on specific areas for finding the exact requirement and then choose the solution.

    1. How frequently your documents get modified.
    2. How secure you want them to be for read
    3. How frequently you use them for your application.
    4. Are these read-only documents
    5. Are these docs require one-time uploading



    Filestream will help for securing, maintaining heirachy and transaction integrity but if you do not need one of these, then simple solution would be to take path and store the path in database. Here you are not only avoiding performance issues but also storage in terms of backup.

    Monday, March 15, 2010 7:29 AM

All replies

  • Hi

    File stream was introduced in sql2008.You want to use filestream. It puts a sort of lock on the file to ensure database integrity. As long as the record exists in the SQL Server, the file can't be removed from the associated file system.

    In case of old versions of SQL Server
    Other ways of storing documents is to use BLOBs:
    Try the following article to achieve it.
    http://support.microsoft.com/kb/309158

    With a small amount of binary data, it's not efficient to use a file stream. This is because it needs extra overhead like file creation and handling. These operations are not needed when predefined database files are used. However, with larger files, file streams are quite efficient.

    -Sreekar
    Wednesday, March 10, 2010 11:37 AM
  • Hello,

    Storing documents and objects in your database may lead to database fragmentation, and will impact buffer cache and the size of your log files.

    Since SQL Server 2008 you have the option to use Filestream which is used to store binary objects on a NTFS file system, with the difference that this filestream storage is considered part of the database and it benefits of transactional consistency and integrated point-in-time backup restore.

    In terms of performance, since you will be storing small files, storing those documents on the database will give you better performance that using Filestream. Filestream shows better performance storing files greater than 1 MB.

    Another way to this is storing the binary objects on the file system and store the UNC path on the database, but it has some security concerns and you will need expend more resources maintaining those files like having to backup the database and those files separately.


    Hope this helps.

    Regards,

    Alberto Morillo
    SQLCoffee.com

    Wednesday, March 10, 2010 12:07 PM
    Answerer
  • Storing Large amout of Docuemnt in the database will lead to performance bottle neck.

    Better you store the docuemnt to a folder over the server and save the path of the docuemnt in the database table.

    In the applilcation code u need to take the path from the sql server database table and open the docuemnt from the file path (folder location)


    In SQL Server 2008 u can file stream to store LOB.
    Mohd Sufian www.sqlship.wordpress.com Please mark the post as Answered if it helped.
    Wednesday, March 10, 2010 12:19 PM
  • So, here is the thing. There are problems storing large objects in the database, but (and here is the thing) if you need what it gives you, it is worth it (and filestream makes it even better, but still has its share of difficulties in usage.)

    My feeling on this is that if you need the transactional integrity of having the files in the database be certainly the ones you stored when you wrote the row in the "index" table, then it is worth it.  My prime example is a database of employee pictures.  If this is for the company directory, then it probably doesn't matter too much if the files are not right.  So if an goofy employee changes a picture of an employee to a chimpanzee, well it is just funny.

    But if this is for a security guard system that has people's pictures that are stored when the picture is taken and will give a person access to secure areas...then storing the data in a database gives you a warm-fuzzy that the security is much harder to crack, and since the index row is written with the indexing information AND the bits, you are far safer to go that way.

    So in your case, if the security and usability features (some searching, like full text that you can do over documents, though not sure about PDFs) that you get from storing it in the database are not there, then storing them not in the database (even filestream) is probably better.  Unless you use an image device, your pictures would need to be stored on the same server as the database, which can be limiting. If you just store a UNC path (or even some form of relative path) you could theoretically keep each file on a different server (you would be considered insane, but you could). I think that you would find that implementing some form file location where when you reach the max number of files in a server/directory you create a new directory or swap to a new server, would be pretty easy.  And you could have queries/triggers that manage all of that for you such that your middle tier objects just got back a path/filename and they new where to store the document.

    I hope this clears it up, but storing large objects gets better in each version of SQL Server, but until filestream can be stored externally on a device that you can create from commodity hardware, it seems like too difficult to work with unless you need something it gives you.
    Louis

    Thursday, March 11, 2010 4:04 AM
  • i fyou want to create a file hierarcy you can use sql server 2008 filestream type and hierarcy type together but if you want to
    search inside files and index them indexing service could be better soluition for this.
    Friday, March 12, 2010 7:05 AM
  • I think we should do the analysis on specific areas for finding the exact requirement and then choose the solution.

    1. How frequently your documents get modified.
    2. How secure you want them to be for read
    3. How frequently you use them for your application.
    4. Are these read-only documents
    5. Are these docs require one-time uploading



    Filestream will help for securing, maintaining heirachy and transaction integrity but if you do not need one of these, then simple solution would be to take path and store the path in database. Here you are not only avoiding performance issues but also storage in terms of backup.

    Monday, March 15, 2010 7:29 AM