none
Best Way to Store Documents In SQL Server

    Question

  • Hi,
    I am wrighting a business application for a professional placement firm and they would like to be able to add a resume to the data for a person and be able to search that resume.  I am currently designing the database in SQL Server 2008 and was wondering the best way to store the resume.  Should I just put it into a VarChar(Max) column or could I store a MS Word or XPS document in SQL Server?   What are the best practrices for this and can you point me to documentation that would help with designing and implementing such a task?

    As you can imagine, they get resumes in many differnt formats and it would be nice to be able to just cut and paste them into a Rich Text Block.  I would like to databind that to business object that gets persited to SQL Server.  I would also like to be able to search the text so that when a certain skill is needed and it is not found in the normal data, the Resume data could be search to see if it exists.  Any suggestions?

     

    Friday, November 21, 2008 7:46 PM

Answers

All replies

  • Moving to the SQL Server Database Engine forum.Hopefully they should be able to give you the best solution.

     

    Friday, November 21, 2008 11:14 PM
  • If you are able to enable full text indexes feature on your server , then I would say this would be step one. This allows you to use the methods contains and freetext - -which will help a lot in your searches.

     

    As far as the design goes that depends on the requirements -- as you noted, most of the resume text will be inputted as text and not its binary object(Word, Corel, etc), so it really depends on whether one of the requirements is to store the document in its original formatted condition.

     

    If yes to storing both, I would have your schema support the text of the document, and then storing the document in its original format - -but not in the same table. Also how you store the document depends on other factors as well -- some people prefer to have sql store the reference to the document, while others input the document itself into the database.

     

    Either is fine, but there are pros and cons to both methods -- storing potentially large binary files, or tables that can get large in size due to many rows that contain binary files, can be hazardous to sql performance if you don't manage files and filegroups properly.

     

    Hope that helps.

     

    Saturday, November 22, 2008 4:02 PM
  • Since you said you are using SQL Server 2008, then please definitely check the new FILESTREAM type - it allows you to store the data outside SQL Server as files and still work with them from within SQL Server.

     

    Thanks,

    Saturday, November 22, 2008 5:33 PM
  • Todd,

    In my experiences (and according to several best practices), it is best not to store documents (there are exceptions such as text and XML documents) in the database. There are many good reasons for this. One is that the database can grow to a size which can be avoided.

    Therefore another solution (which worked well for me in 25 projects similar to what you are describing):

    Store the "file location of the document" in SQL.

    For example:

    Lets say you have to store resume data (as you describe), have the table structure look similar to:

    Resume (ID, ResumeName, FileLocation)

    The file location would be the physical path on the web server (or it can be another file store) to where the actual file is located.

    Now, I am sure there is a need to "search for keywords or phrases" in the actual text of the resume. Therefore, you can take the resume and extract text from it and store it in in the table using the "text data type.

    Another approach would be to extract the data and convert it into XML and store it as a XML data type. Then you will be able to do XML indexing on it.

    There are several approaches, as I described above.

    There is a fundamental principle in general computer science which states: Let the specific module do what it is suppose to do, and only what it suppose to do, but let it do it the best! Same concept goes to database, let the database do what it does best. Handle data Smile

    I hope the above helps and guides you.

    P.S. - I have wrote a similar placement code application a few months ago. I'd be happy to share with you the database code if you would like.


    Monday, November 24, 2008 1:37 AM
  • I believe that SQL Server does support text data such as a resume.  I think the FILESTREAM option allows you to store it outside the the normal data (on the disk) and I believe that SQL Server has a way to search the text.  It really is just data afterall.  I would think that an XPS document which is already XML based, would store nicely in SQL Server (It is also a new Microsoft standard document format).  I am wondering if the Rich Text formats supported by XAML are also capable of being stored in SQL Server.  It just seems like someone should have done this before.

     

    Tuesday, November 25, 2008 3:50 AM
  • As previously posted, since you are on SQL Server 2008, this is what the FileStream is for.

     

    Using FILESTREAM with Other SQL Server Features

     

    Tuesday, November 25, 2008 3:59 AM
    Moderator
  • Filestream is a great option for storing binary files in their original format.

     

    Todd -- SQL server cannot search the "text" of a filestream datatype -- filestream is simply a binary file. SQL server does not know whether it is a jpeg, mp3, xml, Word doc, etc, it is just a binary object. 

     

    If you want searchable text, then you will have to look at doing something similar to what I proposed earlier -- have one schema object to hold the text data of the resume and then use another object to hold either the file reference or, in the case of the filestream datatype, the object(document) itself.

    Tuesday, November 25, 2008 3:20 PM
  •  Sid Atkinson wrote:

    Filestream is a great option for storing binary files in their original format.

     

    Todd -- SQL server cannot search the "text" of a filestream datatype -- filestream is simply a binary file. SQL server does not know whether it is a jpeg, mp3, xml, Word doc, etc, it is just a binary object. 

     

    If you want searchable text, then you will have to look at doing something similar to what I proposed earlier -- have one schema object to hold the text data of the resume and then use another object to hold either the file reference or, in the case of the filestream datatype, the object(document) itself.

     

    Did you read the above BOL reference?  You can most certainly search the text of a filestream column with full text search.  You just need a column that holds the file extension in it so that the appropriate IFilter can be used to extract the formatted data from the files.

    Tuesday, November 25, 2008 3:25 PM
    Moderator
  • An interesting article about varchar(max) vs filestream choice:

    http://blogs.msdn.com/manisblog/archive/2007/10/21/filestream-data-type-sql-server-2008.aspx

    Do consider the performance implications of filestream and the size of your files!

    Another (older) MS whitepaper on BLOBs vs the filesystem:

    http://research.microsoft.com/research/pubs/view.aspx?msr_tr_id=MSR-TR-2006-45

    Hope that helps!
    Tuesday, November 25, 2008 3:41 PM
  • Talk about putting my foot in my mouth -- thanks Jonathan, this is what happens when you work too many hours to ensure you have the holidays off. Duly noted!

    Tuesday, November 25, 2008 3:55 PM
  • I am in the same situation with over 100,000 Word, PDF and HTML resumes going into SQL Server 2008.  As the average doc size is under 50k, FILESTREAM dosen't appear to be the best solution.  What did you end up doing?  Any input would be great appreciated.

    Sunday, November 30, 2008 10:09 PM
  • Arsen Pereymer said:

    Todd,

    In my experiences (and according to several best practices), it is best not to store documents (there are exceptions such as text and XML documents) in the database. There are many good reasons for this. One is that the database can grow to a size which can be avoided.

    Therefore another solution (which worked well for me in 25 projects similar to what you are describing):

    Store the "file location of the document" in SQL.

    For example:

    Lets say you have to store resume data (as you describe), have the table structure look similar to:

    Resume (ID, ResumeName, FileLocation)

    The file location would be the physical path on the web server (or it can be another file store) to where the actual file is located.

    Now, I am sure there is a need to "search for keywords or phrases" in the actual text of the resume. Therefore, you can take the resume and extract text from it and store it in in the table using the "text data type.

    Another approach would be to extract the data and convert it into XML and store it as a XML data type. Then you will be able to do XML indexing on it.

    There are several approaches, as I described above.

    There is a fundamental principle in general computer science which states: Let the specific module do what it is suppose to do, and only what it suppose to do, but let it do it the best! Same concept goes to database, let the database do what it does best. Handle data Smile

    I hope the above helps and guides you.

    P.S. - I have wrote a similar placement code application a few months ago. I'd be happy to share with you the database code if you would like.




    Could you please share the code?
    Friday, February 27, 2009 8:17 PM
  • Ho Todd. I have read your response and I am creating a similar application for a placement fir to store resumes. Can you help me out wiht the code for the database please?

    Kind regards,

    Bravo
    Monday, March 30, 2009 4:02 PM