none
How to store millions of files in Sql Server 2008?

    Question

  • There is some talk about being able to store Files (or more specifically) access FileStreams within Sql Server 2008.  How is this being handled internally, and in the case of multi-millions of files (which the current Windows File System has a hard time handling), how will this be managed?  Does it involve replication, Specific FileGroups, a SAN, CIFS shares, etc.  Any help or information you can provide would be useful.

     

    In the case of Oracle, there you have access to RAC or to something like GFS.  Can you highlight what makes the Sql 2008 implementation different, able to scale better, etc.

     

    Thanks,

    Andre' Hazelwood

    Tuesday, June 05, 2007 3:53 AM

Answers

  •  

     Hi, thank you for this question. Can you please be more specific on which scenarios you are referring to when you say "Windows File system has hard time handling in the case of multi millions of files"? I need you to first clearly identify what issue(s) you have in mind (examples/perf numbers can help), since not all are related to the way we organize files. If you could provide any data on the specific issue you will refer to in your reply this would be very helpful to answer your question better. Typically, NTFS can scale up to a very large number (greater than million) files per directory as long as some conditions are met (filenames are small, short file name generation is disabled, etc.).

    thanks,

    Wednesday, June 13, 2007 12:00 AM

All replies

  • What you're refferring to is the new FILESTREAM data type. As I've understood it, the data is stored in files (or filestreams) in the file system, but can be accessed from the database as if they were blobs. Actually, that's pretty neat, and something which MAY cause me to change my opinion regarding SharePoint, if or when it will start using this new feature. I'm really looking forward to work with this one. I'm afraid that's all i know (or think I know) so far.
    Tuesday, June 05, 2007 9:41 AM
  •  

     Hi, thank you for this question. Can you please be more specific on which scenarios you are referring to when you say "Windows File system has hard time handling in the case of multi millions of files"? I need you to first clearly identify what issue(s) you have in mind (examples/perf numbers can help), since not all are related to the way we organize files. If you could provide any data on the specific issue you will refer to in your reply this would be very helpful to answer your question better. Typically, NTFS can scale up to a very large number (greater than million) files per directory as long as some conditions are met (filenames are small, short file name generation is disabled, etc.).

    thanks,

    Wednesday, June 13, 2007 12:00 AM
  •  

    I see the original poster did not followup on that question. I am interested in the topic, so I can answer from our point of view:

     

    a) I have seen NT have real problems with too many files and directories. I don't remember exact numbers but there was millions of files and directories (never more than 5 levels deep) with short names for a total of about 800gigs. Opening a folder could take 5-20 minutes. Access through FTP/HTTP was much better of course. Some pretty large files but 90% of them were smaller than 1 mb (all zipped files). And lots of directories.

    b) we have scenarios were we could store up to 4-5 terabytes but I expect to answer "up to 10-12" in 4-5 years at most.

    c) I started looking at the CTP but haven't looked at Filestreams yet (lack of time) so I could ask a number of relevant questions that already have answers but I should say that the first things I will look for is how to manage that data (partition, backup/restore, mirror, etc) with flexibility/safety, how to access it (OLEDB is slow for BLOBS and giving HTTP/FTP access directly in the db is a must) and the licensing impact (most scenarios are to segregate that kind of services to a distinct server. Are filestreams part of the database size? If it is not, I can see lots of small DAM starting to appear, using Express, and proliferate through the enterprise. I would use such small DAMs in the DMZ for example and keep the big thing inside the LAN. But if Filestreams are part of the 4 gigs limit, we will stay with our current practice of storing in the file system and keep only a reference in the database.

     

    d) How can this work with a mirror? We are currently using DoubleTake for its ability to mirror both the db and the file system. Could we now rely just on db mirroring?

     

    e) can we store large UDT in there? I can see scenarios were I would like to store an Invoice as a large UDT. One of its property could be to expose itself as a PDF/XPS format, but keeping extra metadata as part of the Invoice is not all. Being an object, the Invoice can have some specific behaviors based on the context/user or from what is requested of it. mmm. Is SQL2008 bridging the object-sql gap by bringing objects as rich Relational Domains? Sure smells like it:-) Wow, I am impressed and exited by the possibilities!

    Friday, June 22, 2007 11:57 AM
  • Filestream give SQL a method of liking a file with a record in the file system

    a) so if windows has a problem with files Filestream will not help

    b) same as a

    c) filestream have a pointer in the database to the file

    d) you will need to mirror both. (if you do a backup filestream data is accessed and put into the backup)

    e) this look more like the type of filestream logic.

    Friday, February 08, 2008 1:03 PM
  • Thanks AlunJ,

     

    We figured this out and switched to an OODBMS for our large enterprise stores. Now everything is in the same place: data, files and even the applications. Easier to scale too.

     

    Thanks,

     

    Ak

     

    Friday, February 08, 2008 1:13 PM
  •  

    If you back-up one level, there is a post on the forum stating that the blob files are not counted towards the 4Gb total for Express.  This is pretty good news imho, because my current solution involves having two databases (one a directory, and then multiple blob catalogs), which is obviously more difficult to manage and potentially error-prone (had to jump through some hoops to ensure the directory gets locked when it fills up, so two users don't create 2 new empty catalogs at the same time, also seems difficult to ensure referential integrity between the blobs in the catalog and their directory entry in the directory database, etc.).

     

    You are right about having thousands of files in a directory being slow, but is this only slow in Explorer, or is it slow in general?  It seems to me, that if you do to command prompt DIR, it'll appear almost instantly, but that Explorer slows things down by trying to read attributes and a whole load of other things it's doing besides.  Maybe I'm wrong here though.  Anyway I'm currently running a test - generating around 1 million FILESTREAM files into my DB ( < 1k each ) which I'm then going to do some queries on.  I should have some results by next week (I think the on-access virus scan we have here at work is slowing things down somewhat......).

     

    Friday, February 08, 2008 3:10 PM
  • Thanks

     

    Please keep us posted

     

    Friday, February 08, 2008 4:06 PM
  • RobinsonUK,

     

    I am aware of it but we needed a solution now. We already have large installed sites and SQL2008, which I would have prefered, is barely in the oven and far from baked. Release has been pushed, may be pushed again (as happened often in the past) and it won't be really ready for production (at large conservative sites) until many months after release.

     

    Note that if authenticated users could do file transactions directly through FTP(S)/HTTP(S) in the db , SQL2008 could become a killer! Maybe Exchange will use it before the next century . Integration, Integration, Integration...

     

    Ak

    Friday, February 08, 2008 4:18 PM
  • Note that if authenticated users could do file transactions directly through FTP(S)/HTTP(S) in the db , SQL2008 could become a killer! Maybe Exchange will use it before the next century . Integration, Integration, Integration...

     

     

    I'm not sure, but can't you execute an SP that does it over HTTP via Soap?  2005 had that facility (but it wasn't enabled in Express). 

     

    With respect to the performance test, I'm looking more for good random access performance, because my views are virtualized; binary items are loaded on demand, one at a time.  I'm almost never going to be doing a large SELECT including the binary field.  A view can list a directory of all of the files by doing a select on other fields in the table and then fetch the binaries to cache, one at a time, when they come into view.

     

     

    Friday, February 08, 2008 4:24 PM
  •  

    Yes, I can easily exclude SOAP. SOAP is not very good with files. Why not use an hammer when in front of a nail? What is being used today to move files everywhere? On all plateforms? All OSes. That even you grandmother can use? HTTP/FTP clients.

     

    Ak

    Friday, February 08, 2008 11:10 PM
  • Well okay, my little test.  Unfortunately we had a power-out on Saturday that killed my desktop here at work over the weekend, so I only managed to add 1/2 million or so FILESTREAM objects to my table.  Anyway, I've been playing around with them, random SELECTs and so forth and I can report I have no concerns about performance so far.  Obviously selecting in 10000 binary blobs is going to run slowly, but selecting in a few gives a similar performance to that which I got when I was using varbinary(max) fields (I wouldn't ever do a large select on a blob field in any case).  I'm not even going to try browsing the blob directory on the filesystem with explorer.  That will be like an invitation to press the Reset button on my PC Wink.

     

    So far Katmai has answered the three main technical issues I had to deal with when I wrote our application: (1) lack of support for hierarchies, (2) file system transactions for external blob management and (3) table parameters to stored procedures. 

     

    I guess it's a big thumbs up to MS on this one from me.

     

     

    Monday, February 11, 2008 1:34 PM
  •  

    RobinsonUK,

     

    A more realistic simulation is to emulate production environment. I could do the same thing you did with access or btrieve and have decent performance. A production environment means a few hundred users, simultaneous read AND writes and maybe some reports at the same time.

     

    Ak

    Monday, February 11, 2008 1:46 PM
  • Well, I can't simulate that over a weekend and to be honest, I would certainly think about some partitioning architecture if I was to do it, but if the initial complaint was concerning NTFS (above), especially when there are a very large number of files in a directory, I can safely say that it isn't a problem for SQL Server.

    Monday, February 11, 2008 1:51 PM