I currently reading my self into Windows Azure and a question pop-upped, could I use Windows Azure to help me index a larger amount >1M files of data (docs,email,pdf) in a short period of time (<12hours)
Let me provide some more information about my needs, I want to be able to index files matter based and have the ability to divide those documents into several objects inside the matter. For example
Matter:12345, Object:A, Folder:A, <all files that belong to 12345:A:A>
Matter:12345, Object:A, Folder:B, <all files that belong to 12345:A:B>
Matter:12345, Object:B, Folder:A, <all files that belong to 12345:B:A>
What I have read so far is that I could create a storage account called Matter, create container called 12345, so far so good but how can I create "Object" and "Folders" to place my blob files in it?
Also I want save some metadata about the files, like original path, hash, data & time, type, etc but would it be better to save this information inside SQL Azure of Storage Table?
My third question regards to my users to be able to search inside the files, I read about Lucene.Net and AzureDirectory which could be of use to me, I was thinking to first upload the files to my Azure Storage, build a worker role to calculate the hash and
update my Sql Azure or Storage Table with the correct information about the file , then another worker role would extracts the text from those files and places the extracted text inside a queue for the index worker to index the text into the right index based
upon the matter, object and folder names and another including a reference to my record inside SQL Azure or Storage Table . So if a user searches for a documents based upon a keyword it search inside the lucene index which holds a reference to the record inside
either my SQL Azure table of Storage Table which holds a reference to the actual document inside the Storage blob, so when the user needs to view or download the document it can retreive it from storage.
Hopefully someone can help my with my question and hopefully point me into the right direction. I think it is possible but would like to get some confirmation before I start porting my code to be Azure ready.
It sounds like blob storage is what you need. Although containers and blobs are similar to folders and files, containers do not have hierarchy as file folders do. However, you can take advantage of the fact that blob names may contain the slash character
- this lets you simulate hierarchy with a naming convention. For example, container 12345 could contain a blob named ObjectA/FolderA/filename. The Windows Azure Storage Client library / API has functions that understand this convention
and allow you to enumerate through your blobs collection intelligently. See the CloudBlobDirectory class for more info.
For storing blob metadata, you can add your own metadata properties to blobs. You can use a tool like Cloud Storage Studio or Azure Storage Explorer to do that manually, or you can do it programmatically using the Storage Client library or Storage
REST API. Alternatively you could store this metadata in table storage, a reason to consider that would be if you need to quickly look up values from an index rather than enumerating through blob containers.
I would suggest using Lucene.net for text indexing on a large-scale.
David Pallmann GM Application Development, Neudesic Windows Azure MVP
Didn't know that blob names could maintain slash characters, this sounds like a good solution to simulate the hierarchy. Regarding you seconds answer, so I could add hash etc as metadata to the blob storage but would it be wisher to separate the user comments/notes
about a file? You see my users may add notes or tags to the documents. This information could change over time so I think its is better to store this type of information inside a SQL Azure table wouldn't you agree? Do you know any sample projects that covers
this type of uses?