Hi All,
I currently reading my self into Windows Azure and a question pop-upped, could I use Windows Azure to help me index a larger amount >1M files of data (docs,email,pdf) in a short period of time (<12hours)
Let me provide some more information about my needs, I want to be able to index files matter based and have the ability to divide those documents into several objects inside the matter. For example
Matter:12345, Object:A, Folder:A, <all files that belong to 12345:A:A>
Matter:12345, Object:A, Folder:B, <all files that belong to 12345:A:B>
Matter:12345, Object:B, Folder:A, <all files that belong to 12345:B:A>
What I have read so far is that I could create a storage account called Matter, create container called 12345, so far so good but how can I create "Object" and "Folders" to place my blob files in it?
Also I want save some metadata about the files, like original path, hash, data & time, type, etc but would it be better to save this information inside SQL Azure of Storage Table?
My third question regards to my users to be able to search inside the files, I read about Lucene.Net and AzureDirectory which could be of use to me, I was thinking to first upload the files to my Azure Storage, build a worker role to calculate the hash and
update my Sql Azure or Storage Table with the correct information about the file , then another worker role would extracts the text from those files and places the extracted text inside a queue for the index worker to index the text into the right index based
upon the matter, object and folder names and another including a reference to my record inside SQL Azure or Storage Table . So if a user searches for a documents based upon a keyword it search inside the lucene index which holds a reference to the record inside
either my SQL Azure table of Storage Table which holds a reference to the actual document inside the Storage blob, so when the user needs to view or download the document it can retreive it from storage.
Hopefully someone can help my with my question and hopefully point me into the right direction. I think it is possible but would like to get some confirmation before I start porting my code to be Azure ready.
Thanks in advance!