Working with images and Azure Data Lake


  • I am undertaking a project where I am working with images. I need to extract the RGB values of these images and store them in a database along with the image name and dimensions of the image. Eventually the database would be exported to a csv and further analytics will be completed. I have read recently that you should try to not store images in a database or you are best to store them as blobs if you have to. 

    I have some thoughts on this process which may not work or need a bit of assistance on:

    1. Images are stored as blobs in Azure, a Python script reads in each images makes the conversion from .png say to the array of RGB values and then sends that to the Data Lake. 

    2. Write a web application in Python that allows for images to be uploaded and then convert to RGB values in same script before sending to the Data Lake.

    3. Use a process in U-SQL similar to that of

    Although I haven't found much material on this and not sure it can do what I would like.

    Thank you for the advice

    Wednesday, February 15, 2017 3:29 AM

All replies

  • I would recommend to use a Feature extractor on your image files similar to the extractor here:

    Note that this way you can process the data without worrying about the 4MB limit on an image.

    You can get the file name added by extracting it using a file set virtual column. For example:

    @x = 
       EXTRACT rgb_array SqlArray<SqlArray<int>>, partition string, filename string 
       FROM "/myimages/{partition}/{filename}"
       USING new ImageExtractors.RGBExtractor();

    where partition is a way to subset your images into a queryable subset in subsequent predicates to address the current scale limits of file sets.

    If you don't want to write the code using C#, you theoretically could have the Extractor callout into Python similar to our Python Extension library/reducer.

    Michael Rys

    Wednesday, February 15, 2017 1:35 PM