locked
Blob storage as input for ASA RRS feed

  • Question

  • I like to build a proof of concept for a smart machine application, which collects 52 sensor data, and stores them daily in a csv file on Azure Blob Storage.

    For some analytics, I need to read data from the blob storage, whereas each daily file has 52 columns (column 1 being the time-stamp according to ISO 8601) and each daily file having several thousand rows.

    My question is how can I write a query in ASA to check column 14 for the number of outliners (e.g.: data in column 14 is not within 20 +/-3)?

    Walter



    • Edited by walterp Friday, November 14, 2014 3:50 PM
    Friday, November 14, 2014 9:04 AM

Answers

  • Ah, good --- this tells us what the problem is!  Since your input data is formatted as CSV, we don't know that the values are numeric.  So, you need to tell us.  The simplest way is to use SQL CAST, as in: CAST(Sensor14 as float)

    This would make your query:

    SELECT 
        count(*) as numOutliers, 
        System.Timestamp as endOfInterval
    FROM sensors TIMESTAMP BY ts
    WHERE CAST(Sensor32 as float) > -5.9 
        or CAST(Sensor32 as float) < -6.1
    GROUP BY TumblingWindow(hour, 2)

    By the way, are you using the new "Test" feature in the portal?  It is so much easier to debug your query using it then by submitting jobs!  Just use a local CSV file --- should be self-explanatory.

    Thanks!

    --Lev

    Saturday, November 15, 2014 6:28 PM

All replies

  • Let's assume that you configured your input to point to blobs, and called it "sensors".  And that the columns in your CSV files are called "ts", "sensor1", "sensor2", ..., "sensor52".

    Then, you can write:

    SELECT count(*) as numOutliers, System.Timestamp as endOfInterval
    FROM sensors TIMESTAMP BY ts
    WHERE Sensor14 > 23 or Sensor14 < 17
    GROUP BY TumblingWindow(hour, 2)

    This will give you the number of outliers for each 2 hour interval.

    Thanks,

    --Lev

    Friday, November 14, 2014 4:34 PM
  • Thank you for information. But, I do not get the query job running. Unfortunately, the operation logs do not help so much to find the error. For your information, the operation logs look as follows:

    12:29:31 PM       Start streaming job 'test'                                                             started
    12:29:31 PM       Microsoft.StreamAnalytics/StreamingJobs/start/action                accepted
    12:29:37 PM       Start job 'test'                                    `                                        validated
    12:29:37 PM       Start job 'test'                                                                              submitting
    12:29:52 PM       Start job 'test'                                                                              submitted
    12:30:20 PM       Job Substatus Change                                                                 initializing
    12:30:20 PM       Job Substatus Change                                                                 idle
    12:30:44 PM       Start job 'test'                                                                              running
    12:30:67 PM       Job Substatus Change                                                                 processing
    12:30:47 PM       Process Events                                                                            failed
    12:30:50 PM       Start job 'test'                                                                              failed

    I changed the time stamp format in the .csv file from “12.05.2014 00:37” to “2014-12-05T00:37:00.0000000”.  But it did not work.  

    I also thought that the column header is case sensitive. So I changed it in the query accordingly. But it failed as well.

    
    

    I have also a further question about the “System.Timestamp as endOfInterval” in the SELECT statement.
    What is the purpose of it?

    Or is it that my sample file has data from May 2014?
    Although, I started the query with CUSTOM TIME from 2014-04-01.

    Thanks Walter





    • Edited by walterp Saturday, November 15, 2014 12:10 PM
    Saturday, November 15, 2014 12:05 PM
  • Hi Walter!

    To your first question, can you check what the error messages in the "failed" log entries are?  To see them, you need to click the "Details" button (it's at the bottom of the screen) after clicking on the log entry. Perhaps we will learn more from there.

    To your second question: the SELECT clause of the query describes the shape of the result set being returned.  The query example groups your records into 2-hour chunks, and reports the count of outliers for each 2-hour interval.  The resulting event is generated with the timestamp corresponding to the end of the chunk's interval.  I assumed you would be interested in knowing which interval had which outlier count, so I added System.Timestamp as a column in the result.  Without it, you would just get a sequence of numbers (a table with one column).

    Thanks,

    --Lev

    Saturday, November 15, 2014 5:37 PM
  • Hi Lev

    Below is the error message, which I have got:
    ( Please note that I have changed the WHERE clause to "WHERE Sensor32 > -5.9 or Sensor32 < -6.1"

    Error:
    Comparison is not allowed for operands of type 'nvarchar(max)' and 'float' in expression 'Sensor32 > - 5.9'.

    Regards,
    Walter


    • Edited by walterp Saturday, November 15, 2014 7:19 PM
    Saturday, November 15, 2014 6:22 PM
  • Ah, good --- this tells us what the problem is!  Since your input data is formatted as CSV, we don't know that the values are numeric.  So, you need to tell us.  The simplest way is to use SQL CAST, as in: CAST(Sensor14 as float)

    This would make your query:

    SELECT 
        count(*) as numOutliers, 
        System.Timestamp as endOfInterval
    FROM sensors TIMESTAMP BY ts
    WHERE CAST(Sensor32 as float) > -5.9 
        or CAST(Sensor32 as float) < -6.1
    GROUP BY TumblingWindow(hour, 2)

    By the way, are you using the new "Test" feature in the portal?  It is so much easier to debug your query using it then by submitting jobs!  Just use a local CSV file --- should be self-explanatory.

    Thanks!

    --Lev

    Saturday, November 15, 2014 6:28 PM
  • Hi Lev

    Thanks a lot. The query works now. It counts the outliers for each 2-hour interval perfectly. But it does it only for the first day of the csv file.  Any idea why this happens?

    For your information, I have put 30 days in one csv file.

    Regards,
    Walter

    PS: I used the new *Test* feature, but I did not see any benifits using it.



    • Edited by walterp Saturday, November 15, 2014 11:26 PM
    Saturday, November 15, 2014 11:19 PM
  • The benefit of the "Test" feature (IMO) is that you see the results right away and can fix the query quicker. 

    But I am glad the query is working!

    As for why it only works for the first day: did I understand you correctly that each day is stored as a separate blob?  If so, what are the names of your blobs, and what did you specify as the "Path Pattern" in the input configuration? 

    Saturday, November 15, 2014 11:43 PM
  • Hi Lev

    Subject 1:

    You are right that each day is stored as a separate blob. The Path Pattern in the input configuration is as follows:

    -          Input Alias:                                         sensors

    -          Storage account Name:                      memex

    -          Blob Container:                                   input2

    Based on your question, I have changed the names of the csv files (blobs), starting them with letters (instead of using digits). Afterwards, I have created a blob container in one of my storage accounts (memex) and named it “input2”. Then, I uploaded the csv files as Excel comma separated files to the blob storage (using Cerebrata Azure Management Studio), using for the csv files the following blob names: sample12, sample13,.., sample22 ( the digit indicating the day).

    Once done, I started the queries at 10:54 and got the expected output. At 11.19AM and at 11:32AM and at 2:02 PM, I repeated the exact same queries with same input data again successfully.  But since I have experienced some troubles with digits in the blob name, could you let me know, whether there are naming rules for the blobs in ASA?

    Furthermore and despite that I have only used 2 to 3 MB of data for testing, it took several minutes until the result was available in the output blob container. Could you also commend on this?


    Subject2:

    I tried out once more the “test” feature. Following to it, I have are two remarks to it

    1. When I used the same csv file as used in ASA as described above, then I got the following error message: “Failed to parse JSON data, S140512.csv: Error parsing boolean value. Path '', line 0, position 0.” What does it mean? How to get away from that?
    2. Further, it took a long time, whiteout indication of a progress bar or some log files. To make really good benefit of the test feature, I suggest to implement some log files or some progress bar in the next release during the” test” process.

    Regards,

    Walter

    Sunday, November 16, 2014 4:06 PM
  • Walter,

    When using Blobs as input to ASA, the key consideration is --- what order should these blobs be processed in?  You have two options.  The most performant one is to encode the date and/or time of the blob in the name of the blob.  To do this:

    • Name your blobs, for instance: sensor2014-11-12, sensor2014-11-13, etc.
    • Check "Configure Advanced Settings" in your input provider configuration
    • On the next page, select "Blob Boundary" for blob serialization boundary --- as I understand your setup, you do not change blobs once they are uploaded.
    • Type a Path Pattern that matches what you created, e.g. sensor{date} for the above example
    • Hit Tab, and select the format you used for the date, e.g. YYYY-MM-DD for the above example

    Now, we will be processing all of your blobs in the right order, and you will hopefully see all of your output.

    With respect to performance, ASA is, of course, optimized for near-real-time stream analytics.  So, for instance, you construct a job like above, start it, it launches, maybe produces some initial output if some blobs are already there.   And then, as new data arrives and you upload new blobs, the job will automatically pick them up and process immediately.  

    The several minutes you observed were our start-up time --- once the job is up, it should process things very quickly.  We are working on improving our start-up time, but the good news is that it only happens to you once in real life.

    Finally, on Subject 2: the hint to your difficulty is in the error message: the test feature is expecting JSON files for testing.  If you only have CSVs and cannot construct JSON samples, you cannot use the test feature for now.  We should make that clearer, and indeed add a progress bar --- thanks for the suggestion.

    Sunday, November 16, 2014 6:40 PM