locked
How To write a query for Web ClickStream to get time spent on the feature RRS feed

  • Question

  • I am trying to figure out the query to write for Stream analytics job in case of a web clickstream.

    I am getting a stream of input in Json format for each event.(one for start and one for end).

    i want to find out which feature is used by which user and for how long.

    i am getting a json file like 
    {
     "user":"email address will be given"
     "feature":"RightMenu"
     "event":"Start/End" (can be start or end)
     "time":"Start(activation time) or End(DeactivationTime)"
    }

    so i want to subtract the timestamp of Stop event to Start event then store it in Table storage.

    So basically i want to store data like 
    user - feature - time spent 

    Can anyone please let me know how to achieve this or if some thing like this is possible using ASA.

    Tuesday, January 19, 2016 6:50 PM

Answers

  • The approach proposed by Niels is a valid one. The correct query will look like:

    WITH 
    StartEvent AS
    (
    SELECT *
    FROM Input TIMESTAMP BY Time
    WHERE event = 'start'
    ),
    EndEvent AS
    (
    SELECT *
    FROM Input TIMESTAMP BY Time
    WHERE event = 'end'
    )
    
    SELECT endEvent.[user], 
           endEvent.feature, 
           DATEDIFF(second, startEvent.Time, endEvent.Time) as duration
    FROM endEvent 
    JOIN startEvent
      ON endEvent.[user] = startEvent.[user]
      AND endEvent.feature = startEvent.feature
      AND DATEDIFF(hour, startEvent, endEvent) BETWEEN 0 and 1

    Additionally, Stream Analytics now includes LAST function which makes retrieving previous event(s) simpler. Here is an alternative much shorter solution to the same problem:

    SELECT 
        [user],
         feature,
         DATEDIFF(second, LAST(Time) OVER (PARTITION BY [user], feature LIMIT DURATION(hour, 1) WHEN Event = 'start'), Time) as duration 
    FROM input TIMESTAMP BY Time
    WHERE Event = 'end'
    Please note that in both queries there is assumption that max possible distance between start and end event is 1 hour. You can adjust this constant if needed.


    Friday, January 22, 2016 6:37 AM

All replies

  • Hi!

    So I am not anywhere near ASA at the moment, and I don't know if this would work - but what about "splitting" the input into "Start" events and "End" events by seleting into from the original stream and then join the start and end:

    SELECT *
    INTO startEvent
    FROM Input
    WHERE feature = "Start"
    
    SELECT *
    INTO endEvent
    FROM Input
    WHERE feature = "End"
    
    SELECT endEvent.user, 
           endEvent.feature, 
           endEvent.time - startEvent.time AS time_spent
    INTO tableOutput
    FROM endEvent 
    JOIN startEvent
      ON endEvent.user = startEvent.User
      AND endEvent.Feature = startEvent.Feature

    The table storage would use the tableOutput above as source.

    As I said, I am not sure this will work, but you can try. Let us know how it goes.

    Niels



    http://www.nielsberglund.com | @nielsberglund

    Wednesday, January 20, 2016 4:25 AM
  • The approach proposed by Niels is a valid one. The correct query will look like:

    WITH 
    StartEvent AS
    (
    SELECT *
    FROM Input TIMESTAMP BY Time
    WHERE event = 'start'
    ),
    EndEvent AS
    (
    SELECT *
    FROM Input TIMESTAMP BY Time
    WHERE event = 'end'
    )
    
    SELECT endEvent.[user], 
           endEvent.feature, 
           DATEDIFF(second, startEvent.Time, endEvent.Time) as duration
    FROM endEvent 
    JOIN startEvent
      ON endEvent.[user] = startEvent.[user]
      AND endEvent.feature = startEvent.feature
      AND DATEDIFF(hour, startEvent, endEvent) BETWEEN 0 and 1

    Additionally, Stream Analytics now includes LAST function which makes retrieving previous event(s) simpler. Here is an alternative much shorter solution to the same problem:

    SELECT 
        [user],
         feature,
         DATEDIFF(second, LAST(Time) OVER (PARTITION BY [user], feature LIMIT DURATION(hour, 1) WHEN Event = 'start'), Time) as duration 
    FROM input TIMESTAMP BY Time
    WHERE Event = 'end'
    Please note that in both queries there is assumption that max possible distance between start and end event is 1 hour. You can adjust this constant if needed.


    Friday, January 22, 2016 6:37 AM
  • Cool solution! Thanks Konstantin!

    Niels


    http://www.nielsberglund.com | @nielsberglund

    Friday, January 22, 2016 7:14 AM
  • Hi Konstantin,

    Thanks.

    I was able to figure out the query based on your reply. That Last Function really helps in this context.

    Although i have one more doubt.

    In current scenario i also need to store data into BOLB storage.

    So i wrote another query below the main one as : -

    select * into output1 from input.

    Now since we have applied the timestamp by above we need to apply it here also but the problem is when i apply that i don't get any output record in Blob and table storage, but if i remove the statement then the execution is fine.

    I understand if you mention the Timestamp by then that datetime will be used instead of system,DATEtime which goes in by Default.

    But i am wondering why i am not able to generate any BLOB by simply mentioning TimeStamp By stmt.

    I am passing the current date for time and i also set the late arrival tolerance to 10 days and also when i check the BLOB , there's hardly 10 to 15 seconds difference between EventProcessedUtcTime and my time.

    Can you think of any possible reason of for this?

    Thanks in Advance.

    Saturday, January 23, 2016 7:31 PM
  • If you see output from your main query, you should be normally seeing output in the blob as well. Please check Operations Log for any error messages.

    If you cannot find any, please email your subscription id and job name to azstream at Microsoft.com

    Sunday, January 24, 2016 2:04 AM