locked
Cosmos DB - Group By RRS feed

  • Question

  • Hi there,

    We have a requirement, where each of our equipment are sending data and we have to get the first data point of each day and we have to compare it with some other data. So when we are calling the SDK for data I want group the all the equipment and only get the first data which I think will solve my problem, but I couldn't find any way to do it.

    If anyone has some other solution please suggest me as well. For more info, Basically we are running Change DB feature of cosmos to calculate the consumption for each day by comparing the first data point of that day.

    Thanks 

    Tuesday, May 19, 2020 1:12 PM

All replies

  • Hi Eshayat,

    Please take a look at the GROUP BY clause in Azure Cosmos DB documentation as a start. Additionally, there is specific documentation for working with the Cosmos DB change feed processor: Change feed in Azure Cosmos DB, which I assume is what you are looking for. 

    If you can provide a more concrete example, it will help to better understand the issue you are attempting to solve such as where are the two data source?

    ~Mike


    Wednesday, May 20, 2020 5:19 PM
  • Hi Eshayat,

    Did you find the information you are searching for, or do you have additional questions? Please let me know if the information provided helps you with your project. 

    Regards,

    Mike

    Tuesday, May 26, 2020 2:20 AM
  • Hi Mike, Thank you for you reply. Sorry I couldn't find anything.

    I'm using this SQL query in cosmos. In here this gives me all data for any equipment for the specified date.

    string sql = @$"
                        SELECT
                            f.equipmentId,
                            e.currentEngineSeconds,
                            f.hireContract.hireNo,
                            f.reportingDateTime
                        FROM c AS f
                        JOIN e IN f.engineInfo
                        WHERE f.equipmentId IN ##
                        AND f.reportingDateTime > ##

    What i want to do is i want to get the day first value of each equipment.

    For example,

    The query gives

    Equipemt - Time

    Equipment1 : 2020-05-26-00-05

    Equipment1 : 2020-05-26-01-03

    Equipment1 : 2020-05-26-02-00

    Equipment1 : 2020-05-26-04-05

    Equipment1 : 2020-05-26-06-10

    Equipment2 : 2020-05-26-1-05

    Equipment2: 2020-05-26-03-03

    Equipment2: 2020-05-26-07-00

    Equipment2: 2020-05-26-09-05

    Equipment2: 2020-05-26-10-10

    I want to get only two value.

    Equipment1 : 2020-05-26-00-05

    Equipment2 : 2020-05-26-1-05

    Hope i was able to make it clear. Thank you so much

    Tuesday, May 26, 2020 6:15 AM
  • I think you could leverage the DISTINCT keyword filter to return the first instance of Equipment1 and Equipment2 in your case. 

    Do you have an example JSON document you can share so that I could reproduce your container and test a query, if you are still not able to figure this out.

    Regards,

    Mike

    Wednesday, May 27, 2020 5:59 PM
  • I ran this query 

    SELECT
        c.equipmentId,
        c.engineInfo.currentEngineSeconds,
        c.providerDatetime
    FROM c
    WHERE c.equipmentId IN ('H106180', 'H106128', 'G75-54')


    to get this value 

    [
        {
            "equipmentId": "H106128",
            "currentEngineSeconds": 44868960,
            "providerDatetime": "2020-05-26T04:43:01"
        },
        {
            "equipmentId": "H106180",
            "currentEngineSeconds": 32282640,
            "providerDatetime": "2020-05-26T04:43:01"
        },
        {
            "equipmentId": "G75-54",
            "currentEngineSeconds": 37096560,
            "providerDatetime": "2020-05-26T04:45:18"
        },
        {
            "equipmentId": "H106128",
            "currentEngineSeconds": 44895240,
            "providerDatetime": "2020-05-26T11:43:01"
        },
        {
            "equipmentId": "G75-54",
            "currentEngineSeconds": 33851160,
            "providerDatetime": "2020-05-26T11:30:18"
        },
        {
            "equipmentId": "H106180",
            "currentEngineSeconds": 32307840,
            "providerDatetime": "2020-05-26T11:43:01"
        },
        {
            "equipmentId": "H106180",
            "currentEngineSeconds": 32311440,
            "providerDatetime": "2020-05-26T12:43:01"
        },
        {
            "equipmentId": "H106128",
            "currentEngineSeconds": 44898840,
            "providerDatetime": "2020-05-26T12:43:01"
        },
        {
            "equipmentId": "G75-54",
            "currentEngineSeconds": 33855480,
            "providerDatetime": "2020-05-26T12:45:18"
        },
        {
            "equipmentId": "H106180",
            "currentEngineSeconds": 32313960,
            "providerDatetime": "2020-05-26T13:28:01"
        },
        {
            "equipmentId": "H106128",
            "currentEngineSeconds": 44900640,
            "providerDatetime": "2020-05-26T13:28:00"
        },
        {
            "equipmentId": "G75-54",
            "currentEngineSeconds": 33858360,
            "providerDatetime": "2020-05-26T13:30:18"
        }
    ]


    But I want to get, Basically the first value of the day. check providerDateTime

    [
        {
            "equipmentId": "H106128",
            "currentEngineSeconds": 44868960,
            "providerDatetime": "2020-05-26T04:43:01"
        },
        {
            "equipmentId": "H106180",
            "currentEngineSeconds": 32282640,
            "providerDatetime": "2020-05-26T04:43:01"
        },
        {
            "equipmentId": "G75-54",
            "currentEngineSeconds": 37096560,
            "providerDatetime": "2020-05-26T04:45:18"
        }
    ]

    I could've take the TOP(3) value but i don't how many times each equipment will report. That's why i wanted to do groupBy and take the first value of each group.

    Thank you so much for your help.

    Thursday, May 28, 2020 1:05 AM
  • So, this has become an interesting scenario. Are you using the SQL API? I will spend some time on this tomorrow after confirming which API you are working with but, if you are using the SQL API you could do something along the lines of:

    SELECT
        c.equipmentId,
        c.engineInfo.currentEngineSeconds,
        c.providerDatetime
    FROM c
    WHERE c.equipmentId IN ('H106180', 'H106128', 'G75-54'),
    ORDER BY c.equipmentId, c.providerDatetime DESC

    But you are looking for return only the first instance of c.providerDatetime so you could try:

    SELECT TOP 1
        c.equipmentId,
        c.engineInfo.currentEngineSeconds,
        c.providerDatetime,
    FROM c
    WHERE c.equipmentId IN ('H106180', 'H106128', 'G75-54'),
    ORDER BY c.equipmentId, c.providerDatetime DESC

    The ORDER BY clause in this case will require a composite index on equipmentId and providerDateTime to return any data. What I suggested here is not tested. I think you need a subquery here to return all providerDatetime values for a given equipmentId with an ORDER BY and the outer query will take the first instance with the SELECT TOP 1. The query above without the subquery will order your result set by equipmentId and then by providerDatetime in descender order. 

    Please confirm whether you are using the SQL API and I can investigate this further.

    Regards,

    Mike


    Thursday, May 28, 2020 4:05 AM
  • Hi Mike,

    Yes, I am using SQL api for that particular query I can use Linq as well if it's easier. But i'm not sure. We are a bit new in Cosmos.

    I'm getting this because of composite key.
    The order by query does not have a corresponding composite index that it can be served from.

    Thank you so much for your support. If I can get through this it will speed up my process.

    Currently I'm getting all values and doing group by in memory using Linq

    My partition key is equipmentId

    Thanks



    • Edited by Eshayat Thursday, May 28, 2020 8:23 AM
    Thursday, May 28, 2020 5:20 AM
  • Hi Eshayat,

    I received some suggestions and will need to try these out on my example dataset but wanted to pass these along for you to try out and modify as needed.

    This would be feasible only for small number of pairs (hundreds or few thousands):

    SELECT
        c.equipmentId,
        c.currentEngineSeconds,
        c.providerDatetime
    FROM c 
     WHERE (c.providerDateTime = <minTime1> AND c.equipmentId = <equipmentId1>) OR
        (c.providerDateTime = <minTime2> AND c.equipmentId = <equipmentId2>) OR …
    
    

    SELECT
        MIN(c.providerDatetime) AS minTime
    FROM c
    GROUP BY c.equipmentId
    
    
    Then loop through all equipmentIds and run a simple query like this:
    
    SELECT
        c.equipmentId,
        c.currentEngineSeconds,
        c.providerDatetime
    FROM c 
     WHERE c.providerDateTime = <minTime> AND c.equipmentId = <equipmentId>
    

    You can also try creating a composite index on (providerDateTime, equipmentId) to reduce the 2<sup>nd</sup> query’s RU charge. The second query should be pretty cheap (~3 RUs if in-partition).

    Regards,

    Mike

    Tuesday, June 2, 2020 8:21 PM
  • Hi Mike,

    Thanks for solution. The second one is a good one. I have checked and it works.  But As you said, It can work with small data, I have over 50-70 GB+ in the beginning of going production. 

    SELECT c.equipmentId, MIN(c.providerDatetime) AS minTime

    FROM c

    WHERE c.providerDatetime > '2020-05-26T00:00:00.7878153Z'

    GROUP BY c.equipmentId

    this query takes about 370 RU on my 5808 document but potentially it can take quite large RU with production data.

    Let me know, if there any other way.

    Thanks so much.

    Wednesday, June 3, 2020 12:46 AM