locked
Parquet file written by Azure Time Series Insights Preview is not readable RRS feed

  • Question

  • We have an Azure Time Series Insights Preview instance connected to an event hub. The incoming events are written to the related cold storage data account as parquet files. When I try to open the parquet file with various readers (like the parquet-[head|cat|etc] cmd tools) I get errors.

    Output of parquet-head

    org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:20200123140854700_c8876d10_01.parquet

    Here is a sample of the issue in more detail. This is the output of parquet-dump

    $ parquet-dump 20200123140854700_c8876d10_01.parquet
    row group 0 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- timestamp: INT64 SNAPPY DO:0 FPO:4 SZ:100/850/8.50 VC:100 ENC:PLAIN,RLE ST:[min: 2020-01-23T14:08:52.583+0000, max: 2020-01-23T14:08:52.583+0000, num_nulls: 0] id_string: BINARY SNAPPY DO:167 FPO:194 SZ:80/76/0.95 VC:100 ENC:PLAIN_DICTIONARY,PLAIN,RLE ST:[min: dabas96, max: dabas96, num_nulls: 0] dabasuploader_time_string: BINARY SNAPPY DO:313 FPO:855 SZ:705/2177/3.09 VC:100 ENC:PLAIN_DICTIONARY,PLAIN,RLE ST:[num_nulls: 0, min/max not defined] dabasuploader_prod_kwh_string: BINARY SNAPPY DO:1118 FPO:1139 SZ:62/58/0.94 VC:100 ENC:PLAIN_DICTIONARY,PLAIN,RLE ST:[min: 0, max: 0, num_nulls: 0] dabasuploader_pred_nxd_kwh_string: BINARY SNAPPY DO:1252 FPO:1488 SZ:319/390/1.22 VC:100 ENC:PLAIN_DICTIONARY,PLAIN,RLE ST:[num_nulls: 0, min/max not defined] dabasuploader_pred_today_kwh_string: BINARY SNAPPY DO:1650 FPO:1903 SZ:336/404/1.20 VC:100 ENC:PLAIN_DICTIONARY,PLAIN,RLE ST:[num_nulls: 0, min/max not defined] java.lang.IllegalArgumentException: [solpos_altitude_double] optional double solpos_altitude_double is not in the store: [[dabasuploader_time_string] optional binary dabasuploader_time_string (STRING), [dabasuploader_pred_nxd_kwh_string] optional binary dabasuploader_pred_nxd_kwh_string (STRING), [id_string] optional binary id_string (STRING), [timestamp] optional int64 timestamp (TIMESTAMP(MILLIS,true)), [dabasuploader_pred_today_kwh_string] optional binary dabasuploader_pred_today_kwh_string (STRING), [dabasuploader_prod_kwh_string] optional binary dabasuploader_prod_kwh_string (STRING)] 100

    The solpos_altitude_double is coming from the events we upload to the eventhub. I mean, we call that solpos_altitude. The _double postfix is coming from TSI, according to the docs.

    According to all MS Azure documentations I could find, reading the parquet file should be possible without issues.

    Does anybody know what went wrong? If more info is needed, I am more than happy to provide.

    Monday, January 27, 2020 10:20 AM

Answers

  • Hi Tamas,

    I have some feedback from the product group. This issue you are experiencing is part of a know issue, and will be fixed towards the end of February. Below is the direct feedback from them with regard to the issue, a work around, and expected resolution time frame.

    "This looks like a known issue with the Parquet files written by TSI.  He’ll have the same problem with Spark/Databricks.  We are actively working on the fix and it should be available end of Feb.  At that time, the customer will need to create a new environment and the Parquet files for this new environment will work."  

    "Until the fix is rolled out, he can access the data from the Time Series Query (TSQ) API or the TSI explorer app.  See the bottom of section 5 for options to download the data being viewed as csv."  

    Please let me know if you have any additional questions.

    Regards,

    Mike

    Wednesday, January 29, 2020 6:32 PM

All replies

  • Hi Tamas,

    I am looking through the following tutorial, and I see that JSON is the only file format if the data is being serialized. In the Add a new event source section, you will see the following:

    I believe this is what you are experiencing.  

    Event serialization format

    Currently, JSON is the only available serialization format. Event messages must be in this format or data can't be read.

    The other option is to Spark and Azure Databricks to read the data: Streaming Real-Time Data from Azure Event Hubs into Databricks Delta Tables

    Is there a specific document you are following as guidance?

    Regards,

    Mike

    Monday, January 27, 2020 11:00 PM
  • Hi Mike,

    The events from the eventhub are serialized in JSON, as the linked tutorial suggests. The TSI Preview gets the data, and with the explorer and with the TSI API I can read/query the data. But! There is a blob cold storage connected to the TSI, where I can see several parquet files. According to http://docs.microsoft.com/en-us/azure/time-series-insights/time-series-insights-update-storage-ingress#parquet-file-format-and-folder-structure, those files should be readable without issues. This is where I got stuck, I am seeing various exceptions (see my original post).

    Thanks,

    Tamas

    Tuesday, January 28, 2020 8:51 AM
  • Thank you for the additional detail, Tamas. Can you use Azure Storage Explorer to download and read the files locally, if you are using a local utility to consume the data? 

    The documentation does not detail any options or steps to take to accomplish what you are attempting, so I am investigating this. 

    I see you have the same question asked on Stack Overflow, and wanted to link the two.

    Regards,

    Mike

    Tuesday, January 28, 2020 6:24 PM
  • Yes, that's what I tried to say in the question:I download the parquet files, and I try to read them with various command line tools, like the `parquet-head` and its accompanying partners :) In the question there are the exceptions I am getting. 

    Thank you for your investigation.

    In the meantime I tried with another tool, called `parq`. Here is the output from it:

    parq 20200123140854700_c8876d10_01.parquet 
    Traceback (most recent call last):
      File "/home/prophet/.pyenv/versions/3.7.3/bin/parq", line 8, in <module>
        sys.exit(main())
      File "/home/prophet/.pyenv/versions/3.7.3/lib/python3.7/site-packages/parq/main.py", line 41, in main
        pq_table = pq.read_table(cmd_args.file)
      File "/home/prophet/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pyarrow/parquet.py", line 1281, in read_table
        use_pandas_metadata=use_pandas_metadata)
      File "/home/prophet/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pyarrow/parquet.py", line 253, in read
        use_threads=use_threads)
      File "pyarrow/_parquet.pyx", line 1136, in pyarrow._parquet.ParquetReader.read_all
      File "pyarrow/error.pxi", line 80, in pyarrow.lib.check_status
    pyarrow.lib.ArrowIOError: The file only has 6 columns, requested metadata for column: 19

    Regards,

    Tamas

    Wednesday, January 29, 2020 8:55 AM
  • Hi Tamas,

    I have some feedback from the product group. This issue you are experiencing is part of a know issue, and will be fixed towards the end of February. Below is the direct feedback from them with regard to the issue, a work around, and expected resolution time frame.

    "This looks like a known issue with the Parquet files written by TSI.  He’ll have the same problem with Spark/Databricks.  We are actively working on the fix and it should be available end of Feb.  At that time, the customer will need to create a new environment and the Parquet files for this new environment will work."  

    "Until the fix is rolled out, he can access the data from the Time Series Query (TSQ) API or the TSI explorer app.  See the bottom of section 5 for options to download the data being viewed as csv."  

    Please let me know if you have any additional questions.

    Regards,

    Mike

    Wednesday, January 29, 2020 6:32 PM
  • Mike,

    Thank you for your help! At least now I know that it wasn't me who used the system badly. Thank you for contacting the right group in my name.

    Have a nice day!

    Best Regards,

    Tamas

    Thursday, January 30, 2020 9:01 AM
  • Hi Mike,

    I have other questions. Should I start a new thread for them? 

    First of all, I can't find anything related to the Time Series Insights Preview data retention policy. Given this is a time series db, I would like to configure the aggregation policies over time, but I can't find anything about it in the Preview version. Do you have any info about it?

    My second question is about the TSI Preview REST API. I am playing around with it and the aggregateSeries part is not so intuitive. 

    I am sending queries for the already set up TSI Preview instance. I am trying to get the minimum temperature within a date range. Here comes two queries with bodies and responses:

    Query for https://<env-id>.env.timeseries.azure.com/timeseries/query?api-version=2018-11-01-preview&storeType=coldstore with body
    { 'aggregateSeries': { 'ProjectedVariables': ['MinTemperature', 'Count'],
                           'inlineVariables': { 'Count': { 'aggregation': { 'tsx': 'count()'},
                                                           'filter': None,
                                                           'kind': 'aggregate'},
                                                'MinTemperature': { 'aggregation': { 'tsx': 'min($value)'},
                                                                    'filter': None,
                                                                    'kind': 'numeric',
                                                                    'value': { 'tsx': '$event.holfuyscraper_temperature'}}},
                           'interval': 'P6D',
                           'searchSpan': { 'from': '2020-01-22T12:00:01Z',
                                           'to': '2020-01-28T18:00:01Z'},
                           'timeSeriesId': ['svajc_hf']}}
    Result is:
    { 'progress': 100.0,
      'properties': [ { 'name': 'MinTemperature',
                        'type': 'Double',
                        'values': [-0.5, -4.8]},
                      { 'name': 'Count',
                        'type': 'Long',
                        'values': [281748, 421732]}],
      'timestamps': ['2020-01-19T00:00:00Z', '2020-01-25T00:00:00Z']}

    I would like to highlight that the result says that the minimum temperature was -0.5 on 2020-01-19.

    Then the second query:

    Query for https://<env-id>.env.timeseries.azure.com/timeseries/query?api-version=2018-11-01-preview&storeType=coldstore with body
    { 'aggregateSeries': { 'ProjectedVariables': ['MinTemperature', 'Count'],
                           'inlineVariables': { 'Count': { 'aggregation': { 'tsx': 'count()'},
                                                           'filter': None,
                                                           'kind': 'aggregate'},
                                                'MinTemperature': { 'aggregation': { 'tsx': 'min($value)'},
                                                                    'filter': None,
                                                                    'kind': 'numeric',
                                                                    'value': { 'tsx': '$event.holfuyscraper_temperature'}}},
                           'interval': 'P6D',
                           'searchSpan': { 'from': '2020-01-20T12:00:01Z',
                                           'to': '2020-01-26T18:00:01Z'},
                           'timeSeriesId': ['svajc_hf']}}
    Result is:
    { 'progress': 100.0,
      'properties': [ { 'name': 'MinTemperature',
                        'type': 'Double',
                        'values': [-6.5, -2.1]},
                      { 'name': 'Count',
                        'type': 'Long',
                        'values': [425684, 196806]}],
      'timestamps': ['2020-01-19T00:00:00Z', '2020-01-25T00:00:00Z']}

    And here, the same timestamp as in the previous query, but with a different value. 

    What am I doing wrong?

    BR,

    Tamas


    • Edited by Tamas Soltesz Thursday, January 30, 2020 3:06 PM typo fix
    Thursday, January 30, 2020 2:39 PM
  • I have a follow-on question to the PG to see if the result set is expected or not. Thank you for bringing this behavior to our attention. It is helpful to create a new thread but given the potentially related nature of the issue and the forthcoming fix, this is a great discussion. 

    I will update this thread when I have additional information to share.

    Regards,

    Mike

    Thursday, January 30, 2020 8:09 PM
  • Hi Tamas,

    The query issue is being investigated but did want to provide feedback regarding the retention period:

    For the Preview, the customer sets the retention time for warm store and for cold the data is stored indefinitely.  Here’s where in the doc we talk about the storage account (cold) retention: https://docs.microsoft.com/en-us/azure/time-series-insights/time-series-insights-update-storage-ingress#your-storage-account

    I will follow-up with feedback on the query result behavior, based upon what I hear from the PG.

    Thank you,

    Mike

    Thursday, January 30, 2020 10:23 PM
  • Hi Mike, 

    Thanks.

    Does this "indefinitely" stored data mean that it can disappear at once? Or is this just more like a typo, meaning infinitely?

    BR,

    Tamas 

    Friday, January 31, 2020 8:36 AM
  • Hi Tamas,

    I was able to get an understanding about the query behavior, and this is the response they provided:

    The search spans of the two queries you presented below are different: 

    The first was:  'searchSpan': { 'from': '2020-01-22T12:00:01Z', 'to': '2020-01-28T18:00:01Z'for which the minimum temperature was -0.5 on 2020-01-19. The second search span given was: 'searchSpan': { 'from': '2020-01-20T12:00:01Z''to': '2020-01-26T18:00:01Z'for which the minimum temperature was -6.5 for the same date. Since different sets of data were aggregated the values returned are different. 

    The timestamps in the responses are the same (even though the search spans differed) because of how we calculate intervals.

    As for the following statement:

    "During public preview, data is stored indefinitely in your Azure Storage account."

    The statement reads correctly in that Time Series Insights is currently in Public Preview and this period of time serves as production testing period where the product group can make some final decisions and figure out and tweak the service before going GA. 

    We can provide product feedback, and examples such as yours where you are asking these questions, are good feedback items. Although, when a service is in public preview, there is little to no support for production issues as the service does not have the support level of a Generally Available service. With that, the product group can decide to what extent services and features are supported. In the case of how long the data is stored, they are not committing to a specific duration for data retention.

    Please let me know if you have additional questions.

    Regards,

    Mike

    Tuesday, February 4, 2020 4:26 PM
  • Mike,

    Thank you for your efforts! 

    Best Regards,

    Tamas

    Friday, February 7, 2020 12:20 PM