none
loading registered datasets into a dataframe in Azure ML RRS feed

  • Question

  • Hi Everyone, 

    I've been struggling with something that, to me, should be very basic .... loading data into a pandas dataframe from a registered dataset in Azure ML (preview) ... Hoping someone can pull me out of my downward cycle of misery.

    First off, let me explain that I started out with a time series dataset that I did some pre-processing on via the designer.   ie. I created an experiment and use the drag-and-drop interface to get my dataset to state where I could feed it into a much fancier Python script.   Since the designer isn't great for running python scripts (that plot graphs), I saved the resulting dataset as a registered dataset in my ML workspace.

    Unfortunately the interface would only save it as a "file" type dataset .. there was no option to save as a tabular dataset which I would have preferred.

    I then created a Notebook (again, in AML (preview) ) and started trying to import that registered dataset, but because it was of type "file" the only function the Dataset class offered was the option to download the dataset into its constituent files (a tabular dataset has options to convert to a dataframe, which is exactly what I want).   What I was left with was a set of eight files, one of which was a parquet file which I have never worked with.

    I've been trying to find functions that can load this resulting parquet file, but so far have not been successful, and I just keep going in circles .... I just want to get this data into a dataframe and get on with my life!!  

    Can anyone help?

    thanks 

    -Sheldon

    Friday, February 7, 2020 8:22 PM

Answers

  • Ok ... as with most things ... the answer was almost trivial once I knew what I was doing.

    import pyarrow.parquet as pq

    table = pq.read_table('data.dataset.parquet')
    table.to_pandas()




    • Marked as answer by slyttle Friday, February 7, 2020 8:51 PM
    • Edited by slyttle Friday, February 7, 2020 8:52 PM
    Friday, February 7, 2020 8:51 PM

All replies

  • Ok ... as with most things ... the answer was almost trivial once I knew what I was doing.

    import pyarrow.parquet as pq

    table = pq.read_table('data.dataset.parquet')
    table.to_pandas()




    • Marked as answer by slyttle Friday, February 7, 2020 8:51 PM
    • Edited by slyttle Friday, February 7, 2020 8:52 PM
    Friday, February 7, 2020 8:51 PM
  • Hi,

    Thanks for sharing updates regarding this issue.

    Regards,

    GiftA-MSFT.

    If a post helps to resolve your issue, please click “Mark as Answer” and/or “Vote as helpful”. By marking a post as Answered and/or Helpful, you help others find the answer faster.  Thanks.

    Friday, February 7, 2020 9:37 PM
    Moderator