none
resampling a dataset that has several categorical columns RRS feed

  • Question

  • Hi There, 

    I am attempting to do some time series analysis in Azure Machine Learning Studio.

    The dataset I am working with contains consumption data of pharmaceutical active ingredients over a four year period.  The key features of the dataset are  Material, Plant, Date, and Consumption.  Ultimately I want to build a model that will predict how much a given active ingredient (material) will be consumed at a particular plant in a given week in the future.

    Currently batches using any active ingredient can be produced any day of the week and even multiple times in a day.   A record appears in the dataset for each batch produced.

    My current challenge is how to resample the data and summarize it by week, since that is ultimately the time period I want my model to predict.

    I can do this relatively easily for a fixed material, and plant using the following python script

    import pandas as pd
    import datetime as dt
    
    # The entry point function can contain up to two input arguments:
    #   Param<dataframe1>: a pandas.DataFrame
    #   Param<dataframe2>: a pandas.DataFrame
    def azureml_main(dataframe1 = None, dataframe2 = None):
        print(dataframe1.head())
        df1 = dataframe1.set_index(['Posting Date'])
        dfweekly = df1.resample('W').sum()
        dfweekly['PDate'] = pd.to_datetime(dfweekly.index.date)
        
        # Return value must be of a sequence of pandas.DataFrame
    return dfweekly,
     

    Unfortunately, my dataset has about 400 materials and 10 plants in it (categorical data) and I need to perform the same resampling for each valid category pairing ... ie. Material A1 at Plant 1 .... Material B1 at Plant 1, and perhaps material B1 at plant 4 as well.

    So we are talking about several hundred different resamplings that need to be performed and appear all within a single dataset (or dataframe)

    I've spent the better part of the weekend, playing with groupby() and pivot_table() functions but I just can't wrap my head around how to do this.    I feel like there might be a way to iterate through all the categories and perform resampling on each of them individually, and then somehow re-combine them into a single dataset but that feels really inefficient.

    Can anyone suggest a better way to tackle this, preferably within Azure Machine Learning Studio?

    Thanks for any assistance the community can offer.

    Sunday, October 6, 2019 10:14 PM

All replies

  • Made some progress.   It turned out that I just had a small error in my groupby and I was able to get it to resample each material / plant pairing using a groupby() 

        dfweekly = df1.groupby(['Raw Material', 'Plant']).resample('W').sum()
    

    The was a dataset with over a million rows, but it is summarized correctly!

    The only problem for me is that the resulting dataframe has a multiindex made up of my categorical columns Raw Material and Plant.

    I haven't worked with MultiIndexes before, and I am struggling to find the syntax to copy those indexes back into regular data columns.   

    Monday, October 7, 2019 3:25 PM