none
What is the best way to share datasets in ML Studio? RRS feed

  • Question

  • I am fairly new to ML Studio, so please forgive me I am missing the obvious here.

    I have a dataset in ML Studio that I want to share with another individual.    How can I share this with them?   

    Initially I tried using MS DataShare, but because the dataset is inside the workspace create for my ML experiments, all the folders their are machine named by GUIDs so it isn't practical to find my dataset.

    Maybe I have to somehow download the dataset and share it on GitHub? .... that seems needlessly awkward, so I thought I better ask the community before I go to that effort.

    thx

    -Sheldon

    Monday, January 13, 2020 9:09 PM

Answers

  • Hello slyttle,

    If you are using ML studio classic you can publish your experiment to the gallery and this will automatically contain the dataset of your training experiment when other users use your experiment in their subscription or workspace. Here are the guidelines and steps to share an experiment to the gallery. All datasets are registered with the workspace while using ML studio classic.

    If you are using the designer you can register the dataset to the workspace and share the same by referencing the workspace. 

    Register the dataset to workspace:

    # create a TabularDataset from Titanic training data
    web_paths = [
                'https://dprepdata.blob.core.windows.net/demo/Titanic.csv',
                'https://dprepdata.blob.core.windows.net/demo/Titanic2.csv'
               ]
    titanic_ds = Dataset.Tabular.from_delimited_files(path=web_paths)
    
    # create a new version of titanic_ds
    titanic_ds = titanic_ds.register(workspace = workspace,
                                     name = 'titanic_ds',
                                     description = 'new titanic training data',
                                     create_new_version = True)

    Access the same in the workspace:

    %%writefile $script_folder/train.py
    
    from azureml.core import Dataset, Run
    
    run = Run.get_context()
    workspace = run.experiment.workspace
    
    dataset_name = 'titanic_ds'
    
    # Get a dataset by name
    titanic_ds = Dataset.get_by_name(workspace=workspace, name=dataset_name)
    
    # Load a TabularDataset into pandas DataFrame
    df = titanic_ds.to_pandas_dataframe()

    Please checkout this documentation about datasets to learn more about their usage.

    -----------------------------------------------------------------------------------------------------------
    If you found this post helpful, please give it a "Helpful" vote. 
    Please remember to mark the replies as answers if they help. 

    Tuesday, January 14, 2020 6:40 AM
    Moderator