Need to analyze input CSV files and determine whether input file is good or bad w.r.t it's data RRS feed

  • Question

  • Hi Team,

    We have a scenario where we need to implement an Artificial Intelligence solution which will evaluate the input data file of my Azure Data Factory pipeline and let us know whether the file is good or bad with respect to it's data. 

    For example, I have 10 files several rows which are good input files and 2 files with several rows which are bad input files.

    Each file either it is good/bad has 26 columns. The above two files are bad because of below reasons. 

    1. One file has all empty values for one column which is not expected.

    2. Another file has, the value 'TRUE for all rows for a specific column which was also not he general scenario. (some % of TRUE's and some % records with FALSE will appear in good files)

    Like this, there may be several scenarios where the input file may be treated as bad file. 

    We want to implement an Artificial Intelligence solution which should analyze all the input files and identify the hidden patterns of the data in file and detect abnormal scenarios like above and should eventually mark the file as bad file. 

    Please suggest for the approach or what components in Azure can help to achieve this kind of file sanity check.


    Monday, December 9, 2019 6:28 PM

All replies

  • Hello Dileep,

    There is a new service offering from Azure Machine Learning called data drifts on datasets which creates dataset monitors to monitor datasets for data drifts and statistical changes in datasets. Currently this service allows you to do the following:

    • Analyze drift in your data to understand how it changes over time.
    • Monitor model data for differences between training and serving datasets.
    • Monitor new data for differences between any baseline and target dataset.
    • Profile features in data to track how statistical properties change over time.
    • Set up alerts on data drift for early warnings to potential issues.

    This service might help you detect changes in your datasets from the baseline targets and provide insights with statistical measurements as well. If you are planning to implement an AI solution then this might help your scenario. 

    For a basic use case to lookup the number of rows and columns in a file you can use Azure Machine Learning designer and import csv files which can perform basic validation using execute python or R scripts and accept or reject files as an outcome of the web service. 


    Tuesday, December 10, 2019 7:58 AM