locked
How to dynamically group on columns without defined schema in data flows? RRS feed

  • Question

  • I have multiple sets of files in azure data lake gen2. Each set of files have the same schema, but the schema differs widely between the sets. The content of the files within a set largely contains the same, redundant data - which I want to get rid of.

    Basically what I need to do is to load the data from all the files in a set into a new destination file, grouping on all columns except one aggregated column (let's say it's a column like LoadTime for clearity). I'm trying to think of a way to do this without any specified schema in mapping data flow. The idea I had was to use the Aggregate component, but this doesn't seem to allow having a column pattern for the group by part (you can have column pattern for the aggregated columns in the aggregate component).

    Any idea on how this can achieved?

    I guess I can get the structure (column names) of the file dynamically with the get metadata actitity in a pipeline and pass some string with those into the data flow via a parameter - but I don't see any clear path forward from there either.

    Monday, April 27, 2020 7:28 AM

All replies

  • Hello Cedersved and thank you for your insightful query.

    I have an idea on how to work-around the lack of a column pattern in the group by section.

    The group by columns are not restricted to hardcoded pre-existing columns.  There is a 'computed column' option.  (Mouse hover over the 'select column', and a blue link appears below.

    I was thinking of using this, and defining the columns using either byPosition.  While this does create a new column, the value can be identical to the original.  This does rely on foreknowledge of the column orders to figure which position to skip.

    An alternative approach, would be to use a derived column transformation to create a compound key.  Since the compound key column is named in the data flow, you can select it in the aggregation group by.

    Tuesday, April 28, 2020 10:38 PM
  • Since we have not heard back from you, we will assume you found your own resolution.  If you found a solution, would you please share it here with the community?
    Friday, May 8, 2020 8:29 PM
  • I'm working on trying to test this out. A question regarding your idea:

    Is it possible to specify more columns using byPosition than what's available in the dataset? Or is there a way to count number of columns in the incoming stream? The number of columns in my datasets varies greatly as well.

    Thanks

    Monday, May 11, 2020 5:10 AM
  • Your suggestion almost works for my use-case, Martin.

    What gets tricky is if I want to exclude a column from the group by, and aggregate on it instead.

    It would be great if it was possible to add some exclusion filter to the columnNames() function, or if it was possible to get the column name based on the position - something like a new ColumnNamebyPosition() function.

    If the schema is stable this could potentially be done by using a get metadata activity in a pipeline, to buid an array of the columns and then using some expression to remove the technical columns for which I in advance know the name and will aggregate on. Unfortunately this isn't easily applicable in my case as I have partitioned files reflecting the source system schema. Meaning that if a new column is added to the source it might only exists in some of the files that are going to be merged in the data flow.

    In your answer you're writing "This does rely on foreknowledge of the column orders to figure which position to skip.". If this column is a column that contains the file name which gets added via the "additional columns" functionality and $$FILEPATH in the copy data activity. Can I control the ordering, and what position this columns gets in the output file of type ORC or parquet? (The rest of the columns are not mapped in the copy activity as I use the same pipeline to load hundreds of tables from multiple resources.)

    Thanks



    • Edited by Cedersved Wednesday, May 20, 2020 8:40 AM
    Wednesday, May 20, 2020 7:36 AM