locked
First and Last record in a set RRS feed

  • Question

  • Anyone have an effecient ideas on how to do the following:

     

    For example lets say the data is Temperature over time.For example lets say the data is Temperature over time.  DataSet (from a file) comes in sequentially (I've even added a "RowNumber" column).  I then drop some temperature records that I determine to be bad.. so the RowNumber sequence would typically look like 5,6,7,8,10,11 etc.. (often the first few readings are bad). 

    What I need out of the set is a bunch of aggregates, High, Low, Average, Count of Readings etc... all those are easy with the Aggregate object in SSIS... however, I struggle to get the starting temp and ending temp.  What I have done is multicast the data into the following streams

    1. Goes through the High, Low, Average, Count of readings etc Aggregate

    2. Goes to an aggregate to get the Min RowNumber

    3. Goes to an aggregate to get the Max RowNumber

    4. Goes to sort on RowNumber and multicasts (for the next step)

    Then 2 and 4 are joined on the Min RowNumber, 3 and 4 are joined on MaxRowNumber.  Essentially I get the max and min row number and then join it to the whole just to get the first temp and last temp.

    So, as you can imagine that is quite expensive just to get the first and last record (the sorting on 4 is what is really slow)

    Ideas?

    Tuesday, November 7, 2006 1:40 AM

Answers

  • With an asych component, since you're doing input buffer processing inside ProcessInput, use <Your OutputBuffer Name>.AddRow directly from inside that function when you have computed a new aggregate row.  No need in this case to implement CreateNewOutputRows().
    Thursday, November 9, 2006 6:29 PM

All replies

  • I would use a script transform, or if you use it in multiple places - create a custom transform for this purpose. It should be relatively easy and most importantly - signiticantly more efficient :)
    Tuesday, November 7, 2006 8:03 AM
  •  Chris Honcoop wrote:

    Anyone have an effecient ideas on how to do the following:

     

    For example lets say the data is Temperature over time.For example lets say the data is Temperature over time.  DataSet (from a file) comes in sequentially (I've even added a "RowNumber" column).  I then drop some temperature records that I determine to be bad.. so the RowNumber sequence would typically look like 5,6,7,8,10,11 etc.. (often the first few readings are bad). 

    What I need out of the set is a bunch of aggregates, High, Low, Average, Count of Readings etc... all those are easy with the Aggregate object in SSIS... however, I struggle to get the starting temp and ending temp.  What I have done is multicast the data into the following streams

    1. Goes through the High, Low, Average, Count of readings etc Aggregate

    2. Goes to an aggregate to get the Min RowNumber

    3. Goes to an aggregate to get the Max RowNumber

    4. Goes to sort on RowNumber and multicasts (for the next step)

    Then 2 and 4 are joined on the Min RowNumber, 3 and 4 are joined on MaxRowNumber.  Essentially I get the max and min row number and then join it to the whole just to get the first temp and last temp.

    So, as you can imagine that is quite expensive just to get the first and last record (the sorting on 4 is what is really slow)

    Ideas?

    If you already added a rownumber you just need to figure out how to capture the last rownumber generated into a ssis variable and then use that variable in muticast/Conditional split combination to capture those 2 values; first and last. For that you may want to add rownumber column after you remove the bad records; so the first is always 1 and the last is always the last generated number.

    When the source is a table or any other souce that you can run a Select against; it is easy to run a SQL Task in control flow to get the Coutn(*) from the table and store it in a variable before the DataFlow is executed.

     

    Rafael Salas

     

    Tuesday, November 7, 2006 1:42 PM
  • Great idea Rafael.. however its not that simple (it never is, is it?) .. the data has a number of items I group by etc.  Using the variable and the split is an interesting idea though and I'm trying to see if there is a way I can do something along those lines.

    Michael, to be frank I am not knowledgeable enough in script with SSIS... especially in the asyncronous data flow area.. to figure out how to write a "first" and "last" aggregation - especially since I need to have "group by".  Any recommendations on resources?  For example I found info on using a DataTable datatype and I think I could get the data I need that way (could get first, last, min, max all with dataviews on the datatable) - but the script editor errors out when I try to declare something of type DataTable.  Basically any pointers to what structure of data in the script would best allow me to do aggregates grouping by some columns would get me going...

    Tuesday, November 7, 2006 6:59 PM
  • I've tried my hand at making an asyncronous script component and have run into some issues.  I am completely new to vb.net so I might be totally wrong but here's what I have hammered out...

    Here's the script:

    ' Microsoft SQL Server Integration Services user script component

    ' This is your new script component in Microsoft Visual Basic .NET

    ' ScriptMain is the entrypoint class for script components

     

    Imports System

    Imports System.Data

    Imports System.Math

    Imports System.Xml

    Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper

    Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

     

    Public Class ScriptMain

        Inherits UserComponent

     

        Public Overrides Sub CreateNewOutputRows()

     

            OutputBuffer.AddRow()

     

        End Sub

     

     

        Public Overrides Sub Input_ProcessInput(ByVal Buffer As InputBuffer)

     

            Dim dtInput As DataTable = New DataTable

     

            dtInput.Columns.Add("ContractID", GetType(Integer))

            dtInput.Columns.Add("iDate", GetType(Integer))

            dtInput.Columns.Add("SessionID", GetType(Integer))

            dtInput.Columns.Add("SequenceID", GetType(Integer))

            dtInput.Columns.Add("iVolume", GetType(Integer))

            dtInput.Columns.Add("vcTime", GetType(String))

            dtInput.Columns.Add("dTemp", GetType(String))

     

            Dim drNewRow As DataRow

     

            While Buffer.NextRow()

                drNewRow = dtInput.NewRow()

                drNewRow("ContractID") = Buffer.ContractID

                drNewRow("iDate") = Buffer.iDate

                drNewRow("SessionID") = Buffer.SessionID

                drNewRow("SequenceID") = Buffer.SequenceID

                drNewRow("iVolume") = Buffer.iVolume

                drNewRow("vcTime") = Buffer.vcTime

                drNewRow("dTemp") = Buffer.dTemp

                dtInput.Rows.Add(drNewRow)

            End While

     

            ' Create a DataView on top of the DataTable to Sort

     

            Dim dv As DataView = dtInput.DefaultView

            dv.Sort = "ContractID, iDate, SessionID, SequenceID ASC"

     

            Dim ContractID As Integer = 0

            Dim iDate As Integer = 0

            Dim SessionID As Integer = 0

            Dim vcTimeStartActual As String

            Dim vcTimeEndActual As String

            Dim dHigh As Decimal = 0

            Dim dLow As Decimal = 0

            Dim dEndTemp As Decimal = 0

            Dim dBeginTemp As Decimal = 0

            Dim iVolume As Integer = 0

            Dim iReadingsCount As Integer = 0

     

     

            Dim results As String

     

            ' Loop through the ordered Data view

            For Each rv As DataRowView In dv

     

                ' Look for Each unique combination of ContractID, iDate, SessionID - if it has changed since last row reset all the variables

                If (Convert.ToInt32(rv.Item("ContractID")) <> ContractID Or Convert.ToInt32(rv.Item("iDate")) <> iDate Or Convert.ToInt32(rv.Item("SessionID")) <> SessionID) Then

     

                    ' If this is not the first row that means we just hit a new set

                     If iDate <> 0 Then

                     ' we've finished looping all the record for one combination so its time to push this to our results.

                   'results = "ContractID: " & CStr(ContractID)

                   'results += " iDate: " & CStr(iDate)

                   'results += " SessID: " & CStr(SessionID)

                   'results += " vcTimeStartActual: " & CStr(vcTimeStartActual)

                   'results += " vcTimeEndActual: " & CStr(vcTimeEndActual)

                   'results += " dHigh: " & CStr(dHigh)

                   'results += " dLow: " & CStr(dLow)

                   'results += " dEndTemp: " & CStr(dEndTemp)

                   'results += " dBeginTemp: " & CStr(dBeginTemp)

                   'results += " iVolume: " & CStr(iVolume)

                   'results += " iReadingsCount " & CStr(iReadingsCount)

                    'MsgBox(results)

     

     

                    End If

     

                    ' Initialize all the variables to the first row's values

                    ContractID = Convert.ToInt32(rv.Item("ContractID"))

                    iDate = Convert.ToInt32(rv.Item("iDate"))

                    SessionID = Convert.ToInt32(rv.Item("SessionID"))

                    vcTimeStartActual = rv.Item("vcTime").ToString

                    vcTimeEndActual = rv.Item("vcTime").ToString

                    dHigh = Convert.ToDecimal(rv.Item("dTemp"))

                    dLow = Convert.ToDecimal(rv.Item("dTemp"))

                    dEndTemp = Convert.ToDecimal(rv.Item("dTemp"))

                    dBeginTemp = Convert.ToDecimal(rv.Item("dTemp"))

                    iVolume = Convert.ToInt32(rv.Item("iVolume"))

                    iReadingsCount = 1

     

                Else ' This is not the first row for a group

     

                    ' Each progressive row will reset the end time and temp

                    vcTimeEndActual = rv.Item("vcTime").ToString

                    dEndTemp = Convert.ToDecimal(rv.Item("dTemp"))

     

                    ' Volume is summed up

                    iVolume = iVolume + Convert.ToInt32(rv.Item("iVolume"))

     

                    ' iReadingsCount is incremented

                    iReadingsCount += 1

     

                    ' Check the prices and set approriately for High and Low

                    If Convert.ToDecimal(rv.Item("dTemp")) > dHigh Then

                        dHigh = Convert.ToDecimal(rv.Item("dTemp"))

                    End If

     

                    If Convert.ToDecimal(rv.Item("dTemp")) < dLow Then

                        dLow = Convert.ToDecimal(rv.Item("dTemp"))

                    End If

     

     

                End If

     

     

            Next

     

            ' We need to get the last values or we lose the data from the last set

                   'results = "ContractID: " & CStr(ContractID)

                   'results += " iDate: " & CStr(iDate)

                   'results += " SessID: " & CStr(SessionID)

                   'results += " vcTimeStartActual: " & CStr(vcTimeStartActual)

                   'results += " vcTimeEndActual: " & CStr(vcTimeEndActual)

                   'results += " dHigh: " & CStr(dHigh)

                   'results += " dLow: " & CStr(dLow)

                   'results += " dEndTemp: " & CStr(dEndTemp)

                   'results += " dBeginTemp: " & CStr(dBeginTemp)

                   'results += " iVolume: " & CStr(iVolume)

                   'results += " iReadingsCount " & CStr(iReadingsCount)

                    'MsgBox(results)

        End Sub

     

     

        Public Overrides Sub Input_ProcessInputRow(ByVal Row As InputBuffer)

            '

            '

        End Sub

     

     

    End Class

     

     

    Here are my questions:

    1. The whole “Public Overrides Sub Input_ProcessInput(ByVal Buffer As InputBuffer) section does actually appear to process in buffer chunks.. ie if a file has 10,000 rows SSIS won’t give all 10,000 rows to the script component at once… this would lead to incorrect results wouldn’t it?  (how can it figure the high and low when it only does one piece of the data at a time?
    2. As you can see I just have msgbox to see that I am getting the results correctly in the variables – what do I do instead of the msgbox to push all those columns into a row in the output (My script only has an Input called input and a single asyncronous output called output)?
    3. The results msgbox used for testing appear correct nearly all the time but occasionally I'll get all zeros when I shouldn't - ideas?

     

     

    Thursday, November 9, 2006 12:54 AM
  • OK... I think I got my question 1 & 3 figured out.

    I moved my DataTable declaration and addition of columns into PreExecute() so it only gets run once for the component.  I then wrapped the code from the DataView onward under an If checking for buffer.EndOfRowset.  Since doing that the component waits for all the data "chunks" to come in before it starts firing off msgbox

    I am still completly stuck on how to get this data into an output though... right now my script is worthless becuase the data isn't in the output...

    Thursday, November 9, 2006 6:01 PM
  • With an asych component, since you're doing input buffer processing inside ProcessInput, use <Your OutputBuffer Name>.AddRow directly from inside that function when you have computed a new aggregate row.  No need in this case to implement CreateNewOutputRows().
    Thursday, November 9, 2006 6:29 PM
  • Awesome... trying it now.
    Thursday, November 9, 2006 6:44 PM
  • Hah hah... that did it... its always the easy stuff that you end up getting stuck on...
    Thursday, November 9, 2006 6:54 PM