locked
workflow 4 correlation / persistence question RRS feed

  • Question

  • Hello all,

    I would like to understand why the following behavior takes place.

    In most cases my workflow that runs monthly account renewals works just fine. But with selected accounts it gets stuck. What I mean by stuck, is that a record is created in System.Activities.DurableInstancing.InstancesTable (in my persistence db), but IsInitialized remains to be set to 0 (so is IsSuspended, IsReadyToRun). When I step through my workflow, I see that the problematic accounts do not pass InitializeCorrelation activity (I pass an AccountID to initialize AccountIDHandle).

    What I can't figure out is  where InitializeCorrelation looks to determine if a flow with a given AccountID already runs? I killed all the instances of workflows in the database using DeleteInstance stored procedure, so InstancesTable has no rows. And yet the same problematic accounts trip over InitializeCorrelation activity - a new instance of a workflow will be created and stored in InstancesTable with IsInitialized = 0.

    Thank you very much,

    Andrei

     

    Friday, September 10, 2010 3:35 PM

All replies

  • What format are your Account IDs in? is it possible that there is something about those particular Ids. What data type is it, and what is an example of one of the ones that consistently (100% of the time or some of the time?) that causes the failure that you experience.

    There are some troubleshooting steps listed on this topic:

    http://msdn.microsoft.com/en-us/library/ee358742.aspx

    If you disable persistence does the InitializeCorrelation work with the problem AccountIDs?

    Does the workflow fault? You can capture these sorts of things using the tracing mentioned in the topic above.

    Let me know the about the AccountIds, data type and value, and if disabling persistence resolves it and that will give us some more clue to go on.

    Thanks,

    Steve Danielson [Microsoft]
    This posting is provided "AS IS" with no warranties, and confers no rights.
    Use of included script samples are subject to the terms specified at http://www.microsoft.com/info/cpyright.htm

     

    Friday, September 10, 2010 5:41 PM
  • Hi Steve,

    Thanks a lot for your help!

    1. AccountID is an Int32, supplied every time

    2. I pushed a workflow feeding an AccountID on a production box, and it didn't go through. Then I fed the same AccountID to a workflow on our dev box pointing to the production databases (including persistence db) and the entire flow went through just fine. Hmmm...

    3. I configured tracing like you suggested. I did my best to compare and interpret traces for both successful flow on the dev box with unsuccessful attempt on the production box. Both traces look almost identical up until this step: after an Activity boundry (Level Stop), the 'good' flow was followed by Get ChannelEndpointElement with FoundChannelElement = true and all other elements (ie. Binding, RemoteEndpointUri etc) set to correct values. However, the 'bad' flow showed a warning:

    Description: Faulted System.ServiceModel.Activities.Dispatcher.PersistenceContext

    then followed by

    Aborted 'System.ServiceModel.Activities.Dispatcher.PersistenceContext/7284720'

    4. I temporarily disabled persistence and ran another workflow feeding with a problematic AccountID on the production box. The results were positive - the account was renewed.

     

    What is strange to me that a production web server + production databases combo didn't work, but a dev + production databases worked for the same AccountID. For a moment I thought that an instance of a workflow correlated on the AccountID never left the application memory for the persistent store. I re-started the app pool, but it didn't work for me.

    So... I'm a bit puzzled.

    Thank you in advance.

     

    Friday, September 10, 2010 9:49 PM
  • Thanks for the additional details, I am checking with a few folks to see if I can get some more information.

    Thanks,

    Steve Danielson [Microsoft]
    This posting is provided "AS IS" with no warranties, and confers no rights.
    Use of included script samples are subject to the terms specified at http://www.microsoft.com/info/cpyright.htm

    Friday, September 10, 2010 10:49 PM
  • Just following up... I don't know for sure if I found a workaround to the problem or not - I'm still monitoring the workflow instances health, but so far I'm happy with how it's behaving. Here is what I discovered:

    I knew that InitializeCorrelation activity did something unusual with some instances. Before re-running workflows for the troubled AccountIDs, I made sure that there are no instances in the persistence store, I even restarted the application with an intent to remove any wf instance from memory. All of that didn't help. Then I replaced InitializeCorrelation activity that came from the toolbox with a custom code activity that wrapped InitializeCorrelation in code. Things worked right away and still working.

    So, is there a bug when using IniailizeCorrelation in xaml?

    Regards, Andrei

    Monday, November 29, 2010 7:33 PM
  • Interesting; I am not aware of any issues but I will pass this info along to our product group. I did see any issue recently where a workflow that used InitializeCorrelation activity had trouble when a duplicate correlation key was used to start a new workflow, and it was resolved by removing the InitializeCorrelation activity and configuring the correlation on the Receive activity instead (since the id in this case was passed along wth the call to the workflow). if you are generating the id on the server side, you could do it between the Receive/SendReply and initialize the correlation on the sendReply when you pass that data back to the caller. That may not map to your scenario, and if you have it working alread you may not want to change. I will pass this information along and report back if I get any additional information.

    Thanks,

    Steve Danielson [Microsoft]
    This posting is provided "AS IS" with no warranties, and confers no rights.
    Use of included script samples are subject to the terms specified at http://www.microsoft.com/info/cpyright.htm

    Tuesday, November 30, 2010 3:16 AM
  • Then I replaced InitializeCorrelation activity that came from the toolbox with a custom code activity that wrapped InitializeCorrelation in code. Things worked right away and still working.
    I am running into a similar problem. My workflow stalls on the InitializeCorrelation, then 30s later, it restarts from the last persist point. Can you tell me how you wrapped the InitializeCorrelation? Did you use a NativeActivity or Activity? A code sample would be nice.
    Tuesday, May 31, 2011 4:17 PM
  • Hi,

    Try taking a look here and see if you can use these techniques to capture an error conditions that may be occurring when it gets to that InitializeCorrelation activity.

    http://msdn.microsoft.com/en-us/library/ee358742.aspx

    I am not sure what code Andrei was using but if he sees this thread he may be able to supply that info.

    Steve Danielson [Microsoft]
    This posting is provided "AS IS" with no warranties, and confers no rights.
    Use of included script samples are subject to the terms specified at http://www.microsoft.com/info/cpyright.htm

    Tuesday, May 31, 2011 9:55 PM
  • In the Activity log, all I see is that the InitalizeCorrelation is scheduled, then after about 30s, a bunch of Activities are Faulted. I see no exception or error message that explains this.

    I don't see any errors in the tracking records that our workflow service outputs.

    I should explain our scenario, which causes this problem, as it is very specific.

    We have an Activity (let's call it ActivityFoo) which basically does the following:

    Sequence
     InitializeCorrelation (Correlate on: WorkflowId + ActivityFoo ActivityId)
     Send message to another service
     Parallel
    Parallel Branch 1
    Loop
    TransactedReceiveScope(Receive[Update]) (MSMQ) status messages Parallel Branch 2
    TransactedReceiveScope(Receive[Finish]) (MSMQ) completion messages
    Exit Parallel
     Persist

    We have a Workflow Service which basically does the following:

    Workflow Instance created by custom WorkflowHostingEndpoint, via MSMQ
    Sequence ActivityFoo1 ActivityFoo2 ActivityFoo3 ActivityFoo4
    ...

    Since ActivityFoo correlates on WorkflowId and ActivityFoo ActivityId, each call to ActivityFoo will result in a different correlation key. So we should have no problem.

    In order to have failover and increased capacity, we run two deployments of this workflow service on two different servers. These two deployments read from the same MSMQ queues. The InstanceLock problem is handled through the TransactedReceiveScope and the retry mechanism available in the netMsmqBinding configuration. It is possible that ActivityFoo completes before processing all its Update messages, in this case the update messages should go through the retry cycles and eventually be thrown away.

    In the single-server scenario, we see no problems. In the dual-server scenario, we observe the following:

    • First ActivityFoo always finishes successfully.
    • The later ActivityFoo calls (it seems to be random which one) stalls on InitializeCorrelation for about 20-30 seconds before the workflow instance is restarted from the last persistence point. After the workflow instance is restarted, it stalls again at the exact same point.
    • After some amounts of stall-restart cycles, the workflow instance eventually continues on. Eventually, the workflow instance finishes, but it takes much longer due to the stalling.
    • While this is happening, querying the InstancesTable in the persistence database is extremely slow, even though there is only 1 record in that table.
    • A TryCatch around InitializeCorrelation catches no exception.
    • In the Activity log, there are no helpful errors logged. There's the log entry that the InitializeCorrelation was scheduled, then after 30s, the next log entries are that a bunch of activities have completed in the Faulted state.

    Could it be that the InitializeCorrelation hits the persistence database, and there's a deadlock happening in it?

    Or perhaps there's a problem running multiple InitializeCorrelation activities in the same workflow instance, even though the correlation keys are different? But then this should be a problem for the single server scenario.

    Or, could it be that when Update messages are in the queue after the ActivityFoo completes and the next ActivityFoo starts, that this could cause problems?

    Update 1

    Initially, we were testing with timeToUnload to 0, because we wanted the workflow instances to persist immediately, so that any server could pick up the workflow instance once it went idle.

    However, after setting the timeToUnload to 10s, it seems to improve the situation.

    Update 2

    Setting timeToUnload to 10s worked well for 1 workflow instance. But with 10 concurrent instances, the same issue occurs.

    Update 3

    I increased the timeToUnload to 1 minute to see if this would improve, but with no luck. I noticed that this problem seems to only occur when the queries to the persistence database take a very long time.


    • Edited by Mas 2112 Monday, June 6, 2011 8:52 AM Update 3
    Wednesday, June 1, 2011 11:16 AM