locked
Why don't persisted workflows with correlation actually persist after server reboot? RRS feed

  • Question

  • According to how persistence and correlation are advertised for Windows Workflows, this should be a simple problem:

    I have a simple Windows Workflow 4.5 workflow hosted as a WCF service in IIS.  I've set up persistence using the SQL Server store provided by Microsoft.

    The workflow accepts a document ID and a boolean indicating if it needs to pause and wait on human activity (review by a human).  If the workflow needs to wait, it has a Receive() activity that correlates on the document ID and pauses (which creates a bookmark behind the scenes). Else it runs to completion and routes the document.  It's my understanding that the act of RECEIVE-ing makes the workflow go idle and, depending on how I've configured my "Time To Persist" settings (I've got an aggressive value of one second for testing), the workflow should be persisted.

    Everything works perfectly as long as we don't reboot the server or do anything like recycling the app pool for the service.  In cases where the workflow is waiting on human input, it pauses and waits for a message with the same Document ID, and resumes upon receipt, running to completion.

    However, in cases you'd expect in the real world for a long-running workflow, if we simulate a server crash by rebooting or recycling the app pool, the workflows waiting with Receive() never respond.   Are we supposed to be doing something special to "rehydrate" the workflows after the server comes back up?  Does correlation not work for workflows that are persisted?  This is really puzzling.

    • Edited by MarcAaron Monday, January 11, 2016 8:17 PM
    Friday, January 8, 2016 8:16 PM

Answers

  • I THINK I FOUND THE ANSWER.   Or at least an answer.

    When you use Visual Studio 2013 to create a new WCF Workflow Service Application, the template app Visual Studio generates contains paired Receive and Send activities.  (For reference, this first receive, which we've been calling "Receive #1" throughout this thread, is the receive activity that has its "CanCreateInstance" property set to true, allowing it to create a new workflow.)

    It seems to be the paired Send() activity that's keeping my workflow from ever going idle (and thus persisting when the second Receive() activity is hit). 

    If I delete the Send from the workflow, it does indeed go idle when the second receive is encountered.  I can simulate a server reboot, by recycling the app pool, and when I send my second (correlated) message to the workflow, that workflow picks up where it was waiting before the server reboot.

    This seems to hark back to what Jim was saying about activities not being complete.  I'm guessing that from this behavior the prewired Receive and Send that are inserted by the Visual Studio template for this kind of project are considered an atomic activity that needs to complete before the workflow can idle? 

    And this could all mean I embarrassingly built my first workflow from the Visual Studio template incorrectly.  I put my activities between the initial Receive #1 and its corresponding pre-wired Send, thinking that the Send meant "I am sending you the results of the workflow" when in fact it makes more sense to think of this as "I'm sending you a response to the message that kicked the workflow off".



    Tuesday, January 12, 2016 7:46 PM

All replies

  • MarcAaron,

    I tried to reproduce your problem without success. Here are the steps I took:

    Hosted a simple workflow service that is a Sequence with 3 Receive activities. The first Receive is marked as "CanCreateInstance = true" and initialize a correlation handle with the instance id of the instance and returns that instance id to the caller.

    Receive #2 correlates with the correlation handle from Receive #1 and accepts a Guid that is used as the "correlates on" value.

    Receive #3 does the same thing as Receive #2.

    I activated the service with Receive #1. I then recycled the AppPool and invoked Receive #2 and Receive #3 with the instance id that was returned from the initial Receive #1. I invoked 2 and 3 from different invocations of the client process.

    I am thinking you may have something wrong with your correlations or with your activation configuration. Are you sure the AppPool is getting restarted after your reboot?

    What sort of error is being returned on the failures for the "subsequent correlated" requests?

    Jim

    Monday, January 11, 2016 10:24 PM
  • Jim,

    Thank you so much for the response.  We don't receive any errors at all, and correlation works exactly as expected as long as we don't simulate the reboot by recycling the app pool.  Unless we simulate the reboot, the workflow sits and waits at Receive #2 until we send a second message with a matching Document ID that correlates to the Document ID in the message that started the workflow; at that point the workflow resumes. 

    If we simulate a reboot by recycling the app pool while the workflow waits at Receive #2, however, any subsequent messages with a matching document ID disappear "into the ether".  I'm sure the app pool is restarted... after we recycle, new messages sent to Receive #1, which can create an instance of the workflow, do indeed spin up a new workflow and things proceed as expected.  These new workflows correlate to new document IDs, etc.

    One thing I'm suspicious of in looking at the paused workflows in AppFabric is that even with very aggressive persistence idle value of one second, we never see the workflow appear in "Persisted Workflows".  It's my understanding that the Receive() activity creates a bookmark behind the scenes and puts the workflow into idle state.  We are unable by any means to manually persist our workflow prior to the receive (we get error messages from the Persist activity if we try), but all the MSFT docs lead us to believe Receive #2 and an aggressive "Persist Instances When Idle" value should do this for us. 

    Also unsure in reading the WF 4.5 documentation (I'm assuming you're testing in 4.5?) if we need to manually do anything to load workflows after a server crash?

    I don't know if it's within forum rules to ask for sharing of code, but I'd be happy to provide an email address in any way specified by you if you'd rather send your sample code than responding to my inane questions.  However, perhaps the thought processes behind this can help others who might run into the same problem. 

    Monday, January 11, 2016 10:55 PM
  • MarcAaron,

    It sounds like your correlations are correct.

    Your description indicates to me that the instance is not being persisted/unloaded, even though the Receive #2 is executing and has created the bookmark.

    It's time to look at the workflow definition to see what other activities might be running "in parallel" with Receive #2 that are preventing the instance from becoming idle. As long as there is some activity that is not "blocked", the workflow instance is not considered idle and will not be persisted/unloaded.

    You can try enabling tracking to get an idea of what activities are getting executed and when. You can use the EtwTrackingBehavior to get the tracking events put into the windows event log.

    Jim

    Tuesday, January 12, 2016 2:35 AM
  • I THINK I FOUND THE ANSWER.   Or at least an answer.

    When you use Visual Studio 2013 to create a new WCF Workflow Service Application, the template app Visual Studio generates contains paired Receive and Send activities.  (For reference, this first receive, which we've been calling "Receive #1" throughout this thread, is the receive activity that has its "CanCreateInstance" property set to true, allowing it to create a new workflow.)

    It seems to be the paired Send() activity that's keeping my workflow from ever going idle (and thus persisting when the second Receive() activity is hit). 

    If I delete the Send from the workflow, it does indeed go idle when the second receive is encountered.  I can simulate a server reboot, by recycling the app pool, and when I send my second (correlated) message to the workflow, that workflow picks up where it was waiting before the server reboot.

    This seems to hark back to what Jim was saying about activities not being complete.  I'm guessing that from this behavior the prewired Receive and Send that are inserted by the Visual Studio template for this kind of project are considered an atomic activity that needs to complete before the workflow can idle? 

    And this could all mean I embarrassingly built my first workflow from the Visual Studio template incorrectly.  I put my activities between the initial Receive #1 and its corresponding pre-wired Send, thinking that the Send meant "I am sending you the results of the workflow" when in fact it makes more sense to think of this as "I'm sending you a response to the message that kicked the workflow off".



    Tuesday, January 12, 2016 7:46 PM
  • I am glad you figured it out!

    This morning I was working on formulating a response that talked about no persist zones.

    In the designer toolbox, there is a "ReceiveAndSendReply", which, as you point out, plops a Sequence that contains a Receive activity followed by a SendReply activity. These two are correlated together with a "request-reply correlation" and they are associated with a single message exchange. If you put other activities between the Receive and the correlated Send, these activities are executed before the reply is sent. While a message is "outstanding" (between the Receive and its correlated SendReply), these activities are in an implicit no persist zone - we don't want to persist while we are still processing a message.

    Jim

    Tuesday, January 12, 2016 8:39 PM
  • I appreciate you sticking with this thread and providing all the great suggestions Jim.  It was your thoughts on parallel activities that got me to thinking "is there anything else I might be doing that could be blocking this?"

    Me culpa on splitting the initial SendAndReply provided by the template project and preventing the idle/persist.  I didn't realize the two activities were wired together, so to speak, until I really got to looking at what had been placed in my project for me.  At first glance, the space between the two activities looked like "insert code here"!

    Thank you again,

    Mark

    Tuesday, January 12, 2016 8:45 PM
  • MarcAaron,

    It is perfectly fine to insert activities between the Receive and SendReply. That is where you do the work to process the message. But you can't persist in there. Including another Receive typically doesn't make sense, since the "client" is probably waiting for a reply from the first request before moving on. Although that is not a "hard and fast" rule, given asynchronous processing.

    Glad to help.

    I have marked your earlier "I figured it out" response as "propose as answer". Please do the "Mark as answer" on it to indicate that this thread is complete.

    Thanks.

    Jim

    Tuesday, January 12, 2016 8:52 PM