locked
System.Runtime.Persistence.InstanceOwnerException causing WorkflowServiceHost Instability RRS feed

  • Question

  • I have a Windows Service running about 7 workflow service hosts, which are obviously hosting Workflow Services.  90% of the time they work fine, but randomly the services seem to fault.  I've had to put a bunch of code in the Windows Service to automatically handle fauled services by aborting them and restarting them.  As well as I've had to make a wrapper around my WCF client to automatically retry hitting the workflow services to account for if they are faulted.  I'm handling the faults / restarting the services in the WorkflowServiceHost.Faulted event.  One of the problems I've had is that the Faulted event has no details about what caused the fault.  No exceptions or anything.  Many times workflows fault due to bugs in the actual workflow design, and those exceptions are captured by workflow tracking.  However, randomly in my logs, there will be service faults with no additional exception data.  Today I hit that in the debugger for the first time, and I was able to figure out that there indeed is an exception somewhere in the object, but it's a private property!  This is the exception I'm getting:

        [System.Runtime.Persistence.InstanceOwnerException]: {"The execution of an InstancePersistenceCommand was interrupted because the instance owner registration for owner ID 'b01ebe12-7246-4aa4-881c-dd97439b8264' has become invalid. This error indicates that the in-memory copy of all instances locked by this owner have become stale and should be discarded, along with the InstanceHandles. Typically, this error is best handled by restarting the host."}
        Data: {System.Collections.ListDictionaryInternal}
        HelpLink: null
        InnerException: null
        Message: "The execution of an InstancePersistenceCommand was interrupted because the instance owner registration for owner ID 'b01ebe12-7246-4aa4-881c-dd97439b8264' has become invalid. This error indicates that the in-memory copy of all instances locked by this owner have become stale and should be discarded, along with the InstanceHandles. Typically, this error is best handled by restarting the host."
        Source: "System.Runtime"
        StackTrace: "   at System.Runtime.AsyncResult.End[TAsyncResult](IAsyncResult result)\r\n   at System.ServiceModel.Activities.Dispatcher.DurableInstanceManager.GetInstanceAsyncResult.HandleEndLoad(IAsyncResult result)\r\n   at System.Runtime.AsyncResult.SyncContinue(IAsyncResult result)\r\n   at System.ServiceModel.Activities.Dispatcher.DurableInstanceManager.GetInstanceAsyncResult.GetInstance()\r\n   at System.ServiceModel.Activities.Dispatcher.DurableInstanceManager.GetInstanceAsyncResult..ctor(DurableInstanceManager instanceManager, InstanceKey instanceKey, ICollection`1 additionalKeys, WorkflowGetInstanceContext parameters, TimeSpan timeout, AsyncCallback callback, Object state)\r\n   at System.ServiceModel.Activities.Dispatcher.ControlOperationInvoker.ControlOperationAsyncResult.Process()"
        TargetSite: {TAsyncResult End[TAsyncResult](System.IAsyncResult)}


    After one service faults with this exception, every other workflow faults the first time it is accessed.  What would cause this exception?  How can I prevent this from happening?
    Tuesday, January 12, 2010 5:44 PM

All replies

  • You could see this because the sql store was unable to renew it's locks in the DB in time. This is related to the lock renewal period that you can configure when you create the store.

    1. What's value do you set the HostLockRenewalPeriod to? The default is 30 seconds.
    2. Is there considerable load on SQL when you see this happening?

    In your scenario, as you are using NT services to host WFSH you are not using WMS to do instance recovery and must be managing this manually? You could potentially set the HostLockRenewalPeriod to a larger value.

    Tuesday, January 12, 2010 5:58 PM
  • I'm getting the same problem when the server (SQL Azure) is not under load. Duration is set to 30 seconds, is this not unreasonable?
    Monday, October 25, 2010 11:50 AM
  • Ryan, Would it be possible for you to post the code in your faulted event that restarts your workflows. We are also getting sporadic workflow faults being issue and the amount of documentation to gracefully handle this is sparse. Many thanks.
    Wednesday, December 1, 2010 6:50 PM
  • Improper Use Of correlation in workflow service can can this problem.

    Correlation key value should be unique. This key would be stored in Keys table of the SQL Persistance store. Any failure during processing should terminate the instance and remove these keys.

    If not removed and if the same value passed for this key would cause this error.

    Example:

    In a Datacontract:

    [DataMember]

    Name

    [Datamember]

    Age

    [Datamemeber]

    Id

    Here - Id is set as a Correlation key. This would be stored in the keys table for a instance .

    Any failure to this instance will not remove the key (Ideally should be configured to be removed) and for any subsequent request if the same id value is passed will cause this key colliation.

    The best approach is to make sure that every time the correlation key is set with a unique value, the better is to use Guid.

     

    Wednesday, December 29, 2010 4:56 PM
  • I'm getting similar exceptions all the time.

    The workflow instance goes into an aborted state. If I restart my workflow server this workflow instance resumes and start running again.

    But I don't want to restart the workflow server everytime one of my workflow instance gets aborted due to this exception.

    What is right solution for this issue. I have a bunch of long running workflows and this exception is really becoming a show stopper.

    I have set  hostLockRenewalPeriod to 00:00:05 and runnableInstancesDetectionPeriod to 00:00:02. I tried setting the default values of 30 secs and 5 secs and still saw the same exception being thrown.

    Friday, October 21, 2011 5:26 PM
  • In the same boat, this happens intermittently

    I have read in some post that trying to load a workflow which is already complete might lead to this.

    Have added a check in my code to check for this condition (make sure workflow is active/not complete before trying to load it) and currently not running into this

    But I will be following this thread closely to see if I get more informaiton on this

    Saturday, December 3, 2011 12:47 AM
  • For people who are hitting this issue, and wondering if it's a bug or not, I think the key deciding factors are

    -are you using WorkflowServiceHost or custom host?
    -if custom host, do you make calls to the instance store renew the lock, using ExtendLockCommand?
    -how often? [should be more often than hostLockRenewalPeriod]
    -what value of hostLockRenewalPeriod did you have set on the instance store?
    -how long was the faulting workflow instance loaded in memory from the database and executing before the issue occurs? Does it correspond to the value of hostLockRenewalPeriod? e.g. is it 30 seconds? 5 minutes? an hour?

    Tim

    Saturday, December 3, 2011 8:13 AM
  • For people who are hitting this issue, and wondering if it's a bug or not, I think the key deciding factors are

    -are you using WorkflowServiceHost or custom host?
    -if custom host, do you make calls to the instance store renew the lock, using ExtendLockCommand?
    -how often? [should be more often than hostLockRenewalPeriod]
    -what value of hostLockRenewalPeriod did you have set on the instance store?
    -how long was the faulting workflow instance loaded in memory from the database and executing before the issue occurs? Does it correspond to the value of hostLockRenewalPeriod? e.g. is it 30 seconds? 5 minutes? an hour?

    Tim

    I am working with M. SNathan on this. We are using WorkflowServiceHosts in a .Net application. What is the ExtendLockCommand? I don't see it documented anywhere.
    Monday, January 16, 2012 9:08 PM
  • Please treat the bit about ExtendLockCommand as just a wrong question. That is not actually a public API, it is part of the implementation of HostLockRenewalPeriod. I need to revise the questions at some point. Still very curious as to what data people can add on this thread.
    Tim

    • Proposed as answer by Anil Mudajja Wednesday, April 11, 2012 4:16 PM
    • Unproposed as answer by Anil Mudajja Wednesday, April 11, 2012 4:16 PM
    Tuesday, January 17, 2012 2:22 AM
  • I have found one other way of causing this exception to be thrown, which I can reproduce reliably using a WorkflowApplication as the host (not tried with WorkflowServiceHost) and that is to run the DeleteWorkflowOwnerCommand against the SqlWorkflowInstanceStore but forget to unset the DefaultInstanceOwner on it. That way any successive commands will get the exception seen here. 
    Tuesday, May 8, 2012 7:40 AM
  • Hey there, I wanted to see if anyone made any progress with this issue? I am seeing the same problem in an application in production for a client so this has become a major problem for me.

    Tuesday, October 16, 2012 1:28 AM