locked
BackupRestoreService cannot start RRS feed

  • Question

  • I wanted to add the BackupRestore system service to my freshly upgraded (to 6.4) cluster on premises.

    I followed the instructions (https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-backuprestoreservice-quickstart-standalonecluster) and added this section to the cluster configuration:

            "addonFeatures": ["BackupRestoreService"],
            "fabricSettings": [
                {
                    "name": "BackupRestoreService",
                    "parameters":  [
                        {
                            "name": "SecretEncryptionCertThumbprint",
                            "value": "398C7279CAEF57948E58AF6C3E779E47F08D6B12"
                        }
                    ]
                }

    It worked for the local dev cluster but failed miserably on 3 nodes test cluster. The BackupRestore service crashes instantly with following errors being logged:

    Failed to open store '' at LocalMachine: E_INVALIDARG
    Failed to get the Certificate's private key. Thumbprint:398C7279CAEF57948E58AF6C3E779E47F08D6B12. Error: E_INVALIDARG
    Failed to get private key file. x509FindValue: 398C7279CAEF57948E58AF6C3E779E47F08D6B12, x509StoreName: , findType: FindByThumbprint, Error E_INVALIDARG


    Then 3 warnings are logged:

    SetCertificateAcls failed. ErrorCode: E_INVALIDARG
    Can't ACL BackupRestoreService/SecretEncryptionCertThumbprint, ErrorCode E_INVALIDARG
    Error at AclConfiguredCertificates, ErrorCode E_FAIL


    It looks like I were supposed to configure the store name, but the documentation does not mention any way to do that.

    The application log if flooded with such entries:

    Application: FabricBRS.exe
    Framework Version: v4.0.30319
    Description: The application requested process termination through System.Environment.FailFast(string message).
    Message: RunAsync failed due to an unhandled exception causing the host process to crash: System.TypeInitializationException: The type initializer for 'System.Fabric.BackupRestore.Common.Constants' threw an exception. ---> System.FormatException: Input string was not in a correct format.
       at System.Number.ParseDouble(String value, NumberStyles options, NumberFormatInfo numfmt)
       at System.Fabric.BackupRestore.Common.Constants.GetRuntimeVersion()
       at System.Fabric.BackupRestore.Common.Constants..cctor()
       --- End of inner exception stack trace ---
       at System.Fabric.BackupRestore.Common.BaseStore`2..ctor(IReliableDictionary`2 reliableDictionary, StatefulService statefulService, String traceType)
       at System.Fabric.BackupRestore.Common.WorkItemStore.<CreateOrGetWorkItemStore>d__4.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at System.Fabric.BackupRestore.Common.BaseWorkItemQueue..ctor(IReliableQueue`1 workItemReliableQueue, Int32 maxWaitTimeInMinutes, WorkItemQueueRunType workItemQueueRunType, String traceType, StatefulService statefulService)
       at System.Fabric.BackupRestore.Common.WorkItemQueue.<CreateOrGetWorkItemQueue>d__3.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at System.Fabric.BackupRestore.Common.WorkItemHandler.<StartAndScheduleWorkItemHandler>d__6.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at System.Fabric.BackupRestore.Service.BackupRestoreService.<RunAsync>d__10.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.ServiceFabric.Services.Runtime.StatefulServiceReplicaAdapter.<ExecuteRunAsync>d__23.MoveNext()
    Stack:
       at System.Environment.FailFast(System.String)
       at System.Threading.Tasks.Task.Execute()
       at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)
       at System.Threading.Tasks.Task.ExecuteEntry(Boolean)
       at System.Threading.ThreadPoolWorkQueue.Dispatch()

    I want to add, that I have successfuly used the configured certificate to decrypt secrets in an application deployed to that cluster.

    And I checked using powershell, that the certificate's private key ACL has been configured for NETWORK SERVICE:

    Access : NT AUTHORITY\SYSTEM Allow  FullControl
             NT AUTHORITY\NETWORK SERVICE Allow  FullControl
             BUILTIN\Administrators Allow  FullControl

    Any ideas?


    Saturday, December 8, 2018 6:45 PM

Answers

All replies

  • I took a look at the doc you referenced and it seems there are two view options. Once for Standalone clusters and one for Azure Clusters

    Did you check the Clusters on Azure Section before deploying? This might be the cause of the issue. 

    Monday, December 10, 2018 8:17 PM
  • Using the link I provided You get the doc for standalone clusters.

    If you choose to switch over to clusters on Azure, two things happen:

    1. The link changes to https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-backuprestoreservice-quickstart-azurecluster

    2. You'll notice that there is no new information, step 2 and 3 of "Enabling backup and restore service" are exactly the same.

    Tuesday, December 11, 2018 9:41 AM
  • Got it. Sorry about that. 

    You mentioned it worked on your Dev cluster. Can you tell me the differences between the dev cluster and the 3 node cluster you are seeing an issue with? 

    I assume the Dev cluster is on prem and after that you deployed a 3 node cluster to Azure and saw the issue. 

    Wednesday, December 12, 2018 10:40 PM
  • It worked on local dev cluster - on my laptop :-) running Windows 10.

    The BackupRecoveryService is up and running according to service fabric explorer. 

    It does not work on a standalone test cluster (on premises) hosted on 3 VMs running Windows Server 2016 Core.

    What bothers me is the fact, that the local cluster (dev box) seems to be more like the azure one then like the standalone one - f.e. it runs happily Event Store Service, which is being told to run only on Azure clusters now (not supported on standalone clusters)

    Nevertheless, it does not work on standalone (on premises) cluster.


    Thursday, December 13, 2018 4:19 PM
  • Thanks for all that. Hard to say what the issue is without digging into the environment details and logs. If you like, we can get you in touch with a Service Fabric Support Engineer to help you take a look and get it sorted out? You can email me at AzCommunity@microsoft.com and provide me with your SubscriptionID and link to this thread and I can work to enable your subscription for that request. 
    Thursday, December 13, 2018 8:47 PM
  • I reported an issue on GitHub, and It turned out there is a bug currently in BRService when running on a server with a culture other than en-US.

    https://github.com/Microsoft/service-fabric/issues/263

    Monday, December 24, 2018 10:42 AM