Stateful Reliable Services memory leak in non-paged pool. RRS feed

  • Question

  • I am running a Service Fabric application on a cluster in Azure. The cluster has two scale sets:

    • 4x B2ms nodes where a stateful service type is placed with Placement Constraints (Primary Scale Set)
    • 2x F1 nodes where a stateless service type is placed.

    There are two types of services in the application

    • WebAPI - stateless service used for receiving statuses from a system via HTTP and sending them to the StatusConsumer.
    • StatusConsumer - stateful service which processes the statuses and keeps the last one. Service instance is created for each system. It communicates via RemotingV2.

    Using a tool I simulate 1000 systems. Each system sends its status every 30 seconds via HTTP request to the stateless service.

    In the beginning of the test the RAM used by each node in the stateful scale set is around 40%. The server response time of the stateful services, observed in Application Insights, is around 15ms. 

    Immediately after the start of the test and creating all stateful services the RAM usage starts gradually increasing. After an hour or so the RAM usage is 99% and server response time is 5-10 seconds. After 2-3 minutes the RAM usage drops and server response time goes back to normal. However RAM usage starts growing again and the whole process repeats. This happens without any action on my side. 

    When I stop the simulator which sends the HTTP requests to the cluster the RAM usage immediately drops to the idle 40% and stays flat.

    During the peak RAM usage I can observe very high non-paged pool values - 6.5 - 7GB out of 8GB total RAM on the machine. When using poolmon the top 3 positions are KLog (~4.3GB), KBuf and KTLL but I can not find any information about them.

    When the RAM usage drops the non-paged pool is around 1.5GB. 

    I have removed almost all functionality from the microservice in order to find the problem. This is the method called when a status is received in the stateful service: 

    public async Task PostStatus(SystemStatusInfo status){
    		Stopwatch stopWatch = new Stopwatch();
    		IReliableDictionary<string, SystemStatusInfo> statusDictionary = await this.StateManager.GetOrAddAsync<IReliableDictionary<string, SystemStatusInfo>>("status");
    		using (ITransaction tx = this.StateManager.CreateTransaction())
    			await statusDictionary.AddOrUpdateAsync(tx,"lastConsumedStatus",(key) => { return status; },(key, oldvalue) => status);
    			await tx.CommitAsync();
    		if (stopWatch.ElapsedMilliseconds / 1000 > 4) //seconds
    			Telemetry.TrackTrace($"Process Status Duration: { stopWatch.ElapsedMilliseconds / 1000 } for {status.SystemId}", SeverityLevel.Critical);
    	catch (Exception e)	{Telemetry.TrackException(e);}

    Also every time I create a cluster I set the following settings via resources.azure.com:

    • SharedLogSizeInMB = 4096
    • WriteBufferMemoryPoolMinimumInKB = 16384
    • WriteBufferMemoryPoolMaximumInKB = 16384


    • MaxDiskQuotaInMB = 1024

    Settings in the Settings.xml file of the problematic stateful reliable service:

    • CheckpointThresholdInMB = 1
    • MaxAccumulatedBackupLogSizeInMB = 1

    How can I fix/avoid this. 

    Tuesday, July 23, 2019 2:53 PM

All replies

  • I think this would best be handled via a technical support request. This will allow the engineers to review the backend configuration and look at internal metrics to help point you to the reason for the issue. 

    Do you have the ability to open a technical support ticket? If not, you can email me at AzCommunity@microsoft.com and provide me with your Azure SubscriptionID and link to this thread. I can then enable your subscription for a free support request. 

    Tuesday, July 23, 2019 8:00 PM
  • Thanks for the reply. I will open a support request and update this thread when there is progress.
    Wednesday, July 24, 2019 7:14 AM
  • Any update on this? 
    Friday, August 2, 2019 4:28 PM
  • Sorry for the late response. After talking to Azure Support and multiple tests I drastically reduced the memory consumption of the services.

    The main thing I learned from the communication with the support was that it is really not a good idea to have a large number of services containing small amount of data each! Memory dumps of the application showed that each service had roughly 20KB of actual data and 700KB of logs of changes in the Reliable Collections accumulated by Service Fabric. This may not be exact numbers but the difference was huge.

    To reduce the number of services I combined the processing and saving of multiple systems` statuses into one service by using a kind of partitioning. I also tried using Actors. All methods worked well.

    There are several other settings I used to reduce the memory consumption, but the big difference was made by changing the architecture of the services themselves:
    In the Settings.xml of the service itself:

    • CheckpointThresholdInMB = 1
    • LogTruncationIntervalSeconds = 1200 (Setting this value to less than 120 actually din`t do anything or made things worse. Try using values > than 300)
    • MaxAccumulatedBackupLogSizeInMB = 1

    In the code of the service itself:

    • ServicePointManager.DefaultConnectionLimit = 200
    • MaxConcurrentCalls = 512 (RemotingListener and Client)

    Cluster Settings:

    • AutomaticMemoryConfiguration - 0 (As mentioned in above comments if you do not set this setting the others won`t work)
    • WriteBufferMemoryPoolMinimumInKB - 16MB.
    • WriteBufferMemoryPoolMaximumInKB - 32MB.

    See the issue in GitHub for more details azure/service-fabric-issues/issues/1523

    Friday, November 29, 2019 5:16 PM
  • Thanks for the update! I am now following the issue opened on the SF repo as well. 
    Monday, December 2, 2019 8:51 PM