none
Memory Leak in DSS Subscription Across the Nodes in Microsoft.Ccr.Core.Arbiters.ReceiverTask[] (potential bug?) RRS feed

  • Question

  • Hi

    It seems that DSS subscription model has a fault/bug where received notifications cause array Microsoft.Ccr.Core.Arbiters.ReceiverTask[] to grow without a limit. The memory leak is small when compared to the overall memory usage but the performance hit is tremendous. After few days of working dsshost delay increases from less than millisecond to a few minutes and my system becomes unstable.

    I have investigated the issue and it seems that it happens only for binary subscriptions across the DSS Nodes using dssp.tcp and does not happen for subscriptions over http or for binary subscriptions in the same node.

    There is similar thread on the subject here (My code runs fine on DSS 2008 RTM, causes a pileup of Receiver<iasyncresult>on DSS 2008 R3. Regression or my bug?)</iasyncresult> but I am using different version of MSRDS and moreover there is no answer or enough details to say that we reported the same issue.

    I also prepared two short projects (PingService and PongService) to illustrate the issue, they are available on my OneDrive: PingPong.zip (31 KB)

    It is important to start both environments with /t switch, like

    /p:40000 /t:40001 /m:"projects/Pong/Pong/Pong.manifest.xml"

    and

    /p:60000 /t:60001 /m:"projects/Ping/Ping/Ping.manifest.xml"

    How does it work? PingService sends a message to PongService and PongService produces a notification. Notification received by PingService is handled and a new message is post to PingService. This message is consumed, resulting with yet another message to PongService and another notification. With each notification cycle an element is added to the Microsoft.Ccr.Core.Arbiters.ReceiverTask[] resulting with a small memory increase (memory profiler is required to clearly see this) and performance degradation. After 100000 cycles the memory increase in memory profiler is visible and increased delay in notifications is clearly visible.

    It doesn't happen in the same DSS node or with http protocol. However the dssp is very important for me because of the performance advantage and DSS node's isolation is also crucial for me. I would be glad for any advice or maybe for a link to older versions of Microsoft.Ccr.Core.dll where as I suppose this bug wasn't present.

    Best regards
    Piotr

    Tuesday, May 13, 2014 11:33 AM

Answers

  • It really seems that there is a bug in the DSS. When I change dssp.tcp to http the memory leak seems to be solved. However the number of handles used starts to increase really fast, after few days one can have millions of unclosed handles (type Key). Turn off the authentication and everything goes away, number of handles stays low and we can even revert back to binary serialization. Another workaround... It would be nice to have someone from the DSS&CCR team to look at it. But anyway I think that it is a time to look for a DSS&CCR replacement.


    Wednesday, June 18, 2014 11:46 AM

All replies

  • Hi Piotr,

    Glad to see you on the forums again! :-D

    Are you using the default CLR dispatcher or a custom one?  If indeed there is a bug, you might try instanciating your own Dispatcher.  This allows you to configure things such as throttling rate, thread affinity, etc.

    Thanks,


    Dennis M. Knippel

    Tuesday, May 13, 2014 3:57 PM
    Moderator
  • Hi Denis

    I often visit the forum as a reader, I would be glad to answer a few questions from time to time (like the endpoints for WCF, cloud and Robotics) or even add some points to discussion (I am playing in my free time with the idea of DDS - Data Distributed Services with embedded devices and DSS on desktop/server). But that is something for the future.

    >Are you using the default CLR dispatcher or a custom one?

    I was using the default dispatcher.

    > If indeed there is a bug, you might try instanciating your own Dispatcher.  This allows you to configure things such as throttling rate, thread affinity, etc.

    I have played with custom Dispatcher and DispatcherQueue options but it didn't help and I can't see how it could. The memory usage still grows and responsiveness decreases. All the diagnostics show that the system is perfectly normal:

    dispatecherQueue.Count: 0 (stays 0)
    dispatecherQueue.CurrentSchedulingRate: 0 (stays 0)
    dispatecherQueue.Dispatcher.PendingTaskCount: 0 (stays 0)
    dispatecherQueue.Dispatcher.ProcessedTaskCount: 1000 (grows as the number of processed messages grows).

    Only the receiver data during debbuging

    receiver = Arbiter.Receive<pong.Pong>(true, pongNotify, PongNotifyHandler);

    shows some problems:

    base = {Receiver`1(Persistent) with method Ping.PingService:PongNotifyHandler nested under none}

    ExceptionPort = {Port Summary:

    Hash:3645
    Type:System.Exception
    Elements:0

    ReceiveThunks:1
    Receive Arbiter Hierarchy:
    DsspReqRspForwarder(Persistent) with no continuation nested under none}

    but I have no idea where this exception comes from and how to handle it.

    Best regards
    Piotr

    Wednesday, May 14, 2014 11:43 AM
  • Gershon,

    If you're still out there, could you chime in on this one... :-)

    Thanks,


    Dennis M. Knippel

    Wednesday, May 14, 2014 4:27 PM
    Moderator
  • It would be nice to get someone on this, as not being able to use binary serialization is a serious handicap. Maybe one day .Net Foundation will include the MSRDS :)

    Cheers
    Piotr

    Saturday, May 17, 2014 12:49 PM
  • I want to add yet another twist to the story. After switching to mode: http only (parameter /p without /t in command line starting dsshost) I have noticed that unmanaged memory grows and after one or two daysI am unable to close the dsshost. I see the messages:

    * ControlCpressed.

    * Initiating shutdown...

    * Shutdown complete.

    but it stays there forever. Moreover I can't kill the process because I receive access denied message. Does anyone have an earlier version of DSS? It seems that the latest version has these bugs that previous version didn't have. I would really appreciate link to earlier version (earlier version works). Or some answer on how to solve this.

    Tuesday, June 10, 2014 2:15 PM
  • Wow, this is bizarre behavior.  I'm not saying there isn't a bug in the CCR/DSS, but thought I'd throw this question into the ring:  Have you tried your solution on a different PC?  Perhaps there is some underlying hardware/driver/software/malware issue on your current dev box causing said memory leak and/or permissions issue?

    Dennis M. Knippel

    Tuesday, June 10, 2014 4:21 PM
    Moderator
  • Hi Dennis

    Our software solution works on five different servers so I would rule out the hardware/driver/software/malware issue. The long shutdown is present only in the situation when /p switch is used. Moreover the increase in memory usage by unmanaged code is bothering me. I wil investigate the subject but it will take some time as the effects are visible usually after 4-5 days.

    If you have by any chance an access to the older version of dlls please let me know. Even for tests they may be helpful.

    Cheers
    Piotr

    Tuesday, June 10, 2014 5:14 PM
  • My understanding of the /p vs. /t switches is that the /p implements a UDP protocol whereas the /t implements a TCP/IP protocol.  I could be wrong on those semantics, but if correct would it shed some light on the observed behavior?

    I may have RDS v3 on my office server, but will not have access to it until this coming Wednesday (June 11th).  Will check to see and if I do have it, will provide a link for you.


    Dennis M. Knippel

    Tuesday, June 10, 2014 5:29 PM
    Moderator
  • Hi Piotr,

    Crazy as it may seem, I had an old version (R3) of RDS lying around.  I think I've successfully uploaded it to my public OneDrive account at:

    https://onedrive.live.com/redir?resid=FFDA433B8A619976!1913&authkey=!AKICw1qydjg32WM&ithint=file%2c.exe

    Let me know if you are not able to successfully download it.  Hope this helps :-)


    Dennis M. Knippel

    Wednesday, June 11, 2014 6:35 PM
    Moderator
  • Hi Dennis

    I have just downloaded the R3 version, thanks. I also finished testing a new version. It does not help and my test scenario still shows the problem and slow degradation of dssp.tcp over time.

    With http transport I get:

    Pong subscribed message: 1000
    00:00:06.5699705 1000
    Pong subscribed message: 2000
    00:00:06.5204951 2000
    Pong subscribed message: 3000
    00:00:06.5425746 3000
    ...
    Pong subscribed message: 10000
    00:00:06.5952621 10000
    Pong subscribed message: 11000
    00:00:06.2360143 11000

    Slow, yes, but stable. With dssp.tcp:

    Pong subscribed message: 1000
    00:00:00.5904222 1000
    Pong subscribed message: 2000
    00:00:00.7046293 2000
    ...
    Pong subscribed message: 20000
    00:00:02.7524658 20000
    Pong subscribed message: 21000
    00:00:03.0142152 21000

    Much faster at first, but then it becomes slower and slower. The bug was still present in R3 release or I am doing something wrong in my scenario. I would imagine that such a simple scenario (subscribe-publish, request-response) was tested before release.

    All the best

    Piotr

    Thursday, June 12, 2014 9:11 AM
  • It really seems that there is a bug in the DSS. When I change dssp.tcp to http the memory leak seems to be solved. However the number of handles used starts to increase really fast, after few days one can have millions of unclosed handles (type Key). Turn off the authentication and everything goes away, number of handles stays low and we can even revert back to binary serialization. Another workaround... It would be nice to have someone from the DSS&CCR team to look at it. But anyway I think that it is a time to look for a DSS&CCR replacement.


    Wednesday, June 18, 2014 11:46 AM
  • Hi Piotr,

    I'm sorry we didn't get much response from the MS Robotics Group, but as you probably have already deduced; support for RDS is dwindling.  BTW, I don't work for MS, just volunteering as a Moderator on this Robotics forum.

    I have already developed a rough equivalent of the CCR component using Task Parallel Library and have used it with success on my project at Agilent Technologies to automate some gas chromatography/mass spectrometry equipment.  I'd be happy to discuss the challenges of this "CCR to TPL" migration effort with you.  I can be reached directly at dknippel@wiesenportz.com

    Regards,


    Dennis M. Knippel

    Wednesday, June 18, 2014 5:27 PM
    Moderator