none
Problem with Caching service on webrole (co-located)

    Question

  • I wonder if any of you could help me solve a problem I'm having...

    We use small webroles and have two instances. The webrole consists of a small public website and the main application (used by members). The product is still in its infancy, and load is very low at the moment.

    For both sites, we use a co-located caching service to store our session data (through the SDK’s libraries). At the moment, we do not use the caching service for anything else.

    We’ve been having a problem that is severely affecting out uptime, and it is difficult to investigate and solve. The problem is intermittent but happens every 1 or 2 weeks.



    Here is a description:

    1. One of the instances is reported as unhealthy in the windows azure monitor.

    2. Monitoring graphs in the azure portal do not load (or take ages!)

    3. When at times they do load, memory available on the instance (or instances as it sometimes affects both instances) is 0 bytes

    4. Restarting the instance usually solves the problem, though it takes ages (somestimes as much as 1hr+).

    5. Restarting one unhealthy instance sometimes affects the healthy one, and interrupts the service (sites do not load), even with 1 healthy instance.

    6. During an instance restart, we start receiving numerous exceptions related to the cache service. This is because we have an exception handler in global.asax to send us any unhandled exceptions through email. This is probably related to the restart of the instance.

    7. Most of the times, remote desktop on the unhealthy instance does not work. However, at times, remote desktop works (perhaps if the service hasn’t been unhealthy for some time).

    8. Upon successfully logging in through RDP on an unhealthy instance, CacheService.exe would be taking an enourmous amount of memory (from task manager), and it keeps on increasing (700-1GB) at a steady visible rate (10mb every min) in small increments.

    9. At times, we have received an ‘out of memory’ exception from the site before the site goes down.

    10. Restarting the caching service on the ailing instance immediately solves the problem, and azure portal reports the instance as healthy again.

    11. Our usage of the caching is just for sessions, and we just store some strings and integers to ensure that the user is properly logged in.

    12. During normal operations, caching service is about 200mb on each instance.

    13. Here are the caching settings we’re using:
    a.Cache size percentage: 30%
    b.High Availability: Disabled
    c.Notifications: Disabled
    d.Eviction: LRU
    e.Expiration: Sliding
    f.TTL: 20
    g.SDK version: 1.7


    Clearly, it seems a problem related to the caching service. I’m wondering if it could be one of the following:
    1. A bug in the caching service
    2. Incorrect utilization of session state and the caching service
    3. Incorrect configuration of the caching service

    In my opinion, the caching service is properly configured (there isn’t much you can do wrong). While I cannot exclude #2, we have never had similar problems in other deployments (the same code is deployed on-premise using SQL server state storage). Unfortunately, there seems no way how I can connect to the caching service and perhaps browse through the content (perhaps list the keys, and sizes of objects?). Is there a way to do so?

    As regards to #1, have you guys ever encountered some type of memory bug in the co-located caching service? Are there any suggestions you may offer to solve or help investigate this issue?

    Thanks!

    Friday, November 09, 2012 4:37 PM

Answers

  • Hi Kranzorg1,

    The caching service was promoted from preview from SDK 1.7.x to 1.8. It's now final version. So, it could make a big difference using the latest version. Consider upgrading your Hosted Service!

    I also have a WebApp being hosted on a WebRole (2x Extra Small instances) and I use the co-located caching service to host my default webapp cache, output caching and also for session state. I didn't run on any troubles so far. I am taking the default 30% of memory on each instance, to serve my cluster.

    Consider also purging/clearing all your caches (default and named caches) on Application_Start() in your Global.asax.cs. It could be that you're dealing with some leftovers/residual data from previous cache operations.

    Hope this helps! 


    Best Regards,
    Carlos Sardo


    • Edited by Carlos Sardo Saturday, November 10, 2012 9:15 AM
    • Proposed as answer by Carlos Sardo Thursday, November 15, 2012 2:32 PM
    • Marked as answer by Dino HeModerator Friday, November 16, 2012 8:55 AM
    Saturday, November 10, 2012 9:14 AM
  • Hi Kranzorg1,

    No, never had that issue. Give it a try with SDK  1.8 and make sure you are referencing the correct binaries, after the upgrade. Something like: C:\Program Files\Microsoft SDKs\Windows Azure\.NET SDK\2012-10\ref\Caching\...

    Hope this helps!


    Best Regards,
    Carlos Sardo

    Monday, November 12, 2012 10:04 AM

All replies

  • Hi Kranzorg1,

    The caching service was promoted from preview from SDK 1.7.x to 1.8. It's now final version. So, it could make a big difference using the latest version. Consider upgrading your Hosted Service!

    I also have a WebApp being hosted on a WebRole (2x Extra Small instances) and I use the co-located caching service to host my default webapp cache, output caching and also for session state. I didn't run on any troubles so far. I am taking the default 30% of memory on each instance, to serve my cluster.

    Consider also purging/clearing all your caches (default and named caches) on Application_Start() in your Global.asax.cs. It could be that you're dealing with some leftovers/residual data from previous cache operations.

    Hope this helps! 


    Best Regards,
    Carlos Sardo


    • Edited by Carlos Sardo Saturday, November 10, 2012 9:15 AM
    • Proposed as answer by Carlos Sardo Thursday, November 15, 2012 2:32 PM
    • Marked as answer by Dino HeModerator Friday, November 16, 2012 8:55 AM
    Saturday, November 10, 2012 9:14 AM
  • Upgrading to SDK v1.8 is the top item in my things to try.

    I wouldn't like to add the clearing of cache items in the Application_Start() because people would get logged out of the app if I recycle or restart an instance. :/

    Through your experience, have you ever had instances reported as unhealthy? I get it a lot.

    Thanks for the help!

    Monday, November 12, 2012 9:30 AM
  • Hi Kranzorg1,

    No, never had that issue. Give it a try with SDK  1.8 and make sure you are referencing the correct binaries, after the upgrade. Something like: C:\Program Files\Microsoft SDKs\Windows Azure\.NET SDK\2012-10\ref\Caching\...

    Hope this helps!


    Best Regards,
    Carlos Sardo

    Monday, November 12, 2012 10:04 AM
  • What version of Windows Server are you using for the web roles? Windows Server 2008 SP1, 2008 R2 or 2012?

    Monday, November 12, 2012 3:03 PM
  • Windows Server 2012

    ServiceConfiguration: osFamily="3" osVersion="*"


    Best Regards,
    Carlos Sardo

    Monday, November 12, 2012 3:12 PM
  • I have changed the OS to Windows Server 2008 R2 (I cannot upgrade to 2012 without a redeployment).

    The problem still reoccurs. I was suspecting a memory issue with our code, though I couldn't find any. Our application pools were at roughly the same memory usage after hours of usage.

    What I found was that the cache service keeps on growing. Using a memory profiler, I found that the cache service had 500MB+ of byte arrays. When I extracted the values of these byte[]s with a  memory profiler, I found that the bytes in these byte arrays were 0x0. I wonder why the cache service was allocating empty byte arrays.

    I have still not upgraded my Azure SDK to 1.8, and I still believe that this is the best solution at the moment. I just wanted to update people with my findings. I will do the upgrade with the next release of the software.

    Friday, November 16, 2012 12:23 PM