I wonder if any of you could help me solve a problem I'm having...
We use small webroles and have two instances. The webrole consists of a small public website and the main application (used by members). The product is still in its infancy, and load is very low at the moment.
For both sites, we use a co-located caching service to store our session data (through the SDK’s libraries). At the moment, we do not use the caching service for anything else.
We’ve been having a problem that is severely affecting out uptime, and it is difficult to investigate and solve. The problem is intermittent but happens every 1 or 2 weeks.
Here is a description:
1. One of the instances is reported as unhealthy in the windows azure monitor.
2. Monitoring graphs in the azure portal do not load (or take ages!)
3. When at times they do load, memory available on the instance (or instances as it sometimes affects both instances) is 0 bytes
4. Restarting the instance usually solves the problem, though it takes ages (somestimes as much as 1hr+).
5. Restarting one unhealthy instance sometimes affects the healthy one, and interrupts the service (sites do not load), even with 1 healthy instance.
6. During an instance restart, we start receiving numerous exceptions related to the cache service. This is because we have an exception handler in global.asax to send us any unhandled exceptions through email. This is probably related to the restart of the
instance.
7. Most of the times, remote desktop on the unhealthy instance does not work. However, at times, remote desktop works (perhaps if the service hasn’t been unhealthy for some time).
8. Upon successfully logging in through RDP on an unhealthy instance, CacheService.exe would be taking an enourmous amount of memory (from task manager), and it keeps on increasing (700-1GB) at a steady visible rate (10mb every min) in small increments.
9. At times, we have received an ‘out of memory’ exception from the site before the site goes down.
10. Restarting the caching service on the ailing instance immediately solves the problem, and azure portal reports the instance as healthy again.
11. Our usage of the caching is just for sessions, and we just store some strings and integers to ensure that the user is properly logged in.
12. During normal operations, caching service is about 200mb on each instance.
13. Here are the caching settings we’re using:
a.Cache size percentage: 30%
b.High Availability: Disabled
c.Notifications: Disabled
d.Eviction: LRU
e.Expiration: Sliding
f.TTL: 20
g.SDK version: 1.7
Clearly, it seems a problem related to the caching service. I’m wondering if it could be one of the following:
1. A bug in the caching service
2. Incorrect utilization of session state and the caching service
3. Incorrect configuration of the caching service
In my opinion, the caching service is properly configured (there isn’t much you can do wrong). While I cannot exclude #2, we have never had similar problems in other deployments (the same code is deployed on-premise using SQL server state storage). Unfortunately,
there seems no way how I can connect to the caching service and perhaps browse through the content (perhaps list the keys, and sizes of objects?). Is there a way to do so?
As regards to #1, have you guys ever encountered some type of memory bug in the co-located caching service? Are there any suggestions you may offer to solve or help investigate this issue?
Thanks!