locked
Redis Cache - Sudden 100% CPU and Crash RRS feed

  • Question

  • Last night I started to get "Timeout performing GET X" errors, then not too long after that it changed to:

    "StackExchange.Redis.RedisConnectionException: No connection is available to service this operation: GET X"

    The cache then went offline (along with all of my websites) for a long period until I returned to work the next day.

    I'd only moved to the Redis cache at the end of last week because the Shared Cache was going offline on 1st September so I was a bit limited on how to respond.

    I am using the Standard 1GB cache, of which about 250MB is taken up. There are 650k Gets and 350k Misses on the cache, so you can tell it is used by some popular websites.

    The first thing I did was to stop the requests - I had to turn off the websites and left the cache to "cool down". After about 10 minutes, I connected from my development machine and all was fine. So I turned back on the sites and it seems ok.

    However, the graphs show that the CPU rose to ~97% when the problem started and it is still at ~97% now - there wasn't a drop when the sites went offline, but the graphs don't seem to be real time. The problem could therefore start up at any time again.

    So now I need to find out what went wrong and how I can debug it.

    It seems difficult because I'm just looking at the Azure interface of delayed graphs. Can I connect to the instances via remote connection to confirm it really was the Redis cache that is/was eating the CPU?

    Why didn't the service failover to the slave?

    Are the instances shared? Could it be someone else who has done a cache GET loop which has put the CPU so high?

    Is there a way to restart the cache in situations like this?

    What other logs can I use to try to find out what's going wrong?

    Note: I also tried to create another Redis cache in case I couldn't stop the issue (and then modify my configs to point to the new low CPU version), but this has failed with the message:

    statusCode:Conflict

    statusMessage:{"error":{"code":"ResourceDeploymentFailure","message":"The resource operation completed with terminal provisioning state 'Failed'."}}

    Whilst writing this post I have also found that I can no longer load the cache information in the Azure Portal - it just lists "We couldn't find any items :-(", but the websites are making use of it.

    It's all quite concerning that the cache can go down and there's not a great deal we can do and then my sites are in trouble because we're using it for sessions and also some code which involves intensive database requests on objects which are normally cached. I'd have to fall back to HttpCache and then we'd lose the benefit of the shared cache of websites hosted on multiple VMs.

    I realise there's the possibility that the code we're using could have a Get loop, but there's not anything obvious how this might occur and we've never had a problem in the past. I just need to find ways to work out what's gone wrong as it's a bit of a black hole at the moment. Any help offered will be appreciated.

    Friday, September 5, 2014 9:12 AM

All replies

  • Last night I started to get "Timeout performing GET X" errors, then not too long after that it changed to:

    "StackExchange.Redis.RedisConnectionException: No connection is available to service this operation: GET X"

    The cache then went offline (along with all of my websites) for a long period until I returned to work the next day.

    I'd only moved to the Redis cache at the end of last week because the Shared Cache was going offline on 1st September so I was a bit limited on how to respond.

    I am using the Standard 1GB cache, of which about 250MB is taken up. There are 650k Gets and 350k Misses on the cache, so you can tell it is used by some popular websites.

    The first thing I did was to stop the requests - I had to turn off the websites and left the cache to "cool down". After about 10 minutes, I connected from my development machine and all was fine. So I turned back on the sites and it seems ok.

    However, the graphs show that the CPU rose to ~97% when the problem started and it is still at ~97% now - there wasn't a drop when the sites went offline, but the graphs don't seem to be real time. The problem could therefore start up at any time again.

    So now I need to find out what went wrong and how I can debug it.

    It seems difficult because I'm just looking at the Azure interface of delayed graphs. Can I connect to the instances via remote connection to confirm it really was the Redis cache that is/was eating the CPU?

    Why didn't the service failover to the slave?

    Are the instances shared? Could it be someone else who has done a cache GET loop which has put the CPU so high?

    Is there a way to restart the cache in situations like this?

    What other logs can I use to try to find out what's going wrong?

    Note: I also tried to create another Redis cache in case I couldn't stop the issue (and then modify my configs to point to the new low CPU version), but this has failed with the message:

    statusCode:Conflict

    statusMessage:{"error":{"code":"ResourceDeploymentFailure","message":"The resource operation completed with terminal provisioning state 'Failed'."}}

    Whilst writing this post I have also found that I can no longer load the cache information in the Azure Portal - it just lists "We couldn't find any items :-(", but the websites are making use of it.

    It's all quite concerning that the cache can go down and there's not a great deal we can do and then my sites are in trouble because we're using it for sessions and also some code which involves intensive database requests on objects which are normally cached. I'd have to fall back to HttpCache and then we'd lose the benefit of the shared cache of websites hosted on multiple VMs.

    I realise there's the possibility that the code we're using could have a Get loop, but there's not anything obvious how this might occur and we've never had a problem in the past. I just need to find ways to work out what's gone wrong as it's a bit of a black hole at the moment. Any help offered will be appreciated.

    All week we have been having this issue, it is very concerning. Azure response with paid support plan said, "Use Forums".. great

    Saturday, September 6, 2014 3:41 AM
  • Are folks still seeing these issue? If you would be willing to share your cache with us (I can provide you an email address if that's easier), we can investigate the issue better.


    Program Manager Azure


    Tuesday, September 9, 2014 5:22 PM
  • Are folks still seeing these issue? If you would be willing to share your cache with us (I can provide you an email address if that's easier), we can investigate the issue better.


    Program Manager Azure


    Please give me your email, I have 10 1GB instances all have failed within the last week with the high cpu issue.

    Deployed vms running our own redis and have 0 issues, so it is definitely Azure Redis related, not our code.

    Tuesday, September 9, 2014 10:12 PM
  • I left mine running whilst I had a couple of days off work with some instructions for the rest of my team - and they needed them!

    The session cache (just on it's own) takes up 25% CPU constantly. If I share the same cache instance for caching for 2 web roles then it eventually brings it down.

    My temporary solution was to have a second cache and separated the sessions. This still has constant queuing problems and has gone down a few times too.

    As far as "solving" the problem is concerned, perhaps answers to these 2 questions will help:

    1) Does a larger instance have higher CPU power? e.g. if I jump up to a 2.5GB or 6GB cache is it going to make any difference as we don't get near the 1GB limit.

    2) Does Redis cache suffer more by having larger objects? I'm using a cache because I need to be able to clear the cache on both web roles and each object contains a fair amount of properties. I could use the cache as notification tool and use HttpCache if this is the case. If it's the number of hits then it just won't matter.

    Everything was fine with the shared caching!

    Wednesday, September 10, 2014 8:34 AM
  • If you are using a 250 MB cache, you have limited CPU (as its built on the A0 compute machines which have shared infrastructure).

    But 1GB and above, you get 1 dedicated CPU. Increasing the cache size will give you more memory and more network bandwidth, but not greater CPU power as Redis is single threaded.

    Can you share more data on exactly what cache size you are using and what size objects you are writing to it? In our internal load test, we have not found CPU to be an issue for moderate loads.


    Program Manager Azure


    Wednesday, September 10, 2014 10:13 PM
  • Also can you send you cache name to AzureCache@microsoft.com


    Program Manager Azure


    Wednesday, September 10, 2014 10:35 PM
  • If you are using a 250 MB cache, you have limited CPU (as its built on the A0 compute machines which have shared infrastructure).

    But 1GB and above, you get 1 dedicated CPU. Increasing the cache size will give you more memory and more network bandwidth, but not greater CPU power as Redis is single threaded.

    Can you share more data on exactly what cache size you are using and what size objects you are writing to it? In our internal load test, we have not found CPU to be an issue for moderate loads.


    Program Manager Azure


    very small objects.

    like I said, I created empty vms, installed redis and used the same code base and the vm cpus havn't exceeded 2% CPU, it is an issue with how azure is deploying redis

    Thursday, September 11, 2014 6:54 AM
  • Thanks for the replies. I have sent an email through to you Saurabh.

    For the benefit of others, I have been using the Standard 1GB cache from the start.

    Inspired by EV_Anonymous' post at http://social.msdn.microsoft.com/Forums/azure/en-US/85b98e22-e44b-4a53-9367-390a6382386e/cache-size-increasing-for-no-reason?forum=azurecache#85b98e22-e44b-4a53-9367-390a6382386e...

    I have changed my settings from 2500ms timeout with 3 retries to 10000ms and 3 retries.

    At one point yesterday, I had 1.1m Cache Hits and ~400k Cache Misses. With the new settings (albeit a different time of the day), I have 500k hits and 20k misses. The CPU is hovering at 1%.

    It looks as though a "short" timeout with the retries causes a big issue because once the cache gets so busy that a timeout occurs, the problem is compounded by repeated attempts - growing the queue larger and larger until it eventually caves in.

    The longer timeout looks like it means that the vast majority of hits get there the first time - even if the cache is busy. I have not noticed the website slowing down, but need to keep monitoring as my site can be quite variable.

    Thursday, September 11, 2014 9:19 AM
  • Thx for reaching out to me over email.

    Like we discussed, we had an Service issue earlier in the week that has now been resolved (the issue affected only <1% of our customers). Given the service is back to normal, you can choose to roll back the timeout value if you so desire.

    Going forward, if folks are still seeing issue with the default timeout value, do let us know.


    Program Manager Azure


    Friday, September 12, 2014 6:33 PM
  • Thx for reaching out to me over email.

    Like we discussed, we had an Service issue earlier in the week that has now been resolved (the issue affected only <1% of our customers). Given the service is back to normal, you can choose to roll back the timeout value if you so desire.

    Going forward, if folks are still seeing issue with the default timeout value, do let us know.


    Program Manager Azure


    IT IS NOT FIXED

    Tuesday, September 16, 2014 2:43 AM
  • We have identified and fixed two issues reported in this thread.

    First, there was an issue that caused high CPU usage on the replica instance of a Standard Redis cache.  This should not have impacted the performance of your cache, however it did cause the CPU Usage reported in the portal to be high.  This should now be fixed for all caches.

    Second, there was an issue in the StackExchange.Redis client library that would sometimes cause the client to create a very large number of connections to the server.  This could make your cache either unavailable or very slow.  We recommend all customers upgrade to StackExchange.Redis version 1.0.333 or later.

    Friday, September 19, 2014 11:13 PM