已答复 Windows Azure Scaling Problem

  • Thursday, March 15, 2012 12:20 PM
     
     

    I have to report a scaling problem when changing instance count in a web role from 1 to 2.

    When I do this instance count change my service stops responding for a period of about 5 mins, starting about 30 secs after the change and until the second instance comes online (about 4-5 mins later). After that, although I have 2 instances online (I can check this through logging) only the 2nd one is answering - like the 1st instance is taken off the load balancer. I've checked that the 2 instances still communicate through internal endpoints, but the 1st is not receiving external requests.

    After that, when I change back to 1 instance, after a while I see both instances answering (for about 30 secs) - the 1st is activated in load balancer again, and then normally the 2nd instance goes offline and the 1st is online again. 

    If I want 2 instances activated after this change I have to reboot the 1st instance through the management portal to see them both answering. Everything is ok then.

    This is a really strange behaviour and I have an offline period about 5-10 minutes and after that only one working instance - although I am charged for 2 and the azure portal reports 2 active instances with status 'ready'. I've noticed this happening last week, I believe it is a bug of the system and I can say that this was not happenning some time before. I use this schedule - changing from 1 to 2 instances - 2-3 months with no problems, but last period I am experiencing this problem. 

    Thank you for your attention.


    • Edited by Dimitris V Thursday, March 15, 2012 12:23 PM
    •  

All Replies

  • Thursday, March 15, 2012 12:30 PM
     
     

    It has to do with the upgrade domains that define how Windows Azure instances are being upgraded. When you change your instance count from 1 to 2, you basically redeploy a new configuration file, which will result in a reboot of the instance. The configuration file can contain multiple changes, so it has to be reloaded to make sure all changes to the configuration file are picked up.

    Since you only have 1 instance, it will go down while being upgraded. That's why the SLA is based upon 2 instances. If you are using 2 instances, the upgrade by upgrade domain will provide you with a running instance while it upgrades the other instance and the windows azure load balancer will make sure all requests end up at the available instance during upgrade. That way you keep your application available while it's being upgraded.

    The suggestion most people will make is to have 2 running instances of your deployment, which providers you with the availability.

    See this thread, it handles the same topic:
    http://social.msdn.microsoft.com/Forums/en-US/windowsazuretroubleshooting/thread/9daad797-f993-402c-8606-ffc970c263ed

    I can not comment on the fact that one of the active instances would not receive any request. I'll try this out to see if I can simulate the same behavior.


    Be nice to nerds ... Chances are you'll end up working for one!


  • Thursday, March 15, 2012 12:43 PM
     
     

    Hi Robbin,

    My first instance is not going down. I can confirm this because I am logging what's happened in every instance. It just cannot be accessed from external.

    And also. I say that I cannot access this 1st instance - which is still working as I can see - after the second is going online. I can access this instance from internal endpoints. And this continues happening until I change the instance count back to 1. Then I can see it from external endpoints. All this time I can confirm that the 1st instance was up and running, just not answering.

    Very strange behaviour - like taking it off from the load balancer for this period, not going it down as you say.

    Thanks.

  • Thursday, March 15, 2012 12:58 PM
     
      Has Code

    Hi Dimitris,

    Actually, because of the configuration change your application will reboot. To be sure, create an RDP connection to your instance and change the number of instances. You should be disconnected for little while. The only way to prevent this is to handle the configuration change yourself:

    public override bool OnStart()
    {
        RoleEnvironment.Changing += RoleEnvironmentChanging;
    return base.OnStart();
    }
    
    private void RoleEnvironmentChanging(object sender, RoleEnvironmentChangingEventArgs e)
    {
        if (e.Changes.Any(change => change is RoleEnvironmentConfigurationSettingChange))
        {
            e.Cancel = false;
        }
    }

    Sandrino


    Sandrino Di Mattia | Twitter: http://twitter.com/sandrinodm | Azure Blog: http://fabriccontroller.net/blog | Blog: http://sandrinodimattia.net/blog

  • Thursday, March 15, 2012 1:01 PM
     
     

    Hi Sandrino,

    Thanks for your answer. My role is allready working the way you are proposing. 

    But I say that the instance is not going down - nor rebooting, is still working, it is reporting on log, it just cannot be accessed.

    Dimitris

  • Thursday, March 15, 2012 1:09 PM
     
     

    Actually I have the same behavior after having tried this out with a simple web deployment.

    After going from 1 instance to 2 instances, only 1 instance seems to receive the requests. The other instance is working, but it does look like the load balancer is not routing any requests to that instance anymore as they are all getting handled by the other instance, while the requests should be distributed round robin by the load balancer.

    Hopefully someone else can state why this is behaving as it is now.


    Be nice to nerds ... Chances are you'll end up working for one!

  • Thursday, March 15, 2012 1:26 PM
     
      Has Code

    OK, but maybe the instance has been taken out of the load balancer. Could you log what happens regarding the status of the machine? If you say only one instance receives the request the other one might be stuck on an other status:

    public override bool OnStart()
    {
       RoleEnvironment.StatusCheck += RoleEnvironmentStatusCheck;
      
       return base.OnStart();
    }
    
    private void RoleEnvironmentStatusCheck(object sender, RoleInstanceStatusCheckEventArgs e)
    {
       Trace.WriteLine("The status of the role instance: " + e.Status, "Information");
    }

    Sandrino


    Sandrino Di Mattia | Twitter: http://twitter.com/sandrinodm | Azure Blog: http://fabriccontroller.net/blog | Blog: http://sandrinodimattia.net/blog

  • Thursday, March 15, 2012 1:35 PM
     
     

    Sandrino,

    I am logging what's happening to both machines. They're both online and the 1st never stopped working. I can access the machine also through internal endpoints. Just not answering to the outside world. And after I am returning back to 1 instance I can see that the 1st one was never stopped and continue logging all the time (I have a timer reporting all the time to SQL Azure, and I log also in a local file - 1st instance was up and running all the time).

    Dimitris.

    PS. Status was all the time in Ready state - I am checking it continually through internal endpoints. 
    • Edited by Dimitris V Thursday, March 15, 2012 1:38 PM
    • Edited by Dimitris V Thursday, March 15, 2012 1:41 PM
    •  
  • Friday, March 16, 2012 12:16 PM
     
     Answered

    I was informed by Microsoft Support that the behavior I am describing caused by a problem in the cooperation of Azure Fabric Controller and Azure SDK, and that there will be a fix in Azure SDK.

    Thanks Everyone.

    • Marked As Answer by Dimitris V Friday, March 16, 2012 12:16 PM
    •