I have to report a scaling problem when changing instance count in a web role from 1 to 2.
When I do this instance count change my service stops responding for a period of about 5 mins, starting about 30 secs after the change and until the second instance comes online (about 4-5 mins later). After that, although
I have 2 instances online (I can check this through logging) only the 2nd one is answering - like the 1st instance is taken off the load balancer. I've checked that the 2 instances still communicate through internal endpoints, but the 1st is
not receiving external requests.
After that, when I change back to 1 instance, after a while I see both instances answering (for about 30 secs) - the 1st is activated in load balancer again, and then normally the 2nd instance goes offline and the 1st is online again.
If I want 2 instances activated after this change I have to reboot the 1st instance through the management portal to see them both answering. Everything is ok then.
This is a really strange behaviour and I have an offline period about 5-10 minutes and after that only one working instance - although I am charged for 2 and the azure portal reports 2 active instances with status 'ready'. I've noticed this happening last week,
I believe it is a bug of the system and I can say that this was not happenning some time before. I use this schedule - changing from 1 to 2 instances - 2-3 months with no problems, but last period I am experiencing this problem.
Thank you for your attention.
Edited byDimitris VThursday, March 15, 2012 12:23 PM
It has to do with the upgrade domains that define how Windows Azure instances are being upgraded. When you change your instance count from 1 to 2, you basically redeploy a new configuration file, which will result in a reboot of the instance. The configuration
file can contain multiple changes, so it has to be reloaded to make sure all changes to the configuration file are picked up.
Since you only have 1 instance, it will go down while being upgraded. That's why the SLA is based upon 2 instances. If you are using 2 instances, the upgrade by upgrade domain will provide you with a running instance while it upgrades the other instance and
the windows azure load balancer will make sure all requests end up at the available instance during upgrade. That way you keep your application available while it's being upgraded.
The suggestion most people will make is to have 2 running instances of your deployment, which providers you with the availability.
My first instance is not going down. I can confirm this because I am logging what's happened in every instance. It just cannot be accessed from external.
And also. I say that I cannot access this 1st instance - which is still working as I can see - after the second is going online. I can access this instance from internal endpoints. And this continues happening until I change the instance count back to 1.
Then I can see it from external endpoints. All this time I can confirm that the 1st instance was up and running, just not answering.
behaviour - like taking it off from the load balancer for this period, not going it down as you say.
Actually, because of the configuration change your application will reboot. To be sure, create an RDP connection to your instance and change the number of instances. You should be disconnected for little while. The only way to prevent this is to handle the
configuration change yourself:
public override bool OnStart()
RoleEnvironment.Changing += RoleEnvironmentChanging;
private void RoleEnvironmentChanging(object sender, RoleEnvironmentChangingEventArgs e)
if (e.Changes.Any(change => change is RoleEnvironmentConfigurationSettingChange))
e.Cancel = false;
Actually I have the same behavior after having tried this out with a simple web deployment.
After going from 1 instance to 2 instances, only 1 instance seems to receive the requests. The other instance is working, but it does look like the load balancer is not routing any requests to that instance anymore as they are all getting handled by the
other instance, while the requests should be distributed round robin by the load balancer.
Hopefully someone else can state why this is behaving as it is now.
Be nice to nerds ... Chances are you'll end up working for one!
OK, but maybe the instance has been taken out of the load balancer. Could you log what happens regarding the status of the machine? If you say only one instance receives the request the other one might be stuck on an other status:
public override bool OnStart()
RoleEnvironment.StatusCheck += RoleEnvironmentStatusCheck;
private void RoleEnvironmentStatusCheck(object sender, RoleInstanceStatusCheckEventArgs e)
Trace.WriteLine("The status of the role instance: " + e.Status, "Information");
I am logging what's happening to both machines. They're both online and the 1st never stopped working. I can access the machine also through internal endpoints. Just not answering to the outside world. And after I am returning back to 1 instance I can see
that the 1st one was never stopped and continue logging all the time (I have a timer reporting all the time to SQL Azure, and I log also in a local file - 1st instance was up and running all the time).
PS. Status was all the time in Ready state - I am checking it continually through internal endpoints.
Edited byDimitris VThursday, March 15, 2012 1:38 PM
Edited byDimitris VThursday, March 15, 2012 1:41 PM