locked
Cannot swap slots anymore RRS feed

  • Question

  • I cannot swap my 2 slots anymore, since today. For the same web app, previous saturday, everything still was OK.

    I'm getting the error: Failed to swap web app slots

    Don't know if related, but since it's not working anymore, I see an app setting value in the portal: "WEBSITE_SLOT_NAME" with the Value of "Production" (slot setting not checked) And I did not put the value there...

    Regards,

    Tom

    Monday, August 31, 2015 7:07 AM

Answers

  • We found what we think is the issue and we're deploying a fix, starting with a small number of scale units. @shipsheep: Can you share your site name, either directly or indirectly, so we can prioritize yours?

    @Tom: we already know yours from your earlier mention of your site. If you try in about 2 hours, the fix should be there.

    We'll then expand the update in the morning PST.

    thanks,
    David

    Thursday, September 3, 2015 4:29 AM

All replies

  • Hi Tom,

    Did you use powerShell to do the swap or the portal?

    Switch-AzureWebsiteSlot –Name <AzureWebsiteName>

    or (if you have more then 2 slots):

    Switch-AzureWebsiteSlot –Name <AzureWebsiteName> -Slot1 <slotName> -Slot2 <slotName>

    Maybe that outputs more error details.


    Please mark answered question as anwered to let others know about it.

    Monday, August 31, 2015 9:14 AM
  • Just tried. Gives me the following error:

    Switch-AzureWebsiteSlot : ExpectationFailed : Cannot swap site slots for site 'husk-app' becaus
    e the 'test' slot did not respond to http ping.

    Note:

    • I can open and use te test slot just fine, using any browser
    • Tried stopping and starting the slot, but  that doesn't help.

    Monday, August 31, 2015 11:16 AM
  • Yeah.

    This afternoon, i created a new test slot (different name, same settings copied) Everything went ok with the new slot ... 

    A couple of hours later... Same problem (exact the same message) with the new test slot!

    Seems like when production is copied over to a testing slot, I cannot swap it again...


    Monday, August 31, 2015 5:09 PM
  • Hi Tom,
    Indeed there are changes in the recent release where warm up is made more robust and actually checks for the return status code (expects < 500). In your case hitting the root of staging slot return status code 500 thus causing the swap to fail to protect your production site of downtime.

    I will check what is the exact syntax and I will edit the post with instruction on how to specify the Urls to be hit during warm up.


    Galin Iliev, http://www.galcho.com

    Tuesday, September 1, 2015 7:56 PM
  • Please see this page for details on listing URLs to hit during swap.
    Tuesday, September 1, 2015 8:04 PM
  • Doesn't seem to work.

    Isn't the applicationInitialization section for EXTRA requests on top of the default warmup request? (the docs seem to suggest so: http://prntscr.com/8bn8ik)

    I find this in the eventlog (/LogFiles/eventlog.xml) : http://prntscr.com/8bn8bm (replaced real domain and ip)
    We're using wildcards in DNS, and it seems it fires a request to http://*.domain.com, which doesn't work of course. Or is that just the IIS binding?
    Wednesday, September 2, 2015 5:52 AM
  • I'm having this issue on sites in two different regions here now (East US and North EU).

    I'll investigate some logs here, but this has been working just perfectly before so I don't really know how this could be an error on our side as we haven't changed anything here that would cause these timeouts.

    Wednesday, September 2, 2015 8:56 AM
  • Seems in our case we have some Blob Storage stuff being initialized at app server startup timing out during site swap, but it works fine after the site swap has failed; the staging site that was attempted swapped can connect to Blob Storage just fine. I do see a couple of entries in the event log about this though, so I'm suspecting it's the cause of the failed swap.

    Now to figure out why those timeouts even occur in the first place.

    Wednesday, September 2, 2015 10:44 AM
  • There is indeed an issue with the new logic, and we are investigating.

    Note that the site is getting polled every 5 seconds for up to 10 minutes. So it seems strange that it would not be enough time to warm up the site.

    The part that we still don't understand is why warm up requests are failing with 500 for some sites.

    @shipsheep: before you added the rewrite rule, what was it returning? I would think this would be a 404 and not a 500, which would not have caused failures. Also, note that even before this change, the warm up was only ever hitting /. So I'm surprised that this would have ever warmed up your site even before the change.

    Wednesday, September 2, 2015 5:25 PM
  • If you are seeing this issue, please try enabling Detailed Error Logs and Failed Request Tracing on the staging slot before the swap. Hopefully, the logs will clarify what is causing the 500 error.
    Wednesday, September 2, 2015 5:49 PM
  • We found what we think is the issue and we're deploying a fix, starting with a small number of scale units. @shipsheep: Can you share your site name, either directly or indirectly, so we can prioritize yours?

    @Tom: we already know yours from your earlier mention of your site. If you try in about 2 hours, the fix should be there.

    We'll then expand the update in the morning PST.

    thanks,
    David

    Thursday, September 3, 2015 4:29 AM
  • @shipsheep: the fix should now be there. Can you give it a try? Thanks!
    Thursday, September 3, 2015 10:28 AM
  • The fix is working for us too! Thank you!
    Thursday, September 3, 2015 12:02 PM
  • My website has the same problem, do you know if its unit has been fixed (http://sproutr.azurewebsites.net/) ?



    Thursday, September 3, 2015 12:29 PM
  • @Matthieu: you should have the fix as well. Within a few more hours, it will be everywhere.
    Thursday, September 3, 2015 7:10 PM
  • @David, can you confirm that the fix has been pushed everywhere as I'm having the exact same issue still. (as of 2015-09-04 16:15 EST)
    Friday, September 4, 2015 6:16 PM
  • @dstj: the fix is indeed everywhere, so it is not expected for the issue to still exist. Can you share your site name, either directly or indirectly? This will help us investigate. Also, please share the UTC time of one such failed swap attempt.
    Friday, September 4, 2015 6:28 PM
  • @david: (Indirect transmission) I created the dummy site http://swappingissue.azurewebsites.net/ in the same App Service Plan.

    Timestamp of one swapping attempt is 2015-09-04 18:44:50 UTC.

    I attempted the swap from both portal.azure.com and a Powershell "Switch-AzureWebsiteSlot" script. Same error.

    One thing I did notice though: When I start the staging slot from a stopped state, I get a HTTP 500 error if trying to ping the app "too quickly". I have to wait about 15-30 seconds, then my application responds normally. I do not know if this is related.

    Friday, September 4, 2015 6:48 PM
  • @dstj: thanks, this should help us get more information about it.

    In theory, the new logic ignores the http code from the request, but conceivably it is faulty. If at all possible, it would probably be good to figure out what's causing this initial error when the site is cold.

    Friday, September 4, 2015 7:12 PM
  • Thank David. We have the same slot swapping problem with at least two applications. Both handle calls to / differently though.

    In the one mentioned above, root calls issue two HTTP requests to warm up other sections of the site (we did not know about <system.webServer/applicationInitialization> at the time) and the HTTP 500 mentioned above is a timeout on the sub-ping. I'll increase the timeout to see if that helps.

    For the second application also failing when swapping slot, root returns a HTTP 302 redirect to /ping.

    Friday, September 4, 2015 7:57 PM

  • Hi dstj, I'm taking a look at your issue and I noticed that your slot is stopped right now. I wanted to know if it's okay for me to trigger a swap on your site from our end and to try to get a better repro for the issue you're seeing.

    Friday, September 4, 2015 10:51 PM
  • @Ahmed: Yes, you can start the slot and attempt a swap. Could you please send an email to the administrator emails on the account informing us if you are "working" with the slot.
    Tuesday, September 8, 2015 1:58 PM
  • Hello David,

    I still have the same error on my website as of today Tuesday : "Cannot swap site slots for site 'xxx' because the 'staging' slot did not respond to http ping."

    The website is http://sproutr.azurewebsites.net/

    It's working for us now (as Wednesday)

    Tuesday, September 8, 2015 2:24 PM

  • Hi David, 

    I have noticed AutoSwaps haven't been swapping and when I resort to using the PowerShell CmdLets to manually initiate the failed swaps I have to re-run the command 2-4 times to get a successful resuit. 

    I am guessing but is the underlying http ping failure also causing AutoSwap to silently fail?! I am using WebDeploy and it had been working well earlier last week. Nothing appears in the audit logs or events or msdeploy log regarding the swap failing, it looks normal everytime.

    WebDeploy: 
    ...
    Info: Finished publishing with auto-swap enabled: SlotName=production, OperationId=...
    Verbose: The synchronization completed in 2 pass(es).
    Total changes: ...


    I have been using this Azure Powershell commandlet to work around it: 
    Switch-AzureWebsiteSlot –Name <AzureWebsiteName> -Slot1 <slotName> -Slot2 <slotName>

    Which eventually works but I am seeing this ping error too often and numerous times today (same as mentioned in this thread):
    -----Snippet-----
    PS C:\> Switch-AzureWebsiteSlot -Name "site-abc" -Slot1 "staging" -Slot2 "production" -Force
    Switch-AzureWebsiteSlot : ExpectationFailed : Cannot swap site slots for site 'site-abc' because the
    'staging' slot did not respond to http ping.
    At line:1 char:1
    + Switch-AzureWebsiteSlot -Name "site-abc" -Slot1 "staging" -Slot2 "produ ...
    + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        + CategoryInfo          : CloseError: (:) [Switch-AzureWebsiteSlot], CloudException
        + FullyQualifiedErrorId : Microsoft.WindowsAzure.Commands.Websites.SwitchAzureWebsiteSlotCommand

    Hopefully this is useful info.

    I look forward to any news you might have!

    Thanks,

    Ray.


    Wednesday, September 9, 2015 2:50 PM
  • Hi Ray,

    There is one more bug in the swap logic which we are fixing right now. The bug causes the swap to fail when the warmup ping request takes more than 60 seconds to complete. That's probably why you sometimes get a successful result but most of the time it fails. We are deploying the fix for this bug currently.


    http://ruslany.net/

    Wednesday, September 9, 2015 3:44 PM
  • Excellent, I look forward to the fix.

    Thanks for the reply.

    Thursday, September 10, 2015 5:03 PM
  • Ray, the fix Ruslan is referring to is now deployed. Could you give it a try now to see if the issue is gone?

    If it still happens, can you share your site name, either directly or indirectly? This will help us investigate.

    thanks,
    David

    Thursday, September 10, 2015 5:14 PM

  • I have been seeing the same behaviour multiple times this afternoon on a few different WebApps with staging slots. AutoSwap generally is silent and does nothing and PS reveals the http ping error.  

    Has the fix been deployed to WEU region? 

    Will the ping timeout be surfaced as a PS parameter/AutoSwap option? seems like it should be a setting

    Will AutoSwap errors be surfaced somewhere, like in the portal? 

    The clues are: 
    - reference site: msdnsupprefsite
    - actual *e-dev, *e-qa, *e-sit

    *e-dev, *e-sit - this http ping error occurred a few minutes ago using PS. 

    I will be looking more into this and mitigations for it tomorrow, hopefully I won't need the mitigations!


    Thanks,

    Ray.

    Thursday, September 10, 2015 6:31 PM
  • I'm seeing HTTP ping messages in different languages - perhaps something is being deployed at the moment

    Thursday, September 10, 2015 6:38 PM
  • Hi Ray,

    the scale unit that hosts your sites is still being upgraded. At the moment less than half of the machines have been upgraded so it may sometime work and sometime fail depending on which machine takes the call. Please give it another couple of hours and then check again.


    http://ruslany.net/

    Thursday, September 10, 2015 7:23 PM
  • The PS cmdlet looks to be consistently working this morning, which is great news. 

    AutoSwap still fails silently so I'm not sure how to proceed with using it.

    Is there any way to dig into what is happening when it does't swap or am I best to avoid using it entirely?



    • Edited by Ray Hebberd Friday, September 11, 2015 10:27 AM sp
    Friday, September 11, 2015 10:23 AM
  • @Ray: is auto-swap something that was previously working for you, and that starting breaking at the same time as the rest of these recent swapping issue? Or is it something that never worked, or maybe that you had not tried before?

    Trying to determine whether it is related to the main issue in this thread, or a separate issue.

    Friday, September 11, 2015 2:30 PM
  • It had worked fairly well earlier last week and deteriorated since - I haven't seen even one AutoSwap this week. 

    I am moving processes away from using it at the moment but there are a couple of things I would like to see before I am tempted to use it again: 

    > Success/failure history of swap operations in the Azure Portal

    > Failure notification that appears in audits and can trigger an email

    I have been using the CmdLet a fair bit today and that is at 100% success rate thus far.

    Friday, September 11, 2015 3:06 PM
  • Hi Ray,

    In our logs I do see the auto-swap succeeding. There was one at 2015-09-11 09:56:08 UTC and another one at 2015-09-11 13:47:05 UTC. Both took about 2 minutes to complete. Note that it takes longer to complete now because prior to the swap operation we wait until the site in the staging slot responds to the ping. It takes > 60 seconds for your site to respond. Most probably the auto-swap before was faster for your site because we did not wait until the ping request completed and proceeded with swap after 30 seconds timeout.


    http://ruslany.net/

    Friday, September 11, 2015 5:38 PM

  • So the logic in the auto-swap operation has been modified recently. This could explain something.

    The app does some cache building during warm up. What happens if the site fails to warm in under 60seconds? 
    Monday, September 14, 2015 9:12 AM
  • The new logic is that we wait for 90 seconds for site to respond. If it did not respond we retry the ping for 4 times waiting for 90 seconds every time. Only after that we will fail the swap.

    http://ruslany.net/

    Monday, September 14, 2015 3:54 PM
  • Hi Ruslan, 

    I have had a chance to revisit AutoSwap for a particular site & deployment and this time it worked.

    What was the underlying issue?

    I still feel that an AutoSwap failure should be made highly visible when it doesn't work? Is there any upcoming feature that will do this?

    Thanks,

    Ray

     

    Monday, September 14, 2015 4:02 PM
  • Hi Ray,

    I am not sure what was the issue when you reported that auto-swap did not work before after we applied the fix. From our logs I saw that the auto-swap had succeeded but took longer time.

    I agree that auto-swap failure (and visibility into auto-swap process in general) should be more visible than what it is right now. This is something we should look into.


    http://ruslany.net/

    Monday, September 14, 2015 4:56 PM
  • If you are already logging swap events (success/fail), those should definitely be surfaced.

    Thanks for your responses, it has been helpful. 

    Tuesday, September 15, 2015 8:50 AM