locked
App Service Issues in the Australia East Region RRS feed

  • General discussion

  • Root Cause Analysis

    Feb 2016 Australia East App Service Interruption

    ­­

    Report Date: March 9, 2016

    The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.

    MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.

    Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.

    Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.

    The descriptions of other companies’ products in this document, if any, are provided only as a convenience to you. Any such references should not be considered an endorsement or support by Microsoft. Microsoft cannot guarantee their accuracy, and the products may change over time. Also, the descriptions are intended as brief highlights to aid understanding, rather than as thorough coverage. For authoritative descriptions of these products, please consult their respective manufacturers.

    © 2015 Microsoft Corporation. All rights reserved. Any use or distribution of these materials without express authorization of Microsoft Corp. is strictly prohibited.

    Microsoft and Windows are either registered trademarks of Microsoft Corporation in the United States and/or other countries.

    The names of actual companies and products mentioned herein may be the trademarks of their respective owners.

    Region(s): Australia East

    RCA:

    Some of our customers using Azure App Service (Web, Mobile and API Apps) in the Australia-East region experienced high latencies, timeouts or failures while attempting to access or connect to their applications. Due to unprecedented demand in this region, one storage scale unit had been operating under higher than expected load and capacity pressure. This caused intermittent high latencies for storage transactions on this scale unit and resulted in failures for some customers.  

    In efforts to relieve the capacity pressure, the team have had two work streams in motion, one was to increase the storage capacity of servers in the cluster, and the other was improve efficiencies in the code accessing storage. During the process of increasing server capacity, the system tends to favor servers with additional capacity; spreading load less evenly than normal, which can lead to hotspots in the system. The coding changes to improve storage efficiencies relieves pressure across all servers evenly and acts to balance load better. Starting February 15th, the program to increase server capacity was initiated, and the imbalanced load issue became noticeable around February 18th. On February 18th, the team attempted to upgrade the storage code, however, the overall pressure on the system did not allow for that change to take effect. Engineers decided to allow more time for more servers to be upgraded to relieve capacity pressure. On February 25th, the team deployed a new management server that was be able to handle the deployment of the storage code patch. Once the new management server was deployed however, it caused a spike in background workload leading to the overloaded servers getting more load and a brief spike in operational latency resulting in many VM failures. After two hours, the code patch was applied thus freeing more space evenly across servers in the cluster and allowing the system to perform in an acceptable range.

    In parallel to the above improvements, the App Service team deployed a new scale unit and directed all new customers and existing customers to create new resource groups so they wouldn’t be affected.

    We are continuously taking steps to improve the Microsoft Azure Platform and processes to help ensure such incidents do not occur in the future, in this case those includes (but are not limited to):

    • Improve the load balancing methods to prevent this sort of issues from reoccurring.
    • Additional improvements regarding infrastructure agility so in the future App Service customers will be better shielded from the underlying platform issues.


    FAQ

    - Why did I experience higher than expected latencies for my Web App in Australia East? 

    Delays in read/write operations are not expected in normal operation.  We experienced higher than expected demand for resources in the region that caused higher latencies of storage operations on one storage scale unit the region. Because content for App Service Apps is stored using Azure Storage infrastructure, App Service Apps were adversely impacted. Steps were taken to add an additional storage scale unit in the region and perform load-balancing operations on the existing scale unit. This has improved resource utilization and restored latencies to normal levels for all storage customers and other services (App Service Apps) depending on the storage infrastructure.

    - Do you expect the latency to impact other services?  

    Higher latency may impact other services that utilize Standard Storage, such as causing slower provisioning of new virtual machines using storage accounts.

    - Is Premium Storage in Australia East also affected by this issue? 

    No 

    - What did Microsoft do to mitigate the issue?  

    We have now added additional storage scale unit in the region and performed load-balancing operations on the existing scale unit. This has improved resource utilization and restored latencies to normal levels.    

    - Is there anything that I can do to mitigate the impact now? 

    The issue is now resolved and the impacted storage scale unit has been restored to normal levels of resource utilization, latencies and availability. At this time, we do not require any action on your part to mitigate any issues.

    - What is Microsoft doing to avoid this issue in the future?  

    Ensuring customers have a great azure experience is a top priority for Microsoft. We are seeing an increased demand for Standard Storage in Australia East and are committed to rapidly scaling our offerings. Specifically, in this case, we have increased our investment of storage resources in Australia and implemented resource deployment process improvements to support strong customer demand. On top of that, we will implement a set of improvements to the App Service infrastructure to reduce the risk of a storage outage degrading customer experience in the future.

    Original Post dated 2/25/2016

    Hello,

    This post is to describe and update on the ongoing issues with storage latencies in the Australia East region. We will be updating this post periodically with new information.

    In the last few days we have been seeing large storage latencies in a the cluster supporting App Service (Web, Mobile and API apps) in Australia East. This issue leads to apps being intermittently down (site content and code are stored on the storage cluster and are shared between instances of App Service apps). Other Azure services that depend on this cluster have been seeing issues as well. The cluster in question has very limited space left which has made mitigations harder to implement.

    The team has been working on several ways to mitigate the situation:

    • We are in the process of physically upgrading some of the hardware on the cluster
    • We are working on migrating customer content from the affected cluster to a new cluster in Australia East as a parallel effort. If we choose to go this route, we will put sites in Read Only mode for the duration of the migration and as soon as the site content is available in the new location we will return the sites to Read \ Write mode.
    • We are in the process of building a new scale unit in Australia East that will be operating against a different storage cluster, that will allow new customers to deploy applications into an unaffected environment

    At this point we don’t have a good estimate on when the storage cluster will be stable again.  

    To keep applications up, we can suggest a few options for customers:

    • Implement the local cache option for their app – the local Cache option (explained below) will allow applications to keep running in most scenarios. This feature is still in preview and has several caveats. We will be publishing an official blog post on this in the next few days
    • Migrate their app to a another region – this can be done using our site cloning capabilities, one a site is cloned, customers can use Azure Traffic Manager and leverage the new instance of the app as a failover node.

    We recognize the severity of the situation and we deeply apologize for the issues, we are working hard to make things right. We will be updating this post as soon as we have new information.

    Thanks!

    The App Service team

    Local cache information:

    What is Local Cache? 

    Azure web applications' content is stored on Azure Storage and surfaced up in a durable manner as a content share. This design is intended to work with a variety of applications and has the following attributes:  

    • The content is shared across multiple VM instances of the web application. 
    • Content is durable and can be modified by running web applications. 
    • Log files and diagnostic data files are available under the same shared content folder. 
    • Publishing new content directly updates the content folder and the same can be viewed through the SCM web site and the running web app immediately (typically some technologies such as ASP.NET do initiate a web application restart on some file changes to pick the latest content). 

    While many web applications use one or all of these features, some web applications just want a high performant read only content store from which they can run with high availability. These applications can benefit from a VM instance specific copy of the content hereby referred to as "Local Cache". "Local Cache" provides a web role view of your content and this content is a write-but-discard cache of your storage content that is created asynchronously on site startup. When the cache is ready, the site is switched to run against the cached content. 

    Web applications running on Local Cache enjoy the following benefits: 

    • They are immune to latencies experienced when accessing content on Azure storage 
    • They are immune to the servers serving the content share undergoing planned upgrades or unplanned downtimes and any other disruptions with Azure Storage. 
    • Fewer app restarts due to storage share changes. 

     

    The Details 

    These are the details of the local cache. 

    • The local cache is a copy of the /site and /siteextensions folders of the web application and is created on the local VM instance on web application startup. The size of the local cache per web application is limited to 300 MB by default but can be increased up to 1 GB. 
    • The local cache is read-write however any modifications will be discarded when the web application moves virtual machines or gets restarted. The local cache should not be used for applications that persist mission critical data in the content store. 
    • Web applications can continue to write log files and diagnostic data as they do currently. Log files and data however are stored locally on the VM and are then copied over periodically to the shared content store. The copy over to the shared content store is a best case effort and write backs could be lost due to a sudden crash of a VM instance. 
    • There is a change in the folder structure of the LogFiles and Data folders for web apps that use Local Cache.  There are now sub-folders in the storage "LogFiles" and "Data" folders following the naming pattern of "unique identifier" + timestamp. Each of the sub folders correspond to a VM instance where the web application is running or has run on.  
    • Publishing changes to the web application through any of the publishing mechanisms will publish to the shared content store. This is by design as we want the published content to be durable. To refresh the local cache of the web application, it needs to be restarted. Seems like an excessive step? See below to make the lifecycle seamless. 
    • D:\Home will point to Local Cache. D:\local will continue pointing to the temporary VM specific storage. 
    • The default content view of the SCM site will continue to be that of the shared content store. To look at what your local cache folder looks like you can navigate to https://SiteName.scm.azurewebsites.net/VFS/LocalSiteRoot/LocalCache 

    Best Practices when using Local Cache 

    • Enabling Local Cache: Local Cache is enabled on a per web application basis by using an AppSetting. WEBSITE_LOCAL_CACHE_OPTION = Always 

    Configure via portal (portal.azure.com) 

    Configuring Local Cache in the Portal
     

    • Local Cache size: default local cache size is 300 MB. This includes the Site, SiteExtensions folders that are copied from content store as well as any locally created logs and data folders. To increase this limit use the AppSetting  WEBSITE_LOCAL_CACHE_SIZEINMB. The size can be increased up to 1 GB per web application. 
    • Usage recommendation: It is recommended that Local Cache is used in conjunction with the Staging Environments feature. 
      1. Add a sticky Appsetting WEBSITE_LOCAL_CACHE_OPTION with value "Always" to your Production slot. If using WEBSITE_LOCAL_CACHE_SIZEINMB, also add it as a sticky setting to your production  slot. 
      2. Create a Staging slot and publish to your Staging slot.  The staging slot typically does not use local cache to enable a seamless build-deploy-test lifecycle for staging while getting the benefits of Local Cache for the production slot. 
      3. Test your site against your Staging slot.  
      4. Once you are ready, issue a swap operation between your Staging and Production slots.  
      5. Sticky settings are by name and sticky to a slot. So when the Staging slot gets swapped into Production, it will inherit the Local Cache App settings. The newly swapped Production slot will run against the Local Cache after a few minutes and will be warmed up as part of slot warmup after swap. So when slot swap is complete, your production slot will be running against Local Cache.

    Frequently asked questions 

    How can I tell if Local Cache applicable for my web application? 

    If your web application need high performant, reliable content store and does not use the content store to write critical data at runtime and is <1 GB in total size- then the answer is Yes! To get the total size of your "site" and "site extensions" folders you can use the site extension "Azure Web Apps Disk Usage".  

     

    How do I enable Local Cache? 

    See above section on Best Practices when using Local Cache. 

     

    How can I tell if my site has switched to using local cache? 

    If using the Local Cache feature with Staging Environments, the swap operation will not complete till Local Cache is warmed up. To check if your site is running against local cache, you can check the worker process environment variable WEBSITE_LOCALCACHE_READY. Use the instructions here to access the worker process environment variable on multiple instances. 

     

    I just published new changes- but my web application does not seem to have them. Why? 

    If your web application uses Local Cache, then you need to restart your site to get the latest changes. Don’t want to that to a production site? See slot options above. 

     

    Where are my logs? 

    With local cache, your logs and data folders do look a little different. However the structure of your sub folders remain the same except that they are nestled under a sub folder of the format "unique VM identifier" + timestamp. 

     

    I have Local Cache enabled but my web application still gets restarted. Why is that? I thought Local Cache helped with frequent app restarts. 

    Local Cache does help prevent storage related web application restarts. However your web application could still undergo restarts during planned infrastructure upgrades of the VM. The overall app restarts experienced with Local Cache enabled should be fewer. 












    Thursday, February 25, 2016 9:47 PM

All replies

  • Status update as of February 28 2016

     

    The team has been working on the issue in the last few days.

    What was done

    • Physical upgrades to the storage cluster supporting Azure App Service in Australia East were completed
    • New App Service scale-unit deployed in Australia East on a separate storage cluster
    • Changes to App Service behavior when accessing storage were deployed as well as new instrumentation

    What is means

    • Reliability in the last 48 hours as measured by our internal instrumentation and external tools like Pingdom has been close to a 100%.
    • New App Service applications that are deployed to a new Resource Group in Australia East will end up on the new scale unit which will ensure the apps are running against a different storage cluster

     

    We are still monitoring availability and will update when we know issues are completely resolved.



    Monday, February 29, 2016 3:42 PM