locked
Development practices for high availabilty RRS feed

  • Question

  • Hi,

    I'm becoming involved in planning what our dev team must do to improve system availability. I don't yet have all the facts but the given is that we'll have two geographically isolated sites with hardware and VMs in each. The management desire is that should one site "go down" then our systems will "resume" at one or the other site. Details are a little vague as I only yesterday became involved.

    Our applications consist of 

    1. Web apps (public and internal facing)
    2. Web services (REST)
    3. Numerous console apps (a "tradition" here) which implement so called "batch" processing 
    4. Scheduling of these batch apps to run, do their stuff, then end.
    5. Significant reliance on SQL Server.

    None of the code has ever given any consideration to being restartable/recoverable certainly not in the sense of following any sort of design pattern, for the Web apps all use in memory session state (which is already causing issues with plans for load balancing) and the services and console apps have no concept of logging their work in such a way that a restart or recovery can be affected.

    Furthermore our code (which reflects many earlier years of bad practices) is such that DB queries and updates are scattered all over the applications, very little use of stored procedures and a great deal of raw SQL query strings embedded right in the application source code.

    In addition there is very little use of SQL server transactions so arbitrary interruptions to almost any app can lead to partially performed DB updates.

    All of the above means that even now (without any concerns about this dual site setup) we tend to have a stream of production tickets due to occasional aborted operations where an app may have only done part of some muti-table update or where some service experienced an exception and although logging it, has left the overall data/system in an incorrect or inconsistent state, I suspect these kinds of issues will only increase in number and complexity if we naively fiddle with each app to "make it more available".

    I'm no expert on SQL Server but I did get into a small disagreement with another experienced developer who told me "People are moving away from SQL server transactions due to potentially horrible locking problems" a statement I am skeptical of!

    I have extensive development experience myself (many decades) but quite limited when it comes to these areas so I'd appreciate any "first thoughts" from others who have done this.

    We (development) must plan what we "need to do" to make our applications more available and the dual site (already heavily invested in) is the environment we're expected to execute within, this strikes me as non-trivial and so we may need to do some rewriting and so on.

    Addressing the in memory session state isn't too hard but that only provides browser users with some degree of availability in the event of a site outage.

    Thanks in advance!





    Saturday, May 12, 2018 3:52 PM