locked
Developing software for a "no downtime environment" RRS feed

  • Question

  • Currently in my organization we have a 2 hour downtime window early on Sunday mornings for when we can deploy our latest code and SQL to the production environment. We received word the other day from the higher ups that in a few months we will need to have 100% up time; thus losing our 2 hour window.

    There is obviously a lot that goes into having to have a 24x7 uptime with both software and hardware but I was just curious if anyone currently has to work in this type of environment and if so, how do you do it?  How do you make changes to your tables in SQL if tables have to be locked? How does your code respond in your applications when the database is being changed from right underneith it?

    If anyone has experience or know of books, blogs or resources on how to write code for a 24x7 environment I would appreciate it.

    Thanks,
    Flea
    Friday, August 14, 2009 8:42 PM

All replies

  • This isn't as uncommon as you'd think - imagine any solution that makes money, really needs zero down time, since that leads to lost revenue.

    First, you obviously need fault tolerant, redundancy built into the hardware, so proper load balancing, and clustering, generators, UPS and so on.

    You need to identify the failure points in the application, and ensure that there is a way to automatically compensate for the failure.

    So having things like a heartbeat monitor for the application health, monitoring CPU %, disk space, bandwidth utilisation and so on, the usual suspects.

    You would also need to consider how your application deals with errors - will it always recover, and always notify you in some way of the failure, to allow you to react to the problem?

    What happens if the machine reboots?  Does the app restart?  How quickly will the load balancer redirect traffic away form that server, and so on.

    Assuming that your software really is not fault tolerant and capable of 100% uptime, you now have the opportunity to take offline the machines, and upgrade them, bring them back into the load balanced pool, and take the other down, so maintaining the uptime.  That said, it's not usually that easy, and you often do have to plan for a maintenance window to use, should something go wrong.

    The key to the whole process is planning, analysis, planning, and more planning.

    Do you have the software hosted on 100% uptime guaranteed hardware, or is this just a case of 'attempt to keep the software up and running for 100% of the time'?  Realistically, most software will have some downtime, the slight exceptions to the rule are financial institutions, where an upgrade is a major undertaking.  You need to consider what the impact of downtime actually has, and the effort involved in maintaining uptime.  100% uptime costs a decent amount of money, and effort to do properly.

    Cheers,

    Martin.
    MCSD, MCTS, MCPD. Please mark my post as helpful if you find the information good!
    Thursday, August 20, 2009 3:01 AM
  • Performance Best Practices at a Glance

    http://msdn.microsoft.com/en-us/library/ms998512.aspx


    Regards,
    Jai
    Friday, October 2, 2009 12:04 PM
  • Great question.  We've been looking at this problem for a while - our (many) customers have db's with "billions" of rows in each of them.

    Our first step along this path has been to alter the schema first and apply Business layer updates afterwords.  We haven't made it 0 downtime yet - but we have closed the downtime window significantly.   It also means that an upgrade takes many days - not a single event.

    We employ multiple schemes depending upon the kind of change.

    We look at a table kind of like COM versioning works - if the binary interface can stay the same (or backwards compat) the change is made.  This means of course adding NULLable columns.  Non-breaking changes are then scripted and installed the week before the scheduled downtime - we try to get as many changes as possible into the system while it is Live.

    In most cases adding the column isn't enough - many will require computed values.  A SQL Agent job is then installed that moves through each of the rows and updates them in batches.  This job is usually scheduled to run overnight with a scheduled stop time of "5 am" - hence it may need to run over multiple nights.  How many jobs are created depends upon the anticipated performance impact and scope of the changes.

    Breaking changes are saved for the maint window.  At this time the agent jobs are removed and any remaining rows that the jobs didn't get to are fixed up.  During the maint window constraints are put back in (NOT NULL for instance) - anything believed necessary to keep data integrity).

    Once we had a schema change that would break many things (the value of a single column went away and was replaced with a whole complex relationship).  So this couldn't be installed prior to the maint window.  However for existing rows the relationship was known... There wasn't one.   So the business layer code had some "default" logic in it to handle this situation.  An Agent job was installed the defined these complex relationships AFTER the upgrade.  New values entered (or updated) filled in missing values on the fly as expected.  The Agent job was there to compute and update "legacy" rows.

    Next stop (and it's a big one) - a disconnected uncoupled layer between business logic and the data layer.   I've been looking into using Service Broker for some of it.

    And don't forget Requirements.   We were able to narrow down and change the definition of "0 downtime" to have a very specific limitation:  People need to get work done, but not all parts of the system need to be available.   Next the features were grouped into "must be available", "limited functionality" or "not available".    This helps with the design so that you don't spend too much time focused on the wrong areas.

    For instance - do users really need to update Definition or configuration data?   Or do they need to Update/Create instance data on a Person or Address (but they can't change the State dictionary for instance).

    I too am on a search for like documentation and ideas.

    PS - SQL Server 2008 R2 will add (supposed to) the ability for services on another tier to participate and handle Service Broker events.

    • Edited by mrj Tuesday, February 2, 2010 7:39 PM added PS - SQL Server 2008 R2
    Tuesday, February 2, 2010 6:01 PM