Microsoft Azure, SSL and Operational Maturity

This past week, Microsoft made the stunning admission that its Azure cloud service experienced a significant outage due to the failure to renew an expiring SSL certificate. On the heels of previous Azure outages, and additional outages at rival Amazon Web Services, the incident stoked the fears of CIO’s. While many IT executives want to leverage the cloud for flexible and agile computing resources, these recent service outages are turning many “decision lights” from yellow to red.

Sadly, this episode need not to have happened. While it is difficult for large scale service providers to eliminate all bugs from highly complex software environments, this incident was a simple case of human error. And it was  a type of error that is well known in the IT community, with easy preventative techniques.

Let’s start with some background. SSL, or Secure Sockets Layer, is a protocol commonly used to ensure secure communication between web browsers and web sites. Anytime you are engaged in a confidential exchange with a web site (e.g. entering credit card information), SSL is almost certainly an underlying component. Part of the architecture of SSL involves the use of certificates; electronic documents that assure a web browser the authenticity and safety of a website. Owners of websites purchase these certificates from certificate authorities (Symantec and GoDaddy control over 50% of the US market), and typically need to renew them every 1-3 years.

In this recent Azure outage, Microsoft failed to renew the certificates, resulting in widespread outages for clients attempting to connect to their services. While SSL certificate expirations are a disruptive and embarrassing mistake for cloud providers, they are hardly the only error of this type. For many years, IT organizations have seen the perils of a host of “renewal failures”. Here are some of the more common examples:

  • Similar to SSL certificates, domain names need to be renewed on a regular basis. Failure to do so can result in outages and even the loss of ownership of that domain name.
  • Software licenses frequently require regular renewals. Failure to renew a license can result in “user lockouts” or a complete inoperability of the product. Additionally, even after the client understands the problem and offers to renew the licenses, there may be additional financial consequences. Some vendors, recognizing their position of strength, will not offer significant renewal discounts.
  • Software maintenance contracts are also done with a renewal period. Failure to renew them in a timely manner can result in a cutoff of support. Unfortunately, this cutoff is typically discovered during a crisis, as the client tries to call a support line to solve a problem. Like the software licensing example, these pressure driven remediations rarely result in best in class pricing.

Historically, these renewal failures have occurred because organizations failed to establish the renewal function as an institutional process. Instead of the function having a set of robust, “enterprise class” controls, it was left to a single individual. Peter in Purchasing or Sally in Software Licensing had a tickler on a personal calendar or Excel spreadsheet. If they left the firm, went on extended jury duty, or simply forgot to follow up on the renewal, a failure occurred.

Progressive organizations quickly understood that individual responsibility for this type of function was insane. There were too many obvious error points and the consequences of failure were significant and costly. Having someone on the team independently track whose turn it is to bring in Friday morning bagels is fine. Having that same person autonomously handle the renewal of SSL certificates is a disaster waiting to happen.

Here are some of the best practices to institutionalize renewal processes, making them robust, “can’t fail” functions:

  •  Establish Organizational Calendars – Critical functions such as certificate renewals, need to be tracked on a centralized calendar, visible to a team of interested parties. This could be something as simple as a shared spreadsheet or Exchange calendar. Alternatively, it could be a purpose driven piece of software, designed for tracking product renewals. In any case, there must be sufficient lead time prior to the expiration to complete an orderly renewal process.
  • Establish Monitoring and Review Process – Regardless of the tool used for storing renewals dates, there must be a regular team process (e.g. weekly, monthly)  to review the calendar, ensure compliance, and escalate at-risk items.
  • Embed Renewal Item Establishment within Procurement Process – Virtually all renewal items are ultimately purchases from a third party. The procurement process should recognize purchases as “requiring renewal” and ensure that they are tracked on the appropriate organizational calendar.
  • Set up Automated Monitoring of Renewal Software – Many products can be monitored and will send alerts in advance of a product expiration. This is absolutely the case with SSL certificates. Typical monitoring tools can be configured to send an alert to computer operations staff at a console or via email to other team members.
  • Set up Automated Renewal Notification from your Vendor – Many vendors will notify a customer when products are set to expire. Make sure that the notification is set to go to a functional team mailbox (e.g. renewals@yourcompany.com) that is monitored by multiple parties as part of a compliance process.

The concept of process maturity must be embedded within the culture of the organization. All team members need to understand how to evaluate, create and monitor a process so that it is consistent with an organization’s maturity goals. They need to be able to recognize how to differentiate between high value processes, like certificate renewals, and low value processes, like our “who’s bringing the bagels” example. They need to be aware of existing practices, such as team calendars, and utilize these techniques as appropriate.

Running a cloud based service, or any mission critical technology enterprise, is a demanding business. The complexity of modern hardware, software and networks inevitability leads to some level of unanticipated failures. But something as straightforward as a product renewal should not be so challenging. Putting in place the proper culture and process maturity practices can ensure that an organization can handle these renewals with very high levels of accuracy.

This entry was posted in General Management, Process Improvment, Risk. Bookmark the permalink.