When a website or web service is not available online or doesn’t function well enough for end users to complete a task, the site is considered to be experiencing downtime.

Although most websites and web services strive for zero downtime, downtime is inevitable. Even the giants like Google and Facebook experience downtime occasionally. Although technology has improved and providers have systems in place to help eliminate downtime, unforeseen circumstances still cause downtime.

What constitutes downtime?

Downtime is a subjective term much like its opposite uptime. In the early days of the Internet, downtime typically meant that a site was inaccessible for its end users. Today what constitutes downtime is more complex. Most consider a site or service down if the end users can’t complete their task. For example, an e-commerce site is in effect down if the end-users can’t put an item in their shopping cart. In fact, visitors are more forgiving to a site that experiences a complete outage than they are to a site with broken functionality. Poor performance can also fall into the category of downtime if it affects the end users ability to achieve their goal.

What causes downtime?

Many things can cause downtime for a provider. Some causes are in the provider’s control such as scheduled maintenance, but other outages are not. Each situation is unique, but most causes fall into the following categories.

Human error

As in everything, when something goes wrong the root cause typically leads to a single mistake made by an individual or a team. A seemingly benign code change affects something else that doesn’t appear during regression testing, a system is brought offline when it shouldn’t have been, or a DNS entry gets updated incorrectly are just a few examples of how humans contribute to a site's downtime. The huge AWS outage early in 2017 is a real-life example of how something as simple as a typo can cause downtime that not only affected Amazon Web Services but also brought down many large websites.

Equipment failure

Equipment wears out and breaks down, and new equipment fails without warning. Proper maintenance and hardware redundancy is the only way to minimize downtime due to hardware. In another Amazon example, the e-commerce giant suffered an outage that affected most of Europe back in 2010. Although it was fist suspected that hackers had brought the site down, later Amazon revealed that the downtime was due to a hardware failure at their data center.

Malicious Attack

Hackers discover clever new ways to infiltrate and disrupt businesses all the time. One common method is the Distributed Denial-of-Service (DDoS) attack. The denial-of-service attacks attempt to overwhelm servers with requests. The requests come simultaneously and repeatedly from multiple locations causing an overload on the target’s web servers. The deluge of requests, in effect, block legitimate requests bring the site down. Other attacks include DNS cache poisoning where the hackers infiltrate the Domain Name System (DNS) resolver’s cache and change the IP address to one that enables them to exploit the site’s users, so in effect, the targeted website is down. Other attacks involve SSL certificates and malware.

How do websites avoid downtime?

When it comes to hardware, companies use redundancy to make sure that backup systems remain ready in the event of an outage, load balancers and data centers help to keep performance up. Synthetic Monitoring services watch websites, servers, APIs, and web applications for outages, performance, and function, and the monitoring service alerts the support teams when things aren’t working properly.

Uptime monitoring

Also called availability monitoring and website monitoring, uptime monitoring is a synthetic monitoring type that uses a network of computers (checkpoints) to send requests, pings, and connects to websites and servers. These basic monitors check the response codes and response times and report the results back to the monitoring service. If an error occurs or the response takes longer than designated, the monitoring service may issue an alert, or the monitoring service may validate the error from another checkpoint before sounding the alarms.

Advanced Availability Monitoring

Advanced Availability Monitoring uses specialized monitoring types to verify availability based on specific servers or functions. Companies use Advanced Availability Monitoring to:

  • Verify TLS/SSL certificates for expiration and content,
  • Check on DNS health by verifying key fields on a DNS entry,
  • Communicate with POP3, SMTP, and IMAP email servers,
  • Query and check MySQL and SQL Server databases, and
  • Check availability and downloads for FTP and SFTP.

Performance and function monitoring

Both uptime and advanced monitoring work well to check for system outages, but they can only modestly check for performance and function. Web Performance, Web Application, and API Monitoring take availability monitoring to another level.

Web Performance monitoring

Performance monitors do more than send and receive a request, for they use real browsers like Chrome and Internet Explorer for sending the request and receiving the response. The checkpoints do more than just check the return for error messages; the checkpoint loads the response into a browser. Loading the content allows subsequent requests to fire and the page’s scripts and contents load into the browser. The monitor checks the performance of each page element. A monitoring service generates a visual report in the form of a waterfall chart for easier scrutiny. Waterfall reports make root-cause analysis easier by identifying poor performing content (third party or native) and reporting on front- and back-end performance for each element.

Web Application Monitoring

The site may be up but not functioning properly, so, in effect, the site is experiencing a form of downtime. Web Application Monitoring or transaction monitoring helps websites keep their websites functional. The checkpoints use scripts that act as regular users to test login forms, shopping carts, web forms, and payment processes. The monitors also monitor server responsiveness and check for page content along the way.

API Monitoring

SaaS businesses and websites communicate with each other and end-users all the time using their public facing APIs. When an API fails, more than just the API goes down with it. Mobile apps quit working, dependent web content and function fails, and back-end processes fail. Testing API functions using API Monitoring can drastically reduce downtime by capturing the failures and trends quickly. Finding API issues early can prevent API issues from affecting the API’s users.

Conclusion

Downtime is difficult to avoid, but the right support systems and monitoring solutions can reduce it to near zero. Providers strive for high availability (99.99% uptime), and many achieve and maintain their goal. Another solution providers use to monitor their web presence is Real User Monitoring (RUM). RUM allows a provider to watch their users’ actual experience (Digital Experience Monitoring or DEM). Although RUM isn’t a good solution for uptime monitoring, it can provide performance details based on user location, browser type and version, operating system and version, device type, and page viewed.