Skip to content

Some suggested changes by a reviewer #107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions hongyu/SuccessStory.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,26 +8,26 @@ Email: {qlin, jlou, honzhang, dongmeiz}@microsoft.com


##Background
Online service systems such as online banking systems and e-commerce systems have been increasingly popular and important in our society. During operation of an online service, there can be a live-site service incident: an unplanned interruption/outage to the service or degradation in the quality of the service. Such incident can lead to huge economic loss or other serious consequences. For example, the estimated average cost of one hour’s service downtime for Amazon.com is $180,000 [1].
Online service systems, such as online banking systems and e-commerce systems, have been increasingly popular and important in our society. During operation of an online service, there can be a live-site service incident: an unplanned interruption, outage, or degradation in the quality of the service. Such incidents can lead to huge economic loss or other serious consequences. For example, the estimated average cost of one hour’s service downtime for Amazon.com is $180,000 [1].

Once a service incident occurs, the service provider should take actions immediately to diagnose the incident and restore the service as soon as possible. A typical procedure of incident management in practice (e.g., at Microsoft and other service-provider companies) goes as follows. When the service monitoring system detects a service violation, the system automatically sends out an alert and makes a phone call to a group of On-Call Engineers (OCEs) to trigger an incident investigation. Given an incident, OCEs need to understand what the problem is and how to resolve it. In ideal cases, OCEs can identify the root cause of the incident and fix it quickly. However, in most cases, OCEs are unable to identify or fix root causes within a short time, as it usually takes time to identify and fix the root causes, conduct regression testing, and re-deploy the new version to data centers. Thus, in order to recover the service as soon as possible, a common practice is to restore the service by identifying a temporary workaround solution (such as restarting a server component) to restore the service. Then after service restoration, identifying and fixing the underlying root cause for the incident can be conducted via offline postmortem analysis.

Incident management has become a critical task for online services. The goal is to minimize the service downtime and to ensure high quality of the provided services. In practice, incident management of an online service heavily depends on data collected at runtime of the service, such as service-level logs, performance counters, and machine/process/service-level events. Such monitoring data typically contains information that reflects the runtime state and behavior of the service. Based on the data collected, service incidents can be timely detected and mitigated.
Incident management has become a critical task for online services. The goal is to minimize the service downtime and to ensure high quality of the provided services. In practice, incident management of an online service heavily depends on data collected at runtime of the service, such as service-level logs, performance counters, and machine/process/service-level events. Such monitoring data typically contains information that reflects the runtime state and behavior of the service. Based on the data collected, service incidents can be detected and mitigated in a timely way.


##Service Analysis Studio
We formulated the problem of incident management for online services as a software analytics problem [2], which can be tackled with phases of task definition, data preparation, analytic-technology development, and deployment and feedback gathering. We carried out a two-year research project, where we designed a set of incident management techniques based on the analysis of a huge amount of data collected at service runtime [3]. As a result of this project, we developed a tool called Service Analysis Studio (SAS), which targets at real incident management scenarios of large-scale online services provided by Microsoft.
We formulated the problem of incident management for online services as a software analytics problem [2], which can be tackled with phases of task definition, data preparation, analytic-technology development, and deployment and feedback gathering. We carried out a two-year research project, where we designed a set of incident management techniques based on the analysis of a huge amount of data collected at service runtime [3]. As a result of this project, we developed a tool called Service Analysis Studio (SAS), which targets real incident management scenarios of large-scale online services provided by Microsoft.

SAS includes a set of data-driven techniques for diagnosing service incidents. Each of these techniques targets at a specific scenario and a certain type of data. Here we just briefly introduce some of the major techniques SAS offers:

* _Identification of Incident Beacons from System Metrics_: When engineers diagnose incidents of online services, they usually start from hunting for a small subset of system metrics that are symptoms of the incidents. We call such kind of metrics service-incident beacons. A service-incident beacon could provide useful information helping engineers locate the cause of an incident. For example, when a resource intensive SQL query blocks the execution of other queries accessing the same table, symptoms can be observed on monitoring data: the waiting time on the SQL-inducing lock becomes longer, and the event “SQL query time out failure” is triggered. Such metrics can be considered as incident beacons. We developed data mining based techniques that helped OCEs effectively and efficiently identify service-incident beacons from such huge number of system metrics. The technical details can be found elsewhere [4, 5].
* _Identification of Incident Beacons from System Metrics_: When engineers diagnose incidents of online services, they usually start by hunting for a small subset of system metrics that are symptoms of the incidents. We call such kinds of metrics "service-incident beacons." A service-incident beacon could provide useful information helping engineers locate the cause of an incident. For example, when a resource intensive SQL query blocks the execution of other queries accessing the same table, symptoms can be observed on monitoring data: the waiting time on the SQL-inducing lock becomes longer, and the event “SQL query time out failure” is triggered. Such metrics can be considered service-incident beacons. We developed data mining based techniques that helped OCEs effectively and efficiently identify service-incident beacons from such huge number of system metrics. The technical details can be found elsewhere [4, 5].

* _Mining Suspicious Execution Patterns_: Transactional logs provide rich information for diagnosing service incidents. When scanning through the logs, OCEs usually look for a set of log events that appear in the log sequences of failed requests but not in the ones of the succeeded requests. Such a set of log events are named as suspicious execution patterns. A suspicious execution pattern could be an error message indicating a specific fault, or a combination of log events of several operations. For example, a normal execution path looks like {task start, user login, cookie validation success, access resource R, do the job, logout}. In contrast, a failed execution path may look like {task start, user login, cookie not found, security token rebuild, access resource R error}. The code branch reflected by {cookie not found, Security token rebuild, access resource X error} indicates a suspicious execution pattern. We proposed a mining-based technique to automatically identify suspicious execution patterns. The details of our technique can be found elsewhere [6].

* _Leveraging Previous Effort for Recurrent Incidents_: OCEs of an online service system may receive many similar incident reports. Therefore, leveraging the knowledge from past incidents can help improve the effectiveness and efficiency of incident management. The key here is to design a technique that automatically retrieves the past incidents similar to the new one, and then proposes a potential restoration action based on the past solutions. More details can be found elsewhere [6].

##Successful Story
We have successfully applied SAS to Microsoft Service X (a geographically distributed, web-based service serving hundreds of millions of users). Similar to other online services, Service X is expected to provide high-quality service on 24x7 basis. During a certain period of time, the Service X team was facing great challenges in improving the effectiveness and efficiency of their incident management in order to provide high-quality service. SAS was first deployed to the datacenters of Service X in June 2011. The OCEs of Service X have been using SAS for incident management since then. The actual usage experience shows that SAS helps the OCEs improve the effectiveness and efficiency of incident management. According to the usage data from a 6-month empirical study, about 91% of OCEs used SAS to accomplish their incident management tasks and SAS was used to diagnose about 86% of service incidents. Now SAS has been successfully deployed to many Microsoft product datacenters and widely used by on-call engineers for incident management.
##Success Story
We have successfully applied SAS to Microsoft Service X (a geographically distributed, web-based service serving hundreds of millions of users). Similar to other online services, Service X is expected to provide high-quality service at all times. During a certain period of time, the Service X team was facing great challenges in improving the effectiveness and efficiency of their incident management in order to provide high-quality service. SAS was first deployed to the datacenters of Service X in June 2011. The OCEs of Service X have been using SAS for incident management since then. The actual usage experience shows that SAS helps the OCEs improve the effectiveness and efficiency of incident management. According to the usage data from a 6-month empirical study, about 91% of OCEs used SAS to accomplish their incident management tasks and SAS was used to diagnose about 86% of service incidents. Now SAS has been successfully deployed to many Microsoft product datacenters and widely used by on-call engineers for incident management.

##References
[1] D. A. Patterson. “A simple way to estimate the cost of downtime”. In Proc. of LISA’ 02, pp. 185-188, 2002
Expand Down