ApacheCon NA 2013

Portland, Oregon

February 26th – 28th, 2013

Register Now!

Tuesday 1:45 p.m.–2:45 p.m.

Advanced CloudStack Troubleshooting using Log Analysis

Kirk Kosinski

Track:
Cloud Crowd
Audience level:
Intermediate

Description

Apache CloudStack provides vast amounts of log entries, and they are distributed throughout the environment. For a CloudStack administrator, investigating errors in the logs is an inevitable task. This may initially seem difficult, but it is possible with the right approach. Using examples, this talk will detail techniques that enable effective analysis of the logs generated by CloudStack.

Abstract

CloudStack and the components of the environment it manages generate many log entries in many places. In order to troubleshoot virtually any problem, the administrator must first know which logs to check. The locations include:

CloudStack management server - CloudStack runs on Tomcat and logs to plethora of files stored on the management server. Some logs are more useful than others, and depending on the issue, it is important to know which ones to check.

Hypervisor hosts - CloudStack can manage hypervisors from several vendors, and each hypervisor has its own logging mechanism. Many issues will require investigating the logs on the hosts.

Storage and network devices - Certain devices may have log entries of use to a CloudStack administrator when troubleshooting a problem. For example, a firewall might be blocking a certain port, or a NAS or SAN might block access to a host.

Once an administrator has determined the log to check, or at least to start with, he or she must know what to look for, such as:

Warnings and errors and exceptions, oh my! - In certain logs, the log entries to be concerned about usually include errors, warnings, and exceptions. However, a simple “grep” will undoubtedly generate countless false positives, as well as strip crucial information, and knowing specifically what to look for can drastically reduce the time required to resolve the issue.

Jobs, sequences, and more - Many tasks do not fail immediately, rather they can fail over seconds, minutes, or even hours. Being able to follow a failed task from beginning to end requires knowledge of how tasks progress, for example with jobs and sequences.

When an administrator finds what they are looking for, he or she must translate it into actions that will resolve the issue. This will depend on the error determined to be the root cause, and examples include:

Capacity - The cloud managed by CloudStack contains a variety of finite resources, so failures due to lack of capacity can occur.

Network - Many errors will be network-related, and it is important to be able to understand the exact meaning. For example, understanding the difference between “no route to host” and “connection refused” can be quite useful.

Just wait longer - Sometimes a failure is not actually a failure and simply requires more patience. If that is not an option, the issue could be classified as being performance-related and investigated and resolved accordingly.