Any data center with a virtualized environment has a real need for effective capacity management. This white paper discusses the reasons why capacity management is critical to achieving the benefits of server virtualization and outlines the three key requirements to consider when evaluating capacity management systems.
Why Capacity Management in Virtualized Environments
A major advantage of virtualized environments is their ability to improve resource utilization by running multiple virtual machines (VMs) on the physical servers in a shared infrastructure. With such an architecture, utilization can increase from as low as 10% for dedicated servers to 60% or more for virtualized servers. The enhanced resource efficiencies make it possible to more fully utilize ever-increasing server power and provide significant savings in capital expenditures, power consumption, rack space and cooling.
This concept of greater efficiencies through resource-sharing is not new. Mainframe systems have long employed time-slicing to enable multiple applications to run concurrently. With mainframe systems, the dedicated and quite sophisticated “capacity planning” is performed by the operating system, which ensures that no application can cause any others to suffer from resource contention issues. The high cost of mainframes created a strong incentive for IT departments to maximize mainframe resource utilization by running as many concurrent applications as physically possible.
Today’s server virtualization solutions operate in a similar manner. Hypervisors enable multiple virtual applications to run on the same physical x86 server, with all sharing the common CPU, memory, storage and networking resources. Through the magic of the hypervisor, each application operates as if running alone on a dedicated server. As with the mainframe, however, each virtual machine is actually sharing resources with other virtual machines. And as with the mainframe, the multiple applications sometimes contend for shared resources causing performance to degrade, especially during peak periods.
The goal with both mainframes and virtualized servers is the same: optimize resource utilization without degrading performance to maximize cost-saving efficiencies. Organizations undertaking server consolidation projects invariably experience such savings—at least initially. Where their data centers had been filled with row after row of underutilized servers, each running a single application, the post-consolidation data center may have seemed almost deserted with the reduction in the number of racks required. A very successful consolidation effort, for example, might be able to run as many as 10 or 20 different applications on each server, thereby requiring only 1/10TH the number of servers. Fewer servers consuming less space and power and requiring less cooling led to significant savings.
This dream of dramatic savings through consolidation and virtualization has the potential to become a real performance nightmare, however, without good capacity planning and management. The key to successful capacity management, therefore, is to ensure satisfactory application performance (prevent the nightmare) while maximizing efficiencies and savings (preserve the dream).
At a high level, managing virtualized server capacity is not that much different from managing mainframes, which also have shared CPU, memory and storage resources. But looking deeper at the details reveals some dramatic differences that might make the mainframe’s systems engineer (read: capacity manager) feel completely unqualified to deal with the complexity inherent in open systems capacity management in virtualized environments. The table below summarizes some of these differences.
Mainframe Open Systems
Number of physical systems A few, very large systems Tens, hundreds or thousands of distributed servers
Variability of systems High degree of similarity in all systems Unless a complete hardware refresh was involved, a wide variety of CPU, memory and storage configurations and capacities
Number of applications A few core, mission-critical applications Hundreds to thousands from mission-critical to mundane file and print systems
Growth Relatively slow Cheap compute power combined with virtualization is unleashing an explosion of virtual machines
Deployment process Planned and controlled VMs spin up, shut down and are added and deleted constantly
Cloud The original private cloud, but highly restricted and controlled Private cloud enables end users to self-service
Purchasing model Major capital expenditure Relentless number of smaller server, storage and networking system purchases
Control and management Highly centralized Highly distributed
Monitoring Highly monitored with an integrated console Multiple monitoring systems for each of the multiple layers
Application mobility Limited Highly mobile applications that regularly move among servers, data stores and data centers, as well as among development, production and disaster recovery (DR) environments
Clearly, for an organization to benefit the most from its virtualized infrastructure, robust capacity management must be an integral component of that infrastructure. Leading analyst firms Gartner, Forrester and others all concur on this need. In Jean-Pierre Garbani’s report titled I&O’s New Capacity Planning Organization, for example, the Forrester analyst states emphatically: “Capacity management and planning are the keys to virtualization.”
Distributed Resource Schedulers Are Not Capacity Managers
An obvious question to ask here is: “Don’t applications like distributed resource schedulers solve the capacity management problem?” And the answer is an emphatic no. DRS applications are intended to balance the load virtual machines place on a hardware cluster. So just as a hypervisor provisions and balances the resources a virtual machine is able to consume on a single host, distributed resource schedulers perform the same provisioning and balancing across a cluster of hosts.
Balancing load across resources in an environment is important. But a balanced environment can still lack sufficient capacity or have too much capacity. In addition, virtual machines in a balanced environment can still be impacted by performance problems caused by the noisy neighbor problem, or the underlying resource availability of the host it is running on. So while operating distributed resource schedulers is good practice, system administrators need more management capabilities at the host, cluster and data center to fully and effectively plan and manage capacity.
Capacity Management Challenges will Only Increase
While daunting already for some today, the capacity management challenges faced by most IT organizations are certain to increase. The following trends are driving the need for more sophisticated capacity management solutions:
· Environment scale – Relentless growth in applications will cause virtualized environments to become increasingly larger and denser, making capacity management more complex.
· Mission-critical applications – The growing number of critical applications will all require enhanced performance monitoring.
· Multi-hypervisor deployments – The use of multiple hypervisors will require an agnostic approach to capacity and performance management.
· Cost optimization – With the “low fruit” savings from initial server consolidation projects now in the past for most organizations, future savings will need to come from cost optimization initiatives. And while chargeback is not as effective as initially thought at curbing waste, CFOs will continue to demand annual improvements.
· VM mobility – Mobility among private, public and hybrid clouds and even among development, production and DR environments, will add complexity to the decision-making process for determining the optimal allocation of VM capacity and workloads.
All of these trends combine to make maintaining the control over and the predictability of virtualized capacity increasingly challenging.
The Cost of Delay
Many organizations do not yet have a purpose-built capacity management system for their virtual environments, relying instead on other tools to somehow perform this essential function. Without dedicated and sophisticated capacity management, however, one of two scenarios inevitably unfolds: either the environment is so over-provisioned that there are no performance issues (and no one has yet caught on to the tremendous waste!); or administrators are using spreadsheets and other manual procedures in a daily struggle to maintain service levels by constantly reallocating an increasingly complex array of resources (often by trial-and-error!).
Delaying the inevitable need to implement fully-effective capacity management has real costs to an organization that often manifest as:
· Application performance problems as VMs contend with each other for resources
· Hours spent firefighting either perceived or actual problems throughout the virtual environment
· Loss of confidence in virtual infrastructure performance
· Wasteful resource allocations that undermine the cost-saving advantages of virtualization
· Over-purchasing of server hardware, memory or storage on a routine basis
· Hours of staff time spent maintaining spreadsheets for management reporting (time spent not being able to work on more productive projects)
· Incorrect sizing of new servers during a hardware refresh by paying a premium for:
o Expensive scale-up systems when scale-out systems are more efficient
o Excessive support for scale-out systems where scale-up systems are more appropriate
o Purchasing the latest CPUs for maximum clock speed when slower, earlier generation (and far less expensive) CPUs will suffice
o Purchasing the latest, highest-density memory when far more economical lower density memory is sufficient for the actual VM load
But perhaps the greatest cost of delay is not getting started aligning IT services with costs. Public clouds now provide alternatives for internal IT consumers to shop for services. These data points create the perception that public cloud services are “cheaper” and these beliefs are difficult to counter when IT has yet to develop a workable cost model for its services.
Even ignoring the public cloud “competition”, few IT executives are currently not focused on maximizing resource utilization to drive down capital and operational expenditures. Virtualization provides the ability to begin aligning IT costs to the services provided. But it is critical to begin this journey with a full understanding of the linkages among capacity, performance and cost. And this is perhaps the biggest reason not to delay implementing a genuine and capable capacity management system. Senior IT management is focusing on the problem throughout IT. Solving the problem sooner rather than later in virtualized infrastructure just makes sense.
Requirements for Capacity Management Solutions
What are the requirements for capacity management in a virtualized environment? At a high level, a capable capacity management solution must:
· Offer enterprise-wide visibility into performance, capacity, cost and resource efficiency of the entire virtualized infrastructure
· Provide actionable intelligence from this information
· Be simple to deploy, operate and maintain
Performance, capacity, cost and resource efficiency are all intertwined in a virtual environment. Without sufficient capacity, performance suffers. With too much capacity, infrastructure costs soar. Even with the right amount of capacity, efficiency can still suffer if virtual machines are consuming more expensive resources than are required.
Therefore, performance, capacity, cost and resource efficiency must be viewed across the enterprise in a holistic fashion to provide visibility for the administrator, as well as to provide information that is both sufficient and accurate enough to facilitate fully-informed decision making. Such visibility requires roll-ups of information across:
· Data centers
· Different types of hypervisors
· Different resource pools, such as CPU, memory, storage and networking
Simply being able to roll-up information up is not enough, however. As environments scale, functionality must be added to view all of this information in a meaningful way as outlined in the table below.
Automatic grouping, viewing and analysis of data based on end user, business unit, application, data center location or any number of other pertinent attributes Data must be presented in manageable chunks for virtualization administrators to be able to process. Chunking this information automatically is important for keeping data groups in a logical sequence.
Variable analysis periods to compensate for different time zones as data centers scale around the globe Without analysis periods that can be customized down to the virtual machine level for different locations, the roll-up of data across an enterprise can be rendered meaningless.
Resource exclusions to compensate for some environments where abnormally high or low resource utilization could be normal or expected Resource exclusions allow QA, test and development environments to be treated differently in capacity analysis calculations. It may not matter, for example, that a QA environment has high CPU utilization.
Variable sizing of virtual machines used in slot capacity calculations As an environment scales, so too does the diversity of different virtual machine types and sizes. For sizing calculations, the ability to use a variety of VM sizes for modeling is needed for an accurate assessment of available slots.
Advanced visualization technologies enabling hundreds or thousands of virtual machines can be displayed on a single pane of glass Larger environments may have thousands of virtual machines. The ability to display information on all machines on a single pane of glass is important to being able to easily monitor these environments.
Assuming data can be analyzed and rolled-up correctly, what types of analyses should be performed on the four different areas of performance, capacity, cost and resource efficiency? Here is an assessment of specific considerations for all four.
Monitoring and analyzing performance effectively involves:
Prioritizing real-time alarms and performance problems detected within both VMs and the hypervisors that run them and identifying the underlying root cause(s) and potential remediation(s) Virtual environments generate a significant number of alerts and alarms. Distinguishing the ones that matter from those that are echoes of other issues is essential to resolving performance issues quickly.
Identifying slowly-building performance bottlenecks that may not trigger active alarms but are nonetheless representative of underlying issues Not all performance problems cause immediate alarms. Because slowly-building problems can go undetected, bottleneck analysis is required to uncover these problems.
Identifying and alerting based on abnormal trends in capacity utilization or metrics that signal resource contention Trend analysis uncovers abnormal issues that could signal resource contention and must therefore be performed on an on-going basis.
Monitoring and analyzing capacity effectively requires:
Determining available virtual machine slot spaces remaining based on realistic models of VM requirements and their impact on available capacity Slot calculations help virtualization administrators determine future capacity needs, as well as the best place to provision a virtual machine.
Accurately modeling growth for an environment and factoring in capacity needs for ongoing or future projects Capacity needs increase even with steady state operations for most virtual machines. As virtual machines operate, they may consume more and more storage over time. In addition, if application loads increase, CPU and memory consumption also increase. Capacity management needs to model and predict these consumption trends.
Predicting future capacity bottlenecks that will lead to performance issues Bottlenecks occur when insufficient capacity is available to support a VM’s load. Because the total resources available in an environment is known and the consumption of those resources by the virtual machines can be predicted, it is possible to model when a future resource contention issue should occur. Modeling this information allows for virtualization administrators to either take action to avoid the bottleneck, or to procure additional hardware if necessary.
Accounting for VM deployments that are planned, but not yet implemented Virtualization administrators plan months in advance for VM deployments. Planning for required capacity, then removing this capacity from inventory reports makes resource planning more effective.
Understanding the virtual environment cost structure is critical as enterprises move toward the cloud. Because the cloud enables self-service portals, end users can quickly drive up operating costs in the absence of a thorough understanding of the underlying costs. Indeed, the sheer ease with which virtual environments enable the deployment of virtual machines has led to virtual machine sprawl. Understanding the cost component of a virtual environment is, therefore, essential to good capacity planning. Support for cost visibility requires:
· Chargeback (or at least “showback”) capabilities by customer for either allocated or utilized resources
· Robust reporting and potential integration with financial management systems
Chargeback/showback may encounter some significant organizational and computational limitations, however. These include:
· Financial systems that lack the ability to integrate chargeback information
· Generally accepted accounting principles that make chargeback difficult
· Budgeting cycles that are based on assumptions of fixed costs, not variable consumption models. IT customers faced with a consumption-based chargeback models must then confront the challenge of estimating uncertain computational demand.
· IT charging back for services may not be “politically palatable” for an organization
· Determining the chargeback amounts is also a non-trivial exercise if the intention is to get an accurate model of pricing
· Finally, chargeback is a measure of the price IT is charging for services and is not necessarily a measure of the actual cost to deliver that service. The actual cost to deliver all services at a high level is the total cost to own and operate all IT infrastructure divided by the number of virtual machines on that infrastructure. IT needs to focus on its actual costs, not the costs charged. This makes chargeback, without cost awareness, less beneficial as a management tool.
The barriers to chargeback are many. Nevertheless, this should not prevent IT from being on the path to understand and manage its cost structure. A key element to this is to implement a cost index that reflects the cost to IT to deploy a VM. Cost indices are a fairly new and advanced tool for IT. Using a cost index, the systems administrators can identify their most expensive virtual machines based on resource consumption, cost of the underlying hardware and density of deployment relative to other virtual machines. By identifying the most expensive virtual machines, actions can be taken to reduce costs or at least understand the impact on overall efficiency. Combining cost indices with cost visibility provides a solid foundation to lower IT costs over time and to understand the main cost drivers throughout the IT infrastructure.
Understanding efficiency in virtualized environments is critical because wasted or under-utilized resources are what drive up capital and operational expenditures. More importantly, since one of the original goals for virtualization was server consolidation and efficiency improvements, poor efficiency of the virtualized environment undermines this fundamental and worthy goal. Of course, IT can perform chargeback or showback, yet still have tremendous inefficiencies throughout the environment. Chargeback can, however, be a tool to help reveal such inefficiencies.
How is virtualized resource efficiency monitored and analyzed?
· The cost index, introduced above, is a way for IT teams to understand the relative costs of operating a virtual machine. While one VM could be expensive to operate relative to others, it could be operating efficiently with the underlying system being the actual culprit driving up the costs. A capacity management system must, therefore, be able to rank the indexed virtual machine costs accordingly.
· Over-allocating VM resources is a major source of inefficiency. Over-allocation occurs when applications have more CPU, memory or storage than needed to perform adequately. The capacity management system must be able to identify over-allocations continuously, preferably by monitoring for peak and average values of resource utilization across CPU, memory and storage. As some hypervisor vendors shift to consumption-based models for licensing, removing over-allocated memory will become an increasingly important aspect of cost control.
· Wasted resources occur in virtualized environments from normal operations. These wasted resources include zombie VMs, abandoned VMs, unused templates and unused snapshots. Capacity management systems must effectively distinguish these wasteful resources from similar resources that are actually in production use.
While enterprise-wide visibility is the first major requirement for a capacity management, generating actionable intelligence from the data collected is just as important. Capacity management is essentially an analysis problem. Correctly performing capacity and performance management requires the analysis of about 20 different metrics per virtual machine at the VM, host, cluster and data center levels taken in at least five minute intervals. For a simple 100 virtual machine environment, for example, this requires the analysis of 100 VMs x 20 metrics x 12 samples/hour x 24 hours x 4 levels of analysis, yielding about 2 million data points per day. Given the sheer volume of data, it is not difficult to see why manual processes simply fail to scale. The better capacity management solutions are able to perform this multi-variable analysis on a massive scale.
The visibility requirement of capacity management solutions demands a significant amount of computational horsepower simply to make sense of the wealth of data. The need for creating actionable intelligence requires even more computations to enable system administrators to move beyond basic visibility into various problems and efficiency issues to being empowered to take action to address the underlying cause(s), either in a manual or automated fashion.
For performance issues, actionable intelligence involves:
· Root cause analysis of the problem with specific recommendations that do not require any additional analysis for how to clear the problem
· Impact analysis to point out any related virtual objects that might be affected by an issue
· Automated actions to clear performance issues, such as moving a virtual machine to a different cluster, or working with native distributed resource schedulers to accomplish the task
· Automated resizing of a virtual machine within the limitations imposed by the operating system(s) or corporate policies, with or without a restart
For cost accounting and efficiency, actionable intelligence involves:
· Specific recommendations for ways to improve efficiency and lower costs
· Automated zombie destruction, template cleansing and abandoned VM clean up, especially for QA environments that potentially contain thousands of virtual machines
· Automated downsizing of virtual machines within the limitations imposed by the operating system(s)
· Automated reporting of cost and efficiency numbers for key stakeholders in a variety of formats
Visibility without actionable intelligence, while a step in the right direction, leaves the administrators, especially in larger environments, with a significant labor burden to maintain efficiency and performance of their virtualized infrastructures.
Starts to Work Out of the Box
Why does a capacity management system have to be hard to deploy? It doesn’t. Here are some reasons ease of deployment and use are important requirements:
· Lack of dedicated capacity planners – Most virtualization teams, even very large ones, do not have dedicated capacity planners. Any application to manage capacity and performance must, therefore, be useable by all team members without significant training.
· Lack of capacity management skills in the existing team – Capacity management is an analytics problem. It is certainly possible to develop this skill set on a virtualization team without a dedicated capacity planner. But there currently exists no formal certification authority similar to vExpert for managing capacity. Even with sufficient training, the scale of the analytics problem would still require a sizable investment in software (whether bought or built) to mine the raw metrics data generated by the hypervisors.
· Return on investment – Long deployment times and high training costs both diminish the return on the investment in a capacity management system. The better systems are able to recover their purchase price in just a few months.
· Pace of expansion – Virtualized environments are in a constant state of change, which normally involves both reconfiguring and expanding resources. If the capacity management solution cannot keep pace with the expansion, or requires constant configuration changes to do so, it will eventually need to be replaced.
· “Expert in the Box” – Similar to having a vExpert on staff to address technical issues, the capacity management solution needs to function like an expert itself from day one and it should not require an expert operator to get great results—ever.
The specific requirements for a capacity management system to work “out of the box” are:
· Little to no configuration or maintenance required for operation
· A minimal learning curve for basic operation with intuitive interfaces to facilitate usage by part-time capacity planners
· Automation to support repetitive tasks
· Automatic creation of user views to eliminate the need for manual customization
· Pre-configured and easily configurable customizable reporting for different audiences
Capacity management for virtualized environments is an absolute necessity in any infrastructure of reasonable scale. Virtualization has reintroduced the mainframe model of computing, but with significantly more complexity for sharing CPU, memory, storage and networking resources. While implementing a capacity management solution could be postponed, most environments will incur some very real and potentially substantial costs by doing so. Robust capacity management systems—those that meet the requirements outlined here for enterprise-wide visibility, actionable intelligence and “out of the box” productivity—pay for themselves almost immediately, with the cost savings continuing to accumulate year after year. It is perhaps the best investment an organization will ever make to get the most from its virtualized infrastructure.
Tags: Server Virtualization, Storage Virtualization