- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
This is part of Cost Modeling Your Cloud series:
** Disclaimer: Math ahead. Actual costs and data points obfuscated for confidentiality. **
Industry media champions density and utilization as harbingers of efficiency in the datacenter. But what do they actually mean? How do you use them in your day-to-day operation? How much can you safely cram in a rack?
While there are really only 3 elements, the subtle nuances of each can either dramatically increase efficiency or expose your business to a high degree of risk.
- Design Envelope: How much usable power per rack
- Target Device Utilization: How much power will each server draw
- Density: How much cooling per square foot
Understanding and tuning these is the key to balancing efficiency against risk.
Rack Design Envelope
How much power can a rack provide? This is dependent on the power circuits available in your datacenter, and those circuits vary by region. A common US standard is the L6-30 drop, which commonly provides 30A at 208V. The US specs also derate to 80%, bringing 6,240 Watts (208*30) down to 4,992. This is the ubiquitous "5kW" rack.
ProTip: You don't always have to pay for your power up front. Some data centers offer "consumption" billing where you pay for the kWh you actually use. Those that don't sometimes offer lower committed rates (say, 3.5kW/rack) as you ramp your business, with an option to increase your billed rate when usage exceeds the committed rate. This can save you millions.
One common mistake is to assume if you have 2x 5kW power feeds and 2 PDUs to a rack, you have 10kW of power per rack. In a redundant configuration, each server is connected to both PDUs. However, the overall load should not exceed the max draw of a single feed. This means there is only 5kW of usable power. If one feed fails while there is 7-9kW of draw, the remaining PDU will end up failing because it cannot handle the load.
ProTip: Certain stateless services can tolerate entire rack failures without issue, and in those cases all 10kW could be used. But most stateless services tend to drop in a single high amperage power feed at 10kW, 12kW, or 14kW rather than multiple with lower power.
Target Device Utilization
Server resources are infinitely customizable and how applications use those resources will vary widely, which means accurate power estimates can be hard to come by.
First, you need to use a manufacturer power calculator to determine the "nameplate" rating of a particular combination of components. If you have been following best practices and standardizing your devices, you only have a few different combinations to consider. If every device is different, you are in a world of hurt and optimization will be somewhere between difficult and impossible.
Second, you must determine average utilization. This should be a combination of science, art, and risk management, and based on *your* applications. Nobody else can tell you how your applications run or what resources they need.
This can be done by looking across your active environment and documenting the 95th percentile resource utilization that devices experience and how much power they draw at that point. A server may have a nameplate rating of 475 watts at 100% utilization. But it may realistically run at 75%, drawing 393 watts of power. Do your best to look across multiple servers to get an accurate representation. A great way to cross-check your assumptions is to get current and max power draws on each power circuit from your datacenter provider. If a current rack has 10 servers that are drawing around 393 watts per, the facility should be showing roughly 4kW/rack. If these numbers don't match, you missed something.
Most racks also have 1-3 networking switches in them, which will also have average draws. Refer to network device manufacturer documentation or calculators to determine these numbers.
ProTip: Customer maturity is a huge predictor of server load at ServiceNow. New customers don't stretch the system while the largest most experienced ones have considerably more impact. This dynamic means that we have to ensure we look at racks with more mature customers to identify our target utilization. Looking at fresh racks with new customers would result in bad utilization estimates.
Once you know how much power you have to work with per rack and how much each device draws, simply divide. Assuming a 5kW rack, 2x 300W switches, and a server draw of 393 watts, you can put 11 servers into this standard rack. Easy as pie!
Density
Newer data centers can cool 14kW-20kW racks side by side in certain configurations, but older facilities might only be able to cool between 1kW and 2kW per rack position. Talk to your provider to ensure your dense racks can be adequately cooled. Occasionally high-density racks can be accommodated into facilities with low density cooling so long as the racks are physically spaced away from each other. This doesn't generally affect your overall performance per watt optimization, but could increase your operational datacenter costs as additional square footage would be required. We avoid this like the plague and shut down all low-density sites.
One More Thing: Cascading Failure… A Worst Case Scenario
Someone designs a rack standard with redundant 5kW feeds with 9.5kW of equipment. 5 racks are deployed, and the service runs well. The datacenter executes planned maintenance with the appropriate notification. One feed in a single rack drops, spiking draw and tripping the second PDU. The entire rack is now offline, but the worst is yet to come.
The capacity plan assumed servers would be running at 50% utilization. With 1 rack down out of 5, that rack's load is added to each of the remaining 4 racks. That 9.5kW is divided by 4 (2.375kw), and each goes from 9.5kW to 11.9kW. Even assuming the increased load is non-linear, there is a high likelihood that some or all of the remaining racks would fall offline after exceeding the draw cap on the circuits and tripping breakers.
While this scenario is unlikely, it *does* happen. Make sure you know what happens when a portion of your application falls offline and your remaining infrastructure has to pick up the slack.
Common Mistakes
People frequently…
- Assume 2x 5kW drops can support 10kW of equipment redundantly.
- Assume no other racks will increase utilization if one goes down.
- Try to apply power assumptions from one service or role to another. DNS may behave quite differently than a giant database.
- Assume your application runs at 100% when it really runs at 20%.
But if you walk through these basic steps and validate your assumptions at each stage, it is simple and straightforward to optimize what you have and identify targets for the next generation of optimization.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.