One of these things (is not like the others)

josiahsullivan · ‎02-21-2014

We have a curse jar in the office. Every time someone says "When I was at X, we Y", referencing prior employer X, 25 cents goes into the jar.

ServiceNow is neither search engine nor online retailer. The goals and constraints we have are unique in their combination, and this uniqueness requires us to make decisions with fresh eyes instead of simply replicating past experience. The curse jar requires us to pause and consider to what degree prior experience applies.

These are some of the differences my team accounts for when evaluating or designing the physical environment and hardware platform that runs ServiceNow, and each of these could be a subject of numerous posts.

Business vs Consumer

Businesses pay for a higher level of service than consumers generally receive. A Netflix outage simply forces the consumer to watch a movie via Vudu or Amazon. There aren't analogous "backup" services that our customers can effectively fall back on if we go down. Their professional lives are in our hands, and so we architect heavily for resilience and keep extensive backups. We have also made a number of changes for our 2014 server platform to reduce restore times.

Business Critical vs Productivity Tool

If Workday goes down, you can still use your insurance to see a doctor. If SalesForce is down, you can still sign a contract or close a deal with a customer. If ServiceNow is down, many of our customers businesses stop functioning. Parts stop flowing from suppliers halting assembly lines, home loans aren't issued, and doctors can't request prescriptions or radiological scans. Check out how CERN relies on ServiceNow. As a result, many of our customers want or need 100% uptime. Serviceability and potential impact on SLA become key evaluation criteria at each level of the technology stack.

Binary vs Qualitative

Complicating the desire for dial-tone service level or 100% uptime is that ServiceNow usually appears On or Off to the customer. Website availability doesn't step down from Super HD to Regular HD. The application has relatively low bandwidth requirements and isn't very latency sensitive so outages appear "complete" to a customer, even if the customer beside them is operating perfectly. We overbuild certain aspects of our server platform (like IO) so that we degrade gracefully under stress (like MySQL working set swaps).

Customization vs Guard Rails

The ServiceNow product and platform are significantly more customizable and extensible than SalesForce, for example. We don't have the same guard rails, which means we have to engineer for broader use cases. Nothing prevents a customer from creating an inefficient query or report that monopolizes the system at the expense of other users, so we invest heavily in customer sizing and isolation to ensure this doesn't happen. (Allan and Tim discuss this in more detail here.) We intentionally invest more to gain better performance from a server, processor, DIMM or storage device instead of trying to save a few pennies. This better prepares us to handle the load that comes from inadvertent customer mistakes, unpredicted customer load and an intentional lack of guard rails.

High Revenue/Server vs Low Revenue/Server

These numbers can difficult to quantify as core data isn't often shared. Some of these are best-guess estimates based on public data.

eBay recognizes roughly $38k in quarterly revenue per server (roughly $2B over 52.5k servers) in Q4 2013. (http://tech.ebay.com/dashboard) With an install base of 4.5k servers SN was closer to $30k. This is roughly 2x Google's number and 3x Facebook. Microsoft and Amazon are even more difficult to quantify given the use and profitability of subsidized business lines, but generally speaking we do a much better job extracting value from servers than most online service providers. A colleague of mine mentioned that we are 53X Azure.

But more interesting than raw revenue is what it means for the footprint. Operating with this degree of efficiency eliminates the need for us to build our own data centers. A smaller footprint also means that minor efficiency gains possible at 300,000+ server qty would not be realized until well after the end of our hardware lifecycles. The engineering effort to build or ODM our own servers has negative ROI in the short term.

We optimize for scale as much as possible, with the knowledge that we simply don't need the volume. We focus on cost-effective improvements that result in immediate value for our customers.

Blast Radius

Monolithic solutions like SAN's or blade servers commonly make financial sense at the project level, or for IT consolidation projects, but failures can take out large swaths of customers. We tend to select technology that scales alongside customer growth (e.g. 10 customers = 1 server) at the smallest inflection point that makes financial sense. For example, we would not consider doubling customer density on a server just to save 2%. We would rather spend the 2% and have only 50% of that customer set affected in the case of an outage.

Data Volume

We move a relatively small amount of data across our network and across the Internet to our customers, but the value of that traffic is exceptionally high. A Netflix user may pay $7.99/month and consume 250GB of data, whereas a ServiceNow customer may pay hundreds of thousands of dollars for the same quantity of bits. The integrity of these bits is of exceedingly high priority, so we have multi-site replication with backups in both.

Bleeding Edge vs Tried and True

While we evaluate new technologies every day, the hard truth is that we tend to remain conservative to introduce the smallest amount of risk into the production environment. Even staid technologies are fully tested, evaluated, and on boarded as if they were bleeding edge to ensure there is no disruption when introduced en masse.

Bringing all of these into balance seems daunting and occasionally impossible (e.g. traditional HA solutions tend to be more monolithic and complex vs distributed and simple), but we have a rigorous evaluation and design process to ensure each server platform generation is the best possible combination. More on that later.

One of these things (is not like the others)

An Exciting Opportunity to Become a 2026 Rising Star

Calling all Servicenow Developers! How can we improve the Developer site experience?

My Architecture Excellence (ArchX) Journey