With the rise of “Cloud Computing” over the last several years, more and more companies are looking to leverage virtual machines or cloud services in their IT infrastructure. Aside from the hype, cloud services and VMs can provide some real-world benefits that are hard to ignore. Lately, companies have started asking if VMs or “The Cloud” can be leveraged as part of their ATG/Oracle Commerce infrastructure.
“The Cloud” or Cloud Computing is a broad term that covers a wide range of technologies and models. Wikipedia defines Cloud Computing as:
Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over a network (typically the Internet).
That’s pretty vague, and covers everything from normal websites to specialized models of service delivery. When it comes to hosting an ATG/Oracle Commerce application we’re really talking about Infrastructure as a Service (IaaS), that is: providing Virtual Machines, storage, load balancers, and networking.
Throughout this document I will be using the terms Cloud, VMs, The Cloud, Cloud Infrastructure, etc… to refer to IaaS and VMs delivered as part of an IaaS service.
Types of The Cloud
Most cloud providers use a shared cloud. This means that multiple clients’ VMs and resources are provided from shared physical cloud infrastructure(s).
A private cloud is a cloud infrastructure dedicated to a single client or organization. All the hardware and VMs are built and managed for the single client. Some hosting companies will provide private cloud options, but many private clouds are built and managed internally within the company using them.
Benefits of The Cloud
The benefits of a cloud based infrastructure usually fall into two basic categories: Stability and Scalability.
Large cloud service providers, such as Amazon, RackSpace, SoftLayer, Linode, and SliceHost use a large physical infrastructure and an intelligent hypervisor layer which protects reasonably well against hardware failure on a real world server. RAM, CPUs, Power Supplies, NICs, and Motherboards will all fail eventually, given enough time and enough servers. A well-built cloud infrastructure can provide protection against those types of hardware failure. However, you can also introduce other risks (shared service overload, hypervisor failure/bugs), so it’s not a silver bullet.
The key with cloud infrastructures is Elastic Scalability, which is the ability to quickly add additional resources to your cluster by making new VMs (or depending on your situation, larger VMs) available within 15 minutes or less. On-demand elastic scalability can help you weather a huge surge in traffic, or just holiday season visitor increases, without needing to deploy massive infrastructure you won’t need 365 days a year.
Drawbacks of The Cloud
Cloud computing and VM based infrastructures have several drawbacks which are important to consider. While the relative importance of each of these is dependent on your application and requirements, it’s critical to keep the following points in mind.
Because VMs rely on a hypervisor and VM management software you will see a negative performance impact when compared to running on the same hardware as dedicated servers. The exact performance impact depends on the VM software, configuration, and other factors, however, here are some basic guidelines:
CPU Impact: Expect to lose at least 5-15% of your CPU performance (some cases can be much worse, with 41% slowdowns or more (JVM server performance seems to be in the 10-15% range in most cases))
Network: Up to 95% slower than native (VMWare is in the 0-5% penalty range, while Xen may be as bad as 95+%)
Memory: 0-10% slower than native
Disk: 30-60% slower than native
In my personal experience, Disk and Network are the two worst areas. CPU impact is also measurable and shouldn’t be ignored, especially for ATG/Oracle Commerce applications (more on that later), but Disk and Network slowness can easily bring a “lightweight” application to its knees. The more “shared” the Cloud infrastructure is, the worse these areas are likely to be.
Security and PCI Compliance
With Shared Cloud infrastructures there can be security and PCI compliance concerns. There have been many hypervisor bugs which allow one client of the Cloud to access another client’s VMs, compromising security. Many Shared Cloud offerings are not PCI Level 1 certified. A few of the larger players have PCI compliant offerings now, but they are typically separate from their main Cloud service, have higher costs, and more limitations. This is less of an issue with Private Clouds, although they have their own drawbacks.
With Cloud infrastructure, especially Shared Clouds, it’s possible to have the infrastructure over-provisioned. While good Cloud providers will probably avoid this, it can be hard to know for sure. You know how airlines overbook flights, assuming that some people won’t show? Some Cloud providers will oversell their hardware, selling more VM resources than the hardware can even provide at full capacity. This is part of the magic of the Cloud, both good and bad.
By selling 20 VMs on a 16 core box, you can make the price affordable, and most of the time, most of the VMs won’t be using anywhere close to 100% of their CPU/Network/RAM/Disk, etc… However, you don’t want to need all your CPU resources, or all of your network resources, and not have them available to you…. Again, this is an area where network and disk are often the most oversold and under-delivered.
Research your Cloud provider to ensure they are under-provisioning, not over-provisioning.
In order to provide scaling, most Cloud infrastructures should be under-provisioned. This, combined with the performance overhead inherent in VMs, means that you will always be paying more for a given amount of performance, than if you were buying dedicated hardware. The only way you save money is if you’re using less than a server’s worth of resources.
Some products, including ATG/Oracle Commerce and many other Oracle products are licensed based on cores/processors or sockets. While Oracle have introduced new requests-based licensing model, the majority of ATG/Oracle Commerce customers are still on old cores-based one. As such, there are typically significant restraints in the licensing terms around virtualization, VMs, and Cloud infrastructures.
For instance, for ATG/Oracle Commerce, unless you’re using Oracle VM, you have to be licensed for (or buy licensing for) all of the physical cores/processors that provide your VMs. If you use Oracle VM you only need to be licensed for your provisioned cores/processors. However, Oracle VM is rarely used by Cloud providers, and usually isn’t the first choice of IT departments building private clouds.
Where Cloud Hosting Shines
So after all the detail about Cons, after a very lightweight look at the Pros, you might think I’m against Cloud hosting. I’m not. The benefits are well known and easy to understand, so I’ve spent more time detailing out some of the negative aspects, especially ones that are relevant for ATG Oracle Commerce hosting.
Cloud Hosting is absolutely fantastic for two high level use cases:
1. When you need fewer resources than a full dedicated server.
In the this case, you don’t need to have a full quad core server with 16 GB of RAM and 2 Gbit NICs for your application, much less multiple boxes for redundancy. So using a smaller cheaper VM makes a lot of sense. It saves money and offers better redundancy than one dedicated server, and saves a lot more money than paying for two dedicated servers.
At Spark::red we utilize VMs in this way for things like IMAP, LDAP, DNS, and other services that aren’t resource intensive, but uptime, and the ability to spin up multiple instances around the world is a very good thing.
2. When you need to, and are able to architecture wise, scale your infrastructure resources up and down dramatically.
In the second situation, you have an application where the architecture, and licensing (or lack there-of), facilitates scaling your number of VM nodes up and down around demand (hourly, or seasonally, or whatever). Stateless and/or simple applications can do this very well, as can applications written to support this deployment topology concept. Having an application where you can add in caching front end servers, or stateless application servers, or database read-only nodes, or other similar pieces, can allow you to scale your capacity up and down based on need, minimizing costs when your traffic is lower.
Pintrest does this in a way that saves them significant hosting costs. Netflix is another great example of a very elastic and self managing scalable application. Many other web apps support similar scaling.
Why Cloud Isn’t a Good Fit for ATG/Oracle Commerce Hosting
As you can imagine, ATG/Oracle Commerce doesn’t fit into either of these two high level use cases.
ATG/Oracle Commerce environments almost always need a full server’s resources or multiple servers’ resources. Production obviously needs multiple servers’ worth of CPU, RAM, and I/O. And in production, where page response times have a direct impact on conversions and revenue, the performance overhead of a VM system has significantly negative impacts on your revenue. In Development or Staging environments, you typically will need 16-32 GB of RAM per environment, plus Database, for Development you want quick restart times (CPU heavy), and in Stage you will probably want to be able to run load testing, so you need performance that maps well to production and isn’t artificially constrained on VM system disk I/O or sub-par virtual CPUs.
While the ATG/Oracle Commerce platform has some impressive architecture, it is not set up for easy cluster topology changes. The BCC/CA deployment configuration points to hostnames/IPs and ports for each agent, and it has to know what agents are live so it can ensure accurate deployments to all of them. Also if you add in new agents (app server instances), assuming you don’t want to do a full deploy (you don’t), you’ll need to pre-load the latest deployment data and VFS files, as well as the current snapshot ID, make sure the switching data sources are all lined up to match the current live config of the existing instances, and then add the new agent to the CA deployment topology. You will also need to add all new instances to your load balancer or Apache proxy configuration, and reload. There are also complications around instance names and ports in various JMS messages, cache invalidation events, lock managers, etc., that can add serious speed bumps to making frequent on the fly cluster topology changes.
In short, an ATG/Oracle Commerce cluster does not lend itself to automated or frequent scaling up or down.
Another important factor is the ATG/Oracle Commerce licensing. The licensing is done based on CPU cores. Licenses are sold by “processor” which has a basic multiplier relationship with CPU cores based on your processor architecture. For Intel chips, it’s a 2x multiplier. That means if you buy 4 “processors” of ATG/Oracle Commerce, you are actually buying 8 Intel cores worth. You can run on 1 octo-core proc, or two quad-core procs. This has a few important implications regarding virtualization.
First, Oracle actually has specific licensing rules around virtualization. One set of rules applies if you’re using the Oracle VM product. In this case you have to have licenses for all virtual cores you are using in the VM(s). This is pretty standard. However, if you’re using any other VM product, you have to have licenses for all of the physical cores on the infrastructure that the VM system is running on, regardless of how many are provisioning in the VM(s) you are running ATG/Oracle Commerce on. That means you have to pay for a great deal of expensive licenses that you aren’t actually using to serve customer traffic. Since the ATG/Oracle Commerce licenses and support contracts are likely to be one of the largest costs for running your website, it makes no sense to waste those licenses, and that money, on anything other than serving up your website pages as quickly as possible.
What if we ignore the two primary benefits of Cloud Hosting I’ve described above? Maybe you have other goals, or you have corporate standards and you don’t really care about the additional complexity required to manage ATG clusters in changing topologies. Well, even then it comes down to issues of price and performance.
Since ATG/Oracle Commerce instances typically require large amounts of RAM and CPU resources, in addition to I/O, the servers needed to build out a useful VM infrastructure will have to be very large, with huge amounts of RAM, massive CPU density, multiple 10 Gbit NICs, very fast disk arrays, and more. It is almost always significantly more expensive to purchase a smaller number of very large servers than a larger number of smaller servers. So for that same amount of computing power, or web site capacity, you will end up paying significantly more for your hardware if you go with a large server VM based solution, rather than using a larger number of smaller commodity servers.
You can only ever run as many CPU cores under your ATG/Oracle Commerce application as you have purchased licenses for. That’s true for dedicated hardware, it’s true for Oracle VM based VM solutions, and you’re actually worse off if you’re using any other VM solution. Assuming that you’re running on a private cloud (for security, PCI, and performance reasons), that means you’re paying for and running at least as many CPU cores as your maximum ATG/Oracle Commerce licenses. Let’s use an example here to illustrate. Assume you have 32 cores (16 “processors”) worth of ATG/Oracle Commerce licenses for your production environment. Let’s pretend for the moment that you don’t need more CPUs for the VM system and management, etc., (although you really do, so the costs are even higher) so you have 4 8-core servers (or 2 16-core servers or whatever). At peak, you can be running ATG/Oracle Commerce on all of these servers’ resources, using all of your licenses. In quiet times, you could scale back to running half that, say 16 cores, leaving 16 cores available to scale up into. However you still pay for those licenses, and in a private cloud, you’re still paying for the hardware 365 days a year (plus the VM costs and management overhead). In this case there is no reason to not run at full capacity all year, as it will provide better performance to your end users, increasing revenue, with no additional operational costs at all. So you would never want to scale down below your maximum licensed core count, which completely defeats the purpose. Now on a shared cloud, things can be a bit different, but the PCI and security issues can often make this a non-starter.
As I mentioned above, you will never get the same performance from a VM system as you would from using the same hardware directly. You will likely see a 10+% performance penalty. This reduces capacity and end user performance, which in turn directly impacts revenue. If you’re paying for the hardware and the ATG/Oracle Commerce licenses, you should always maximize the performance you are getting for your dollar. VMs do just the opposite.
While Cloud and VM based solutions have positive aspects that can benefit many applications and infrastructure components in your corporate architecture, ATG/Oracle Commerce is not a platform that is well suited to The Cloud. At this point in time, I highly recommend you use high performance, late model, dedicated servers for your ATG/Oracle Commerce environment.
What if you have large seasonal traffic spikes (holiday or season or event driven), and while you own all the licenses you need to support these peak load times, you don’t want to be paying for infrastructure you don’t need the other 10 or 11 months out of the year? Easy! Spark::red allows you to add servers on a monthly basis (as long as you have the ATG/Oracle Commerce licenses for them) as needed. We can scale up your cluster in a couple of days or less, and you’ll only pay for the additional hardware for the time you need it, month by month. We have several clients who do just that. It’s higher performance and more cost effective than VM scaling, and best of all, you don’t need to manage it or worry about it. Leave it to us!