Making Facebook’s software infrastructure more energy efficient with Autoscale

bhouston · on Aug 9, 2014

Vfx companies have software that turn off renddr farm machines in periods of low load. I helped write software that did this backin 2006.

Basically, we found it was possible to shut down machines in periods of low load and then use "Wake On Lan" to start them up once load picked up again.

I am unsure if the on-off power cycling reduces machine longevity.

Might be something worth exploring at Facebook.

mschuster91 · on Aug 9, 2014

> I am unsure if the on-off power cycling reduces machine longevity.

Possibly the disks are the most vulnerable components. I wonder if it would be possible in software to shut down the CPUs and have only disks + network running...

donavanm · on Aug 9, 2014

A bit late to the party on this one. Check out cpu "c states" and things like intels "speedstep." Modern cpus can reduce/shutdown power to individual packages and cores. This can reduce power consumption from a hundred tdp to tens of tdp.

The downside is latency associated with changing state. Depending on the change it can be hundreds or thousands of micros to go through these states. On a server workload this can introduce huge latency outliers as a request blocks on a core to wake up.

mschuster91 · on Aug 9, 2014

I was more after "shut down everything except the disks" to have the absolute minimum running and the disks spinning to reduce wear on the engines.

If you're shutting down, you have massive latency anyway but if it's possible to at least save all the power not required for keeping disks up that 'd be great.

wmf · on Aug 10, 2014

In a diurnal cycle like Facebook's, you'd have one start-stop per day, which should be well within the rated specs of hard disks. A few years back I looked at the idea of treating disk lifetime as a resource and explicitly managing it: http://dx.doi.org/10.1109/MSST.2011.5937221

That's assuming that the servers have disks at all, which they probably shouldn't.

praseodym · on Aug 8, 2014

This is one of these things where virtualisation can help even more. For example, VMware can dynamically put servers in standby mode when demand is low and power them up again when needed: http://www.vmware.com/products/vsphere/features/drs-dpm

SEJeff · on Aug 8, 2014

You know that would be trivial to do with bare metal and out of band management cards like Dell Dracs, IBM RSA cards, HP ILOs, or generic IPMI BMCs, right?

Virtualization doesn't really add much of anything for that specific problem other than increased context switching and slightly lower performance.

Disclaimer: building this type of thing (on bare metal) is a chunk of my day job. I see it as unbelievably trivial. In fact, the same ideas are behind Rackspace's "OnMetal" initiative:

http://www.rackspace.com/cloud/servers/onmetal/

devicenull · on Aug 8, 2014

Getting IPMI controllers to do what you want is anything but trivial :)

thrownaway2424 · on Aug 8, 2014

Seriously. A person who would advocate using IPMI at scale has either never owned an IPMI card or has never worked at scale or both. The just don't work, and they erase whatever power savings you're trying to achieve.

devicenull · on Aug 9, 2014

Although... if you have the engineering resources of Facebook, you can write your own IPMI software and probably get it working pretty well. They're all just embedded ARM systems after all..

I ran out of motivation, but I did manage to fix up some of the SuperMicro IPMI firmware: https://github.com/devicenull/supermicro_ipmi_firmware

wmf · on Aug 8, 2014

Real servers can't be bought without IPMI and AFAIK the BMC cannot be turned off, so it's probably not worth worrying about BMC power if there's nothing you can do about it.

thrownaway2424 · on Aug 8, 2014

Sure, but as you're aware facebook, google, et al don't buy "real servers", they buy servers that actually meet their requirements. That's why "real vendors" like HP have missed the boat on selling millions of servers into the cloud.

wmf · on Aug 8, 2014

Speaking of Facebook specifically, the evolution is interesting. They replaced BMCs with the reboot-on-LAN hack but then their next motherboard version had BMCs again. It would be interesting to hear the story behind that.

SEJeff · on Aug 8, 2014

I actually advocated using the vendor specific ones with the BMC being if you use shitty hardware. It works ok, but is absolutely suboptimal.

acdha · on Aug 8, 2014

> Disclaimer: building this type of thing (on bare metal) is a chunk of my day job. I see it as unbelievably trivial.

You do, I'm sure, because you've invested a lot of time learning it and building tooling around it. For anyone who isn't a full-time sysadmin, debugging all of the many and various quirks in management hardware is a major time sink versus scripting a VM server's API.

wmf · on Aug 8, 2014

If you want to power manage your pets, VMware makes sense. But yeah, there's a reason Facebook doesn't use virtualization.

reeze_xia · on Aug 9, 2014

I am curious why Facebook don't use?

If they do use virtualization, the fact, 0 request lead to low power does not correct any more, for example linux container, other container may accept requests, if so they have to dispatch request cross all containers and servers.

otterley · on Aug 9, 2014

The Achilles heel of virtualization is networking. All of the hypervisors out there (VMWare, Xen, KVM) have user-space software switch implementations that dramatically reduce the throughput of TCP session creation. As a consequence you lose a significant amount of hardware potential to serve HTTP connections.

vicaya · on Aug 10, 2014

This is wrong as well. VMware standard and distributed virtual switch is in hypervisor kernel.

otterley · on Aug 10, 2014

How's VMWare's TCP new-session-switching-rate compared to Xen's OVS? Any benchmarks showing a significant improvement?

wmf · on Aug 10, 2014

We just did some benchmarks on this topic: https://news.ycombinator.com/item?id=8146536

reeze_xia · on Aug 9, 2014

How about lightweight virtualization like linux container, docker?

otterley · on Aug 9, 2014

That's not virtualization; it's namespace isolation. There's a small performance impact if you're using NAT, but otherwise the kernel networking stack is used, so there's no performance penalty.

reeze_xia · on Aug 11, 2014

Yes, indeed. I mean the optimization of energy saving seems not suitable for the condition of namespace isolation, since you can't control the other containers requests. if we have to, we need to dispatch request from server perspective but not container.

vicaya · on Aug 10, 2014

"Virtualization doesn't really add much of anything for that specific problem other than increased context switching and slightly lower performance."

This is BS. What if you have 3 physical servers with 30% utilization? DRS can _seamlessly_ consolidate _arbitrary_ application VMs to one server and shutdown the rest. With bare metal, only certain specifically designed workloads (stateless web farms and some distributed systems etc.) can be moved easily.

It seems that dismissing virtualization out of sheer ignorance is a fad these days. Virtualization provides important hardware abstraction to a much wider variety of workloads.

meatmanek · on Aug 9, 2014

You could probably get a similar effect by using HAProxy's "balance first" algorithm, which chooses the first available server with an available connection slot (as defined by maxconn). If you did this, you'd want to set maxconn pretty conservatively.