Wow, that seems almost unbelievable. Power 8 is able to beat Intel architecture by a factor of 0.3 at the same level of concurrency? Can someone offer an explanation of how/why this happens, but we're not all running Power architecture on our servers?
One reason would be that power uses 8 way smt while Intel (with hyper threading) uses 2 way smt. If your code has tons of cache misses you will gain greatly from 8way smt since another thread can run on the physical core while other threads are waiting for memory. I suspect PPC would do very poorly with High performance code with low cache misses. However for something like a Java program where cache misses are abundant I would expect to see a big boost in performance. The benchmarks above are therefore completely dependant on the cache size.
I worked with the people at IBM who did optimization work for Java on Power. The ridiculous amount of work they did was amazing. There's dozens upon dozens of compiler papers and presentations that are more or less power specific. It was an exciting, and stressful, thing to be a part of.
I think if you look closely at the specs, I think the Xeon has about 2.5MB of L3 per core, while P8 has about 8MB of L3/core. There are bigger differences than this. But the examples above have both processors set at SMT2 (~30 seconds vs ~10 seconds). Differences are greater for SMT4 and SMT8. Of course, if your code fits nicely into the L3 - e.g. if your hot code consumes 3MB, then Power8 will be a bigger winner since there may be more thrashing for the x86 chip - unless, of course, the prediction routines is straight forward and the cacheline can be prefetched (think sequential access vs random access).
From what I saw elsewhere, highly parallel code runs faster on P8, where as single thread perf is faster on x86. So if your HPC app is basically single threaded on one of your compute nodes, then that would be faster. But if your HPC code is highly multithreaded on the core, then P8 may surprise you.
They run hot. Yes, they've been improving on heat output and power consumption, but they're certainly not anywhere near most Intel parts — that said, they're chasing a market where perf matters above all else.
Video encoding probably wouldn't be so great comparatively, in large part because I don't think any encoders use the POWER8 vector stuff.
Is this "brand new" network substantively different from the one you launched last year? Looks to me like this is still based off of OVH's (oft never fully online) internal 1Tbps network + a lot of rented fiber or virtual-PoP over rented transit.
Is the bandwidth still part of the lackluster Volume network you use in your French DCs?
The creators of htop just need to get inventive on how to present the data. One idea is to change it into a histogram of percentages (these many threads @ this level of use).
#Intel
sysbench --test=cpu --cpu-max-prime=100000 --num-threads=8 run execution time (avg/stddev): 29.8777/0.01
#Power8
sysbench --test=cpu --cpu-max-prime=100000 run --num-threads=8 execution time (avg/stddev): 9.5050/0.00
sysbench --test=cpu --cpu-max-prime=100000 run --num-threads=16 execution time (avg/stddev): 4.7733/0.01
sysbench --test=cpu --cpu-max-prime=100000 run --num-threads=126 execution time (avg/stddev): 2.6502/0.14
[0] http://www.ovh.com/us/dedicated-servers/infra/2014-EG-32.xml