The problem is not the theoretical peak teraflops. The problem is actually achieving those teraflops with useful work. Due to architecture that is easier on a CPU than on a GPU, so you can't directly compare teraflops and conclude that GPUs are superior. Getting something to run fast on a GPU is very difficult.
And actually the thing that does 4.5 teraflops in single precision does only 95 gigaflops in double precision per GPU. A good x86 CPU does ~100 gigaflops in double precision as well, and you're much more likely to actually achieve that number on a x86. Although another one on the page you linked to theoretically does 665 gigaflops double precision.
Single precision is probably fine for a neural network. Neural networks are somewhat insensitive to noise and failure and single precision adds very little noise.
And actually the thing that does 4.5 teraflops in single precision does only 95 gigaflops in double precision per GPU. A good x86 CPU does ~100 gigaflops in double precision as well, and you're much more likely to actually achieve that number on a x86. Although another one on the page you linked to theoretically does 665 gigaflops double precision.