In my
previous post I described my work to support for SIMD instructions in GHC and to
exploit this support in the high-level vector
library. Despite some issues
with the code generated by LLVM (still unresolved), SIMD instructions did lead
to a significant performance increase on some small floating-point intensive
microbenchmarks.
This performance gain comes at a syntactic cost: code must be re-written to use
new SIMD operations. For consumers of the vector
library, this is a small cost
because high-level SIMD operators are available, but it would be nice not to
have to pay any syntactic cost at all. There is an additional "deficiency" in
the vector
approach to computing with arrays—a lack of parallelism. It's
nice to be able to leverage SIMD operations from Haskell code, but it would be
even better if we could leverage them in parallel code! There are two primary
frameworks for computing with parallel arrays in Haskell:
Repa and
Data Parallel Haskell (DPH). DPH includes a vectorizer that converts scalar code to vector
code, so it seems like there's a good chance we could support SIMD operations in
DPH without requiring users to rewrite code. The rest of this post describes
(positive) preliminary results in this direction.
My running example will be computing the dot product of two vectors, this time
using DPH instead of the vector
library. Here is the code:
import Data.Array.Parallel import qualified Data.Array.Parallel.Prelude.Double as D dotp :: PArray Double -> PArray Double -> Double {-# NOINLINE dotp #-} dotp v w = dotp' (fromPArrayP v) (fromPArrayP w) dotp' :: [:Double:] -> [:Double:] -> Double dotp' v w = D.sumP (zipWithP (D.*) v w)
DPH introduces a new parallel array data type (PArray
), some new syntax for
parallel arrays ([:...:]
), and some new combinators for operating on parallel
arrays (zipWithP
). DPH doesn't quite yet support type classes, so we have to
import a special version of the Prelude
specialized to the Double
type. You
can read more about the details on the
DPH wiki page.
Using the simd
branches of ghc
, the vector
library, and the dph
library,
we can change a single import statement and our code will use SIMD instructions
to compute the dot product:
1: import Data.Array.Parallel 2: import qualified Data.Array.Parallel.Prelude.MultiDouble as D 3: 4: dotp :: PArray Double -> PArray Double -> Double 5: {-# NOINLINE dotp #-} 6: dotp v w = dotp' (fromPArrayP v) (fromPArrayP w) 7: 8: dotp' :: [:Double:] -> [:Double:] -> Double 9: dotp' v w = D.sumP (zipWithP (D.*) v w)
Note than only the import statement on line 2 changed. If I had simply rolled
SIMD support directly into Data.Array.Parallel.Prelude.Double
, exploitation of
SIMD instructions would be automatic, but for now I've put my changes into a
separate module (this also makes comparisons easier).
Here are benchmarks taken on a 2.70GHz Intel® Core™ i7-2620M CPU with frequency
scaling disabled. I'm using a perf-llvm
build of the simd
branch of GHC
(compiled with GHC 7.4.1—earlier versions have trouble using LLVM to compile
GHC). The benchmark program was compiled with the options -rtsopts -threaded -Odph -O2 -fllvm -optlo-O3 -optc-O3 -optc-march=corei7
. I did not use
-fcpr-off
or -fno-liberate-case
as suggested by comments in the README
in
the dph
library because these options prevented fusion of SIMD operations. It
looks like we really need
Constructed product result analysis for efficient code, which is what I would expect.
Errors bars mark one standard deviation, and times are averaged over 100 runs.
The C and vector
versions both take advantage of SIMD instructions—in the
former case, via intrinsics. I've included three DPH versions: the two versions
I already gave code for, and a version that uses DPH's parallel array syntax.
The code for this version is:
dotp :: PArray Double -> PArray Double -> Double {-# NOINLINE dotp #-} dotp v w = dotp' (fromPArrayP v) (fromPArrayP w) dotp' :: [:Double:] -> [:Double:] -> Double dotp' xs ys = D.sumP [:x D.* y | x <- xs | y <- ys:]
Strangely, using syntactic sugar results in a significant performance hit. This is a pre-release version of DPH though, and I am told that this performance gap should be eliminated in time for the next release.
Using SIMD instructions in DPH gives us a big gain over the baseline DPH code, but it looks like we're limited by memory bandwidth, so we're not taking advantage of our extra cores. Fortunately, I have access to a 32-core 2.0GHz AMD Opteron™ 6128. Here is the same benchmark run there:
Now we're cooking—much better scaling behavior! With DPH we can take advantage
of multiple cores and we don't have to re-write our code to make use of SIMD
instructions. Before GHC 7.6.1, I'd like to integrate all of this work back into
the mainline branches of GHC, vector
, and dph
. Once that is done, all you
will have to do to take advantage of SIMD versions of floating point operations
in your DPH code is move to the new release.
No comments:
Post a Comment