From C# to SIMD: Numerics.Vector and Hybridizer

From C# to SIMD: Numerics.Vector and Hybridizer

System.Numerics.Vector is a library provided by .NET (as a NuGet package) that exposes value types such as Vector<T> which are recognized by RyuJIT as intrinsics. Supported intrinsics are listed in the coreclr repository. This enables C# SIMD acceleration, as long as you modify your code to use these intrinsic types instead of scalar floating point elements.

On the other hand, Hybridizer aims to provide these benefits without being intrusive in the code (only metadata is required).

We naturally wanted to test if System.Numerics.Vector delivers good performance, compared to Hybridizer.

Summary

We measured that Numerics.Vector provides good speed-up over C# code as long as no transcendental function is involved (such as Math.Exp), but still lags behind Hybridizer. Because of the lack of some operators and mathematical functions, Numerics can also generate really slow code (when the AVX pipeline is broken). In addition, code modification is a heavy process, and can’t easily be rolled back.

We wrote and ran two benchmarks, and for each of them we have four versions:

  • Simple C# scalar code
  • Numerics.Vector
  • Simple C# scalar code, hybridized
  • Numerics.Vector, hybridized

Processor is a Core i7-4770S @ 3.1GHz (max measured turbo in AVX mode being 3.5GHz). Peak flops is 224 GFlop/s, or 112 GCFlop/s if we count FMA as one (supported on this processor).

Compute bound benchmark

This is a compute‑intensive benchmark. For each element of a large double‑precision array (8 million elements: 67 MB), we iterate twelve times the computation of an exponential’s Taylor expansion (expm1). This is largely enough to enter the compute‑bound world by hiding memory latencies behind a full bunch of floating‑point operations.

Scalar code is simply:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static double expm1(double x)
{
    return ((((((((((((((15.0 + x)
        * x + 210.0)
        * x + 2730.0)
        * x + 32760.0)
        * x + 360360.0)
        * x + 3603600.0)
        * x + 32432400.0)
        * x + 259459200.0)
        * x + 1816214400.0)
        * x + 10897286400.0)
        * x + 54486432000.0)
        * x + 217945728000.0)
        * x + 653837184000.0)
        * x + 1307674368000.0)
        * x * 7.6471637318198164759011319857881e-13;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static double twelve(double x)
{
    return expm1(expm1(expm1(expm1(expm1(expm1(expm1(expm1(expm1(expm1(expm1(x)))))))))));
}

We added the AggressiveInlining attribute to help RyuJIT merge operations at JIT time.

The Numerics.Vector version of the code is quite the same:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector<double> expm1(Vector<double> x)
{
    return ((((((((((((((new Vector<double>(15.0) + x)
        * x + new Vector<double>(210.0))
        * x + new Vector<double>(2730.0))
        * x + new Vector<double>(32760.0))
        * x + new Vector<double>(360360.0))
        * x + new Vector<double>(3603600.0))
        * x + new Vector<double>(32432400.0))
        * x + new Vector<double>(259459200.0))
        * x + new Vector<double>(1816214400.0))
        * x + new Vector<double>(10897286400.0))
        * x + new Vector<double>(54486432000.0))
        * x + new Vector<double>(217945728000.0))
        * x + new Vector<double>(653837184000.0))
        * x + new Vector<double>(1307674368000.0))
        * x * new Vector<double>(7.6471637318198164759011319857881e-13);
}

The four versions of this code give the following performance results:

FlavorScalar C#Vector C#Vector HybScalar Hyb
GCFlop/s4.3119.9541.2959.65

Compute-bound speedup

As stated, Numerics.Vector delivers a close to 4× speedup over scalar. However, performance is far from what we reach with Hybridizer. If we look at generated assembly, it’s quite clear why:

vbroadcastsd ymm0,mmword ptr [7FF7C2255B48h]
vbroadcastsd ymm1,mmword ptr [7FF7C2255B50h]
vbroadcastsd ymm2,mmword ptr [7FF7C2255B58h]
vbroadcastsd ymm3,mmword ptr [7FF7C2255B60h]
vbroadcastsd ymm4,mmword ptr [7FF7C2255B68h]
vbroadcastsd ymm5,mmword ptr [7FF7C2255B70h]
vbroadcastsd ymm7,mmword ptr [7FF7C2255B78h]
vbroadcastsd ymm8,mmword ptr [7FF7C2255B80h]
vaddpd      ymm0,ymm0,ymm6
vmulpd      ymm0,ymm0,ymm6
vaddpd      ymm0,ymm0,ymm1
vmulpd      ymm0,ymm0,ymm6
vaddpd      ymm0,ymm0,ymm2
vmulpd      ymm0,ymm0,ymm6
vaddpd      ymm0,ymm0,ymm3
vmulpd      ymm0,ymm0,ymm6
vaddpd      ymm0,ymm0,ymm4
vmulpd      ymm0,ymm0,ymm6
vaddpd      ymm0,ymm0,ymm5
vmulpd      ymm0,ymm0,ymm6
vaddpd      ymm0,ymm0,ymm7
vmulpd      ymm0,ymm0,ymm6
vaddpd      ymm0,ymm0,ymm8
vmulpd      ymm0,ymm0,ymm6
; repeated

Fused multiply-add are not reconstructed, and constant operands are reloaded from the constant pool at each expm1 invocation. This leads to high register pressure (for constants), where memory operands could save some.

Here is what Hybridizer generates from scalar code:

vaddpd      ymm1,ymm0,ymmword ptr []
vfmadd213pd ymm1,ymm0,ymmword ptr []
vfmadd213pd ymm1,ymm0,ymmword ptr []
vfmadd213pd ymm1,ymm0,ymmword ptr []
vfmadd213pd ymm1,ymm0,ymmword ptr []
vfmadd213pd ymm1,ymm0,ymmword ptr []
vfmadd213pd ymm1,ymm0,ymmword ptr []
vfmadd213pd ymm1,ymm0,ymmword ptr []
vfmadd213pd ymm1,ymm0,ymmword ptr []
vfmadd213pd ymm1,ymm0,ymmword ptr []
vfmadd213pd ymm1,ymm0,ymmword ptr []
vfmadd213pd ymm1,ymm0,ymmword ptr []
vfmadd213pd ymm1,ymm0,ymmword ptr []
vfmadd213pd ymm1,ymm0,ymmword ptr []
vmulpd      ymm0,ymm0,ymm1
vmulpd      ymm0,ymm0,ymmword ptr []
vmovapd     ymmword ptr [rsp+0A20h],ymm0
; repeated

This reconstructs fused multiply-add and leverages memory operands to save registers.

Why are we not at peak performance (112 GCFlops)? Haswell has two pipelines for FMA and a latency of 5. To reach peak performance, we would need to interleave two independent FMA instructions each cycle. This could be done by reordering instructions, but the backend compiler is not capable of such reordering. To get better performance, we would have to write assembly by hand.

Invoke transcendentals

In this second benchmark, we need to compute the exponential of all the components of a vector. To do that, we invoke Math.Exp.

Scalar code is:

[EntryPoint]
public static void Apply_scal(double[] d, double[] a, double[] b, double[] c, int start, int stop)
{
    int sstart = start + threadIdx.x + blockDim.x * blockIdx.x;
    int step = blockDim.x * gridDim.x;
    for (int i = sstart; i < stop; i += step)
    {
        d[i] = a[i] * Math.Exp(b[i]) * Math.Exp(c[i]);
    }
}

In this context, Numerics.Vector cannot directly call vectorized transcendentals, so code often falls back to scalar loops or repeated calls that break the AVX pipeline. Hybridizer, however, can generate efficient vectorized code for the same logic.

Bandwidth speedup