GTC 2017: Hybrid Vector Library — From Memory Bound to Compute Bound with NVVM

When submitting small tasks to the GPU, grid scheduling and synchronization costs may be much higher than computations, even on a CPU. In this case, the benefit of GPU computing is lost. Leveraging runtime compilation, we illustrate an approach that generates source code to replace a list of library API calls into a single kernel call.

The benefits are twofold:

Scheduling costs are reduced to a minimum, as several calls are merged into a single one.
Executing an aggregate kernel on a vector of values results in a compute-bound implementation.

Resources:

Poster PDF: gtc-2017-poster