gemmlowp's public interface is defined in
public/gemmlowp.h.
The primary public entry point is: GemmWithOutputPipeline
.
A usage example is given in
doc/quantization_example.cc.
The high-level overview of how this specifies a low-precision matrix
multiplication is explained in low-precision.md. The
rationale for a specific quantization paradigm is given in
quantization.md. That specific quantization paradigm is
implemented at two different stages of the computation: as pre-processing ont
the operands and as post-processing on the result:
Pre-processing on the LHS, RHS operands, in the form of adding constantlhs_offset
, rhs_offset
to them, is explained in
low-precision.md.
Post-processing on the result, in the form of a flexible "output pipeline",
is explained in output.md.
More details on this below as we discuss specific function parameters.
The prototype is:
template <typename InputScalar, typename OutputScalar, typename BitDepthParams,
MapOrder LhsOrder, MapOrder RhsOrder, MapOrder ResultOrder,
typename OutputPipelineType, typename GemmContextType>
void GemmWithOutputPipeline(GemmContextType* context,
const MatrixMap<const InputScalar, LhsOrder>& lhs,
const MatrixMap<const InputScalar, RhsOrder>& rhs,
MatrixMap<OutputScalar, ResultOrder>* result,
int lhs_offset, int rhs_offset,
const OutputPipelineType& output_pipeline);
A typical call looks like (from the usage example):
gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::uint8_t,
gemmlowp::DefaultL8R8BitDepthParams>(
&gemm_context, uint8_lhs_matrix, uint8_rhs_matrix,
&uint8_result_matrix, lhs_offset, rhs_offset, output_pipeline);
Typically only the 3 first template parameters need to be specified, the rest
being automatically deduced from function parameters:
InputScalar
: The scalar type of the LHS and RHS operands. At the moment,std::uint8_t
.OutputScalar
: The scalar type of the LHS and RHS operands. At the moment,std::uint8_t
.BitDepthParams
: Defines the bit format of the input and output matricesgemmlowp::DefaultL8R8BitDepthParams
. SeeThe other template parameters, which typically do not need to be specified, are:
LhsOrder
, RhsOrder
, ResultOrder
: the storage orders (row-major orOutputPipelineType
: the actual std::tuple
type of the output pipeline.output_pipeline
parameter, andGemmContextType
: the type of the context
parameter. At the moment, thisgemmlowp::GemmContext
.The function parameters taken by GemmWithOutputPipeline
are:
context
: The gemmlowp::GemmContext
object holding state and resources tolhs
, rhs
: The LHS and RHS operand matrices. Note that these areMatrixMap
objects, mapping external buffers as matrices, not owning data.result
: pointer to the destination MatrixMap
object, which must belhs_offset
, rhs_offset
are constants added to each matrix entry in theoutput_pipeline
.output_pipeline
is a std::tuple
of output stages (seegemmlowp supports arbitrary combinations of storage orders for the LHS, RHS and
result matrices. However, not all are equally optimized for.
Because gemmlowp is primarily aimed at neural network inference workloads,
optimization focus is on this particular combination of storage orders:
LhsOrder=RowMajor
RhsOrder=ColMajor
ResultOrder=ColMajor
The rationale is that the LHS is typically the constant weights of a neural
network layer (e.g. the weights of a Convolutional layer implemented as a matrix
multiplication), while the RHS and result are neural network activations,
respectively the input and output activations of the layer.
Because the RHS and result are activations, we want them to share the same
storage order -- so that one layer's output activations can be readily used as
the next layer's input activations. Thus, we focus on RhsOrder=ResultOrder
.
We also know from general considerations on matrix multiplication that it is
slightly more efficient to have the direction of accumulation (the "depth"
dimension) be the direction of contiguous storage in memory. That means that it
is always going to be slightly easier and more efficient to haveLhsOrder=RowMajor
and RhsOrder=ColMajor
.
Putting this together, we arrive at gemmlowp's focus on the above-described
combination of storage orders.
Using other storage orders will typically mean taking less efficient paths in
the packing and unpacking stages, see packing.md. The compute
kernel stage (kernel.md) is unaffected.
This is a variant where lhs_offset
and rhs_offset
may be vectors instead of
scalar. They are then broadcasted against LHS, RHS respectively.
This is useful for some flavors of neural network inference with "per-channel
quantization", whence the PC suffix. This has been useful in some settings where
a neural network trained in float arithmetic was subsequently quantized. On the
other hand, retraining neural networks for quantized inference tends to remove
the need for per-channel quantization. For that reason, the long-term usefulness
of this entry point is in question.
This is gemmlowp's original, now legacy and deprecated, entry point. See the
section of low-precision.md on the legacy quantization
paradigm. Avoid in new code.
As explained in the top-level README.md, this
is entirely deprecated.