Standard convolution has dominated visual recognition for nearly a decade. But its two core design principles — spatial-agnostic (the same kernel across all positions) and channel-specific (different kernels per channel) — may not be the right inductive biases for vision.
Our CVPR 2021 paper Involution: Inverting the Inherence of Convolution for Visual Recognition proposes a novel atomic operation that inverts these principles.
The Core Idea
Involution is spatial-specific and channel-agnostic:
- Spatial-specific: a different kernel is generated for each spatial location, capturing position-dependent context
- Channel-agnostic: the same kernel is shared across all channels, reducing redundancy
This simple inversion leads to a surprisingly capable and efficient operator. Notably, self-attention — the driving force behind Vision Transformers — can be seen as a special case of involution.
Results
Replacing convolution with involution in standard ResNet-50 backbones:
| Task | Dataset | Improvement |
|---|---|---|
| Classification | ImageNet | +1.6% top-1 accuracy |
| Detection | COCO | +2.5% box AP |
| Segmentation | COCO | +2.4% mask AP |
| Semantic Seg. | Cityscapes | +4.7% mIoU |
With lower FLOPs: 66%, 65%, 72%, and 57% of the convolutional baseline respectively.
Why It Matters
Involution provides a unified framework connecting convolution, self-attention, and dynamic filtering — clarifying the design space of neural network operations for vision. It shows that the right inductive bias can yield both better accuracy and better efficiency simultaneously.
Code and pretrained models: github.com/d-li14/involution
arXiv: arxiv.org/abs/2103.06255