Involution: Rethinking the Core of Convolutional Neural Networks

Qi She (佘琪)

ByteDance · Research Scientist

MLLMs · Agentic AI · AIGC

Involution: Rethinking the Core of Convolutional Neural Networks

Jun 20, 2021

Research
Computer Vision

Standard convolution has dominated visual recognition for nearly a decade. But its two core design principles — spatial-agnostic (the same kernel across all positions) and channel-specific (different kernels per channel) — may not be the right inductive biases for vision.

Our CVPR 2021 paper Involution: Inverting the Inherence of Convolution for Visual Recognition proposes a novel atomic operation that inverts these principles.

The Core Idea

Involution is spatial-specific and channel-agnostic:

Spatial-specific: a different kernel is generated for each spatial location, capturing position-dependent context
Channel-agnostic: the same kernel is shared across all channels, reducing redundancy

This simple inversion leads to a surprisingly capable and efficient operator. Notably, self-attention — the driving force behind Vision Transformers — can be seen as a special case of involution.

Results

Replacing convolution with involution in standard ResNet-50 backbones:

Task	Dataset	Improvement
Classification	ImageNet	+1.6% top-1 accuracy
Detection	COCO	+2.5% box AP
Segmentation	COCO	+2.4% mask AP
Semantic Seg.	Cityscapes	+4.7% mIoU

With lower FLOPs: 66%, 65%, 72%, and 57% of the convolutional baseline respectively.

Why It Matters

Involution provides a unified framework connecting convolution, self-attention, and dynamic filtering — clarifying the design space of neural network operations for vision. It shows that the right inductive bias can yield both better accuracy and better efficiency simultaneously.

Code and pretrained models: github.com/d-li14/involution
arXiv: arxiv.org/abs/2103.06255