Qi She (佘琪) | Research Scientist at ByteDance

Scaling MLLM-Powered AI Agents for Real-World Industrial Applications

2024-06-01T01:00:00+00:00

Over the past two years, my team at ByteDance has been driving the large-scale industrialization of Multimodal Large Language Model (MLLM) agents for Business Integrity — the systems that ensure platform safety and content quality at hundreds of millions of requests per day.

This post shares some of the key architectural principles and lessons we’ve learned.

From Models to Agents

A standalone MLLM — however capable — is not enough for real-world deployment. The gap between a model that performs well on benchmarks and a system that reliably handles production traffic at scale is enormous.

The shift toward agentic architectures addresses this: instead of a single model call, an agent orchestrates multiple steps — perception, reasoning, tool use, verification — to solve complex tasks robustly.

Core Challenges at Scale

1. Reliability under distribution shift
Real-world content is adversarial and ever-evolving. An agent that relies on a single monolithic model will degrade unpredictably. We design agents with explicit verification steps: intermediate outputs are checked before influencing downstream decisions.

2. Latency vs. reasoning depth
Deeper reasoning chains produce more accurate decisions but increase latency. We use adaptive depth: simple cases are resolved in one step, complex or ambiguous cases trigger multi-step reasoning. This keeps P99 latency acceptable while maintaining accuracy on hard cases.

3. Fine-grained multimodal understanding
Text-only LLMs cannot handle the rich visual content that characterizes most real-world policy violations. Our MLLMs are specifically fine-tuned on domain-relevant visual-language pairs, with specialized heads for tasks like OCR, object grounding, and visual relation understanding.

4. Alignment with evolving policies
Platform policies change frequently. We build agents with explicit policy-grounding: reasoning is tied to retrievable policy documents, so updating a policy doesn’t require full model retraining.

What We’ve Published

Several papers from our team formalize parts of this work:

MammothModa: A unified MLLM framework for multi-modal understanding and generation
TimeSearch: Hierarchical video search with spotlight and reflection for long video understanding
ChainV: Atomic visual hints that make multimodal reasoning shorter and better

Looking Ahead

The convergence of stronger foundation models, better alignment techniques (RLHF, GRPO), and scalable agentic architectures is opening up entirely new categories of industrial AI applications. We’re just at the beginning of understanding how to build trustworthy, adaptive, and efficient agentic systems at scale.

If you’re working on similar challenges, feel free to reach out — we’re actively hiring researchers and engineers in this space.

Involution: Rethinking the Core of Convolutional Neural Networks

2021-06-20T01:00:00+00:00

Standard convolution has dominated visual recognition for nearly a decade. But its two core design principles — spatial-agnostic (the same kernel across all positions) and channel-specific (different kernels per channel) — may not be the right inductive biases for vision.

Our CVPR 2021 paper Involution: Inverting the Inherence of Convolution for Visual Recognition proposes a novel atomic operation that inverts these principles.

The Core Idea

Involution is spatial-specific and channel-agnostic:

Spatial-specific: a different kernel is generated for each spatial location, capturing position-dependent context
Channel-agnostic: the same kernel is shared across all channels, reducing redundancy

This simple inversion leads to a surprisingly capable and efficient operator. Notably, self-attention — the driving force behind Vision Transformers — can be seen as a special case of involution.

Results

Replacing convolution with involution in standard ResNet-50 backbones:

Task	Dataset	Improvement
Classification	ImageNet	+1.6% top-1 accuracy
Detection	COCO	+2.5% box AP
Segmentation	COCO	+2.4% mask AP
Semantic Seg.	Cityscapes	+4.7% mIoU

With lower FLOPs: 66%, 65%, 72%, and 57% of the convolutional baseline respectively.

Why It Matters

Involution provides a unified framework connecting convolution, self-attention, and dynamic filtering — clarifying the design space of neural network operations for vision. It shows that the right inductive bias can yield both better accuracy and better efficiency simultaneously.

Code and pretrained models: github.com/d-li14/involution
arXiv: arxiv.org/abs/2103.06255

News

2020-12-14T15:33:36+00:00

📄 [2020-03] We have 3 paper accepted into CVPR 2021, congradulations for our collaborators. Codes will be released soon.

📢 [2020-12] Our workshop proposal “Continual Learnig in Computer Vision” has been accepted by CVPR 2021, which is selected out of 109 proposals.

📢 [2020-12] Reviewer service: ICML 2021; NeurIPS 2020 (Top 10% high-scoring reviewer).

📄 [2020-11] Our paper “Generative Adversarial Networks in Computer Vision: A Survey and Taxonomy” has been accepted by ACM Computing Surveys. The journal is ranked 3/221 in Computer Science Theory & Methods (Impact Factor: 7.99):

📢 [2020-08] I join the Bytedance AI Lab as a research scientist this month.

📢 [2020-06] We have successfully organized the 1st “Continual Learning in Computer Vision” workshop at CVPR 2020.

📄 [2020-06] Our paper “IROS 2019 Lifelong Robotic Vision: Object Recognition Challenge [Competitions]” has been published at IEEE Robotics & Automation Magazine.

📄 [2020-05] Our paper “Synthetic-Neuroscore: Using a neuro-AI interface for evaluating generative adversarial networks” has been accepted by Neurocomputing.

📄 [2020-01] 2 papers have been accepted by ICRA 2020: “OpenLORIS-Object: A Robotic Vision Dataset and Benchmark for Lifelong Deep Learning”, and “Are we ready for service robots? The OpenLORIS-scene datasets for lifelong SLAM”.