<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://qi-she.net/feed.xml" rel="self" type="application/atom+xml" /><link href="https://qi-she.net/" rel="alternate" type="text/html" hreflang="en" /><updated>2026-04-26T13:57:42+00:00</updated><id>https://qi-she.net/feed.xml</id><title type="html">Qi She (佘琪) | Research Scientist at ByteDance</title><subtitle>Qi She (佘琪) is a Research Scientist at ByteDance leading the Applied Algorithm &amp; Foundation team. Expert in Multimodal Large Language Models, Agentic AI, Computer Vision, and Machine Learning. PhD from City University of Hong Kong, former Intel Labs. Winner of 2025 CityU Young Alumni Award and 2020 Intel Gordy Award.
</subtitle><author><name>Qi She</name><email>sheqi1991@gmail.com</email></author><entry><title type="html">Scaling MLLM-Powered AI Agents for Real-World Industrial Applications</title><link href="https://qi-she.net/research/ai%20agents/2024/06/01/mllm-agent-at-scale.html" rel="alternate" type="text/html" title="Scaling MLLM-Powered AI Agents for Real-World Industrial Applications" /><published>2024-06-01T01:00:00+00:00</published><updated>2024-06-01T01:00:00+00:00</updated><id>https://qi-she.net/research/ai%20agents/2024/06/01/mllm-agent-at-scale</id><content type="html" xml:base="https://qi-she.net/research/ai%20agents/2024/06/01/mllm-agent-at-scale.html"><![CDATA[<p>Over the past two years, my team at ByteDance has been driving the large-scale industrialization of <strong>Multimodal Large Language Model (MLLM) agents</strong> for Business Integrity — the systems that ensure platform safety and content quality at hundreds of millions of requests per day.</p>

<p>This post shares some of the key architectural principles and lessons we’ve learned.</p>

<h2 id="from-models-to-agents">From Models to Agents</h2>

<p>A standalone MLLM — however capable — is not enough for real-world deployment. The gap between a model that performs well on benchmarks and a system that reliably handles production traffic at scale is enormous.</p>

<p>The shift toward <strong>agentic architectures</strong> addresses this: instead of a single model call, an agent orchestrates multiple steps — perception, reasoning, tool use, verification — to solve complex tasks robustly.</p>

<h2 id="core-challenges-at-scale">Core Challenges at Scale</h2>

<p><strong>1. Reliability under distribution shift</strong><br />
Real-world content is adversarial and ever-evolving. An agent that relies on a single monolithic model will degrade unpredictably. We design agents with explicit verification steps: intermediate outputs are checked before influencing downstream decisions.</p>

<p><strong>2. Latency vs. reasoning depth</strong><br />
Deeper reasoning chains produce more accurate decisions but increase latency. We use adaptive depth: simple cases are resolved in one step, complex or ambiguous cases trigger multi-step reasoning. This keeps P99 latency acceptable while maintaining accuracy on hard cases.</p>

<p><strong>3. Fine-grained multimodal understanding</strong><br />
Text-only LLMs cannot handle the rich visual content that characterizes most real-world policy violations. Our MLLMs are specifically fine-tuned on domain-relevant visual-language pairs, with specialized heads for tasks like OCR, object grounding, and visual relation understanding.</p>

<p><strong>4. Alignment with evolving policies</strong><br />
Platform policies change frequently. We build agents with explicit policy-grounding: reasoning is tied to retrievable policy documents, so updating a policy doesn’t require full model retraining.</p>

<h2 id="what-weve-published">What We’ve Published</h2>

<p>Several papers from our team formalize parts of this work:</p>

<ul>
  <li><a href="https://arxiv.org/abs/2406.18193"><strong>MammothModa</strong></a>: A unified MLLM framework for multi-modal understanding and generation</li>
  <li><a href="https://arxiv.org/abs/2504.01407"><strong>TimeSearch</strong></a>: Hierarchical video search with spotlight and reflection for long video understanding</li>
  <li><a href="https://arxiv.org/abs/2511.17106"><strong>ChainV</strong></a>: Atomic visual hints that make multimodal reasoning shorter and better</li>
</ul>

<h2 id="looking-ahead">Looking Ahead</h2>

<p>The convergence of stronger foundation models, better alignment techniques (RLHF, GRPO), and scalable agentic architectures is opening up entirely new categories of industrial AI applications. We’re just at the beginning of understanding how to build <strong>trustworthy, adaptive, and efficient</strong> agentic systems at scale.</p>

<p>If you’re working on similar challenges, feel free to <a href="mailto:sheqi.roger@bytedance.com">reach out</a> — we’re actively <a href="/#hiring">hiring researchers and engineers</a> in this space.</p>]]></content><author><name>Qi She</name><email>sheqi1991@gmail.com</email></author><category term="Research" /><category term="AI Agents" /><summary type="html"><![CDATA[Key lessons from deploying Multimodal Large Language Model agents at scale in ByteDance's Business Integrity ecosystem — covering architecture choices, reliability challenges, and what it takes to build production-grade agentic systems.]]></summary></entry><entry><title type="html">Involution: Rethinking the Core of Convolutional Neural Networks</title><link href="https://qi-she.net/research/computer%20vision/2021/06/20/involution.html" rel="alternate" type="text/html" title="Involution: Rethinking the Core of Convolutional Neural Networks" /><published>2021-06-20T01:00:00+00:00</published><updated>2021-06-20T01:00:00+00:00</updated><id>https://qi-she.net/research/computer%20vision/2021/06/20/involution</id><content type="html" xml:base="https://qi-she.net/research/computer%20vision/2021/06/20/involution.html"><![CDATA[<p>Standard convolution has dominated visual recognition for nearly a decade. But its two core design principles — <strong>spatial-agnostic</strong> (the same kernel across all positions) and <strong>channel-specific</strong> (different kernels per channel) — may not be the right inductive biases for vision.</p>

<p>Our CVPR 2021 paper <a href="https://openaccess.thecvf.com/content/CVPR2021/html/Li_Involution_Inverting_the_Inherence_of_Convolution_for_Visual_Recognition_CVPR_2021_paper.html"><strong>Involution: Inverting the Inherence of Convolution for Visual Recognition</strong></a> proposes a novel atomic operation that <strong>inverts</strong> these principles.</p>

<h2 id="the-core-idea">The Core Idea</h2>

<p>Involution is <strong>spatial-specific and channel-agnostic</strong>:</p>

<ul>
  <li><strong>Spatial-specific</strong>: a different kernel is generated for each spatial location, capturing position-dependent context</li>
  <li><strong>Channel-agnostic</strong>: the same kernel is shared across all channels, reducing redundancy</li>
</ul>

<p>This simple inversion leads to a surprisingly capable and efficient operator. Notably, self-attention — the driving force behind Vision Transformers — can be seen as a special case of involution.</p>

<h2 id="results">Results</h2>

<p>Replacing convolution with involution in standard ResNet-50 backbones:</p>

<table>
  <thead>
    <tr>
      <th>Task</th>
      <th>Dataset</th>
      <th>Improvement</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Classification</td>
      <td>ImageNet</td>
      <td>+1.6% top-1 accuracy</td>
    </tr>
    <tr>
      <td>Detection</td>
      <td>COCO</td>
      <td>+2.5% box AP</td>
    </tr>
    <tr>
      <td>Segmentation</td>
      <td>COCO</td>
      <td>+2.4% mask AP</td>
    </tr>
    <tr>
      <td>Semantic Seg.</td>
      <td>Cityscapes</td>
      <td>+4.7% mIoU</td>
    </tr>
  </tbody>
</table>

<p>With <strong>lower</strong> FLOPs: 66%, 65%, 72%, and 57% of the convolutional baseline respectively.</p>

<h2 id="why-it-matters">Why It Matters</h2>

<p>Involution provides a unified framework connecting convolution, self-attention, and dynamic filtering — clarifying the design space of neural network operations for vision. It shows that the right inductive bias can yield both better accuracy and better efficiency simultaneously.</p>

<p>Code and pretrained models: <a href="https://github.com/d-li14/involution">github.com/d-li14/involution</a><br />
arXiv: <a href="https://arxiv.org/abs/2103.06255">arxiv.org/abs/2103.06255</a></p>]]></content><author><name>Qi She</name><email>sheqi1991@gmail.com</email></author><category term="Research" /><category term="Computer Vision" /><summary type="html"><![CDATA[Involution introduces a new atomic operation for deep neural networks that inverts the design principles of standard convolution — spatial-specific and channel-agnostic — achieving better performance with lower computational cost.]]></summary></entry><entry><title type="html">News</title><link href="https://qi-she.net/2020/12/14/news.html" rel="alternate" type="text/html" title="News" /><published>2020-12-14T15:33:36+00:00</published><updated>2020-12-14T15:33:36+00:00</updated><id>https://qi-she.net/2020/12/14/news</id><content type="html" xml:base="https://qi-she.net/2020/12/14/news.html"><![CDATA[<p>📄 [2020-03] We have 3 paper accepted into CVPR 2021, congradulations for our collaborators. Codes will be released soon.</p>

<p>📢 [2020-12] Our workshop proposal “Continual Learnig in Computer Vision” has been accepted by <strong>CVPR 2021</strong>, which is selected out of 109 proposals.</p>

<p>📢 [2020-12] Reviewer service: <strong>ICML 2021</strong>; <strong>NeurIPS 2020</strong> (Top 10% high-scoring reviewer).</p>

<p>📄 [2020-11] Our paper <a href="https://arxiv.org/abs/1906.01529">“Generative Adversarial Networks in Computer Vision: A Survey and Taxonomy”</a> has been accepted by <a href="https://www.letpub.com.cn/index.php?page=journalapp&amp;view=detail&amp;journalid=19">ACM Computing Surveys</a>. The journal is ranked 3/221 in Computer Science Theory &amp; Methods (Impact Factor: 7.99):</p>

<p>📢 [2020-08] I join the <strong>Bytedance AI Lab</strong> as a research scientist this month.</p>

<p>📢 [2020-06] We have successfully organized the 1st <a href="https://sites.google.com/view/clvision2020/overview?authuser=0">“Continual Learning in Computer Vision”</a> workshop at <strong>CVPR 2020</strong>.</p>

<p>📄 [2020-06] Our paper “IROS 2019 Lifelong Robotic Vision: Object Recognition Challenge [Competitions]” has been published at IEEE Robotics &amp; Automation Magazine.</p>

<p>📄 [2020-05] Our paper “Synthetic-Neuroscore: Using a neuro-AI interface for evaluating generative adversarial networks” has been accepted by <strong>Neurocomputing</strong>.</p>

<p>📄 [2020-01] 2 papers have been accepted by <strong>ICRA 2020</strong>: “OpenLORIS-Object: A Robotic Vision Dataset and Benchmark for Lifelong Deep Learning”, and “Are we ready for service robots? The OpenLORIS-scene datasets for lifelong SLAM”.</p>]]></content><author><name>Qi She</name><email>sheqi1991@gmail.com</email></author><summary type="html"><![CDATA[📄 [2020-03] We have 3 paper accepted into CVPR 2021, congradulations for our collaborators. Codes will be released soon.]]></summary></entry></feed>