Publications | Soolab Sibei Yang

2025

MVTokenFlow: High-quality 4D Content Generation using Multiview Token Flow

Hanzhuo Huang*, Yuan Liu*, Ge Zheng, Jiepeng Wang, Zhiyang Dou, Sibei Yang†

Accepted by ICLR, 2025

arXiv HTML Code

2024

TPAMI2024

A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-oriented Perspective

Chaoqi Chen*, Yushuang Wu*, Qiyuan Dai*, Hong-Yu Zhou*, Mutian Xu, Sibei Yang†, Xiaoguang Han†, Yizhou Yu†

Accepted by TPAMI, 2024

arXiv
Part2Object: Hierarchical Unsupervised 3D Instance Segmentation

Cheng Shi*, Yulin Zhang*, Bin Yang, Jiajin Tang, Yuexin Ma, Sibei Yang†

Accepted by ECCV, 2024

arXiv Code
Plain-D^Net: A Plain Multi-Dataset Object Detector

Cheng Shi*, Yuchen Zhu*, and Sibei Yang†

Accepted by ECCV, 2024

arXiv Code
WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language

Zhenxiang Lin, Xidong Peng, Peishan Cong, Ge Zheng, Yujing Sun, Yuenan Hou, Xinge Zhu, Sibei Yang, Yuexin Ma

Accepted by ECCV, 2024

arXiv Code
CVPR2024

Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation

Qiyuan Dai, and Sibei Yang†

Accepted by CVPR, 2024

arXiv
The Devil is in the Object Boundary: Towards Annotation-free Instance Segmentation Using Foundation Models

Cheng Shi, and Sibei Yang†

Accepted by ICLR, 2024

arXiv Code
OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers

Han Liang, Jiacheng Bao, Ruichi Zhang, Sihan Ren, Yuecheng Xu, Sibei Yang, Xin Chen, Jingyi Yu, Lan Xu

Accepted by CVPR, 2024

arXiv Code Video
RealDex: Towards Human-like Grasping for Robotic Dexterous Hand

Yumeng Liu*, Yaxun Yang*, Youzhuo Wang*, Xiaofei Wu, Jiamin Wang, Yichen Yao, Sören Schwertfeger, Sibei Yang, Wenping Wang, Jingyi Yu, Xuming He, Yuexin Ma

Accepted by IJCAI, 2024

arXiv

2023

DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models

Ge Zheng*, Bin Yang*, Jiajin Tang*, Hong-Yu Zhou, Sibei Yang†

Accepted by NeurIPS, 2023

arXiv Code
Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator

Hanzhuo Huang*, Yufan Feng*, Cheng Shi, Lan Xu, Jingyi Yu, Sibei Yang†

Accepted by NeurIPS, 2023

arXiv Code
ICCV2023

LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models

Cheng Shi, and Sibei Yang†

Accepted by ICCV, 2023

arXiv HTML PDF
ICCV2023

EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment

Cheng Shi, and Sibei Yang†

Accepted by ICCV, 2023

arXiv HTML PDF
ICCV2023

CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection

Jiajin Tang*, Ge Zheng*, Jingyi Yu, Sibei Yang†

Accepted by ICCV, 2023

arXiv HTML PDF
ICCV2023

Temporal Collection and Distribution for Referring Video Object Segmentation

Jiajin Tang, Ge Zheng, and Sibei Yang†

Accepted by ICCV, 2023

HTML PDF
ICCV2023

Grounded lmage Text Matching with Mismatched Relation Reasoning

Yu Wu*, Yana Wei*, Haozhe Wang, Yongfei Liu, Sibei Yang, Xuming He†

Accepted by ICCV, 2023

arXiv
CVPR2023

Contrastive Grouping with Transformer for Referring Image Segmentation

Jiajin Tang, Ge Zheng, Cheng Shi, Sibei Yang†

Accepted by CVPR, 2023

PDF Code
AAAI2023

CCQ: Cross-Class Query Network for Partially Labeled Organ Segmentation

Xuyang Liu, Bingbing Wen, and Sibei Yang†

Accepted by AAAI, 2023

HTML Code
SIGGRAPH2023

DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance

Longwen Zhang*, Qiwei Qiu*, Hongyang Lin*, Qixuan Zhang, Cheng Shi, Wei Yang, Ye Shi, Sibei Yang†, Lan Xu†, Jingyi Yu†

Accepted by SIGGRAPH, 2023

arXiv HTML PDF Video
TPAMI2023

A Unified Visual Information Preservation Framework for Self-supervised Pre-training in Medical Image Analysis

Hong-Yu Zhou*, Chixiang Lu*, Chaoqi Chen, Sibei Yang, Yizhou Yu†

Accepted by TPAMI, 2023

arXiv
WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language

Zhenxiang Lin, Xidong Peng, Peishan Cong, Yuenan Hou, Xinge Zhu, Sibei Yang, Yuexin Ma

arXiv preprint arXiv:2304.05645, 2023

arXiv

2022

ECCV2022

Spatial and Visual Perspective-Taking via View Rotation and Relation Reasoning for Embodied Reference Understanding

Cheng Shi, and Sibei Yang†

Accepted by ECCV, 2022

PDF Code

2021

TMM2021

Structured attention network for referring image segmentation

Liang Lin, Pengxiang Yan, Xiaoqian Xu, Sibei Yang, Kun Zeng, Guanbin Li

Accepted by Transactions on Multimedia, 2021

HTML
CVPR2021

Bottom-up shift and reasoning for referring image segmentation

Sibei Yang, Meng Xia, Guanbin Li, Hong-Yu Zhou, Yizhou Yu

CVPR, 2021

PDF
ICCV2021

Convnets vs. transformers: Whose visual representations are more transferable?

Hong-Yu Zhou, Chixiang Lu, Sibei Yang, Yizhou Yu

Accepted by ICCV, 2021

arXiv
ICCV2021

Preservational learning improves self-supervised medical image models by reconstructing diverse contexts

Hong-Yu Zhou, Chixiang Lu, Sibei Yang, Xiaoguang Han, Yizhou Yu

Accepted by ICCV, 2021

arXiv Code

2020

TPAMI2020

Relationship-embedded representation learning for grounding referring expressions

Sibei Yang, Guanbin Li, and Yizhou Yu

TPAMI, 2020

arXiv
CVPR2020

Graph-structured referring expression reasoning in the wild

Sibei Yang, Guanbin Li, and Yizhou Yu

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020

Abs arXiv Code

Grounding referring expressions aims to locate in an image an object referred to by a natural language expression. The linguistic structure of a referring expression provides a layout of reasoning over the visual contents, and it is often crucial to align and jointly understand the image and the referring expression. In this paper, we propose a scene graph guided modular network (SGMN), which performs reasoning over a semantic graph and a scene graph with neural modules under the guidance of the linguistic structure of the expression. In particular, we model the image as a structured semantic graph, and parse the expression into a language scene graph. The language scene graph not only decodes the linguistic structure of the expression, but also has a consistent representation with the image semantic graph. In addition to exploring structured solutions to grounding referring expressions, we also propose Ref-Reasoning, a large-scale real-world dataset for structured referring expression reasoning. We automatically generate referring expressions over the scene graphs of images using diverse expression templates and functional programs. This dataset is equipped with real-world visual contents as well as semantically rich expressions with different reasoning layouts. Experimental results show that our SGMN not only significantly outperforms existing state-of-the-art algorithms on the new Ref-Reasoning dataset, but also surpasses state-of-the-art structured methods on commonly used benchmark datasets. It can also provide interpretable visual evidences of reasoning.
ECCV2020

Propagating over phrase relations for one-stage visual grounding

Sibei Yang, Guanbin Li, and Yizhou Yu

Accepted by ECCV, 2020

HTML

2019

AAAI2019

Non-Local Context Encoder: Robust Biomedical Image Segmentation against Adversarial Attacks

Xiang He, Sibei Yang, Guanbin Li, Haofeng Li, HuiYou Chang, Yizhou Yu

Accepted by AAAI, 2019

arXiv HTML
Dynamic Graph Attention for Referring Expression Comprehension

Sibei Yang, Guanbin Li, and Yizhou Yu

Accepted by ICCV, Oct 2019

HTML Video
CVPR2019

Cross-Modal Relationship Inference for Grounding Referring Expressions

Sibei Yang, Guanbin Li, and Yizhou Yu

Accepted by CVPR, Oct 2019

HTML PDF Code

2018

CVPR2018

Multi-Evidence Filtering and Fusion for Multi-Label Classification, Object Detection and Semantic Segmentation Based on Weakly Supervised Learning

Weifeng Ge, Sibei Yang, and Yizhou Yu

Accepted by CVPR, Jun 2018

arXiv HTML