Yu-Gang Jiang

Yu-Gang Jiang

Fudan University

H-index: 78

Asia-China

About Yu-Gang Jiang

Yu-Gang Jiang, With an exceptional h-index of 78 and a recent h-index of 64 (since 2020), a distinguished researcher at Fudan University, specializes in the field of Video Analytics, Multimedia, Computer Vision, Trustworthy AI, AGI.

His recent articles reflect a diverse array of research interests and contributions to the field:

PoseAnimate: Zero-shot high fidelity pose controllable character animation

Learning from Rich Semantics and Coarse Locations for Long-tailed Object Detection

Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data

Instance-aware multi-camera 3d object detection with structural priors mining and self-boosting learning

Eyes Can Deceive: Benchmarking Counterfactual Reasoning Abilities of Multi-modal Large Language Models

Multi-prompt alignment for multi-source unsupervised domain adaptation

Identity-Driven Multimedia Forgery Detection via Reference Assistance

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

Yu-Gang Jiang Information

University

Fudan University

Position

Professor Computer Science

Citations(all)

24753

Citations(since 2020)

16918

Cited By

13130

hIndex(all)

78

hIndex(since 2020)

64

i10Index(all)

223

i10Index(since 2020)

181

Email

University Profile Page

Fudan University

Yu-Gang Jiang Skills & Research Interests

Video Analytics

Multimedia

Computer Vision

Trustworthy AI

AGI

Top articles of Yu-Gang Jiang

PoseAnimate: Zero-shot high fidelity pose controllable character animation

Authors

Bingwen Zhu,Fanyi Wang,Tianyi Lu,Peng Liu,Jingwen Su,Jinxiu Liu,Yanhao Zhang,Zuxuan Wu,Yu-Gang Jiang,Guo-Jun Qi

Journal

arXiv preprint arXiv:2404.13680

Published Date

2024/4/21

Image-to-video(I2V) generation aims to create a video sequence from a single image, which requires high temporal coherence and visual fidelity with the source image.However, existing approaches suffer from character appearance inconsistency and poor preservation of fine details. Moreover, they require a large amount of video data for training, which can be computationally demanding.To address these limitations,we propose PoseAnimate, a novel zero-shot I2V framework for character animation.PoseAnimate contains three key components: 1) Pose-Aware Control Module (PACM) incorporates diverse pose signals into conditional embeddings, to preserve character-independent content and maintain precise alignment of actions.2) Dual Consistency Attention Module (DCAM) enhances temporal consistency, and retains character identity and intricate background details.3) Mask-Guided Decoupling Module (MGDM) refines distinct feature perception, improving animation fidelity by decoupling the character and background.We also propose a Pose Alignment Transition Algorithm (PATA) to ensure smooth action transition.Extensive experiment results demonstrate that our approach outperforms the state-of-the-art training-based methods in terms of character consistency and detail fidelity. Moreover, it maintains a high level of temporal coherence throughout the generated animations.

Learning from Rich Semantics and Coarse Locations for Long-tailed Object Detection

Authors

Lingchen Meng,Xiyang Dai,Jianwei Yang,Dongdong Chen,Yinpeng Chen,Mengchen Liu,Yi-Ling Chen,Zuxuan Wu,Lu Yuan,Yu-Gang Jiang

Journal

arXiv preprint arXiv:2310.12152

Published Date

2023/10/18

Long-tailed object detection (LTOD) aims to handle the extreme data imbalance in real-world datasets, where many tail classes have scarce instances. One popular strategy is to explore extra data with image-level labels, yet it produces limited results due to (1) semantic ambiguity---an image-level label only captures a salient part of the image, ignoring the remaining rich semantics within the image; and (2) location sensitivity---the label highly depends on the locations and crops of the original image, which may change after data transformations like random cropping. To remedy this, we propose RichSem, a simple but effective method, which is robust to learn rich semantics from coarse locations without the need of accurate bounding boxes. RichSem leverages rich semantics from images, which are then served as additional``soft supervision''for training detectors. Specifically, we add a semantic branch to our detector to learn these soft semantics and enhance feature representations for long-tailed object detection. The semantic branch is only used for training and is removed during inference. RichSem achieves consistent improvements on both overall and rare-category of LVIS under different backbones and detectors. Our method achieves state-of-the-art performance without requiring complex training and testing procedures. Moreover, we show the effectiveness of our method on other long-tailed datasets with additional experiments.

Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data

Authors

Zuxuan Wu,Zejia Weng,Wujian Peng,Xitong Yang,Ang Li,Larry S Davis,Yu-Gang Jiang

Journal

arXiv preprint arXiv:2310.05010

Published Date

2023/10/8

Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition. This paper presents Open-VCLIP++, a simple yet effective framework that adapts CLIP to a strong zero-shot video classifier, capable of identifying novel actions and events during testing. Open-VCLIP++ minimally modifies CLIP to capture spatial-temporal relationships in videos, thereby creating a specialized video classifier while striving for generalization. We formally demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data. To address this problem, we introduce Interpolated Weight Optimization, a technique that leverages the advantages of weight interpolation during both training and testing. Furthermore, we build upon large language models to produce fine-grained …

Instance-aware multi-camera 3d object detection with structural priors mining and self-boosting learning

Authors

Yang Jiao,Zequn Jie,Shaoxiang Chen,Lechao Cheng,Jingjing Chen,Lin Ma,Yu-Gang Jiang

Journal

Proceedings of the AAAI Conference on Artificial Intelligence

Published Date

2024/3/24

Camera-based bird-eye-view (BEV) perception paradigm has made significant progress in the autonomous driving field. Under such a paradigm, accurate BEV representation construction relies on reliable depth estimation for multi-camera images. However, existing approaches exhaustively predict depths for every pixel without prioritizing objects, which are precisely the entities requiring detection in the 3D space. To this end, we propose IA-BEV, which integrates image-plane instance awareness into the depth estimation process within a BEV-based detector. First, a category-specific structural priors mining approach is proposed for enhancing the efficacy of monocular depth generation. Besides, a self-boosting learning strategy is further proposed to encourage the model to place more emphasis on challenging objects in computation-expensive temporal stereo matching. Together they provide advanced depth estimation results for high-quality BEV features construction, benefiting the ultimate 3D detection. The proposed method achieves state-of-the-art performances on the challenging nuScenes benchmark, and extensive experimental results demonstrate the effectiveness of our designs.

Eyes Can Deceive: Benchmarking Counterfactual Reasoning Abilities of Multi-modal Large Language Models

Authors

Yian Li,Wentao Tian,Yang Jiao,Jingjing Chen,Yu-Gang Jiang

Journal

arXiv preprint arXiv:2404.12966

Published Date

2024/4/19

Counterfactual reasoning, as a crucial manifestation of human intelligence, refers to making presuppositions based on established facts and extrapolating potential outcomes. Existing multimodal large language models (MLLMs) have exhibited impressive cognitive and reasoning capabilities, which have been examined across a wide range of Visual Question Answering (VQA) benchmarks. Nevertheless, how will existing MLLMs perform when faced with counterfactual questions? To answer this question, we first curate a novel \textbf{C}ounter\textbf{F}actual \textbf{M}ulti\textbf{M}odal reasoning benchmark, abbreviated as \textbf{CFMM}, to systematically assess the counterfactual reasoning capabilities of MLLMs. Our CFMM comprises six challenging tasks, each including hundreds of carefully human-labeled counterfactual questions, to evaluate MLLM's counterfactual reasoning capabilities across diverse aspects. Through experiments, interestingly, we find that existing MLLMs prefer to believe what they see, but ignore the counterfactual presuppositions presented in the question, thereby leading to inaccurate responses. Furthermore, we evaluate a wide range of prevalent MLLMs on our proposed CFMM. The significant gap between their performance on our CFMM and that on several VQA benchmarks indicates that there is still considerable room for improvement in existing MLLMs toward approaching human-level intelligence. On the other hand, through boosting MLLMs performances on our CFMM in the future, potential avenues toward developing MLLMs with advanced intelligence can be explored.

Multi-prompt alignment for multi-source unsupervised domain adaptation

Authors

Haoran Chen,Zuxuan Wu,Yu-Gang Jiang

Journal

arXiv preprint arXiv:2209.15210

Published Date

2022/9/30

Most existing methods for unsupervised domain adaptation (UDA) rely on a shared network to extract domain-invariant features. However, when facing multiple source domains, optimizing such a network involves updating the parameters of the entire network, making it both computationally expensive and challenging, particularly when coupled with min-max objectives. Inspired by recent advances in prompt learning that adapts high-capacity models for downstream tasks in a computationally economic way, we introduce Multi-Prompt Alignment (MPA), a simple yet efficient framework for multi-source UDA. Given a source and target domain pair, MPA first trains an individual prompt to minimize the domain gap through a contrastive loss. Then, MPA denoises the learned prompts through an auto-encoding process and aligns them by maximizing the agreement of all the reconstructed prompts. Moreover, we show that the resulting subspace acquired from the auto-encoding process can easily generalize to a streamlined set of target domains, making our method more efficient for practical usage. Extensive experiments show that MPA achieves state-of-the-art results on three popular datasets with an impressive average accuracy of 54.1% on DomainNet.

Identity-Driven Multimedia Forgery Detection via Reference Assistance

Authors

Junhao Xu,Jingjing Chen,Xue Song,Feng Han,Haijun Shan,Yugang Jiang

Journal

arXiv preprint arXiv:2401.11764

Published Date

2024/1/22

Recent advancements in technologies, such as the 'deepfake' technique, have paved the way for the generation of various media forgeries. In response to the potential hazards of these media forgeries, many researchers engage in exploring detection methods, increasing the demand for high-quality media forgery datasets. Despite this, existing datasets have certain limitations. Firstly, most of datasets focus on the manipulation of visual modality and usually lack diversity, as only a few forgery approaches are considered. Secondly, the quality of media is often inadequate in clarity and naturalness. Meanwhile, the size of the dataset is also limited. Thirdly, while many real-world forgeries are driven by identity, the identity information of the subject in media is frequently neglected. For detection, identity information could be an essential clue to boost accuracy. Moreover, official media concerning certain identities on the Internet can serve as prior knowledge, aiding both the audience and forgery detectors in determining the true identity. Therefore, we propose an identity-driven multimedia forgery dataset, IDForge, which contains 249,138 video shots. All video shots are sourced from 324 wild videos collected of 54 celebrities from the Internet. The fake video shots involve 9 types of manipulation across visual, audio and textual modalities. Additionally, IDForge provides extra 214,438 real video shots as a reference set for the 54 celebrities. Correspondingly, we design an effective multimedia detection network, Reference-assisted Multimodal Forgery Detection Network (R-MFDN). Through extensive experiments on the proposed dataset, we demonstrate …

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

Authors

Tianwen Qian,Jingjing Chen,Linhai Zhuo,Yang Jiao,Yu-Gang Jiang

Journal

Proceedings of the AAAI Conference on Artificial Intelligence

Published Date

2024/3/24

We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues. Compared to traditional VQA tasks, VQA in autonomous driving scenario presents more challenges. Firstly, the raw visual data are multi-modal, including images and point clouds captured by camera and LiDAR, respectively. Secondly, the data are multi-frame due to the continuous, real-time acquisition. Thirdly, the outdoor scenes exhibit both moving foreground and static background. Existing VQA benchmarks fail to adequately address these complexities. To bridge this gap, we propose NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. Specifically, we leverage existing 3D detection annotations to generate scene graphs and design question templates manually. Subsequently, the question-answer pairs are generated programmatically based on these templates. Comprehensive statistics prove that our NuScenes-QA is a balanced large-scale benchmark with diverse question formats. Built upon it, we develop a series of baselines that employ advanced 3D detection and VQA techniques. Our extensive experiments highlight the challenges posed by this new task. Codes and dataset are available at https://github.com/qiantianwen/NuScenes-QA.

The Dog Walking Theory: Rethinking Convergence in Federated Learning

Authors

Kun Zhai,Yifeng Gao,Xingjun Ma,Difan Zou,Guangnan Ye,Yu-Gang Jiang

Journal

arXiv preprint arXiv:2404.11888

Published Date

2024/4/18

Federated learning (FL) is a collaborative learning paradigm that allows different clients to train one powerful global model without sharing their private data. Although FL has demonstrated promising results in various applications, it is known to suffer from convergence issues caused by the data distribution shift across different clients, especially on non-independent and identically distributed (non-IID) data. In this paper, we study the convergence of FL on non-IID data and propose a novel \emph{Dog Walking Theory} to formulate and identify the missing element in existing research. The Dog Walking Theory describes the process of a dog walker leash walking multiple dogs from one side of the park to the other. The goal of the dog walker is to arrive at the right destination while giving the dogs enough exercise (i.e., space exploration). In FL, the server is analogous to the dog walker while the clients are analogous to the dogs. This analogy allows us to identify one crucial yet missing element in existing FL algorithms: the leash that guides the exploration of the clients. To address this gap, we propose a novel FL algorithm \emph{FedWalk} that leverages an external easy-to-converge task at the server side as a \emph{leash task} to guide the local training of the clients. We theoretically analyze the convergence of FedWalk with respect to data heterogeneity (between server and clients) and task discrepancy (between the leash and the original tasks). Experiments on multiple benchmark datasets demonstrate the superiority of FedWalk over state-of-the-art FL methods under both IID and non-IID settings.

Cdistnet: Perceiving multi-domain character distance for robust text recognition

Authors

Tianlun Zheng,Zhineng Chen,Shancheng Fang,Hongtao Xie,Yu-Gang Jiang

Journal

International Journal of Computer Vision

Published Date

2024/2

The transformer-based encoder-decoder framework is becoming popular in scene text recognition, largely because it naturally integrates recognition clues from both visual and semantic domains. However, recent studies show that the two kinds of clues are not always well registered and therefore, feature and character might be misaligned in difficult text (e.g., with a rare shape). As a result, constraints such as character position are introduced to alleviate this problem. Despite certain success, visual and semantic are still separately modeled and they are merely loosely associated. In this paper, we propose a novel module called multi-domain character distance perception (MDCDP) to establish a visually and semantically related position embedding. MDCDP uses the position embedding to query both visual and semantic features following the cross-attention mechanism. The two kinds of clues are fused into the …

GaussianBody: Clothed Human Reconstruction via 3d Gaussian Splatting

Authors

Mengtian Li,Shengxiang Yao,Zhifeng Xie,Keyu Chen

Journal

arXiv preprint arXiv:2401.09720

Published Date

2024/1/18

In this work, we propose a novel clothed human reconstruction method called GaussianBody, based on 3D Gaussian Splatting. Compared with the costly neural radiance based models, 3D Gaussian Splatting has recently demonstrated great performance in terms of training time and rendering quality. However, applying the static 3D Gaussian Splatting model to the dynamic human reconstruction problem is non-trivial due to complicated non-rigid deformations and rich cloth details. To address these challenges, our method considers explicit pose-guided deformation to associate dynamic Gaussians across the canonical space and the observation space, introducing a physically-based prior with regularized transformations helps mitigate ambiguity between the two spaces. During the training process, we further propose a pose refinement strategy to update the pose regression for compensating the inaccurate initial estimation and a split-with-scale mechanism to enhance the density of regressed point clouds. The experiments validate that our method can achieve state-of-the-art photorealistic novel-view rendering results with high-quality details for dynamic clothed human bodies, along with explicit geometry reconstruction.

Fdgaussian: Fast gaussian splatting from single image via geometric-aware diffusion model

Authors

Qijun Feng,Zhen Xing,Zuxuan Wu,Yu-Gang Jiang

Journal

arXiv preprint arXiv:2403.10242

Published Date

2024/3/15

Reconstructing detailed 3D objects from single-view images remains a challenging task due to the limited information available. In this paper, we introduce FDGaussian, a novel two-stage framework for single-image 3D reconstruction. Recent methods typically utilize pre-trained 2D diffusion models to generate plausible novel views from the input image, yet they encounter issues with either multi-view inconsistency or lack of geometric fidelity. To overcome these challenges, we propose an orthogonal plane decomposition mechanism to extract 3D geometric features from the 2D input, enabling the generation of consistent multi-view images. Moreover, we further accelerate the state-of-the-art Gaussian Splatting incorporating epipolar attention to fuse images from different viewpoints. We demonstrate that FDGaussian generates images with high consistency across different views and reconstructs high-quality 3D objects, both qualitatively and quantitatively. More examples can be found at our website https://qjfeng.net/FDGaussian/.

Learning to Rank Patches for Unbiased Image Redundancy Reduction

Authors

Yang Luo,Zhineng Chen,Peng Zhou,Zuxuan Wu,Xieping Gao,Yu-Gang Jiang

Journal

arXiv preprint arXiv:2404.00680

Published Date

2024/3/31

Images suffer from heavy spatial redundancy because pixels in neighboring regions are spatially correlated. Existing approaches strive to overcome this limitation by reducing less meaningful image regions. However, current leading methods rely on supervisory signals. They may compel models to preserve content that aligns with labeled categories and discard content belonging to unlabeled categories. This categorical inductive bias makes these methods less effective in real-world scenarios. To address this issue, we propose a self-supervised framework for image redundancy reduction called Learning to Rank Patches (LTRP). We observe that image reconstruction of masked image modeling models is sensitive to the removal of visible patches when the masking ratio is high (e.g., 90\%). Building upon it, we implement LTRP via two steps: inferring the semantic density score of each patch by quantifying variation between reconstructions with and without this patch, and learning to rank the patches with the pseudo score. The entire process is self-supervised, thus getting out of the dilemma of categorical inductive bias. We design extensive experiments on different datasets and tasks. The results demonstrate that LTRP outperforms both supervised and other self-supervised methods due to the fair assessment of image content.

Instruction-Guided Scene Text Recognition

Authors

Yongkun Du,Zhineng Chen,Yuchen Su,Caiyan Jia,Yu-Gang Jiang

Journal

arXiv preprint arXiv:2401.17851

Published Date

2024/1/31

Multi-modal models have shown appealing performance in visual tasks recently, as instruction-guided training has evoked the ability to understand fine-grained visual content. However, current methods cannot be trivially applied to scene text recognition (STR) due to the gap between natural and text images. In this paper, we introduce a novel paradigm that formulates STR as an instruction learning problem, and propose instruction-guided scene text recognition (IGTR) to achieve effective cross-modal learning. IGTR first generates rich and diverse instruction triplets of <condition,question,answer>, serving as guidance for nuanced text image understanding. Then, we devise an architecture with dedicated cross-modal feature fusion module, and multi-task answer head to effectively fuse the required instruction and image features for answering questions. Built upon these designs, IGTR facilitates accurate text recognition by comprehending character attributes. Experiments on English and Chinese benchmarks show that IGTR outperforms existing models by significant margins. Furthermore, by adjusting the instructions, IGTR enables various recognition schemes. These include zero-shot prediction, where the model is trained based on instructions not explicitly targeting character recognition, and the recognition of rarely appearing and morphologically similar characters, which were previous challenges for existing models.

Secrets of rlhf in large language models part ii: Reward modeling

Authors

Binghai Wang,Rui Zheng,Lu Chen,Yan Liu,Shihan Dou,Caishuang Huang,Wei Shen,Senjie Jin,Enyu Zhou,Chenyu Shi,Songyang Gao,Nuo Xu,Yuhao Zhou,Xiaoran Fan,Zhiheng Xi,Jun Zhao,Xiao Wang,Tao Ji,Hang Yan,Lixing Shen,Zhan Chen,Tao Gui,Qi Zhang,Xipeng Qiu,Xuanjing Huang,Zuxuan Wu,Yu-Gang Jiang

Journal

arXiv preprint arXiv:2401.06080

Published Date

2024/1/11

Reinforcement Learning from Human Feedback (RLHF) has become a crucial technology for aligning language models with human values and intentions, enabling models to produce more helpful and harmless responses. Reward models are trained as proxies for human preferences to drive reinforcement learning optimization. While reward models are often considered central to achieving high performance, they face the following challenges in practical applications: (1) Incorrect and ambiguous preference pairs in the dataset may hinder the reward model from accurately capturing human intent. (2) Reward models trained on data from a specific distribution often struggle to generalize to examples outside that distribution and are not suitable for iterative RLHF training. In this report, we attempt to address these two issues. (1) From a data perspective, we propose a method to measure the strength of preferences within the data, based on a voting mechanism of multiple reward models. Experimental results confirm that data with varying preference strengths have different impacts on reward model performance. We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset and fully leverage high-quality preference data. (2) From an algorithmic standpoint, we introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses, thereby improving model generalization. Furthermore, we employ meta-learning to enable the reward model to maintain the ability to differentiate subtle differences in out-of-distribution samples, and this …

Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models

Authors

Yang Jiao,Shaoxiang Chen,Zequn Jie,Jingjing Chen,Lin Ma,Yu-Gang Jiang

Journal

arXiv preprint arXiv:2403.07304

Published Date

2024/3/12

Large Multimodal Model (LMM) is a hot research topic in the computer vision area and has also demonstrated remarkable potential across multiple disciplinary fields. A recent trend is to further extend and enhance the perception capabilities of LMMs. The current methods follow the paradigm of adapting the visual task outputs to the format of the language model, which is the main component of a LMM. This adaptation leads to convenient development of such LMMs with minimal modifications, however, it overlooks the intrinsic characteristics of diverse visual tasks and hinders the learning of perception capabilities. To address this issue, we propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement. We decouple the LMM's learning of perception capabilities into task-agnostic and task-specific stages. Lumen first promotes fine-grained vision-language concept alignment, which is the fundamental capability for various visual tasks. Thus the output of the task-agnostic stage is a shared representation for all the tasks we address in this paper. Then the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders with negligible training efforts. Benefiting from such a decoupled design, our Lumen surpasses existing LMM-based approaches on the COCO detection benchmark with a clear margin and exhibits seamless scalability to additional visual tasks. Furthermore, we also conduct comprehensive ablation studies and generalization evaluations for deeper insights. The code will be released at https://github.com/SxJyJay/Lumen.

OmniViD: A Generative Framework for Universal Video Understanding

Authors

Junke Wang,Dongdong Chen,Chong Luo,Bo He,Lu Yuan,Zuxuan Wu,Yu-Gang Jiang

Published Date

2024/3/26

The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution. Despite sharing a common goal, different tasks often rely on distinct model architectures and annotation formats. In contrast, natural language processing benefits from a unified output space, i.e., text sequences, which simplifies the training of powerful foundational language models, such as GPT-3, with extensive training corpora. Inspired by this, we seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. In this way, a variety of video tasks could be formulated as video-grounded token generation. This enables us to address various types of video tasks, including classification (such as action recognition), captioning (covering clip captioning, video question answering, and dense video captioning), and localization tasks (such as visual object tracking) within a fully shared encoder-decoder architecture, following a generative framework. Through comprehensive experiments, we demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results on seven video benchmarks, providing a novel perspective for more universal video understanding. Code is available at https://github.com/wangjk666/OmniVid.

Doubly Abductive Counterfactual Inference for Text-based Image Editing

Authors

Xue Song,Jiequan Cui,Hanwang Zhang,Jingjing Chen,Richang Hong,Yu-Gang Jiang

Journal

arXiv preprint arXiv:2403.02981

Published Date

2024/3/5

We study text-based image editing (TBIE) of a single image by counterfactual inference because it is an elegant formulation to precisely address the requirement: the edited image should retain the fidelity of the original one. Through the lens of the formulation, we find that the crux of TBIE is that existing techniques hardly achieve a good trade-off between editability and fidelity, mainly due to the overfitting of the single-image fine-tuning. To this end, we propose a Doubly Abductive Counterfactual inference framework (DAC). We first parameterize an exogenous variable as a UNet LoRA, whose abduction can encode all the image details. Second, we abduct another exogenous variable parameterized by a text encoder LoRA, which recovers the lost editability caused by the overfitted first abduction. Thanks to the second abduction, which exclusively encodes the visual transition from post-edit to pre-edit, its inversion -- subtracting the LoRA -- effectively reverts pre-edit back to post-edit, thereby accomplishing the edit. Through extensive experiments, our DAC achieves a good trade-off between editability and fidelity. Thus, we can support a wide spectrum of user editing intents, including addition, removal, manipulation, replacement, style transfer, and facial change, which are extensively validated in both qualitative and quantitative evaluations. Codes are in https://github.com/xuesong39/DAC.

MouSi: Poly-Visual-Expert Vision-Language Models

Authors

Xiaoran Fan,Tao Ji,Changhao Jiang,Shuo Li,Senjie Jin,Sirui Song,Junke Wang,Boyang Hong,Lu Chen,Guodong Zheng,Ming Zhang,Caishuang Huang,Rui Zheng,Zhiheng Xi,Yuhao Zhou,Shihan Dou,Junjie Ye,Hang Yan,Tao Gui,Qi Zhang,Xipeng Qiu,Xuanjing Huang,Zuxuan Wu,Yu-Gang Jiang

Journal

arXiv preprint arXiv:2401.17221

Published Date

2024/1/30

Current large vision-language models (VLMs) often encounter challenges such as insufficient capabilities of a single visual component and excessively long visual tokens. These issues can limit the model's effectiveness in accurately interpreting complex visual information and over-lengthy contextual information. Addressing these challenges is crucial for enhancing the performance and applicability of VLMs. This paper proposes the use of ensemble experts technique to synergizes the capabilities of individual visual encoders, including those skilled in image-text matching, OCR, image segmentation, etc. This technique introduces a fusion network to unify the processing of outputs from different visual experts, while bridging the gap between image encoders and pre-trained LLMs. In addition, we explore different positional encoding schemes to alleviate the waste of positional encoding caused by lengthy image feature sequences, effectively addressing the issue of position overflow and length limitations. For instance, in our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1. Experimental results demonstrate that VLMs with multiple experts exhibit consistently superior performance over isolated visual encoders and mark a significant performance boost as more experts are integrated. We have open-sourced the training code used in this report. All of these resources can be found on our project website.

LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation Network

Authors

Yuchen Su,Zhineng Chen,Zhiwen Shao,Yuning Du,Zhilong Ji,Jinfeng Bai,Yong Zhou,Yu-Gang Jiang

Journal

Proceedings of the AAAI Conference on Artificial Intelligence

Published Date

2024/3/24

Recently, regression-based methods, which predict parameterized text shapes for text localization, have gained popularity in scene text detection. However, the existing parameterized text shape methods still have limitations in modeling arbitrary-shaped texts due to ignoring the utilization of text-specific shape information. Moreover, the time consumption of the entire pipeline has been largely overlooked, leading to a suboptimal overall inference speed. To address these issues, we first propose a novel parameterized text shape method based on low-rank approximation. Unlike other shape representation methods that employ data-irrelevant parameterization, our approach utilizes singular value decomposition and reconstructs the text shape using a few eigenvectors learned from labeled text contours. By exploring the shape correlation among different text contours, our method achieves consistency, compactness, simplicity, and robustness in shape representation. Next, we propose a dual assignment scheme for speed acceleration. It adopts a sparse assignment branch to accelerate the inference speed, and meanwhile, provides ample supervised signals for training through a dense assignment branch. Building upon these designs, we implement an accurate and efficient arbitrary-shaped text detector named LRANet. Extensive experiments are conducted on several challenging benchmarks, demonstrating the superior accuracy and efficiency of LRANet compared to state-of-the-art methods.

See List of Professors in Yu-Gang Jiang University(Fudan University)

Yu-Gang Jiang FAQs

What is Yu-Gang Jiang's h-index at Fudan University?

The h-index of Yu-Gang Jiang has been 64 since 2020 and 78 in total.

What are Yu-Gang Jiang's top articles?

The articles with the titles of

PoseAnimate: Zero-shot high fidelity pose controllable character animation

Learning from Rich Semantics and Coarse Locations for Long-tailed Object Detection

Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data

Instance-aware multi-camera 3d object detection with structural priors mining and self-boosting learning

Eyes Can Deceive: Benchmarking Counterfactual Reasoning Abilities of Multi-modal Large Language Models

Multi-prompt alignment for multi-source unsupervised domain adaptation

Identity-Driven Multimedia Forgery Detection via Reference Assistance

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

...

are the top articles of Yu-Gang Jiang at Fudan University.

What are Yu-Gang Jiang's research interests?

The research interests of Yu-Gang Jiang are: Video Analytics, Multimedia, Computer Vision, Trustworthy AI, AGI

What is Yu-Gang Jiang's total number of citations?

Yu-Gang Jiang has 24,753 citations in total.

What are the co-authors of Yu-Gang Jiang?

The co-authors of Yu-Gang Jiang are Larry Davis, Shih-Fu Chang, Mubarak Shah, Tao Xiang, Chong-Wah Ngo.

    Co-Authors

    H-index: 139
    Larry Davis

    Larry Davis

    University of Maryland

    H-index: 134
    Shih-Fu Chang

    Shih-Fu Chang

    Columbia University in the City of New York

    H-index: 132
    Mubarak Shah

    Mubarak Shah

    University of Central Florida

    H-index: 97
    Tao Xiang

    Tao Xiang

    University of Surrey

    H-index: 63
    Chong-Wah Ngo

    Chong-Wah Ngo

    Singapore Management University

    academic-engine

    Useful Links