Rongrong Ji 纪荣嵘

Rongrong Ji 纪荣嵘

Xiamen University

H-index: 78

Asia-China

About Rongrong Ji 纪荣嵘

Rongrong Ji 纪荣嵘, With an exceptional h-index of 78 and a recent h-index of 66 (since 2020), a distinguished researcher at Xiamen University, specializes in the field of Model Compression, Neural Architecture Search, Image Retrieval.

His recent articles reflect a diverse array of research interests and contributions to the field:

Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization

ObjectAdd: Adding Objects into Image via a Training-Free Diffusion Modification Fashion

DiffuMatting: Synthesizing Arbitrary Objects with Matting-level Annotation

Uncovering the Over-Smoothing Challenge in Image Super-Resolution: Entropy-Based Quantification and Contrastive Optimization

Defense Against Adversarial Attacks Using Topology Aligning Adversarial Training

Identity-Aware Variational Autoencoder for Face Swapping

CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes

Toward Open-Set Human Object Interaction Detection

Rongrong Ji 纪荣嵘 Information

University

Xiamen University

Position

Professor

Citations(all)

26114

Citations(since 2020)

19445

Cited By

11776

hIndex(all)

78

hIndex(since 2020)

66

i10Index(all)

332

i10Index(since 2020)

264

Email

University Profile Page

Xiamen University

Rongrong Ji 纪荣嵘 Skills & Research Interests

Model Compression

Neural Architecture Search

Image Retrieval

Top articles of Rongrong Ji 纪荣嵘

Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization

Authors

Yongdong Luo,Haojia Lin,Xiawu Zheng,Yigeng Jiang,Fei Chao,Jie Hu,Guannan Jiang,Songan Zhang,Rongrong Ji

Journal

arXiv preprint arXiv:2404.11064

Published Date

2024/4/17

3D Visual Grounding (3DVG) and 3D Dense Captioning (3DDC) are two crucial tasks in various 3D applications, which require both shared and complementary information in localization and visual-language relationships. Therefore, existing approaches adopt the two-stage "detect-then-describe/discriminate" pipeline, which relies heavily on the performance of the detector, resulting in suboptimal performance. Inspired by DETR, we propose a unified framework, 3DGCTR, to jointly solve these two distinct but closely related tasks in an end-to-end fashion. The key idea is to reconsider the prompt-based localization ability of the 3DVG model. In this way, the 3DVG model with a well-designed prompt as input can assist the 3DDC task by extracting localization information from the prompt. In terms of implementation, we integrate a Lightweight Caption Head into the existing 3DVG network with a Caption Text Prompt as a connection, effectively harnessing the existing 3DVG model's inherent localization capacity, thereby boosting 3DDC capability. This integration facilitates simultaneous multi-task training on both tasks, mutually enhancing their performance. Extensive experimental results demonstrate the effectiveness of this approach. Specifically, on the ScanRefer dataset, 3DGCTR surpasses the state-of-the-art 3DDC method by 4.3% in CIDEr@0.5IoU in MLE training and improves upon the SOTA 3DVG method by 3.16% in Acc@0.25IoU.

ObjectAdd: Adding Objects into Image via a Training-Free Diffusion Modification Fashion

Authors

Ziyue Zhang,Mingbao Lin,Rongrong Ji

Journal

arXiv preprint arXiv:2404.17230

Published Date

2024/4/26

We introduce ObjectAdd, a training-free diffusion modification method to add user-expected objects into user-specified area. The motive of ObjectAdd stems from: first, describing everything in one prompt can be difficult, and second, users often need to add objects into the generated image. To accommodate with real world, our ObjectAdd maintains accurate image consistency after adding objects with technical innovations in: (1) embedding-level concatenation to ensure correct text embedding coalesce; (2) object-driven layout control with latent and attention injection to ensure objects accessing user-specified area; (3) prompted image inpainting in an attention refocusing & object expansion fashion to ensure rest of the image stays the same. With a text-prompted image, our ObjectAdd allows users to specify a box and an object, and achieves: (1) adding object inside the box area; (2) exact content outside the box area; (3) flawless fusion between the two areas

DiffuMatting: Synthesizing Arbitrary Objects with Matting-level Annotation

Authors

Xiaobin Hu,Xu Peng,Donghao Luo,Xiaozhong Ji,Jinlong Peng,Zhengkai Jiang,Jiangning Zhang,Taisong Jin,Chengjie Wang,Rongrong Ji

Journal

arXiv preprint arXiv:2403.06168

Published Date

2024/3/10

Due to the difficulty and labor-consuming nature of getting highly accurate or matting annotations, there only exists a limited amount of highly accurate labels available to the public. To tackle this challenge, we propose a DiffuMatting which inherits the strong Everything generation ability of diffusion and endows the power of "matting anything". Our DiffuMatting can 1). act as an anything matting factory with high accurate annotations 2). be well-compatible with community LoRAs or various conditional control approaches to achieve the community-friendly art design and controllable generation. Specifically, inspired by green-screen-matting, we aim to teach the diffusion model to paint on a fixed green screen canvas. To this end, a large-scale greenscreen dataset (Green100K) is collected as a training dataset for DiffuMatting. Secondly, a green background control loss is proposed to keep the drawing board as a pure green color to distinguish the foreground and background. To ensure the synthesized object has more edge details, a detailed-enhancement of transition boundary loss is proposed as a guideline to generate objects with more complicated edge structures. Aiming to simultaneously generate the object and its matting annotation, we build a matting head to make a green color removal in the latent space of the VAE decoder. Our DiffuMatting shows several potential applications (e.g., matting-data generator, community-friendly art design and controllable generation). As a matting-data generator, DiffuMatting synthesizes general object and portrait matting sets, effectively reducing the relative MSE error by 15.4% in General Object Matting …

Uncovering the Over-Smoothing Challenge in Image Super-Resolution: Entropy-Based Quantification and Contrastive Optimization

Authors

Tianshuo Xu,Lijiang Li,Peng Mi,Xiawu Zheng,Fei Chao,Rongrong Ji,Yonghong Tian,Qiang Shen

Journal

IEEE Transactions on Pattern Analysis and Machine Intelligence

Published Date

2024/3/19

PSNR-oriented models are a critical class of super-resolution models with applications across various fields. However, these models tend to generate over-smoothed images, a problem that has been analyzed previously from the perspectives of models or loss functions, but without taking into account the impact of data properties. In this paper, we present a novel phenomenon that we term the center-oriented optimization (COO) problem, where a model's output converges towards the center point of similar high-resolution images, rather than towards the ground truth. We demonstrate that the strength of this problem is related to the uncertainty of data, which we quantify using entropy. We prove that as the entropy of high-resolution images increases, their center point will move further away from the clean image distribution, and the model will generate over-smoothed images. Implicitly optimizing the COO problem …

Defense Against Adversarial Attacks Using Topology Aligning Adversarial Training

Authors

Huafeng Kuang,Hong Liu,Xianming Lin,Rongrong Ji

Journal

IEEE Transactions on Information Forensics and Security

Published Date

2024/1/29

Recent works have indicated that deep neural networks (DNNs) are vulnerable to adversarial attacks, wherein an attacker perturbs an input example with human-imperceptible noise that can easily fool the DNNs, resulting in incorrect predictions. This severely limits the application of deep learning in security-critical scenarios, such as face authentication. Adversarial training (AT) is one of the most practical approaches to strengthening the robustness of DNNs. However, existing AT-based methods treat each training sample independently, thereby ignoring the underlying topological structure in the training data. To this end, in this paper, we take full advantage of the topology information and introduce a Topology Aligning Adversarial Training (TAAT) algorithm. TAAT aims to encourage the trained model to maintain consistency in the topological structure within the feature space of both natural and adversarial …

Identity-Aware Variational Autoencoder for Face Swapping

Authors

Zonglin Li,Zhaoxin Zhang,Shengfeng He,Quanling Meng,Shengping Zhang,Bineng Zhong,Rongrong Ji

Journal

IEEE Transactions on Circuits and Systems for Video Technology

Published Date

2024/1/4

Face swapping aims to transfer the identity of a source face to a target face image while preserving the target attributes (e.g., facial expression, head pose, illumination, and background). Most existing methods use a face recognition model to extract global features from the source face and directly fuse them with the target to generate a swapping result. However, identity-irrelevant attributes (e.g., hairstyle and facial appearances) contribute a lot to the recognition task, and thus swapping this task-specific feature inevitably interfuses source attributes with target ones. In this paper, we propose an identity-aware variational autoencoder (ID-VAE) based face swapping framework, dubbed VAFSwap, which learns disentangled identity and attribute representations for high-fidelity face swapping. In particular, we overcome the unpaired training barrier of VAE and impose a proxy identity on the latent space by exploiting the …

CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes

Authors

Yulei Qin,Xingyu Chen,Yunhang Shen,Chaoyou Fu,Yun Gu,Ke Li,Xing Sun,Rongrong Ji

Journal

Advances in Neural Information Processing Systems

Published Date

2024/2/13

Webly supervised learning has attracted increasing attention for its effectiveness in exploring publicly accessible data at scale without manual annotation. However, most existing methods of learning with web datasets are faced with challenges from label noise, and they have limited assumptions on clean samples under various noise. For instance, web images retrieved with queries of” tiger cat “(a cat species) and” drumstick “(a musical instrument) are almost dominated by images of tigers and chickens, which exacerbates the challenge of fine-grained visual concept learning. In this case, exploiting both web images and their associated texts is a requisite solution to combat real-world noise. In this paper, we propose Cross-modality Aligned Prototypes (CAPro), a unified prototypical contrastive learning framework to learn visual representations with correct semantics. For one thing, we leverage textual prototypes, which stem from the distinct concept definition of classes, to select clean images by text matching and thus disambiguate the formation of visual prototypes. For another, to handle missing and mismatched noisy texts, we resort to the visual feature space to complete and enhance individual texts and thereafter improve text matching. Such semantically aligned visual prototypes are further polished up with high-quality samples, and engaged in both cluster regularization and noise removal. Besides, we propose collective bootstrapping to encourage smoother and wiser label reference from appearance-similar instances in a manner of dictionary look-up. Extensive experiments on WebVision1k and NUS-WIDE (Web) demonstrate that CAPro …

Toward Open-Set Human Object Interaction Detection

Authors

Mingrui Wu,Yuqi Liu,Jiayi Ji,Xiaoshuai Sun,Rongrong Ji

Journal

Proceedings of the AAAI Conference on Artificial Intelligence

Published Date

2024/3/24

This work is oriented toward the task of open-set Human Object Interaction (HOI) detection. The challenge lies in identifying completely new, out-of-domain relationships, as opposed to in-domain ones which have seen improvements in zero-shot HOI detection. To address this challenge, we introduce a simple Disentangled HOI Detection (DHD) model for detecting novel relationships by integrating an open-set object detector with a Visual Language Model (VLM). We utilize a disentangled image-text contrastive learning metric for training and connect the bottom-up visual features to text embeddings through lightweight unary and pair-wise adapters. Our model can benefit from the open-set object detector and the VLM to detect novel action categories and combine actions with novel object categories. We further present the VG-HOI dataset, a comprehensive benchmark with over 17k HOI relationships for open-set scenarios. Experimental results show that our model can detect unknown action classes and combine unknown object classes. Furthermore, it can generalize to over 17k HOI classes while being trained on just 600 HOI classes.

Functionally Similar Multi-Label Knowledge Distillation

Authors

Binghan Chen,Jianlong Hu,Xiawu Zheng,Wei Lin,Fei Chao,Rongrong Ji

Published Date

2024/4/14

Existing multi-label knowledge distillation methods simply use regression or single-label classification methods without fully exploiting the essence of multi-label classification, resulting in student models’ inadequate performance and poor functional similarity to teacher models. In this paper, we reinterpret multi-label classification as multiple intra-class ranking tasks, with each class corresponding to a ranking task. Furthermore, we define the knowledge of multi-label classification models as the ranking of intra-class samples. On the one hand, we propose to evaluate the functional similarity between multi-label classification models with Kendall’s tau and rank-biased overlap, which are common metrics for evaluating ranking similarity. On the other hand, we propose a new functionally similar multi-label knowledge distillation method called FSD, which enables student models to learn the ranking of intra-class samples …

Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

Authors

Timin Gao,Peixian Chen,Mengdan Zhang,Chaoyou Fu,Yunhang Shen,Yan Zhang,Shengchuan Zhang,Xiawu Zheng,Xing Sun,Liujuan Cao,Rongrong Ji

Journal

arXiv preprint arXiv:2404.16033

Published Date

2024/4/24

With the advent of large language models(LLMs) enhanced by the chain-of-thought(CoT) methodology, visual reasoning problem is usually decomposed into manageable sub-tasks and tackled sequentially with various external tools. However, such a paradigm faces the challenge of the potential "determining hallucinations" in decision-making due to insufficient visual information and the limitation of low-level perception tools that fail to provide abstract summaries necessary for comprehensive reasoning. We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. This paper delves into the realm of multimodal CoT to solve intricate visual reasoning tasks with multimodal large language models(MLLMs) and their cognitive capability. To this end, we propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Cantor first acts as a decision generator and integrates visual inputs to analyze the image and problem, ensuring a closer alignment with the actual context. Furthermore, Cantor leverages the advanced cognitive functions of MLLMs to perform as multifaceted experts for deriving higher-level information, enhancing the CoT generation process. Our extensive experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance across two complex visual reasoning datasets, without necessitating fine-tuning or ground-truth rationales. Project Page: https://ggg0919.github.io/cantor/ .

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

Authors

Gen Luo,Yiyi Zhou,Yuxin Zhang,Xiawu Zheng,Xiaoshuai Sun,Rongrong Ji

Journal

arXiv preprint arXiv:2403.03003

Published Date

2024/3/5

Despite remarkable progress, existing multimodal large language models (MLLMs) are still inferior in granular visual recognition. Contrary to previous works, we study this problem from the perspective of image resolution, and reveal that a combination of low- and high-resolution visual features can effectively mitigate this shortcoming. Based on this observation, we propose a novel and efficient method for MLLMs, termed Mixture-of-Resolution Adaptation (MRA). In particular, MRA adopts two visual pathways for images with different resolutions, where high-resolution visual information is embedded into the low-resolution pathway via the novel mixture-of-resolution adapters (MR-Adapters). This design also greatly reduces the input sequence length of MLLMs. To validate MRA, we apply it to a recent MLLM called LLaVA, and term the new model LLaVA-HR. We conduct extensive experiments on 11 vision-language (VL) tasks, which show that LLaVA-HR outperforms existing MLLMs on 8 VL tasks, e.g., +9.4% on TextVQA. More importantly, both training and inference of LLaVA-HR remain efficient with MRA, e.g., 20 training hours and 3 inference speed than LLaVA-1.5. Source codes are released at: https://github.com/luogen1996/LLaVA-HR.

Training-Free Transformer Architecture Search With Zero-Cost Proxy Guided Evolution

Authors

Qinqin Zhou,Kekai Sheng,Xiawu Zheng,Ke Li,Yonghong Tian,Jie Chen,Rongrong Ji

Journal

IEEE Transactions on Pattern Analysis and Machine Intelligence

Published Date

2024/3/19

Transformers have shown remarkable performance, however, their architecture design is a time-consuming process that demands expertise and trial-and-error. Thus, it is worthwhile to investigate efficient methods for automatically searching high-performance Transformers via Transformer Architecture Search (TAS). In order to improve the search efficiency, training-free proxy based methods have been widely adopted in Neural Architecture Search (NAS). Whereas, these proxies have been found to be inadequate in generalizing well to Transformer search spaces, as confirmed by several studies and our own experiments. This paper presents an effective scheme for TAS called TR ansformer A rchitecture search with Z er O -cost p R oxy guided evolution (T-Razor) that achieves exceptional efficiency. Firstly, through theoretical analysis, we discover that the synaptic diversity of multi-head self-attention (MSA) and …

Unified-Width Adaptive Dynamic Network for All-In-One Image Restoration

Authors

Yimin Xu,Nanxi Gao,Zhongyun Shan,Fei Chao,Rongrong Ji

Journal

arXiv preprint arXiv:2401.13221

Published Date

2024/1/24

In contrast to traditional image restoration methods, all-in-one image restoration techniques are gaining increased attention for their ability to restore images affected by diverse and unknown corruption types and levels. However, contemporary all-in-one image restoration methods omit task-wise difficulties and employ the same networks to reconstruct images afflicted by diverse degradations. This practice leads to an underestimation of the task correlations and suboptimal allocation of computational resources. To elucidate task-wise complexities, we introduce a novel concept positing that intricate image degradation can be represented in terms of elementary degradation. Building upon this foundation, we propose an innovative approach, termed the Unified-Width Adaptive Dynamic Network (U-WADN), consisting of two pivotal components: a Width Adaptive Backbone (WAB) and a Width Selector (WS). The WAB incorporates several nested sub-networks with varying widths, which facilitates the selection of the most apt computations tailored to each task, thereby striking a balance between accuracy and computational efficiency during runtime. For different inputs, the WS automatically selects the most appropriate sub-network width, taking into account both task-specific and sample-specific complexities. Extensive experiments across a variety of image restoration tasks demonstrate that the proposed U-WADN achieves better performance while simultaneously reducing up to 32.3\% of FLOPs and providing approximately 15.7\% real-time acceleration. The code has been made available at \url{https://github.com/xuyimin0926/U-WADN}.

Towards language-guided visual recognition via dynamic convolutions

Authors

Gen Luo,Yiyi Zhou,Xiaoshuai Sun,Yongjian Wu,Yue Gao,Rongrong Ji

Journal

International Journal of Computer Vision

Published Date

2024/1

In this paper, we are committed to establishing a unified and end-to-end multi-modal network via exploring language-guided visual recognition. To approach this target, we first propose a novel multi-modal convolution module called Language-guided Dynamic Convolution (LaConv). Its convolution kernels are dynamically generated based on natural language information, which can help extract differentiated visual features for different multi-modal examples. Based on the LaConv module, we further build a fully language-driven convolution network, termed as LaConvNet, which can unify the visual recognition and multi-modal reasoning in one forward structure. To validate LaConv and LaConvNet, we conduct extensive experiments on seven benchmark datasets of three vision-and-language tasks, i.e., visual question answering, referring expression comprehension and segmentation. The experimental results not …

Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models

Authors

Qiong Wu,Wei Yu,Yiyi Zhou,Shubin Huang,Xiaoshuai Sun,Rongrong Ji

Published Date

2023/9/4

With ever increasing parameters and computation, vision-language pre-trained (VLP) models exhibit prohibitive expenditure in downstream task adaption. Recent endeavors mainly focus on parameter efficient transfer learning (PETL) for VLP models by only updating a small number of parameters. However, excessive computational overhead still plagues the application of VLPs. In this paper, we aim at parameter and computation efficient transfer learning (PCETL) for VLP models. In particular, PCETL not only needs to limit the number of trainable parameters in VLP models, but also to reduce the computational redundancy during inference, thus enabling a more efficient transfer. To approach this target, we propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL. Instead of directly optimizing the intrinsic architectures of VLP models, DAS first observes the significances of their modules to downstream tasks via a reinforcement learning (RL) based process, and then skips the redundant ones with lightweight networks, ie adapters, according to the obtained rewards. In this case, the VLP model can well maintain the scale of trainable parameters while speeding up its inference on downstream tasks. To validate DAS, we apply it to two representative VLP models, namely ViLT and METER, and conduct extensive experiments on a bunch of VL tasks. The experimental results not only show the great advantages of DAS in reducing computational complexity, eg-11.97% FLOPs of METER on VQA2. 0, but also confirm its competitiveness against existing PETL methods in terms of parameter scale and performance. Our …

Learning Image Demoiréing from Unpaired Real Data

Authors

Yunshan Zhong,Yuyao Zhou,Yuxin Zhang,Fei Chao,Rongrong Ji

Journal

Proceedings of the AAAI Conference on Artificial Intelligence

Published Date

2024/3/24

This paper focuses on addressing the issue of image demoireing. Unlike the large volume of existing studies that rely on learning from paired real data, we attempt to learn a demoireing model from unpaired real data, i.e., moire images associated with irrelevant clean images. The proposed method, referred to as Unpaired Demoireing (UnDeM), synthesizes pseudo moire images from unpaired datasets, generating pairs with clean images for training demoireing models. To achieve this, we divide real moire images into patches and group them in compliance with their moire complexity. We introduce a novel moire generation framework to synthesize moire images with diverse moire features, resembling real moire patches, and details akin to real moire-free images. Additionally, we introduce an adaptive denoise method to eliminate the low-quality pseudo moire images that adversely impact the learning of demoireing models. We conduct extensive experiments on the commonly-used FHDMi and UHDM datasets. Results manifest that our UnDeM performs better than existing methods when using existing demoireing models such as MBCNN and ESDNet-L.

CycleTrans: Learning Neutral Yet Discriminative Features via Cycle Construction for Visible-Infrared Person Re-Identification

Authors

Qiong Wu,Jiaer Xia,Pingyang Dai,Yiyi Zhou,Yongjian Wu,Rongrong Ji

Journal

arXiv preprint arXiv:2208.09844

Published Date

2022/8/21

Visible-infrared person re-identification (VI-ReID) is a task of matching the same individuals across the visible and infrared modalities. Its main challenge lies in the modality gap caused by cameras operating on different spectra. Existing VI-ReID methods mainly focus on learning general features across modalities, often at the expense of feature discriminability. To address this issue, we present a novel cycle-construction-based network for neutral yet discriminative feature learning, termed CycleTrans. Specifically, CycleTrans uses a lightweight Knowledge Capturing Module (KCM) to capture rich semantics from the modality-relevant feature maps according to pseudo queries. Afterwards, a Discrepancy Modeling Module (DMM) is deployed to transform these features into neutral ones according to the modality-irrelevant prototypes. To ensure feature discriminability, another two KCMs are further deployed for feature cycle constructions. With cycle construction, our method can learn effective neutral features for visible and infrared images while preserving their salient semantics. Extensive experiments on SYSU-MM01 and RegDB datasets validate the merits of CycleTrans against a flurry of state-of-the-art methods, +4.57% on rank-1 in SYSU-MM01 and +2.2% on rank-1 in RegDB.

Semi-supervised Counting via Pixel-by-pixel Density Distribution Modelling

Authors

Hui Lin,Zhiheng Ma,Rongrong Ji,Yaowei Wang,Zhou Su,Xiaopeng Hong,Deyu Meng

Journal

arXiv preprint arXiv:2402.15297

Published Date

2024/2/23

This paper focuses on semi-supervised crowd counting, where only a small portion of the training data are labeled. We formulate the pixel-wise density value to regress as a probability distribution, instead of a single deterministic value. On this basis, we propose a semi-supervised crowd-counting model. Firstly, we design a pixel-wise distribution matching loss to measure the differences in the pixel-wise density distributions between the prediction and the ground truth; Secondly, we enhance the transformer decoder by using density tokens to specialize the forwards of decoders w.r.t. different density intervals; Thirdly, we design the interleaving consistency self-supervised learning mechanism to learn from unlabeled data efficiently. Extensive experiments on four datasets are performed to show that our method clearly outperforms the competitors by a large margin under various labeled ratio settings. Code will be released at https://github.com/LoraLinH/Semi-supervised-Counting-via-Pixel-by-pixel-Density-Distribution-Modelling.

CutDiffusion: A Simple, Fast, Cheap, and Strong Diffusion Extrapolation Method

Authors

Mingbao Lin,Zhihang Lin,Wengyi Zhan,Liujuan Cao,Rongrong Ji

Journal

arXiv preprint arXiv:2404.15141

Published Date

2024/4/23

Transforming large pre-trained low-resolution diffusion models to cater to higher-resolution demands, i.e., diffusion extrapolation, significantly improves diffusion adaptability. We propose tuning-free CutDiffusion, aimed at simplifying and accelerating the diffusion extrapolation process, making it more affordable and improving performance. CutDiffusion abides by the existing patch-wise extrapolation but cuts a standard patch diffusion process into an initial phase focused on comprehensive structure denoising and a subsequent phase dedicated to specific detail refinement. Comprehensive experiments highlight the numerous almighty advantages of CutDiffusion: (1) simple method construction that enables a concise higher-resolution diffusion process without third-party engagement; (2) fast inference speed achieved through a single-step higher-resolution diffusion process, and fewer inference patches required; (3) cheap GPU cost resulting from patch-wise inference and fewer patches during the comprehensive structure denoising; (4) strong generation performance, stemming from the emphasis on specific detail refinement.

AffineQuant: Affine Transformation Quantization for Large Language Models

Authors

Yuexiao Ma,Huixia Li,Xiawu Zheng,Feng Ling,Xuefeng Xiao,Rui Wang,Shilei Wen,Fei Chao,Rongrong Ji

Journal

arXiv preprint arXiv:2403.12544

Published Date

2024/3/19

The significant resource requirements associated with Large-scale Language Models (LLMs) have generated considerable interest in the development of techniques aimed at compressing and accelerating neural networks. Among these techniques, Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its noteworthy compression efficiency and cost-effectiveness in the context of training. Existing PTQ methods for LLMs limit the optimization scope to scaling transformations between pre- and post-quantization weights. In this paper, we advocate for the direct optimization using equivalent Affine transformations in PTQ (AffineQuant). This approach extends the optimization scope and thus significantly minimizing quantization errors. Additionally, by employing the corresponding inverse matrix, we can ensure equivalence between the pre- and post-quantization outputs of PTQ, thereby maintaining its efficiency and generalization capabilities. To ensure the invertibility of the transformation during optimization, we further introduce a gradual mask optimization method. This method initially focuses on optimizing the diagonal elements and gradually extends to the other elements. Such an approach aligns with the Levy-Desplanques theorem, theoretically ensuring invertibility of the transformation. As a result, significant performance improvements are evident across different LLMs on diverse datasets. To illustrate, we attain a C4 perplexity of 15.76 (2.26 lower vs 18.02 in OmniQuant) on the LLaMA2-7B model of W4A4 quantization without overhead. On zero-shot tasks, AffineQuant achieves an average of 58.61 accuracy …

See List of Professors in Rongrong Ji 纪荣嵘 University(Xiamen University)

Rongrong Ji 纪荣嵘 FAQs

What is Rongrong Ji 纪荣嵘's h-index at Xiamen University?

The h-index of Rongrong Ji 纪荣嵘 has been 66 since 2020 and 78 in total.

What are Rongrong Ji 纪荣嵘's top articles?

The articles with the titles of

Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization

ObjectAdd: Adding Objects into Image via a Training-Free Diffusion Modification Fashion

DiffuMatting: Synthesizing Arbitrary Objects with Matting-level Annotation

Uncovering the Over-Smoothing Challenge in Image Super-Resolution: Entropy-Based Quantification and Contrastive Optimization

Defense Against Adversarial Attacks Using Topology Aligning Adversarial Training

Identity-Aware Variational Autoencoder for Face Swapping

CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes

Toward Open-Set Human Object Interaction Detection

...

are the top articles of Rongrong Ji 纪荣嵘 at Xiamen University.

What are Rongrong Ji 纪荣嵘's research interests?

The research interests of Rongrong Ji 纪荣嵘 are: Model Compression, Neural Architecture Search, Image Retrieval

What is Rongrong Ji 纪荣嵘's total number of citations?

Rongrong Ji 纪荣嵘 has 26,114 citations in total.

    academic-engine

    Useful Links