Eric Xing

Carnegie Mellon University

H-index: 114

North America-United States

About Eric Xing

Eric Xing, With an exceptional h-index of 114 and a recent h-index of 87 (since 2020), a distinguished researcher at Carnegie Mellon University, specializes in the field of Machine Learning, ML Systems, Optimization, Statistics, Network Analysis.

His recent articles reflect a diverse array of research interests and contributions to the field:

Learning to Prompt Segment Anything Models

Judging llm-as-a-judge with mt-bench and chatbot arena

MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT

Temporally Disentangled Representation Learning under Unknown Nonstationarity

Cappy: Outperforming and boosting large multi-task lms with a small scorer

AttentionPert: Accurately Modeling Multiplexed Genetic Perturbations with Multi-scale Effects

Squeeze, recover and relabel: Dataset condensation at imagenet scale from a new perspective

Generating, Reconstructing, and Representing Discrete and Continuous Data: Generalized Diffusion with Learnable Encoding-Decoding

Eric Xing Information

University	Carnegie Mellon University
Position	President at Mohamed bin Zayed University of AI Professor of Computer Science U
Citations(all)	57613
Citations(since 2020)	33465
Cited By	37934
hIndex(all)	114
hIndex(since 2020)	87
i10Index(all)	435
i10Index(since 2020)	340
Email	Access Email
University Profile Page	Carnegie Mellon University

Eric Xing Skills & Research Interests

Machine Learning

ML Systems

Optimization

Statistics

Network Analysis

Authors

Nanqing Dong,Zhipeng Wang,Jiahao Sun,Michael Kampffmeyer,William Knottenbelt,Eric Xing

Journal

IEEE Transactions on Artificial Intelligence

Published Date

2024/3/18

Slimpajama-dc: Understanding data combinations for llm training

Authors

Zhiqiang Shen,Tianhua Tao,Liqun Ma,Willie Neiswanger,Joel Hestness,Natalia Vassilieva,Daria Soboleva,Eric Xing

Journal

arXiv preprint arXiv:2309.10818

Published Date

2023/9/19

This paper aims to understand the impacts of various data combinations (e.g., web text, wikipedia, github, books) on the training of large language models using SlimPajama. SlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T tokens RedPajama dataset contributed by Together. We've termed our research as SlimPajama-DC, an empirical analysis designed to uncover fundamental characteristics and best practices associated with employing SlimPajama in the training of large language models. During our research with SlimPajama, two pivotal observations emerged: (1) Global deduplication vs. local deduplication. We analyze and discuss how global (across different sources of datasets) and local (within the single source of dataset) deduplications affect the performance of trained models. (2) Proportions of high-quality/highly-deduplicated multi-source datasets in the combination. To study this, we construct six configurations of SlimPajama dataset and train individual ones using 1.3B Cerebras-GPT model with Alibi and SwiGLU. Our best configuration outperforms the 1.3B model trained on RedPajama using the same number of training tokens by a significant margin. All our 1.3B models are trained on Cerebras 16 CS-2 cluster with a total of 80 PFLOP/s in bf16 mixed precision. We further extend our discoveries (such as increasing data diversity is crucial after global deduplication) on a 7B model with large batch-size training. Our models and the separate SlimPajama-DC datasets are available at: https://huggingface.co/MBZUAI-LLM and https …

SegMix: A Simple Structure-Aware Data Augmentation Method

Authors

Yuxin Pei,Pushkar Bhuse,Zhengzhong Liu,Eric Xing

Journal

arXiv preprint arXiv:2311.09505

Published Date

2023/11/16

Interpolation-based Data Augmentation (DA) methods (Mixup) linearly interpolate the inputs and labels of two or more training examples. Mixup has more recently been adapted to the field of Natural Language Processing (NLP), mainly for sequence labeling tasks. However, such a simple adoption yields mixed or unstable improvements over the baseline models. We argue that the direct-adoption methods do not account for structures in NLP tasks. To this end, we propose SegMix, a collection of interpolation-based DA algorithms that can adapt to task-specific structures. SegMix poses fewer constraints on data structures, is robust to various hyperparameter settings, applies to more task settings, and adds little computational overhead. In the algorithm's core, we apply interpolation methods on task-specific meaningful segments, in contrast to applying them on sequences as in prior work. We find SegMix to be a flexible framework that combines rule-based DA methods with interpolation-based methods, creating interesting mixtures of DA techniques. We show that SegMix consistently improves performance over strong baseline models in Named Entity Recognition (NER) and Relation Extraction (RE) tasks, especially under data-scarce settings. Furthermore, this method is easy to implement and adds negligible training overhead.

Memoization-Aware Bayesian Optimization for AI Pipelines with Unknown Costs

Authors

Abdelmajid Essofi,Ridwan Salahuddeen,Munachiso S Nwadike,Navish Kumar,Kun Zhang,Eric Xing,Willie Neiswanger,Qirong Ho

Published Date

2023/10/13

Bayesian optimization (BO) is an effective approach for optimizing expensive black-box functions via potentially noisy function evaluations. However, few BO techniques address the cost-aware setting, in which different samples impose different costs on the optimizer, particularly when costs are initially unknown. This cost-aware BO setting is of special interest in tuning multi-stage AI pipelines, in which we could apply caching techniques to store and reuse early-stage outputs in favor of optimizing later stages, without incurring the costs of re-running the full pipeline. In this paper, we propose the Expected-Expected Improvement Per Unit Cost (EEIPU), a novel extension to the Expected Improvement (EI) acquisition function that adapts to unknown costs in multi-stage pipelines. EEIPU fits individual Gaussian Process (GP) models for each stage's cost data and manages the different cost regions of the search space, while balancing exploration-exploitation trade-offs. Additionally, EEIPU incorporates early-stage memoization, reducing redundant computations and costs by reusing the results of earlier stages, allowing for more iterations than existing approaches within the specified budget. In the cost-aware setting, EEIPU significantly outperforms comparable methods when tested on both synthetic and real pipelines, returning higher objective function values at lower total execution costs. This offers a significant advancement in cost-aware BO for optimizing multi-stage machine learning pipelines.

A Study on the Calibration of In-context Learning

Authors

Hanlin Zhang,Yi-Fan Zhang,Yaodong Yu,Dhruv Madeka,Dean Foster,Eric Xing,Hima Lakkaraju,Sham Kakade

Journal

arXiv preprint arXiv:2312.04021

Published Date

2023/12/7

Authors

Ning Sun,Xingyi Cheng,Shentong Mo,Chiming Liu,Hui Li,Eric Xing,Le Song

Published Date

2023/10/13

AlphaFold2 has achieved seminal success in predicting structures from amino acid sequences with remarkable atomic accuracy. However, its Evoformer module faces a critical challenge in terms of high memory consumption, particularly concerning the computational complexity associated with sequence length and the number of Multiple Sequence Alignments (MSA), denoted as . This challenge arises from the attention mechanism involving third-order MSA and pair-wise tensors, leading to a complexity of . This memory bottleneck poses difficulties when working with lengthy protein sequences. To tackle this problem, we introduce a novel and lightweight variant of Evoformer named Liteformer. Liteformer employs an innovative attention linearization mechanism, reducing complexity to through the implementation of a bias-aware flow attention mechanism, which seamlessly integrates MSA sequences and pair-wise information. Our extensive experiments, conducted on both monomeric and multimeric benchmark datasets, showcase the efficiency gains of our framework. Specifically, compared with Evoformer, Liteformer achieves up to a 44\% reduction in memory usage and a 23\% acceleration in training speed, all while maintaining competitive accuracy in protein structure prediction.

Weakly supervised 3d open-vocabulary segmentation

Authors

Kunhao Liu,Fangneng Zhan,Jiahui Zhang,Muyu Xu,Yingchen Yu,Abdulmotaleb El Saddik,Christian Theobalt,Eric Xing,Shijian Lu

Journal

Advances in Neural Information Processing Systems

Published Date

2023/12/15

Authors

Caleb Ellington,Jannik Deuschel,Ben Lengerich,Yingtao Luo,Pascal Friederich,Eric Xing

Published Date

2023/10/13

Interpretable policy learning seeks to estimate intelligible decision policies from observed actions; however, existing models fall short by forcing a tradeoff between accuracy and interpretability. This tradeoff limits data-driven interpretations of human decision-making process. e.g. to audit medical decisions for biases and suboptimal practices, we require models of decision processes which provide concise descriptions of complex behaviors. Fundamentally, existing approaches are burdened by this tradeoff because they represent the underlying decision process as a universal policy, when in fact human decisions are dynamic and can change drastically with contextual information. Thus, we propose Contextualized Policy Recovery (CPR), which re-frames the problem of modeling complex decision processes as a multi-task learning problem in which complex decision policies are comprised of context-specific policies. CPR models each context-specific policy as a linear observation-to-action mapping, and generates new decision models \textit{on-demand} as contexts are updated with new observations. CPR is compatible with fully offline and partially observable decision environments, and can be tailored to incorporate any recurrent black-box model or interpretable decision model. We assess CPR through studies on simulated and real data, achieving state-of-the-art performance on the canonical tasks of predicting antibiotic prescription in intensive care units (% AUROC vs. previous SOTA) and predicting MRI prescription for Alzheimer's patients (% AUROC vs. previous SOTA). With this improvement in predictive performance, CPR …

Redco: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs

Authors

Bowen Tan,Yun Zhu,Lijuan Liu,Hongyi Wang,Yonghao Zhuang,Jindong Chen,Eric Xing,Zhiting Hu

Journal

arXiv preprint arXiv:2310.16355

Published Date

2023/10/25

The recent progress of AI can be largely attributed to large language models (LLMs). However, their escalating memory requirements introduce challenges for machine learning (ML) researchers and engineers. Addressing this requires developers to partition a large model to distribute it across multiple GPUs or TPUs. This necessitates considerable coding and intricate configuration efforts with existing model parallel tools, such as Megatron-LM, DeepSpeed, and Alpa. These tools require users' expertise in machine learning systems (MLSys), creating a bottleneck in LLM development, particularly for developers without MLSys background. In this work, we present Redco, a lightweight and user-friendly tool crafted to automate distributed training and inference for LLMs, as well as to simplify ML pipeline development. The design of Redco emphasizes two key aspects. Firstly, to automate model parallism, our study identifies two straightforward rules to generate tensor parallel strategies for any given LLM. Integrating these rules into Redco facilitates effortless distributed LLM training and inference, eliminating the need of additional coding or complex configurations. We demonstrate the effectiveness by applying Redco on a set of LLM architectures, such as GPT-J, LLaMA, T5, and OPT, up to the size of 66B. Secondly, we propose a mechanism that allows for the customization of diverse ML pipelines through the definition of merely three functions, eliminating redundant and formulaic code like multi-host related processing. This mechanism proves adaptable across a spectrum of ML algorithms, from foundational language modeling to complex …

Linker-Tuning: Optimizing Continuous Prompts for Heterodimeric Protein Prediction

Authors

Shuxian Zou,Hui Li,Shentong Mo,Xingyi Cheng,Eric Xing,Le Song

Journal

arXiv preprint arXiv:2312.01186

Published Date

2023/12/2

Published Date

2023/8/8

The current disclosure is directed towards system and method for controlling industrial process. In one example, a method comprising deploying a forecast model for controlling an industrial process with training configurations that can be used as a single point of truth for guiding training and retraining versions of the forecast model using a model training algorithm without human input. The retraining and redeployment of the forecast model may be triggered when the performance of the forecast model degrades.

Lightseq: Sequence level parallelism for distributed training of long context transformers

Authors

Dacheng Li,Rulin Shao,Anze Xie,Eric P Xing,Joseph E Gonzalez,Ion Stoica,Xuezhe Ma,Hao Zhang

Journal

arXiv preprint arXiv:2310.03294

Published Date

2023/10/5

Increasing the context length of large language models (LLMs) unlocks fundamentally new capabilities, but also significantly increases the memory footprints of training. Previous model-parallel systems such as Megatron-LM partition and compute different attention heads in parallel, resulting in large communication volumes, so they cannot scale beyond the number of attention heads, thereby hindering its adoption. In this paper, we introduce a new approach, LightSeq, for long-context LLMs training. LightSeq has many notable advantages. First, LightSeq partitions over the sequence dimension, hence is agnostic to model architectures and readily applicable for models with varying numbers of attention heads, such as Multi-Head, Multi-Query and Grouped-Query attention. Second, LightSeq not only requires up to 4.7x less communication than Megatron-LM on popular LLMs but also overlaps the communication with computation. To further reduce the training time, LightSeq features a novel gradient checkpointing scheme to bypass an forward computation for memory-efficient attention. We evaluate LightSeq on Llama-7B and its variants with sequence lengths from 32K to 512K. Through comprehensive experiments on single and cross-node training, we show that LightSeq achieves up to 1.24-2.01x end-to-end speedup, and a 2-8x longer sequence length on models with fewer heads, compared to Megatron-LM. Codes will be available at https://github.com/RulinShao/LightSeq.

Contextualized machine learning

Authors

Benjamin Lengerich,Caleb N Ellington,Andrea Rubbi,Manolis Kellis,Eric P Xing

Journal

arXiv preprint arXiv:2310.11340

Published Date

2023/10/17

We examine Contextualized Machine Learning (ML), a paradigm for learning heterogeneous and context-dependent effects. Contextualized ML estimates heterogeneous functions by applying deep learning to the meta-relationship between contextual information and context-specific parametric models. This is a form of varying-coefficient modeling that unifies existing frameworks including cluster analysis and cohort modeling by introducing two reusable concepts: a context encoder which translates sample context into model parameters, and sample-specific model which operates on sample predictors. We review the process of developing contextualized models, nonparametric inference from contextualized models, and identifiability conditions of contextualized models. Finally, we present the open-source PyTorch package ContextualizedML.

Llm360: Towards fully transparent open-source llms

Authors

Zhengzhong Liu,Aurick Qiao,Willie Neiswanger,Hongyi Wang,Bowen Tan,Tianhua Tao,Junbo Li,Yuqi Wang,Suqi Sun,Omkar Pangarkar,Richard Fan,Yi Gu,Victor Miller,Yonghao Zhuang,Guowei He,Haonan Li,Fajri Koto,Liping Tang,Nikhil Ranjan,Zhiqiang Shen,Xuguang Ren,Roberto Iriondo,Cun Mu,Zhiting Hu,Mark Schulze,Preslav Nakov,Tim Baldwin,Eric P Xing

Journal

arXiv preprint arXiv:2312.06550

Published Date

2023/12/11

Authors

Lianmin Zheng,Wei-Lin Chiang,Ying Sheng,Tianle Li,Siyuan Zhuang,Zhanghao Wu,Yonghao Zhuang,Zhuohan Li,Zi Lin,Eric Xing,Joseph E Gonzalez,Ion Stoica,Hao Zhang

Published Date

2023/10/13

Studying how people interact with large language models (LLMs) in real-world scenarios is increasingly important due to their widespread use in various applications. In this paper, we introduce RealChat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art LLMs. This dataset is collected from 210K unique IP addresses in the wild on our chat demo website. We offer an overview of the dataset's content, including its curation process, basic statistics, and topic distribution, highlighting its diversity, originality, and scale. We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions. We believe that this dataset will serve as a valuable resource for understanding and advancing LLM capabilities. The dataset will be publicly available.

Sliced recursive transformer

Authors

Zhiqiang Shen,Zechun Liu,Eric Xing

Journal

arXiv preprint arXiv:2111.05297 (ECCV 2022)

Published Date

2021/11/9

We present a neat yet effective recursive operation on vision transformers that can improve parameter utilization without involving additional parameters. This is achieved by sharing weights across depth of transformer networks. The proposed method can obtain a substantial gain (2%) simply using naïve recursive operation, requires no special or sophisticated knowledge for designing principles of networks, and introduces minimal computational overhead to the training procedure. To reduce the additional computation caused by recursive operation while maintaining the superior accuracy, we propose an approximating method through multiple sliced group self-attentions across recursive layers which can reduce the cost consumption by 10–30% without sacrificing performance. We call our model Sliced Recursive Transformer (SReT), a novel and parameter-efficient vision transformer design that is compatible …

Amp: Automatically finding model parallel strategies with heterogeneity awareness

Authors

Dacheng Li,Hongyi Wang,Eric Xing,Hao Zhang

Journal

NeurIPS 2022

Published Date

2022/10/13

Published Date

2020/6/30

Organ segmentation in chest X-rays using convolutional neural networks is disclosed. One embodiment provides a method to train a convolutional segmentation network with chest X-ray images to generate pixel-level predictions of target classes. Another embodiment will also train a critic network with an input mask, wherein the input mask is one of a segmentation network mask and a ground truth annotation, and outputting a probability that the input mask is the ground truth annotation instead of the prediction by the segmentation network, and to provide the probability output by the critic network to the segmentation network to guide the segmentation network to generate masks more consistent with learned higher-order structures.

Oracle-oriented Robustness: Robust Image Model Evaluation with Pretrained Models as Surrogate Oracle

Authors

Peiyan Zhang,Sunghun Kim,Eric Xing,Haohan Wang

Published Date

2022/9/29

Authors

Eric Xing,Guangming Xing

Published Date

2022/3/1

Traditional paper-based exams and LMS-provided online exams for introductory programming courses are not aligned with learning objectives that emphasize problem-solving and coding skills. In this poster, we present a cloud-based assessment solution for introductory programming courses. First, we discuss the requirements and challenges of conducting frequent assessments. We then outline the functions in our online exam toolkit that allow instructors to administer versatile assessments. Instead of relying on a traditional lockdown browser, the plagiarism and cheating detection in our toolkit allows instructors to administer exams in any modern browser for face-to-face classes.

Rlprompt: Optimizing discrete text prompts with reinforcement learning

Authors

Mingkai Deng,Jianyu Wang,Cheng-Ping Hsieh,Yihan Wang,Han Guo,Tianmin Shu,Meng Song,Eric P Xing,Zhiting Hu

Journal

arXiv preprint arXiv:2205.12548

Published Date

2022/5/25

Prompting has shown impressive success in enabling large pretrained language models (LMs) to perform diverse NLP tasks, especially when only few downstream data are available. Automatically finding the optimal prompt for each task, however, is challenging. Most existing work resorts to tuning soft prompt (e.g., embeddings) which falls short of interpretability, reusability across LMs, and applicability when gradients are not accessible. Discrete prompt, on the other hand, is difficult to optimize, and is often created by "enumeration (e.g., paraphrasing)-then-selection" heuristics that do not explore the prompt space systematically. This paper proposes RLPrompt, an efficient discrete prompt optimization approach with reinforcement learning (RL). RLPrompt formulates a parameter-efficient policy network that generates the desired discrete prompt after training with reward. To overcome the complexity and stochasticity of reward signals by the large LM environment, we incorporate effective reward stabilization that substantially enhances the training efficiency. RLPrompt is flexibly applicable to different types of LMs, such as masked (e.g., BERT) and left-to-right models (e.g., GPTs), for both classification and generation tasks. Experiments on few-shot classification and unsupervised text style transfer show superior performance over a wide range of existing finetuning or prompting methods. Interestingly, the resulting optimized prompts are often ungrammatical gibberish text; and surprisingly, those gibberish prompts are transferrable between different LMs to retain significant performance, indicating LM prompting may not follow human language patterns.

MPCFormer: fast, performant and private Transformer inference with MPC

Authors

Dacheng Li,Rulin Shao,Hongyi Wang,Han Guo,Eric P Xing,Hao Zhang

Journal

arXiv preprint arXiv:2211.01452

Published Date

2022/11/2

Enabling private inference is crucial for many cloud inference services that are based on Transformer models. However, existing private inference solutions can increase the inference latency by more than 60x or significantly compromise the inference quality. In this paper, we design the framework MPCFORMER as a practical solution, using Secure Multi-Party Computation (MPC) and Knowledge Distillation (KD). Through extensive evaluations, we show that MPCFORMER significantly speeds up Transformer inference in MPC settings while achieving similar ML performance to the input model. On the IMDb dataset, it achieves similar performance to BERTBASE, while being 5.3x faster. On the GLUE benchmark, it achieves 97% performance of BERTBASE with a 2.2x speedup. MPCFORMER remains effective with different trained Transformer weights such as ROBERTABASE and larger models including BERTLarge. Code is available at https://github.com/MccRee177/MPCFormer.

Expeditious Saliency-guided Mix-up through Random Gradient Thresholding

Authors

Minh-Long Luu,Zeyi Huang,Eric P Xing,Yong Jae Lee,Haohan Wang

Journal

arXiv preprint arXiv:2212.04875

Published Date

2022/12/9

Mix-up training approaches have proven to be effective in improving the generalization ability of Deep Neural Networks. Over the years, the research community expands mix-up methods into two directions, with extensive efforts to improve saliency-guided procedures but minimal focus on the arbitrary path, leaving the randomization domain unexplored. In this paper, inspired by the superior qualities of each direction over one another, we introduce a novel method that lies at the junction of the two routes. By combining the best elements of randomness and saliency utilization, our method balances speed, simplicity, and accuracy. We name our method R-Mix following the concept of "Random Mix-up". We demonstrate its effectiveness in generalization, weakly supervised object localization, calibration, and robustness to adversarial attacks. Finally, in order to address the question of whether there exists a better decision protocol, we train a Reinforcement Learning agent that decides the mix-up policies based on the classifier's performance, reducing dependency on human-designed objectives and hyperparameter tuning. Extensive experiments further show that the agent is capable of performing at the cutting-edge level, laying the foundation for a fully automatic mix-up. Our code is released at [https://github.com/minhlong94/Random-Mixup].

Technology readiness levels for machine learning systems

Authors

Alexander Lavin,Ciarán M Gilligan-Lee,Alessya Visnjic,Siddha Ganju,Dava Newman,Sujoy Ganguly,Danny Lange,Atílím Güneş Baydin,Amit Sharma,Adam Gibson,Stephan Zheng,Eric P Xing,Chris Mattmann,James Parr,Yarin Gal

Journal

Nature Communications

Published Date

2022/10/20

The development and deployment of machine learning systems can be executed easily with modern tools, but the process is typically rushed and means-to-an-end. Lack of diligence can lead to technical debt, scope creep and misaligned objectives, model misuse and failures, and expensive consequences. Engineering systems, on the other hand, follow well-defined processes and testing standards to streamline development for high-quality, reliable results. The extreme is spacecraft systems, with mission critical measures and robustness throughout the process. Drawing on experience in both spacecraft engineering and machine learning (research through product across domain areas), we’ve developed a proven systems engineering approach for machine learning and artificial intelligence: the Machine Learning Technology Readiness Levels framework defines a principled process to ensure robust, reliable …

BertNet: Harvesting knowledge graphs with arbitrary relations from pretrained language models

Authors

Shibo Hao,Bowen Tan,Kaiwen Tang,Bin Ni,Xiyan Shao,Hengzhe Zhang,Eric P Xing,Zhiting Hu

Journal

arXiv preprint arXiv:2206.14268

Published Date

2022/6/28

It is crucial to automatically construct knowledge graphs (KGs) of diverse new relations to support knowledge discovery and broad applications. Previous KG construction methods, based on either crowdsourcing or text mining, are often limited to a small predefined set of relations due to manual cost or restrictions in text corpus. Recent research proposed to use pretrained language models (LMs) as implicit knowledge bases that accept knowledge queries with prompts. Yet, the implicit knowledge lacks many desirable properties of a full-scale symbolic KG, such as easy access, navigation, editing, and quality assurance. In this paper, we propose a new approach of harvesting massive KGs of arbitrary relations from pretrained LMs. With minimal input of a relation definition (a prompt and a few shot of example entity pairs), the approach efficiently searches in the vast entity pair space to extract diverse accurate knowledge of the desired relation. We develop an effective search-and-rescore mechanism for improved efficiency and accuracy. We deploy the approach to harvest KGs of over 400 new relations from different LMs. Extensive human and automatic evaluations show our approach manages to extract diverse accurate knowledge, including tuples of complex relations (e.g., "A is capable of but not good at B"). The resulting KGs as a symbolic interpretation of the source LMs also reveal new insights into the LMs' knowledge capacities.

Prototypical graph contrastive learning

Authors

Shuai Lin,Chen Liu,Pan Zhou,Zi-Yuan Hu,Shuojia Wang,Ruihui Zhao,Yefeng Zheng,Liang Lin,Eric Xing,Xiaodan Liang

Journal

IEEE transactions on neural networks and learning systems

Published Date

2022/7/27

Graph-level representations are critical in various real-world applications, such as predicting the properties of molecules. However, in practice, precise graph annotations are generally very expensive and time-consuming. To address this issue, graph contrastive learning constructs an instance discrimination task, which pulls together positive pairs (augmentation pairs of the same graph) and pushes away negative pairs (augmentation pairs of different graphs) for unsupervised representation learning. However, since for a query, its negatives are uniformly sampled from all graphs, existing methods suffer from the critical sampling bias issue, i.e., the negatives likely having the same semantic structure with the query, leading to performance degradation. To mitigate this sampling bias issue, in this article, we propose a prototypical graph contrastive learning (PGCL) approach. Specifically, PGCL models the underlying …

Trade-offs of linear mixed models in genome-wide Association studies

Authors

Haohan Wang,Bryon Aragam,Eric P Xing

Journal

Journal of Computational Biology

Published Date

2022/3/1

Motivated by empirical arguments that are well known from the genome-wide association studies (GWAS) literature, we study the statistical properties of linear mixed models (LMMs) applied to GWAS. First, we study the sensitivity of LMMs to the inclusion of a candidate single nucleotide polymorphism (SNP) in the kinship matrix, which is often done in practice to speed up computations. Our results shed light on the size of the error incurred by including a candidate SNP, providing a justification to this technique to trade off velocity against veracity. Second, we investigate how mixed models can correct confounders in GWAS, which is widely accepted as an advantage of LMMs over traditional methods. We consider two sources of confounding factors—population stratification and environmental confounding factors—and study how different methods that are commonly used in practice trade off these two confounding …

Dropout as a regularizer of interaction effects

Authors

Benjamin J Lengerich,Eric Xing,Rich Caruana

Published Date

2022/5/3

Authors

Xijie Huang,Zhiqiang Shen,Shichao Li,Zechun Liu,Xianghong Hu,Jeffry Wicaksana,Eric Xing,Kwang-Ting Cheng

Published Date

2022/6/9

In order to deploy deep models in a computationally efficient manner, model quantization approaches have been frequently used. In addition, as new hardware that supports various-bit arithmetic operations, recent research on mixed precision quantization (MPQ) begins to fully leverage the capacity of representation by searching various bitwidths for different layers and modules in a network. However, previous studies mainly search the MPQ strategy in a costly scheme using reinforcement learning, neural architecture search, etc., or simply utilize partial prior knowledge for bitwidth distribution, which might be biased and sub-optimal. In this work, we present a novel Stochastic Differentiable Quantization (SDQ) method that can automatically learn the MPQ strategy in a more flexible and globally-optimized space with a smoother gradient approximation. Particularly, Differentiable Bitwidth Parameters (DBPs) are employed as the probability factors in stochastic quantization between adjacent bitwidth. After the optimal MPQ strategy is acquired, we further train our network with the entropy-aware bin regularization and knowledge distillation. We extensively evaluate our method on different networks, hardwares (GPUs and FPGA), and datasets. SDQ outperforms all other state-of-the-art mixed or single precision quantization with less bitwidth, and are even better than the original full-precision counterparts across various ResNet and MobileNet families, demonstrating the effectiveness and superiority of our method. Code will be publicly available.

Robustar: Interactive Toolbox Supporting Precise Data Annotation for Robust Vision Learning

Authors

Chonghan Chen,Haohan Wang,Leyang Hu,Yuhao Zhang,Shuguang Lyu,Jingcheng Wu,Xinnuo Li,Linjing Sun,Eric P Xing

Journal

arXiv preprint arXiv:2207.08944

Published Date

2022/7/18

We introduce the initial release of our software Robustar, which aims to improve the robustness of vision classification machine learning models through a data-driven perspective. Building upon the recent understanding that the lack of machine learning model's robustness is the tendency of the model's learning of spurious features, we aim to solve this problem from its root at the data perspective by removing the spurious features from the data before training. In particular, we introduce a software that helps the users to better prepare the data for training image classification models by allowing the users to annotate the spurious features at the pixel level of images. To facilitate this process, our software also leverages recent advances to help identify potential images and pixels worthy of attention and to continue the training with newly annotated data. Our software is hosted at the GitHub Repository https://github.com/HaohanWang/Robustar.

Efficient peer-to-peer architecture for distributed machine learning

Published Date

2022/2/15

A computer in a distributed peer-to-peer system is disclosed. The distributed system includes a plurality of computers configured to run a distributed machine learning (ML) program represented as an expression of a target loss function with a model parameter matrix. The computer includes: a parser module configured to convert a loss function in the distributed program into an expression graph and then one or more multiplication trees; a parameter replica module in communication with the parser module, the parameter replica module configured to maintain the model parameter matrix of the ML program; a compressor module in communication with the parameter replica module, the compressor module configured to extract sufficient factors from the expression graph for updating the model matrix; and a communication module in communication with the compressor module, the communication module configured …

Gene set priorization guided by regulatory networks with p-values through kernel mixed model

Authors

Haohan Wang,Oscar L Lopez,Wei Wu,Eric P Xing

Published Date

2022/4/29

The transcriptome association study has helped prioritize many causal genes for detailed study and thus further helped the development of many therapeutic strategies for multiple diseases. How- ever, prioritizing the causal gene only does not seem always to be able to offer sufficient guidance to the downstream analysis. Thus, in this paper, we propose to perform the association studies from another perspective: we aim to prioritize genes with a tradeoff between the pursuit of the causality evidence and the interest of the genes in the pathway. We introduce a new method for transcriptome association study by incorporating the information of gene regulatory networks. In addition to directly building the regularization into variable selection methods, we also expect the method to report p-values of the associated genes so that these p-values have been empirically proved trustworthy by geneticists. Thus, we introduce a …

System and Methods for Distributed Machine Learning with Multiple Data Sources, Multiple Programming Languages or Frameworks, and Multiple Devices or Infrastructures

Published Date

2022/8/18

Methods and systems are presented for consuming different data sources, and deploying artificial intelligence and machine learning programs on different target devices or infrastructures. Many data types can be transformed into machine learning data shards (MLDS) while many machine learning programs written in various programming languages or frameworks are transformed to common operator representations. Operator representations are transformed into execution graphs (EG) for a chosen target device or infrastructure. The MLDS and EG are input to the targeted devices and infrastructures, which then execute the machine learning programs (now transformed to EGs) on the MLDS to produce trained models or predictions with trained models.

A fast knowledge distillation framework for visual recognition

Authors

Zhiqiang Shen,Eric Xing

Journal

arXiv preprint arXiv:2112.01528 (ECCV 2022)

Published Date

2021/12/2

While Knowledge Distillation (KD) has been recognized as a useful tool in many visual tasks, such as supervised classification and self-supervised representation learning, the main drawback of a vanilla KD framework is its mechanism that consumes the majority of the computational overhead on forwarding through the giant teacher networks, making the entire learning procedure inefficient and costly. The recently proposed solution ReLabel suggests creating a label map for the entire image. During training, it receives the cropped region-level label by RoI aligning on a pre-generated entire label map, which allows for efficient supervision generation without having to pass through the teachers repeatedly. However, as the pre-trained teacher employed in ReLabel is from the conventional multi-crop scheme, there are various mismatches between the global label-map and region-level labels in this technique …

Masked generative adversarial networks are data-efficient generation learners

Authors

Jiaxing Huang,Kaiwen Cui,Dayan Guan,Aoran Xiao,Fangneng Zhan,Shijian Lu,Shengcai Liao,Eric Xing

Published Date

2022/11

Published Date

2022/4/12

Accordingly, a data engineering system for machine learning at scale is disclosed. In one embodiment, the data engineering system includes an ingest processing module having a schema update submodule and a feature statistics update submodule, wherein the schema update submodule is configured to discover new features and add them to a schema, and wherein the feature statistics update submodule collects statistics for each feature to be used in an online transformation, a record store to store data from a data source, and a transformation module, to receive a low dimensional data instance from the record store and to receive the schema and feature statistics from the ingest processing module, and to transform the low dimensional data instance into a high dimensional representation. One embodiment provides a method for data engineering for machine learning at scale, the method including calling a …

Toward learning human-aligned cross-domain robust models by countering misaligned features

Authors

Haohan Wang,Zeyi Huang,Hanlin Zhang,Yong Jae Lee,Eric P Xing

Published Date

2022/8/17

Machine learning has demonstrated remarkable prediction accuracy over iid data, but the accuracy often drops when tested with data from another distribution. In this paper, we aim to offer another view of this problem in a perspective assuming the reason behind this accuracy drop is the reliance of models on the features that are not aligned well with how a data annotator considers similar across these two datasets. We refer to these features as misaligned features. We extend the conventional generalization error bound to a new one for this setup with the knowledge of how the misaligned features are associated with the label. Our analysis offers a set of techniques for this problem, and these techniques are naturally linked to many previous methods in robust machine learning literature. We also compared the empirical strength of these methods demonstrated the performance when these previous techniques are combined, with implementation available.

Exploring transformer backbones for heterogeneous treatment effect estimation

Authors

Yi-Fan Zhang,Hanlin Zhang,Zachary C Lipton,Li Erran Li,Eric P Xing

Journal

arXiv preprint arXiv:2202.01336

Published Date

2022/2/2

Previous works on Treatment Effect Estimation (TEE) are not in widespread use because they are predominantly theoretical, where strong parametric assumptions are made but untractable for practical application. Recent work uses multilayer perceptron (MLP) for modeling casual relationships, however, MLPs lag far behind recent advances in ML methodology, which limits their applicability and generalizability. To extend beyond the single domain formulation and towards more realistic learning scenarios, we explore model design spaces beyond MLPs, i.e., transformer backbones, which provide flexibility where attention layers govern interactions among treatments and covariates to exploit structural similarities of potential outcomes for confounding control. Through careful model design, Transformers as Treatment Effect Estimators (TransTEE) is proposed. We show empirically that TransTEE can: (1) serve as a general purpose treatment effect estimator that significantly outperforms competitive baselines in a variety of challenging TEE problems (e.g., discrete, continuous, structured, or dosage-associated treatments) and is applicable to both when covariates are tabular and when they consist of structural data (e.g., texts, graphs); (2) yield multiple advantages: compatibility with propensity score modeling, parameter efficiency, robustness to continuous treatment value distribution shifts, explainable in covariate adjustment, and real-world utility in auditing pre-trained language models

See List of Professors in Eric Xing University(Carnegie Mellon University)