Our Mission

The Generative Artificial Intelligence Research (GAIR) is a research lab to create cutting-edge Generative Artificial Intelligence technologies that empower humans to solve complex problems and improve the quality of life for people around the world. Specifically:
  • Fundamental Research: we are committed to conducting rigorous and ethical research that promotes transparency and accountability of generative AI technologies.
  • Aligned systems: by leveraging cutting-edge machine learning, natural language processing etc, we aim to create AI systems that can generate novel and useful outputs, while respecting the diverse perspectives and values of their users.
  • Social Impact: we will collaborate closely with academic, industry, community partners, government and general users to ensure that our work has a positive impact on society.

Selected Projects

...
LIMR: Less is More for RL Scaling (2025)

Open
An approach that challenges the assumption about data scaling in reinforcement learning for LLMs.
...
LIMO: Less Is More for Reasoning (2025)

Open
LIMO challenges the conventional wisdom in mathematical reasoning by demonstrating that models can achieve superior performance with significantly less but higher quality training data.
...
DIVE: Diversified Iterative self-ImproVEment (2025)

Open
DIVE increases reasoning diversity in self-improving language models while maintaining performance quality.
...
O1 Replication Journey (2024)

Open
Introduce a pioneering approach to artificial intelligence research, embodied in our O1 Replication Journey.
...
Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale (2024)

Open
A lm-based data refinement framework to improve the quality of data used in pre-training large language models.
...
Benchmarking Benchmark Leakage (2024)

Open
Propose a detection pipeline for estimating potential benchmark leakage.
...
Dissecting Human and LLM Preferences (2024)

Open
Conduct a thorough analysis of human and LLM preferences based on various real-world scenarios.
...
Reformatted Alignment (2024)

Open
Reformating the responses of instruction data can significantly enhance performance of LLMs.
...
ScaleEval (2024)

Open
An agent-debate-assisted meta-evaluation framework that leverages the capabilities of multiple communicative LLM agents.
...
InFoBench (2024)

Open
A benchmark comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories to follow instructions.
...
Entropy-ABF (2024)

Open
Supports efficient context window extension of RoPE-based LLMs with only 100 samples.
...
The Critique of Critique (2024)

Open
A new judge that can effectively evaluate human-written or LLMs-generated critique by generating critique.
...
SimulateBench (2023)

Open
Evaluate the level of believability of agents composed of APIbased or open-source LLMs.
...
MathPile (2023)

Open
A diverse and high-quality math-centric corpus comprising about 9.5 billion tokens.
...
Align on the fly (2023)

Open
A real-time alignment framework, which can constrain LLMs' behaviors without necessitating retraining, allowing for convenient updates and customization of human value.
...
Alignment for Honesty (2023)

Open
Aim to ensure that LLMs proactively refuse to answer questions when they lack knowledge, while still not being overly conservative.
...
Auto-J (2023)

Open
A new open-source generative judge that can effectively evaluate different LLMs on how they align to human preference.
...
FELM: Benchmarking Factuality Evaluation (2023)

Open
A meta benchmark to evaluate factuality evaluation on various scenarios for LLMs.
...
Generative AI for Math: Abel (2023)

Open
Establish a new state-of-the-art performance across open-source LLMs on GSM8K and MATH using ONLY SFT.
...
Factuality in Generative AI (2023)

Open
An innovative, tool-augmented framework designed to detect factual errors in texts generated by LLMs across various scenarios.
...
LIMA: Less is More for Alignment (2023)

Open
Build a remarkably strong chat model with only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling.
...
reStructured Pre-training (2022)

Open
In such a paradigm, the role of data will be re-emphasized, and model pre-training and fine-tuning of downstream tasks are viewed as a process of data storing and accessing.
...
Gaokao Benchmark for AI (2022)

Open
...
Prompt Engineering (2021)

Open
Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods
...
DataLab (2022)

Open
A Platform for Data Analysis and Intervention
...
ExplainaBoard (2021)

Open
An Explainable Leaderboard for NLP
...
BARTScore (2021)

Open
Formulate evaluating generated text as a text generation task from pre-trained language models.
...
Review Advisor (2021)

Open
Can we automate scientific reviewing? Proposes to use NLP models to generate first-pass peer reviews for scientific papers.
...
SpanNER (2021)

Open
Span prediction, simultaneously, can serve as a system combiner to re-recognize named entities from different systems' outputs.
...
SimCLS (2021)

Open
Formulate text generation as a reference-free evaluation problem (i.e., quality estimation) assisted by contrastive learning.
...
Interpretable Evaluation (2020)

Open
Interpretable Evaluation for (Almost) All NLP Tasks.