A curated collection of research papers that I find interesting, have read, or plan to read in the future.
Attention is All You Need
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Proximal Policy Optimization Algorithms
Deep reinforcement learning from human preferences
Deep contextualized word representations
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Improving Language Understanding by Generative Pre-Training
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
Universal Language Model Fine-tuning for Text Classification
Language Models are Unsupervised Multitask Learners
RoBERTa: A Robustly Optimized BERT Pretraining Approach
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
XLNet: Generalized Autoregressive Pretraining for Language Understanding
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Generating Long Sequences with Sparse Transformers
Reformer: The Efficient Transformer
Longformer: The Long-Document Transformer
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Big Bird: Transformers for Longer Sequences
Language Models are Few-Shot Learners
Rethinking Attention with Performers
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Measuring Massive Multitask Language Understanding
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Scaling Laws for Neural Language Models
RoFormer: Enhanced Transformer with Rotary Position Embedding
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
Transcending Scaling Laws with 0.1% Extra Compute
Improving language models by retrieving from trillions of tokens
LoRA: Low-Rank Adaptation of Large Language Models
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
EFFICIENTLY SCALING TRANSFORMER INFERENCE
Fast Inference from Transformers via Speculative Decoding
Training Compute-Optimal Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Training language models to follow instructions with human feedback
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Emergent Abilities of Large Language Models
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
Red Teaming Language Models with Language Models
Holistic Evaluation of Language Models
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Voyager: An Open-Ended Embodied Agent with Large Language Models
Universal and Transferable Adversarial Attacks on Aligned Language Models
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
WizardLM: Empowering Large Language Models to Follow Complex Instructions
DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales
GPT-4 Technical Report
Mistral 7B
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
PaLM 2 Technical Report
LIMA: Less Is More for Alignment
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Visual Instruction Tuning
Textbooks Are All You Need II: phi-1.5 technical report
Qwen Technical Report
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
LLaMA: Open and Efficient Foundation Language Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
Efficient Memory Management for Large Language Model Serving with PagedAttention
QLoRA: Efficient Finetuning of Quantized LLMs
Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Generative Agents: Interactive Simulacra of Human Behavior