A curated collection of research papers that I find interesting, have read, or plan to read in the future.
Attention is All You Needarrow-up-right
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layerarrow-up-right
Proximal Policy Optimization Algorithmsarrow-up-right
Deep reinforcement learning from human preferencesarrow-up-right
Deep contextualized word representationsarrow-up-right
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandingarrow-up-right
Improving Language Understanding by Generative Pre-Trainingarrow-up-right
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processingarrow-up-right
Universal Language Model Fine-tuning for Text Classificationarrow-up-right
Language Models are Unsupervised Multitask Learnersarrow-up-right
RoBERTa: A Robustly Optimized BERT Pretraining Approacharrow-up-right
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighterarrow-up-right
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehensionarrow-up-right
XLNet: Generalized Autoregressive Pretraining for Language Understandingarrow-up-right
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelismarrow-up-right
Generating Long Sequences with Sparse Transformersarrow-up-right
Reformer: The Efficient Transformerarrow-up-right
Longformer: The Long-Document Transformerarrow-up-right
GShard: Scaling Giant Models with Conditional Computation and Automatic Shardingarrow-up-right
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasksarrow-up-right
Big Bird: Transformers for Longer Sequencesarrow-up-right
Language Models are Few-Shot Learnersarrow-up-right
Rethinking Attention with Performersarrow-up-right
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformerarrow-up-right
Measuring Massive Multitask Language Understandingarrow-up-right
ZeRO: Memory Optimizations Toward Training Trillion Parameter Modelsarrow-up-right
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generatorsarrow-up-right
Scaling Laws for Neural Language Modelsarrow-up-right
RoFormer: Enhanced Transformer with Rotary Position Embeddingarrow-up-right
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LMarrow-up-right
Transcending Scaling Laws with 0.1% Extra Computearrow-up-right
Improving language models by retrieving from trillions of tokensarrow-up-right
LoRA: Low-Rank Adaptation of Large Language Modelsarrow-up-right
Scaling Language Models: Methods, Analysis & Insights from Training Gopherarrow-up-right
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Modelarrow-up-right
EFFICIENTLY SCALING TRANSFORMER INFERENCEarrow-up-right
Fast Inference from Transformers via Speculative Decodingarrow-up-right
Training Compute-Optimal Large Language Modelsarrow-up-right
Chain-of-Thought Prompting Elicits Reasoning in Large Language Modelsarrow-up-right
Training language models to follow instructions with human feedbackarrow-up-right
BLOOM: A 176B-Parameter Open-Access Multilingual Language Modelarrow-up-right
Emergent Abilities of Large Language Modelsarrow-up-right
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awarenessarrow-up-right
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpointsarrow-up-right
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolationarrow-up-right
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scalearrow-up-right
Red Teaming Language Models with Language Modelsarrow-up-right
Holistic Evaluation of Language Modelsarrow-up-right
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformersarrow-up-right
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language modelsarrow-up-right
Voyager: An Open-Ended Embodied Agent with Large Language Modelsarrow-up-right
Universal and Transferable Adversarial Attacks on Aligned Language Modelsarrow-up-right
Tree of Thoughts: Deliberate Problem Solving with Large Language Modelsarrow-up-right
WizardLM: Empowering Large Language Models to Follow Complex Instructionsarrow-up-right
DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scalesarrow-up-right
GPT-4 Technical Reportarrow-up-right
Mistral 7Barrow-up-right
Direct Preference Optimization: Your Language Model is Secretly a Reward Modelarrow-up-right
PaLM 2 Technical Reportarrow-up-right
LIMA: Less Is More for Alignmentarrow-up-right
Mamba: Linear-Time Sequence Modeling with Selective State Spacesarrow-up-right
Visual Instruction Tuningarrow-up-right
Textbooks Are All You Need II: phi-1.5 technical reportarrow-up-right
Qwen Technical Reportarrow-up-right
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyondarrow-up-right
LLaMA: Open and Efficient Foundation Language Modelsarrow-up-right
Llama 2: Open Foundation and Fine-Tuned Chat Modelsarrow-up-right
Efficient Memory Management for Large Language Model Serving with PagedAttentionarrow-up-right
QLoRA: Efficient Finetuning of Quantized LLMsarrow-up-right
Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessmentarrow-up-right
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioningarrow-up-right
AWQ: Activation-aware Weight Quantization for LLM Compression and Accelerationarrow-up-right
Generative Agents: Interactive Simulacra of Human Behaviorarrow-up-right