Tianjian Li

Hi👋, I’m Tianjian! I’m a PhD student in Computer Science at Johns Hopkins University, proudly advised by Prof. Daniel Khashabi. Previously, I completed my Master’s degree in Computer Science at JHU, where I had the privilege of working closely with my wonderful advisors, Kenton Murray and Philipp Koehn. Before that, I was an undergraduate at New York University.

My research lies at the intersection between machine learning and natural language processing.

I prefer solutions that are simple, generalizable, and theoretically sound.

If you have anything to share with me, please feel free to contact me through my email: tli104 at jhu.edu

news

May 1, 2025	SimpleMix is accepted to ICML 2025 !! In this work, we studied the interplay between on- and off-policy data in preference optimization.
Jan 23, 2025	3 papers are accepted to NAACL🎉, which includes my work on training on heavily imbalanced datasets, Jack’s work on making language models produce verbatim quotes from training data, and Yining’s work on evaluating the creativity of language models on code generation. I am super grateful to my wonderful co-authors!
Dec 11, 2024	I will be joining Meta AI Research (FAIR) as a research intern in summer 2025!
Dec 6, 2024	New blog post on why does the chosen and the rejected log-probs is decreased during DPO and why it is to some extent beneficial for alignment.
Oct 4, 2024	New preprint on how to train on heavily imbalanced datasets!!

selected publications

ICML

SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning

Tianjian Li, and Daniel Khashabi

In ICML 2025,

arXiv
NAACL

Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets

Tianjian Li, Haoran Xu, Weiting Tan, and 2 more authors

In NAACL 2025,

arXiv
NAACL

Benchmarking Language Model Creativity: A Case Study on Code Generation

Yining Lu, Dixuan Wang, Tianjian Li, and 2 more authors

In NAACL 2025,

arXiv
NAACL

Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data

Jingyu Zhang, Marc Marone, Tianjian Li, and 2 more authors

In NAACL 2025,

arXiv
ICLR

Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models

Tianjian Li, Haoran Xu, Philipp Koehn, and 2 more authors

In ICLR 2024
(Spotlight - Top 5%),

arXiv
ACL

Why Does Zero-shot Cross-lingual Generation Fail? An Explaination and A Solution

Tianjian Li, and Kenton Murray

In ACL 2023 (Findings),

arXiv