Tianjian Li

Center for Language and Speech Processing, Johns Hopkins University


Hi👋, I’m Tianjian! I’m an incoming PhD student at Johns Hopkins University, where I will be working with Prof. Daniel Khashabi. I also did my Master’s degree in Computer Science at JHU. I am grateful to have the opportunity to work closely with Kenton Murray and Philipp Koehn on multilingual language models and machine translation during my Master’s.

My research interests lies at the intersection of machine learning and natural language processing, with a particular focus on addressing the question: how can we better leverage our vast amount of data beyond simply feeding it into our models during training? To this end, I am currently working on measuring various properties of data (e.g. quality and utility), optimizing data mixtures, and curating data for training and aligning our language models.

I prefer solutions that are simple, generalizable, and theoretically sound.

If you have anything to share with me, please feel free to contact me through my email: tli104 at jhu.edu


Apr 7, 2024 I will be staying at Johns Hopkins University for my PhD, working with Prof. Daniel Khashabi!
Jan 15, 2024 Error Norm Truncation has been accepted to ICLR 2024 (spotlight) !!
Nov 8, 2023 New blog post on latest advances on balanced training for Multilingual Machine Translation!
Oct 2, 2023 New preprint on truncating noisy data for training text generation models!!
Jul 31, 2023 New blog post on estimating data utility.

selected publications

  1. preprint
    Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data
    Jingyu Zhang, Marc Marone, Tianjian Li, and 2 more authors
  2. ICLR
    Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models
    Tianjian Li, Haoran Xu, Philipp Koehn, and 2 more authors
    In The Twelfth International Conference on Learning Representations, 2024 (Spotlight - Top 5%)
  3. ACL
    Why Does Zero-shot Cross-lingual Generation Fail? An Explaination and A Solution
    Tianjian Li, and Kenton Murray
    In Proceedings of the 2023 Annual Meeting of the Association for Computational Linguistics (Findings), Jul 2023