About me

Hi! I’m Xueguang Ma (马雪光). I am a fourth-year PhD student in the David R. Cheriton School of Computer Science at the University of Waterloo, Canada. I am advised by Prof. Jimmy Lin. My research interests include Information Retrieval (IR) and Natural Language Processing (NLP).

I received my Bachelor of Mathematics from the University of Waterloo in 2021.

I like snowboarding and badminton.

Research Interests

My overarching research goal is to make it easy for people and intelligent systems to access, understand, and interact with world information.

To support this vision, my research explores:

Publications

Topics: All Retrieval Large Language Model Retrieval Augmented Generation Reasoning Multilingual Multimodal Benchmark
Year: All 2024 2023 2022 2021
Authorship: All (Co-)First Author
Venue: All ACL NAACL EMNLP SIGIR CIKM NeurIPS Journal Preprint Others

* indicate equal contribution.

  • General-Reasoner: Advancing LLM Reasoning Across All Domains
    Xueguang Ma*, Qian Liu*, Dongfu Jiang, Ge Zhang, Zejun Ma, Wenhu Chen.
    ArXiv 2025. Paper | Code
  • VISA: Retrieval Augmented Generation with Visual Source Attribution
    Xueguang Ma*, Shengyao Zhuang*, Bevan Koopman, Guido Zuccon, Wenhu Chen, Jimmy Lin.
    ACL 2025. Paper | Code
  • DRAMA: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers
    Xueguang Ma*, Xi Victoria Lin, Barlas Oguz, Jimmy Lin, Wen-tau Yih, Xilun Chen*.
    ACL 2025. Paper | Code
  • Document Screenshot Retrievers are Vulnerable to Pixel Poisoning Attacks
    Shengyao Zhuang, Ekaterina Khramtsova, Xueguang Ma, Bevan Koopman, Jimmy Lin, Guido Zuccon.
    SIGIR 2025. Paper | Code
  • Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality
    Xueguang Ma*, Luyu Gao*, Shengyao Zhuang*, Jiaqi Samantha Zhan, Jamie Callan, Jimmy Lin.
    SIGIR 2025. Paper | Code
  • PixelWorld: Towards Perceiving Everything as Pixels
    Zhiheng Lyu, Xueguang Ma, Wenhu Chen.
    ArXiv 2025. Paper | Code
  • Rank-R1: Enhancing Reasoning in LLM-based Document Rerankers via Reinforcement Learning
    Shengyao Zhuang*, Xueguang Ma*, Bevan Koopman, Jimmy Lin, Guido Zuccon.
    ArXiv 2025. Paper | Code
  • ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations
    Yubo Wang, Xueguang Ma, Ping Nie, Huaye Zeng, Zhiheng Lyu, Yuxuan Zhang, Benjamin Schneider, Yi Lu, Xiang Yue, Wenhu Chen.
    ArXiv 2025. Paper | Code
  • MAGMaR Shared Task System Description: Video Retrieval with OmniEmbed
    Jiaqi Samantha Zhan*, Crystina Zhang*, Shengyao Zhuang*, Xueguang Ma*, Jimmy Lin.
    ArXiv 2025. Paper | Code
  • Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval
    Nandan Thakur, Crystina Zhang, Xueguang Ma, Jimmy Lin.
    ArXiv 2025. Paper | Code
  • Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard
    Ehsan Kamalloo, Nandan Thakur, Carlos Lassance, Xueguang Ma, Jheng-Hong Yang, Jimmy Lin.
    ArXiv 2024. Paper
  • Unifying Multimodal Retrieval via Document Screenshot Embedding
    Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, Jimmy Lin.
    EMNLP 2024. Paper | Code
  • LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs
    Ziyan Jiang, Xueguang Ma, Wenhu Chen.
    ArXiv 2024. Paper | Code
  • MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhu Chen.
    NeurIPS 2024. Paper | Code
  • PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval
    Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, Guido Zuccon.
    EMNLP 2024. Paper | Code
  • Fine-Tuning LLaMA for Multi-Stage Text Retrieval
    Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, Jimmy Lin.
    SIGIR 2024. Paper | Code
  • Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models
    Raphael Tang, Xinyu Zhang, Xueguang Ma, Jimmy Lin, Ferhan Ture.
    NAACL 2024. Paper | Code
  • Augmenting Black-box LLMs with Medical Textbooks for Clinical Question Answering
    Yubo Wang, Xueguang Ma, Wenhu Chen.
    EMNLP 2024 Findings. Paper | Code
  • Enhancing Sparse Retrieval via Unsupervised Learning
    Xueguang Ma, Hengxin Fun, Xusen Yin, Antonio Mallia, Jimmy Lin.
    SIGIR-AP 2023. Paper
  • SLIM: Sparsified late interaction for multi-vector retrieval with inverted indexes
    Minghan Li, Sheng-Chieh Lin, Xueguang Ma, Jimmy Lin.
    SIGIR 2023. Paper | Code
  • TheoremQA: A theorem-driven question answering dataset
    Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, Tony Xia.
    EMNLP 2023. Paper | Code
  • Toward best practices for training multilingual dense retrieval models
    Xinyu Zhang, Kelechi Ogueji, Xueguang Ma, Jimmy Lin.
    TOIS 2023. Paper
  • Anserini Gets Dense Retrieval: Integration of Lucene's HNSW Indexes
    Xueguang Ma, Tommaso Teofili, Jimmy Lin.
    CIKM 2023. Paper | Code
  • Zero-Shot Listwise Document Reranking with a Large Language Model
    Xueguang Ma, Xinyu Zhang, Ronak Pradeep, Jimmy Lin.
    ArXiv 2023. Paper
  • Few-shot In-context Learning for Knowledge Base Question Answering
    Tianle Li, Xueguang Ma, Alex Zhuang, Yu Gu, Yu Su, Wenhu Chen.
    ACL 2023. Paper | Code
  • Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
    Wenhu Chen*, Xueguang Ma*, Xinyi Wang, William W. Cohen.
    TMLR 2023. Paper | Code
  • Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval
    Luyu Gao*, Xueguang Ma*, Jimmy Lin, Jamie Callan.
    SIGIR 2023. Paper | Code
  • Precise Zero-Shot Dense Retrieval without Relevance Labels
    Luyu Gao*, Xueguang Ma*, Jimmy Lin, Jamie Callan.
    ACL 2023. Paper | Code
  • Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback: A Reproducibility Study
    Hang Li, Shengyao Zhuang, Ahmed Mourad, Xueguang Ma, Jimmy Lin, Guido Zuccon.
    ECIR 2022. Paper
  • To interpolate or not to interpolate: Prf, dense and sparse retrievers
    Hang Li, Shuai Wang, Shengyao Zhuang, Ahmed Mourad, Xueguang Ma, Jimmy Lin, Guido Zuccon.
    SIGIR 2022. Paper | Code
  • An Encoder Attribution Analysis for Dense Passage Retriever in Open-Domain Question Answering
    Minghan Li, Xueguang Ma, Jimmy Lin.
    TrustNLP Workshop 2022. Paper
  • A Replication Study of Dense Passage Retriever
    Xueguang Ma, Kai Sun, Ronak Pradeep, Minghan Li, Jimmy Lin.
    ECIR 2022. Paper | Code
  • Document Expansions and Learned Sparse Lexical Representations for MS MARCO V1 and V2
    Xueguang Ma, Ronak Pradeep, Rodrigo Nogueira, Jimmy Lin.
    SIGIR 2022. Paper | Code
  • Vera: Prediction techniques for reducing harmful misinformation in consumer health search
    Ronak Pradeep, Xueguang Ma, Rodrigo Nogueira, Jimmy Lin.
    SIGIR 2021. Paper
  • Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations
    Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, Rodrigo Nogueira.
    SIGIR 2021. Paper | Code
  • On the Separation of Logical and Physical Ranking Models for Text Retrieval Applications
    Jimmy Lin, Xueguang Ma, Joel Mackenzie, Antonio Mallia.
    DESIRES 2021. Paper
  • Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval
    Xinyu Zhang, Xueguang Ma, Peng Shi, Jimmy Lin.
    MRL Workshop 2021. Paper | Code
  • Simple and Effective Unsupervised Redundancy Elimination to Compress Dense Vectors for Passage Retrieval
    Xueguang Ma*, Minghan Li*, Kai Sun, Ji Xin, Jimmy Lin.
    EMNLP 2021. Paper | Code
  • Sparsifying Sparse Representations for Passage Retrieval by Top-k Masking
    Jheng-Hong Yang, Xueguang Ma, Jimmy Lin.
    ArXiv 2021. Paper
  • Scientific Claim Verification with VERT5ERINI
    Ronak Pradeep, Xueguang Ma, Rodrigo Nogueira, Jimmy Lin.
    LOUHI Workshop 2021. Paper

Service

I actively serve as Reviewer for ACL, NAACL, EMNLP, SIGIR, NeurIPS.

I have served as Teaching Assistant for the following courses at the university of Waterloo:

CS 116 Introduction to Computer Science (Fall 2022)
CS 246 Object-Oriented Software Development (Spring 2021)
CS 346 Application Development (Fall 2022, Winter 2023)
CS 350 Operating Systems (Winter 2024, Spring 2024)
CS 370 Numerical Computation (Fall 2021)
CS 371 Introduction to Computational Mathematics (Winter 2022)