About me
Hi! I’m Xueguang Ma (马雪光). I am a fourth-year PhD student in the David R. Cheriton School of Computer Science at the University of Waterloo, Canada. I am advised by Prof. Jimmy Lin. My research interests include Information Retrieval (IR) and Natural Language Processing (NLP).
I received my Bachelor of Mathematics from the University of Waterloo in 2021.
I like snowboarding and badminton.
Research Interests
My overarching research goal is to make it easy for people and intelligent systems to access, understand, and interact with world information.
To support this vision, my research explores:
- Neural Retrieval and LLM Retrieval: Developing precise and generalizable retrieval systems. UniCOIL, RankLlama, HyDE, LRL, PromptReps, Drama, Rank-R1
- Multimodal Retrieval Augmented Generation: Bridging text and visual modalities to access information beyond text. DSE, VISA, OmniEmbed
- LLM Reasoning: Assessing and optimizing reasoning capabilities of large language models across domains and tasks. MMLU-Pro, General-Reasoner
- Open-source Toolkits: Building open-source toolkits for research on information retrieval. Tevatron, Pyserini, Anserini
Publications
* indicate equal contribution.
- General-Reasoner: Advancing LLM Reasoning Across All Domains
- VISA: Retrieval Augmented Generation with Visual Source Attribution
- DRAMA: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers
- Document Screenshot Retrievers are Vulnerable to Pixel Poisoning Attacks
- Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality
- PixelWorld: Towards Perceiving Everything as Pixels
- Rank-R1: Enhancing Reasoning in LLM-based Document Rerankers via Reinforcement Learning
- ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations
- MAGMaR Shared Task System Description: Video Retrieval with OmniEmbed
- Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval
- Resources for Brewing BEIR: Reproducible Reference Models and an Official LeaderboardArXiv 2024. Paper
- Unifying Multimodal Retrieval via Document Screenshot Embedding
- LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs
- MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
- PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval
- Fine-Tuning LLaMA for Multi-Stage Text Retrieval
- Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models
- Augmenting Black-box LLMs with Medical Textbooks for Clinical Question Answering
- Enhancing Sparse Retrieval via Unsupervised LearningSIGIR-AP 2023. Paper
- SLIM: Sparsified late interaction for multi-vector retrieval with inverted indexes
- TheoremQA: A theorem-driven question answering dataset
- Toward best practices for training multilingual dense retrieval modelsTOIS 2023. Paper
- Anserini Gets Dense Retrieval: Integration of Lucene's HNSW Indexes
- Zero-Shot Listwise Document Reranking with a Large Language ModelArXiv 2023. Paper
- Few-shot In-context Learning for Knowledge Base Question Answering
- Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
- Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval
- Precise Zero-Shot Dense Retrieval without Relevance Labels
- Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback: A Reproducibility StudyECIR 2022. Paper
- To interpolate or not to interpolate: Prf, dense and sparse retrievers
- An Encoder Attribution Analysis for Dense Passage Retriever in Open-Domain Question AnsweringTrustNLP Workshop 2022. Paper
- A Replication Study of Dense Passage Retriever
- Document Expansions and Learned Sparse Lexical Representations for MS MARCO V1 and V2
- Vera: Prediction techniques for reducing harmful misinformation in consumer health searchSIGIR 2021. Paper
- Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations
- On the Separation of Logical and Physical Ranking Models for Text Retrieval ApplicationsDESIRES 2021. Paper
- Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval
- Simple and Effective Unsupervised Redundancy Elimination to Compress Dense Vectors for Passage Retrieval
- Sparsifying Sparse Representations for Passage Retrieval by Top-k MaskingArXiv 2021. Paper
- Scientific Claim Verification with VERT5ERINILOUHI Workshop 2021. Paper
Service
I actively serve as Reviewer for ACL, NAACL, EMNLP, SIGIR, NeurIPS.
I have served as Teaching Assistant for the following courses at the university of Waterloo:
CS 116 Introduction to Computer Science (Fall 2022)
CS 246 Object-Oriented Software Development (Spring 2021)
CS 346 Application Development (Fall 2022, Winter 2023)
CS 350 Operating Systems (Winter 2024, Spring 2024)
CS 370 Numerical Computation (Fall 2021)
CS 371 Introduction to Computational Mathematics (Winter 2022)