Open-domain Multimodal Multihop Retrieval

LILaC:
Late Interacting in Layered Component Graph for Open-domain Multimodal Multihop Retrieval

POSTECH (CSE · GSAI), DirectorLabs
EMNLP 2025 (Main)

Abstract

Multimodal document retrieval must surface the right paragraphs, tables, and images across documents while respecting both local detail and cross-document links. Fixed, single-granular units often mix relevant and distracting content, and most systems struggle to reason over multihop connections across modalities.

LILaC introduces a layered component graph with coarse- and fine-grained nodes and navigational edges, paired with late-interaction-based subgraph retrieval that decomposes queries and scores edges on-the-fly. The approach attains state-of-the-art retrieval on five benchmarks (MP-DocVQA, SlideVQA, InfoVQA, MultimodalQA, MMCoQA) using only pretrained models—no additional fine-tuning—and is released at github.com/joohyung00/lilac.

Highlights

Layered component graph

Documents are encoded as coarse (paragraph/table/image) nodes linked to fine-grained subcomponents (sentences/rows/objects), preserving hierarchical structure and navigational links for multihop reasoning.

Late-interaction traversal

Query is decomposed by an LLM into modality-specific subqueries; beam search scores edges on-the-fly via late interaction between subqueries and fine-grained evidence, avoiding full edge embedding.

State-of-the-art results

With pretrained embedders (MM-Embed, UniME, mmE5), LILaC sets new retrieval and end-to-end QA performance across five benchmarks, surpassing VisRAG/ColPali without task-specific fine-tuning.

Motivation

Challenges of TextRAG and VisRAG in multimodal retrieval

Challenges of existing TextRAG and VisRAG pipelines: (a) text-only summaries can drop crucial visual cues; (b) coarse screenshot granularity dilutes relevant content; (c) missing structural links limits multihop reasoning.

Method Overview

Overview of the LILaC layered component graph and retrieval pipeline

LILaC builds a two-layer component graph (coarse components and fine-grained subcomponents), encodes navigational and hierarchical edges, then performs late-interaction-guided traversal to retrieve a query-relevant subgraph.

Late Interaction

Edge-level late interaction scoring in LILaC

Edge-level scoring matches each decomposed subquery to the most relevant fine-grained subcomponent incident to an edge, summing evidence to guide traversal without precomputing all edge embeddings.

Results

  • Benchmarks: MP-DocVQA, SlideVQA, InfoVQA (VisRAG-style document images) plus reconstructed MultimodalQA and MMCoQA webpages; evaluated with R@3 and MRR@10 for retrieval, EM/F1 for end-to-end QA.
  • Retrieval gains: LILaC (MM-Embed) sets the best R@3/MRR@10 across all five datasets—e.g., 69.07 / 75.28 on MultimodalQA (+10.34 R@3 over VisRAG-Ret) and 55.80 / 50.77 on MMCoQA (+28.17 R@3).
  • QA accuracy: Paired with Qwen2.5-VL 7B, LILaC delivers the highest F1 on every benchmark (e.g., 51.97 on MultimodalQA, 43.22 on MMCoQA) while staying competitive on VisRAG-style datasets.
  • No fine-tuning required: Uses pretrained multimodal embedders (MM-Embed/UniME/mmE5) and late interaction scoring to avoid storing edge embeddings while improving both precision and efficiency.

BibTeX

@inproceedings{yun2025lilac,
  title={LILaC: Late Interacting in Layered Component Graph for Open-domain Multimodal Multihop Retrieval},
  author={Yun, Joohyung and Lee, Doyup and Han, Wook-Shin},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  pages={20551--20570},
  year={2025}
}