Multimodal document retrieval must surface the right paragraphs, tables, and images across documents while respecting both local detail and cross-document links. Fixed, single-granular units often mix relevant and distracting content, and most systems struggle to reason over multihop connections across modalities.
LILaC introduces a layered component graph with coarse- and fine-grained nodes and navigational edges, paired with late-interaction-based subgraph retrieval that decomposes queries and scores edges on-the-fly. The approach attains state-of-the-art retrieval on five benchmarks (MP-DocVQA, SlideVQA, InfoVQA, MultimodalQA, MMCoQA) using only pretrained models—no additional fine-tuning—and is released at github.com/joohyung00/lilac.