Papers

Cite the tutorial

If you use material from this tutorial, please cite:

@misc{alam_chowdhury_2026_mm_llms_wild,
  
  title        = {Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages},

  author       = {Alam, Firoj and Chowdhury, Shammur Absar},

  year         = {2026},

  howpublished = {\url{https://mm-llms-in-the-wild.github.io}},

  note         = {Tutorial materials}
}

Suggested reading

Foundations and multimodal models: BLIP-2 (Li et al., 2023); LLaVA (Liu et al., 2023); KOSMOS-1 (Huang et al., 2023); PaLM-E (Driess et al., 2023); PALO and Maya (multilingual vision–language); SeamlessM4T and AudioPaLM (speech–text LLMs).
Adapters, PEFT, and efficient training: LoRA/QLoRA (Hu et al., 2022; Dettmers et al., 2023); adapter stacks for VLMs; mixture-of-experts for modality/language specialization (Shen et al., 2024).
Data creation and curation: translation/back-translation pipelines; OCR/ASR bootstraps for low-resource multimodal data; safety and culture-aware filtering (Pfeiffer et al., 2022).
Evaluation and robustness: culture-aware benchmarks such as xGQA, MaRVL, HaVQA; dialectal stress tests; hallucination and grounding diagnostics for VLMs and speech→text→LLM cascades.
Applications and toolkits: open-source pipelines for multilingual VLM fine-tuning (e.g., LLaVA-Med/Next-GPT variants), speech front ends (Whisper, Seamless), and benchmarking toolkits for multilingual/multimodal tasks.