Outline (durations are approximate)
- Introduction [15 min]: why tri-modality matters for low-resource and Global South contexts; evolution from vision–language to speech–text–vision systems.
- Model landscape [25 min]: multilingual VLMs and speech–text LLMs (e.g., BLIP-2, LLaVA, PaLM-E, SeamlessM4T, AudioPaLM, PALO, Maya) with low-resource takeaways.
- Data creation & multilingual resources [30 min]: low-cost pipelines (translation/back-translation, OCR/ASR bootstraps), safety/culture considerations, and multilingual evaluation sets (xGQA, MaRVL, HaVQA).
- Architectures & efficient training [35 min]: adapter stacks, PEFT/LoRA/QLoRA, quantization tips, and MoE routing for modality/language specialization.
- Speech multimodality in practice [20 min]: wiring speech→text→LLM pipelines; cascaded vs. unified speech–text systems and deployment trade-offs.
- Evaluation & error analysis [30 min]: culture-aware benchmarks, dialect stress tests, hallucination/grounding checks, and robustness to noise/occlusion.
- Resources, demos & wrap-up [25 min]: quick LoRA-tuning of a compact multilingual VLM and speech-front-end integrations; pointers to code and datasets.
Slides and lab notebooks will be posted as they are finalized.