We have done some digging into how multimodal LLMs process visual information. Check out our workshop paper at CVPR 2025 Mechnasitic Interpretability for Vision workshop and our paper at CVPR 2025 XAI4CV.