Zero-Shot Protest Event Detection with Images and Large Multimodal Models

Zachary Steinert-Threlkeld; Cihang Xie

doi:10.33774/apsa-2025-h700r

This paper evaluates the capabilities of large multimodal models in image analysis, focusing on the performance of Unum, LLaVA-7b, and LLaVA-13b. The analysis utilizes two datasets comprising 26,142 geolocated protest images and 1,000,000 random geolocated images from Twitter. The study assesses zero-shot classification abilities of these models through image captioning and direct image classification. Findings reveal that increased model size enhances narrative quality in captions and true negative identification but does not significantly improve the detection of true positives. Unum, despite its smaller scale, competes closely with larger models in identifying positive instances. The research highlights that while LLMs excel in generating detailed descriptions, their effectiveness in zero-shot image classification is limited. Further analysis using captions demonstrates more accurate classification, suggesting a potential approach for improved image analysis.

Zero-Shot Protest Event Detection with Images and Large Multimodal Models

Abstract

Keywords

Comments

Version History

Metrics

License

DOI

Author’s competing interest statement

Ethics

Share

Zero-Shot Protest Event Detection with Images and Large Multimodal Models

Authors

Abstract

Keywords

Comments

Version History

Metrics

License

DOI

Author’s competing interest statement

Ethics

Share