Abstract
This paper evaluates the capabilities of large multimodal models in image analysis, focusing on the performance of Unum, LLaVA-7b, and LLaVA-13b. The analysis utilizes two datasets comprising 26,142 geolocated protest images and 1,000,000 random geolocated images from Twitter. The study assesses zero-shot classification abilities of these models through image captioning and direct image classification. Findings reveal that increased model size enhances narrative quality in captions and true negative identification but does not significantly improve the detection of true positives. Unum, despite its smaller scale, competes closely with larger models in identifying positive instances. The research highlights that while LLMs excel in generating detailed descriptions, their effectiveness in zero-shot image classification is limited. Further analysis using captions demonstrates more accurate classification, suggesting a potential approach for improved image analysis.