Speaker
Description
Recent advances in machine learning, particularly in multimodal models, have created new opportunities for analyzing complex data in high-energy physics, where accurate identification of particle interactions is critical for scientific discovery. However, existing approaches rely heavily on convolutional neural networks, which lack interpretability and do not fully leverage multimodal reasoning capabilities. Here we show that a fine-tuned Vision Language Model (VLM) based on LLaMA 3.2 can effectively identify neutrino interactions in pixelated detector data, outperforming both a state-of-the-art convolutional neural network and a Vision Transformer baseline in classification accuracy and robustness. In addition, the VLM provides improved explainability through reasoning-based, interpretable predictions and supports integration of auxiliary semantic information. These results demonstrate the potential of multimodal transformer architectures as general-purpose tools for physics event classification, paving the way for more transparent, flexible, and scalable analysis methods in future high-energy physics experiments.