Details
| Originalsprache | Englisch |
|---|---|
| Seiten (von - bis) | 13193-13201 |
| Seitenumfang | 9 |
| Fachzeitschrift | IEEE ACCESS |
| Jahrgang | 14 |
| Frühes Online-Datum | 31 Dez. 2025 |
| Publikationsstatus | Veröffentlicht - 27 Jan. 2026 |
Abstract
Large Vision Language Models (VLMs), such as CLIP, have significantly contributed to various computer vision tasks, including object recognition and object detection. However, their opaque nature and lack of explainability in predictions make them less trustworthy in critical domains. Providing VLMs with reasonable rationales for object recognition typically comes at the expense of classification accuracy. To tackle this issue, we present ECOR: Explainable CLIP for Object Recognition. First, we propose a mathematical definition of explainability in the object recognition task based on the joint probability distribution of categories and rationales, then leverage this definition to fine-tune CLIP in an explainable manner. Through evaluation on different datasets, ECOR demonstrates state-of-the-art performance in explainable classification. Notably, it excels in zero-shot settings, showcasing its adaptability. This advancement improves explainable object recognition, enhancing trust across diverse applications. We make the code available online (Our code is available on github).
ASJC Scopus Sachgebiete
- Informatik (insg.)
- Allgemeine Computerwissenschaft
- Werkstoffwissenschaften (insg.)
- Allgemeine Materialwissenschaften
- Ingenieurwesen (insg.)
- Allgemeiner Maschinenbau
Zitieren
- Standard
- Harvard
- Apa
- Vancouver
- BibTex
- RIS
in: IEEE ACCESS, Jahrgang 14, 27.01.2026, S. 13193-13201.
Publikation: Beitrag in Fachzeitschrift › Artikel › Forschung › Peer-Review
}
TY - JOUR
T1 - Bridging Explainability and Accuracy in CLIP for Object Recognition via Joint Probability Over Rationales and Categories
AU - Rasekh, Ali
AU - Kazemi Ranjbar, Sepehr
AU - Heidari Sorkhehdizaj, Milad
AU - Gottschalk, Simon
AU - Nejdl, Wolfgang
N1 - Publisher Copyright: © 2013 IEEE.
PY - 2026/1/27
Y1 - 2026/1/27
N2 - Large Vision Language Models (VLMs), such as CLIP, have significantly contributed to various computer vision tasks, including object recognition and object detection. However, their opaque nature and lack of explainability in predictions make them less trustworthy in critical domains. Providing VLMs with reasonable rationales for object recognition typically comes at the expense of classification accuracy. To tackle this issue, we present ECOR: Explainable CLIP for Object Recognition. First, we propose a mathematical definition of explainability in the object recognition task based on the joint probability distribution of categories and rationales, then leverage this definition to fine-tune CLIP in an explainable manner. Through evaluation on different datasets, ECOR demonstrates state-of-the-art performance in explainable classification. Notably, it excels in zero-shot settings, showcasing its adaptability. This advancement improves explainable object recognition, enhancing trust across diverse applications. We make the code available online (Our code is available on github).
AB - Large Vision Language Models (VLMs), such as CLIP, have significantly contributed to various computer vision tasks, including object recognition and object detection. However, their opaque nature and lack of explainability in predictions make them less trustworthy in critical domains. Providing VLMs with reasonable rationales for object recognition typically comes at the expense of classification accuracy. To tackle this issue, we present ECOR: Explainable CLIP for Object Recognition. First, we propose a mathematical definition of explainability in the object recognition task based on the joint probability distribution of categories and rationales, then leverage this definition to fine-tune CLIP in an explainable manner. Through evaluation on different datasets, ECOR demonstrates state-of-the-art performance in explainable classification. Notably, it excels in zero-shot settings, showcasing its adaptability. This advancement improves explainable object recognition, enhancing trust across diverse applications. We make the code available online (Our code is available on github).
KW - Explainability
KW - multi-modal learning
KW - object recognition
KW - text-image alignment
KW - vision-language models
UR - http://www.scopus.com/inward/record.url?scp=105029232483&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2025.3649809
DO - 10.1109/ACCESS.2025.3649809
M3 - Article
AN - SCOPUS:105029232483
VL - 14
SP - 13193
EP - 13201
JO - IEEE ACCESS
JF - IEEE ACCESS
SN - 2169-3536
ER -