Bridging Explainability and Accuracy in CLIP for Object Recognition via Joint Probability Over Rationales and Categories

Publikation: Beitrag in FachzeitschriftArtikelForschungPeer-Review

Autorschaft

Organisationseinheiten

Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Seiten (von - bis)13193-13201
Seitenumfang9
FachzeitschriftIEEE ACCESS
Jahrgang14
Frühes Online-Datum31 Dez. 2025
PublikationsstatusVeröffentlicht - 27 Jan. 2026

Abstract

Large Vision Language Models (VLMs), such as CLIP, have significantly contributed to various computer vision tasks, including object recognition and object detection. However, their opaque nature and lack of explainability in predictions make them less trustworthy in critical domains. Providing VLMs with reasonable rationales for object recognition typically comes at the expense of classification accuracy. To tackle this issue, we present ECOR: Explainable CLIP for Object Recognition. First, we propose a mathematical definition of explainability in the object recognition task based on the joint probability distribution of categories and rationales, then leverage this definition to fine-tune CLIP in an explainable manner. Through evaluation on different datasets, ECOR demonstrates state-of-the-art performance in explainable classification. Notably, it excels in zero-shot settings, showcasing its adaptability. This advancement improves explainable object recognition, enhancing trust across diverse applications. We make the code available online (Our code is available on github).

ASJC Scopus Sachgebiete

Zitieren

Bridging Explainability and Accuracy in CLIP for Object Recognition via Joint Probability Over Rationales and Categories. / Rasekh, Ali; Kazemi Ranjbar, Sepehr; Heidari Sorkhehdizaj, Milad et al.
in: IEEE ACCESS, Jahrgang 14, 27.01.2026, S. 13193-13201.

Publikation: Beitrag in FachzeitschriftArtikelForschungPeer-Review

Rasekh A, Kazemi Ranjbar S, Heidari Sorkhehdizaj M, Gottschalk S, Nejdl W. Bridging Explainability and Accuracy in CLIP for Object Recognition via Joint Probability Over Rationales and Categories. IEEE ACCESS. 2026 Jan 27;14:13193-13201. Epub 2025 Dez 31. doi: 10.1109/ACCESS.2025.3649809
Rasekh, Ali ; Kazemi Ranjbar, Sepehr ; Heidari Sorkhehdizaj, Milad et al. / Bridging Explainability and Accuracy in CLIP for Object Recognition via Joint Probability Over Rationales and Categories. in: IEEE ACCESS. 2026 ; Jahrgang 14. S. 13193-13201.
Download
@article{bab789dc577b4bf286b18dc87aa9af8d,
title = "Bridging Explainability and Accuracy in CLIP for Object Recognition via Joint Probability Over Rationales and Categories",
abstract = "Large Vision Language Models (VLMs), such as CLIP, have significantly contributed to various computer vision tasks, including object recognition and object detection. However, their opaque nature and lack of explainability in predictions make them less trustworthy in critical domains. Providing VLMs with reasonable rationales for object recognition typically comes at the expense of classification accuracy. To tackle this issue, we present ECOR: Explainable CLIP for Object Recognition. First, we propose a mathematical definition of explainability in the object recognition task based on the joint probability distribution of categories and rationales, then leverage this definition to fine-tune CLIP in an explainable manner. Through evaluation on different datasets, ECOR demonstrates state-of-the-art performance in explainable classification. Notably, it excels in zero-shot settings, showcasing its adaptability. This advancement improves explainable object recognition, enhancing trust across diverse applications. We make the code available online (Our code is available on github).",
keywords = "Explainability, multi-modal learning, object recognition, text-image alignment, vision-language models",
author = "Ali Rasekh and {Kazemi Ranjbar}, Sepehr and {Heidari Sorkhehdizaj}, Milad and Simon Gottschalk and Wolfgang Nejdl",
note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",
year = "2026",
month = jan,
day = "27",
doi = "10.1109/ACCESS.2025.3649809",
language = "English",
volume = "14",
pages = "13193--13201",
journal = "IEEE ACCESS",
issn = "2169-3536",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

Download

TY - JOUR

T1 - Bridging Explainability and Accuracy in CLIP for Object Recognition via Joint Probability Over Rationales and Categories

AU - Rasekh, Ali

AU - Kazemi Ranjbar, Sepehr

AU - Heidari Sorkhehdizaj, Milad

AU - Gottschalk, Simon

AU - Nejdl, Wolfgang

N1 - Publisher Copyright: © 2013 IEEE.

PY - 2026/1/27

Y1 - 2026/1/27

N2 - Large Vision Language Models (VLMs), such as CLIP, have significantly contributed to various computer vision tasks, including object recognition and object detection. However, their opaque nature and lack of explainability in predictions make them less trustworthy in critical domains. Providing VLMs with reasonable rationales for object recognition typically comes at the expense of classification accuracy. To tackle this issue, we present ECOR: Explainable CLIP for Object Recognition. First, we propose a mathematical definition of explainability in the object recognition task based on the joint probability distribution of categories and rationales, then leverage this definition to fine-tune CLIP in an explainable manner. Through evaluation on different datasets, ECOR demonstrates state-of-the-art performance in explainable classification. Notably, it excels in zero-shot settings, showcasing its adaptability. This advancement improves explainable object recognition, enhancing trust across diverse applications. We make the code available online (Our code is available on github).

AB - Large Vision Language Models (VLMs), such as CLIP, have significantly contributed to various computer vision tasks, including object recognition and object detection. However, their opaque nature and lack of explainability in predictions make them less trustworthy in critical domains. Providing VLMs with reasonable rationales for object recognition typically comes at the expense of classification accuracy. To tackle this issue, we present ECOR: Explainable CLIP for Object Recognition. First, we propose a mathematical definition of explainability in the object recognition task based on the joint probability distribution of categories and rationales, then leverage this definition to fine-tune CLIP in an explainable manner. Through evaluation on different datasets, ECOR demonstrates state-of-the-art performance in explainable classification. Notably, it excels in zero-shot settings, showcasing its adaptability. This advancement improves explainable object recognition, enhancing trust across diverse applications. We make the code available online (Our code is available on github).

KW - Explainability

KW - multi-modal learning

KW - object recognition

KW - text-image alignment

KW - vision-language models

UR - http://www.scopus.com/inward/record.url?scp=105029232483&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2025.3649809

DO - 10.1109/ACCESS.2025.3649809

M3 - Article

AN - SCOPUS:105029232483

VL - 14

SP - 13193

EP - 13201

JO - IEEE ACCESS

JF - IEEE ACCESS

SN - 2169-3536

ER -

Von denselben Autoren