Learning Action Embeddings for Off-Policy Evaluation

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Autoren

  • Matej Cief
  • Jacek Golebiowski
  • Philipp Schmidt
  • Ziawasch Abedjan
  • Artur Bekasov

Externe Organisationen

  • Technische Universität Brünn (VRT)
  • Kempelen Institute of Intelligent Technologies (KINIT)
  • Amazon.com, Inc.
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Titel des SammelwerksAdvances in Information Retrieval
Untertitel46th European Conference on Information Retrieval, ECIR 2024
Herausgeber/-innenNazli Goharian, Nicola Tonellotto, Yulan He, Aldo Lipani, Graham McDonald, Craig Macdonald, Iadh Ounis
Herausgeber (Verlag)Springer Science and Business Media Deutschland GmbH
Seiten108-122
Seitenumfang15
ISBN (elektronisch)978-3-031-56027-9
ISBN (Print)9783031560262
PublikationsstatusVeröffentlicht - 20 März 2024
Veranstaltung46th European Conference on Information Retrieval, ECIR 2024 - Glasgow, Großbritannien / Vereinigtes Königreich
Dauer: 24 März 202428 März 2024

Publikationsreihe

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Band14608 LNCS
ISSN (Print)0302-9743
ISSN (elektronisch)1611-3349

Abstract

Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims [13] propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.

ASJC Scopus Sachgebiete

Zitieren

Learning Action Embeddings for Off-Policy Evaluation. / Cief, Matej; Golebiowski, Jacek; Schmidt, Philipp et al.
Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024. Hrsg. / Nazli Goharian; Nicola Tonellotto; Yulan He; Aldo Lipani; Graham McDonald; Craig Macdonald; Iadh Ounis. Springer Science and Business Media Deutschland GmbH, 2024. S. 108-122 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Band 14608 LNCS).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Cief, M, Golebiowski, J, Schmidt, P, Abedjan, Z & Bekasov, A 2024, Learning Action Embeddings for Off-Policy Evaluation. in N Goharian, N Tonellotto, Y He, A Lipani, G McDonald, C Macdonald & I Ounis (Hrsg.), Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Bd. 14608 LNCS, Springer Science and Business Media Deutschland GmbH, S. 108-122, 46th European Conference on Information Retrieval, ECIR 2024, Glasgow, Großbritannien / Vereinigtes Königreich, 24 März 2024. https://doi.org/10.48550/arXiv.2305.03954, https://doi.org/10.1007/978-3-031-56027-9_7
Cief, M., Golebiowski, J., Schmidt, P., Abedjan, Z., & Bekasov, A. (2024). Learning Action Embeddings for Off-Policy Evaluation. In N. Goharian, N. Tonellotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, & I. Ounis (Hrsg.), Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024 (S. 108-122). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Band 14608 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.48550/arXiv.2305.03954, https://doi.org/10.1007/978-3-031-56027-9_7
Cief M, Golebiowski J, Schmidt P, Abedjan Z, Bekasov A. Learning Action Embeddings for Off-Policy Evaluation. in Goharian N, Tonellotto N, He Y, Lipani A, McDonald G, Macdonald C, Ounis I, Hrsg., Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024. Springer Science and Business Media Deutschland GmbH. 2024. S. 108-122. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.48550/arXiv.2305.03954, 10.1007/978-3-031-56027-9_7
Cief, Matej ; Golebiowski, Jacek ; Schmidt, Philipp et al. / Learning Action Embeddings for Off-Policy Evaluation. Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024. Hrsg. / Nazli Goharian ; Nicola Tonellotto ; Yulan He ; Aldo Lipani ; Graham McDonald ; Craig Macdonald ; Iadh Ounis. Springer Science and Business Media Deutschland GmbH, 2024. S. 108-122 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Download
@inproceedings{a295fa61047d4a44b67283f51a78e4ce,
title = "Learning Action Embeddings for Off-Policy Evaluation",
abstract = "Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims [13] propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.",
keywords = "large action space, multi-armed bandits, off-policy evaluation, recommender systems, representation learning",
author = "Matej Cief and Jacek Golebiowski and Philipp Schmidt and Ziawasch Abedjan and Artur Bekasov",
note = "Funding Information: The research conducted by Matej Cief (also with slovak.AI) was partially supported by TAILOR, a project funded by EU Horizon 2020 under GA No. 952215, https://doi.org/10.3030/952215.; 46th European Conference on Information Retrieval, ECIR 2024 ; Conference date: 24-03-2024 Through 28-03-2024",
year = "2024",
month = mar,
day = "20",
doi = "10.48550/arXiv.2305.03954",
language = "English",
isbn = "9783031560262",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Science and Business Media Deutschland GmbH",
pages = "108--122",
editor = "Nazli Goharian and Nicola Tonellotto and Yulan He and Aldo Lipani and Graham McDonald and Craig Macdonald and Iadh Ounis",
booktitle = "Advances in Information Retrieval",
address = "Germany",

}

Download

TY - GEN

T1 - Learning Action Embeddings for Off-Policy Evaluation

AU - Cief, Matej

AU - Golebiowski, Jacek

AU - Schmidt, Philipp

AU - Abedjan, Ziawasch

AU - Bekasov, Artur

N1 - Funding Information: The research conducted by Matej Cief (also with slovak.AI) was partially supported by TAILOR, a project funded by EU Horizon 2020 under GA No. 952215, https://doi.org/10.3030/952215.

PY - 2024/3/20

Y1 - 2024/3/20

N2 - Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims [13] propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.

AB - Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims [13] propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.

KW - large action space

KW - multi-armed bandits

KW - off-policy evaluation

KW - recommender systems

KW - representation learning

UR - http://www.scopus.com/inward/record.url?scp=85189744882&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2305.03954

DO - 10.48550/arXiv.2305.03954

M3 - Conference contribution

AN - SCOPUS:85189744882

SN - 9783031560262

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 108

EP - 122

BT - Advances in Information Retrieval

A2 - Goharian, Nazli

A2 - Tonellotto, Nicola

A2 - He, Yulan

A2 - Lipani, Aldo

A2 - McDonald, Graham

A2 - Macdonald, Craig

A2 - Ounis, Iadh

PB - Springer Science and Business Media Deutschland GmbH

T2 - 46th European Conference on Information Retrieval, ECIR 2024

Y2 - 24 March 2024 through 28 March 2024

ER -