Learning Action Embeddings for Off-Policy Evaluation

Matej Cief; Jacek Golebiowski; Philipp Schmidt; Ziawasch Abedjan; Artur Bekasov

doi:10.48550/arXiv.2305.03954

Details

Original language	English
Title of host publication	Advances in Information Retrieval
Subtitle of host publication	46th European Conference on Information Retrieval, ECIR 2024
Editors	Nazli Goharian, Nicola Tonellotto, Yulan He, Aldo Lipani, Graham McDonald, Craig Macdonald, Iadh Ounis
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	108-122
Number of pages	15
ISBN (electronic)	978-3-031-56027-9
ISBN (print)	9783031560262
Publication status	Published - 20 Mar 2024
Event	46th European Conference on Information Retrieval, ECIR 2024 - Glasgow, United Kingdom (UK) Duration: 24 Mar 2024 → 28 Mar 2024

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	14608 LNCS
ISSN (Print)	0302-9743
ISSN (electronic)	1611-3349

Abstract

Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims [13] propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.

Keywords

large action space, multi-armed bandits, off-policy evaluation, recommender systems, representation learning

ASJC Scopus subject areas

Mathematics(all)
Theoretical Computer Science
Computer Science(all)
General Computer Science

Cite this

Learning Action Embeddings for Off-Policy Evaluation. / Cief, Matej; Golebiowski, Jacek; Schmidt, Philipp et al.
Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024. ed. / Nazli Goharian; Nicola Tonellotto; Yulan He; Aldo Lipani; Graham McDonald; Craig Macdonald; Iadh Ounis. Springer Science and Business Media Deutschland GmbH, 2024. p. 108-122 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 14608 LNCS).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Cief, M, Golebiowski, J, Schmidt, P, Abedjan, Z & Bekasov, A 2024, Learning Action Embeddings for Off-Policy Evaluation. in N Goharian, N Tonellotto, Y He, A Lipani, G McDonald, C Macdonald & I Ounis (eds), Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 14608 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 108-122, 46th European Conference on Information Retrieval, ECIR 2024, Glasgow, United Kingdom (UK), 24 Mar 2024. https://doi.org/10.48550/arXiv.2305.03954, https://doi.org/10.1007/978-3-031-56027-9_7

Cief, M., Golebiowski, J., Schmidt, P., Abedjan, Z., & Bekasov, A. (2024). Learning Action Embeddings for Off-Policy Evaluation. In N. Goharian, N. Tonellotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, & I. Ounis (Eds.), Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024 (pp. 108-122). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 14608 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.48550/arXiv.2305.03954, https://doi.org/10.1007/978-3-031-56027-9_7

Cief M, Golebiowski J, Schmidt P, Abedjan Z, Bekasov A. Learning Action Embeddings for Off-Policy Evaluation. In Goharian N, Tonellotto N, He Y, Lipani A, McDonald G, Macdonald C, Ounis I, editors, Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024. Springer Science and Business Media Deutschland GmbH. 2024. p. 108-122. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.48550/arXiv.2305.03954, 10.1007/978-3-031-56027-9_7

Cief, Matej ; Golebiowski, Jacek ; Schmidt, Philipp et al. / Learning Action Embeddings for Off-Policy Evaluation. Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024. editor / Nazli Goharian ; Nicola Tonellotto ; Yulan He ; Aldo Lipani ; Graham McDonald ; Craig Macdonald ; Iadh Ounis. Springer Science and Business Media Deutschland GmbH, 2024. pp. 108-122 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

Download

@inproceedings{a295fa61047d4a44b67283f51a78e4ce,

title = "Learning Action Embeddings for Off-Policy Evaluation",

abstract = "Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims [13] propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.",

keywords = "large action space, multi-armed bandits, off-policy evaluation, recommender systems, representation learning",

author = "Matej Cief and Jacek Golebiowski and Philipp Schmidt and Ziawasch Abedjan and Artur Bekasov",

note = "Funding Information: The research conducted by Matej Cief (also with slovak.AI) was partially supported by TAILOR, a project funded by EU Horizon 2020 under GA No. 952215, https://doi.org/10.3030/952215.; 46th European Conference on Information Retrieval, ECIR 2024 ; Conference date: 24-03-2024 Through 28-03-2024",

year = "2024",

month = mar,

day = "20",

doi = "10.48550/arXiv.2305.03954",

language = "English",

isbn = "9783031560262",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "108--122",

editor = "Nazli Goharian and Nicola Tonellotto and Yulan He and Aldo Lipani and Graham McDonald and Craig Macdonald and Iadh Ounis",

booktitle = "Advances in Information Retrieval",

address = "Germany",

}

Download

TY - GEN

T1 - Learning Action Embeddings for Off-Policy Evaluation

AU - Cief, Matej

AU - Golebiowski, Jacek

AU - Schmidt, Philipp

AU - Abedjan, Ziawasch

AU - Bekasov, Artur

N1 - Funding Information: The research conducted by Matej Cief (also with slovak.AI) was partially supported by TAILOR, a project funded by EU Horizon 2020 under GA No. 952215, https://doi.org/10.3030/952215.

PY - 2024/3/20

Y1 - 2024/3/20

N2 - Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims [13] propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.

AB - Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims [13] propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.

KW - large action space

KW - multi-armed bandits

KW - off-policy evaluation

KW - recommender systems

KW - representation learning

UR - http://www.scopus.com/inward/record.url?scp=85189744882&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2305.03954

DO - 10.48550/arXiv.2305.03954

M3 - Conference contribution

AN - SCOPUS:85189744882

SN - 9783031560262

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 108

EP - 122

BT - Advances in Information Retrieval

A2 - Goharian, Nazli

A2 - Tonellotto, Nicola

A2 - He, Yulan

A2 - Lipani, Aldo

A2 - McDonald, Graham

A2 - Macdonald, Craig

A2 - Ounis, Iadh

PB - Springer Science and Business Media Deutschland GmbH

T2 - 46th European Conference on Information Retrieval, ECIR 2024

Y2 - 24 March 2024 through 28 March 2024

ER -

Research@Leibniz University

Learning Action Embeddings for Off-Policy Evaluation

Authors

Research Organisations

External Research Organisations