research-article

Small-Loss Bounds for Online Learning with Partial Information

Authors:
Thodoris Lykouris

Massachusetts Institute of Technology, Cambridge, Massachusetts 02139;

Massachusetts Institute of Technology, Cambridge, Massachusetts 02139;

https://orcid.org/0000-0002-3375-5579
View Profile

,
Karthik Sridharan

Cornell University, Ithaca, New York 14850

Cornell University, Ithaca, New York 14850
View Profile

,
Éva Tardos

Cornell University, Ithaca, New York 14850

Cornell University, Ithaca, New York 14850
View Profile

Mathematics of Operations Research Volume 47 Issue 3August 2022 pp 2186–2218https://doi.org/10.1287/moor.2021.1204

Published:01 August 2022Publication History

Mathematics of Operations Research

Abstract

We consider the problem of adversarial (nonstochastic) online learning with partial-information feedback, in which, at each round, a decision maker selects an action from a finite set of alternatives. We develop a black-box approach for such problems in which the learner observes as feedback only losses of a subset of the actions that includes the selected action. When losses of actions are nonnegative, under the graph-based feedback model introduced by Mannor and Shamir, we offer algorithms that attain the so called “small-loss” o(αL⋆) regret bounds with high probability, where α is the independence number of the graph and L⋆ is the loss of the best action. Prior to our work, there was no data-dependent guarantee for general feedback graphs even for pseudo-regret (without dependence on the number of actions, i.e., utilizing the increased information feedback). Taking advantage of the black-box nature of our technique, we extend our results to many other applications, such as combinatorial semi-bandits (including routing in networks), contextual bandits (even with an infinite comparator class), and learning with slowly changing (shifting) comparators. In the special case of multi-armed bandit and combinatorial semi-bandit problems, we provide optimal small-loss, high-probability regret guarantees of O˜(dL⋆), where d is the number of actions, answering open questions of Neu. Previous bounds for multi-armed bandits and semi-bandits were known only for pseudo-regret and only in expectation. We also offer an optimal O˜(κL⋆) regret guarantee for fixed feedback graphs with clique-partition number at most κ.

References

[1] Agarwal A, Krishnamurthy A, Langford J, Luo H, Schapire RE (2017) Open problem: First-order regret bounds for contextual bandits. Proc. 2017 Conf. Learn. Theory, vol. 65, 4–7.Google Scholar
[2] Allenberg C, Auer P, Györfi L, Ottucsák G (2006) Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. Proc. 17th Internat. Conf. Algorithmic Learn. Theory, 229–243.Google Scholar
[3] Allen-Zhu Z, Bubeck S, Li Y (2018) Make the minority great again: First-order regret bound for contextual bandits. Proc. 35th Internat. Conf. Machine Learn., 186–194.Google Scholar
[4] Alon N, Cesa-Bianchi N, Dekel O, Koren T (2015) Online learning with feedback graphs: Beyond bandits. Proc. 28th Conf. Learn. Theory, 23–35.Google Scholar
[5] Alon N, Cesa-Bianchi N, Gentile C, Mansour Y (2013) From bandits to experts: A tale of domination and independence. Proc. 26th Internat. Conf. Neural Inform. Processing Systems, 1610–1618.Google Scholar
[6] Alon N, Cesa-Bianchi N, Gentile C, Mannor S, Mansour Y, Shamir O (2017) Nonstochastic multi-armed bandits with graph-structured feedback. SIAM J. Comput. 46(6):1785–1826.Google ScholarDigital Library
[7] Audibert J, Bubeck S (2010) Regret bounds and minimax policies under partial monitoring. J. Machine Learn. Res. 11(94):2785–2836.Google ScholarDigital Library
[8] Audibert JY, Bubeck S, Lugosi G (2014) Regret in online combinatorial optimization. Math. Oper. Res. 39(1):31–45.Google ScholarDigital Library
[9] Auer P, Cesa-Bianchi N, Gentile C (2002) Adaptive and self-confident on-line learning algorithms. J. Comput. System Sci. 64(1):48–75.Google ScholarDigital Library
[10] Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2003) The nonstochastic multi-armed bandit problem. SIAM J. Comput. 32(1):48–77.Google ScholarDigital Library
[11] Awerbuch B, Kleinberg RD (2004) Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. Proc. 36th Annual ACM Sympos. Theory Comput., 45–53.Google Scholar
[12] Beygelzimer A, Langford J, Li L, Reyzin L, Schapire R (2011) Contextual bandit algorithms with supervised learning guarantees. Proc. 14th Internat. Conf. Artificial Intelligence Statist. (PMLR), 19–26.Google Scholar
[13] Blum A, Hartline JD (2005) Near-optimal online auctions. Proc. 16th Annual ACM-SIAM Sympos. Discrete Algorithms, 1156–1163.Google Scholar
[14] Blum A, Even-Dar E, Ligett K (2010) Routing without regret: On convergence to Nash equilibria of regret-minimizing algorithms in routing games. Theory Comput. 6(1):179–199.Google ScholarCross Ref
[15] Blum A, Hajiaghayi M, Ligett K, Roth A (2008) Regret minimization and the price of total anarchy. Proc. 40th Annual ACM Sympos. Theory Comput., 373–382.Google Scholar
[16] Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learning 5(1):1–122. https://www.nowpublishers.com/article/Details/MAL-024.Google Scholar
[17] Cesa-Bianchi N, Lugosi G (2006) Prediction, Learning, and Games (Cambridge University Press).Google ScholarCross Ref
[18] Cesa-Bianchi N, Gentile C, Mansour Y (2013) Regret minimization for reserve prices in second-price auctions. Proc. 24th Annual ACM-SIAM Sympos. Discrete Algorithms, 1190–1204.Google Scholar
[19] Cesa-Bianchi N, Lugosi G, Stoltz G (2005) Minimizing regret with label efficient prediction. IEEE Trans. Inform. Theory 51(6):2152–2162.Google ScholarDigital Library
[20] Cohen A, Hazan T, Koren T (2016) Online learning with feedback graphs without the graphs. Proc. 33rd Internat. Conf. Machine Learn., 811–819.Google Scholar
[21] Cover TM (1991) Universal portfolios. Math. Finance 1(1):1–29.Google ScholarCross Ref
[22] Daniely A, Gonen A, Shalev-Shwartz S (2015) Strongly adaptive online learning. Proc. 32nd Internat. Conf. Machine Learn., 1405–1411.Google Scholar
[23] Foster DJ, Li Z, Lykouris T, Sridharan K, Tardos É (2016) Learning in games: Robustness of fast convergence. Annual Conf. Neural Inform. Processing Systems 2016, 4727–4735.Google Scholar
[24] Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55(1):119–139.Google ScholarDigital Library
[25] Hannan J (1957) Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, vol. 3 (Princeton University Press), 97–139.Google Scholar
[26] Hazan E, Agarwal A, Kale S (2007) Logarithmic regret algorithms for online convex optimization. Machine Learn. 69(2–3):169–192.Google ScholarDigital Library
[27] Herbster M, Warmuth MK (1998) Tracking the best expert. Machine Learn. 32(2):151–178.Google ScholarDigital Library
[28] Kalai A, Vempala S (2005) Efficient algorithms for online decision problems. J. Comput. System Sci. 71(3):291–307.Google ScholarDigital Library
[29] Kocák T, Neu G, Valko M (2016) Online learning with noisy side observations. Proc. 19th Internat. Conf. Artificial Intelligence Statist., 1186–1194.Google Scholar
[30] Kocák T, Neu G, Valko M, Munos R (2014) Efficient learning by implicit exploration in bandit problems with side observations. Adv. Neural Inform. Processing Systems, 613–621.Google Scholar
[31] Lai T, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Google ScholarDigital Library
[32] Langford J, Zhang T (2007) The epoch-greedy algorithm for contextual multi-armed bandits. Proc. 20th Internat. Conf. Neural Inform. Processing Systems, 817–824.Google Scholar
[33] Littlestone N, Warmuth MK (1994) The weighted majority algorithm. Inform. Comput. 108(2):212–261.Google ScholarDigital Library
[34] Liu YP, Sellke M (2018) Personal communication via email.Google Scholar
[35] Luo H, Schapire RE (2015) Achieving all with no parameters: Adanormalhedge. Proc. 28th Conf. Learn. Theory, 1286–1304.Google Scholar
[36] Lykouris T, Syrgkanis V, Tardos E (2016) Learning and efficiency in games with dynamic population. Proc. 27th Annual ACM-SIAM Sympos. Discrete Algorithms, 120–129.Google Scholar
[37] Mannor S, Shamir O (2011) From bandits to experts: On the value of side-observations. Proc. 24th Internat. Conf. Neural Inform. Processing Systems, 684–692.Google Scholar
[38] Neu G (2015) Explore no more: Improved high-probability regret bounds for non-stochastic bandits. Annual Conf. Neural Inform. Processing Systems, 3168–3176.Google Scholar
[39] Neu G (2015) First-order regret bounds for combinatorial semi-bandits. Proc. 28th Conf. Learn. Theory, 1360–1375.Google Scholar
[40] Neu G, Bartók G (2016) Importance weighting without importance weights: An efficient algorithm for combinatorial semi-bandits. J. Machine Learn. Res. 17(1):5355–5375.Google ScholarDigital Library
[41] Rakhlin A, Sridharan K (2013) Online learning with predictable sequences. Proc. 26th Annual Conf. Learn. Theory, 993–1019.Google Scholar
[42] Rakhlin A, Sridharan K (2014) Online non-parametric regression. Proc. 27th Conf. Learn. Theory, 1232–1264.Google Scholar
[43] Rakhlin A, Sridharan K (2016) Bistro: An efficient relaxation-based method for contextual bandits. Proc. 33rd Internat. Conf. Machine Learn., vol. 48, 1977–1985.Google Scholar
[44] Rakhlin A, Sridharan K (2017) On equivalence of martingale tail bounds and deterministic regret inequalities. Proc. 30th Conf. Learn. Theory, 1704–1722.Google Scholar
[45] Rakhlin A, Sridharan K, Tewari A (2010) Online learning: Random averages, combinatorial parameters, and learnability. Preprint, submitted June 6, https://arxiv.org/abs/1006.1138.Google Scholar
[46] Roughgarden T (2015) Intrinsic robustness of the price of anarchy. J. ACM 62(5)1–42.Google ScholarDigital Library
[47] Roughgarden T, Wang JR (2016) Minimizing regret with multiple reserves. Proc. 2016 ACM Conf. Econom. Comput., 601–616.Google Scholar
[48] Syrgkanis V, Krishnamurthy A, Schapire RE (2016a) Efficient algorithms for adversarial contextual learning. Proc. 33rd Internat. Conf. Machine Learn., 2159–2168.Google Scholar
[49] Syrgkanis V, Luo H, Krishnamurthy A, Schapire RE (2016b) Improved regret bounds for oracle-based adversarial contextual bandits. Proc. 30th Internat. Conf. Neural Inform. Processing Systems, 3143–3151.Google Scholar
[50] Tossou A, Dimitrakakis C, Dubhashi D (2017) Thompson sampling for stochastic bandits with graph feedback. Proc. Conf. AAAI Artificial Intelligence, 31(1).Google Scholar

Index Terms

(auto-classified)

Small-Loss Bounds for Online Learning with Partial Information
1. Computing methodologies
  1. Machine learning

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Mathematics of Operations Research Volume 47, Issue 3
August 2022
840 pages
ISSN:0364-765X
DOI:10.1287/moor.2022.47.issue-3
Issue’s Table of Contents

Copyright © 2022, INFORMS
Sponsors
In-Cooperation
Publisher
INFORMS
Linthicum, MD, United States
Publication History
- Published: 1 August 2022
- Accepted: 30 June 2021
- Received: 13 June 2018
Author Tags
partial information
small-loss bounds
feedback graphs
contextual bandits
high probability
online learning
semi-bandits
regret bounds
Primary: 68Q32
first-order bounds
bandit algorithms
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 0
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

Small-Loss Bounds for Online Learning with Partial Information

Save to Binder

Mathematics of Operations Research

Abstract

References

Cited By

Index Terms

Small-Loss Bounds for Online Learning with Partial Information

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

Digital Edition

Caption

Small-Loss Bounds for Online Learning with Partial Information

Save to Binder

Mathematics of Operations Research

Abstract

References

Cited By

Index Terms

Small-Loss Bounds for Online Learning with Partial Information

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

Digital Edition

Share this Publication link

Share on Social Media