research-article

Small-Loss Bounds for Online Learning with Partial Information

Published:01 August 2022Publication History
Skip Abstract Section

Abstract

We consider the problem of adversarial (nonstochastic) online learning with partial-information feedback, in which, at each round, a decision maker selects an action from a finite set of alternatives. We develop a black-box approach for such problems in which the learner observes as feedback only losses of a subset of the actions that includes the selected action. When losses of actions are nonnegative, under the graph-based feedback model introduced by Mannor and Shamir, we offer algorithms that attain the so called “small-loss” o(αL⋆) regret bounds with high probability, where α is the independence number of the graph and L⋆ is the loss of the best action. Prior to our work, there was no data-dependent guarantee for general feedback graphs even for pseudo-regret (without dependence on the number of actions, i.e., utilizing the increased information feedback). Taking advantage of the black-box nature of our technique, we extend our results to many other applications, such as combinatorial semi-bandits (including routing in networks), contextual bandits (even with an infinite comparator class), and learning with slowly changing (shifting) comparators. In the special case of multi-armed bandit and combinatorial semi-bandit problems, we provide optimal small-loss, high-probability regret guarantees of O˜(dL⋆), where d is the number of actions, answering open questions of Neu. Previous bounds for multi-armed bandits and semi-bandits were known only for pseudo-regret and only in expectation. We also offer an optimal O˜(κL⋆) regret guarantee for fixed feedback graphs with clique-partition number at most κ.

References

  1. [1] Agarwal A, Krishnamurthy A, Langford J, Luo H, Schapire RE (2017) Open problem: First-order regret bounds for contextual bandits. Proc. 2017 Conf. Learn. Theory, vol. 65, 47.Google ScholarGoogle Scholar
  2. [2] Allenberg C, Auer P, Györfi L, Ottucsák G (2006) Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. Proc. 17th Internat. Conf. Algorithmic Learn. Theory, 229243.Google ScholarGoogle Scholar
  3. [3] Allen-Zhu Z, Bubeck S, Li Y (2018) Make the minority great again: First-order regret bound for contextual bandits. Proc. 35th Internat. Conf. Machine Learn., 186194.Google ScholarGoogle Scholar
  4. [4] Alon N, Cesa-Bianchi N, Dekel O, Koren T (2015) Online learning with feedback graphs: Beyond bandits. Proc. 28th Conf. Learn. Theory, 2335.Google ScholarGoogle Scholar
  5. [5] Alon N, Cesa-Bianchi N, Gentile C, Mansour Y (2013) From bandits to experts: A tale of domination and independence. Proc. 26th Internat. Conf. Neural Inform. Processing Systems, 16101618.Google ScholarGoogle Scholar
  6. [6] Alon N, Cesa-Bianchi N, Gentile C, Mannor S, Mansour Y, Shamir O (2017) Nonstochastic multi-armed bandits with graph-structured feedback. SIAM J. Comput. 46(6):17851826.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Audibert J, Bubeck S (2010) Regret bounds and minimax policies under partial monitoring. J. Machine Learn. Res. 11(94):27852836.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Audibert JY, Bubeck S, Lugosi G (2014) Regret in online combinatorial optimization. Math. Oper. Res. 39(1):3145.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Auer P, Cesa-Bianchi N, Gentile C (2002) Adaptive and self-confident on-line learning algorithms. J. Comput. System Sci. 64(1):4875.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2003) The nonstochastic multi-armed bandit problem. SIAM J. Comput. 32(1):4877.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Awerbuch B, Kleinberg RD (2004) Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. Proc. 36th Annual ACM Sympos. Theory Comput., 4553.Google ScholarGoogle Scholar
  12. [12] Beygelzimer A, Langford J, Li L, Reyzin L, Schapire R (2011) Contextual bandit algorithms with supervised learning guarantees. Proc. 14th Internat. Conf. Artificial Intelligence Statist. (PMLR), 1926.Google ScholarGoogle Scholar
  13. [13] Blum A, Hartline JD (2005) Near-optimal online auctions. Proc. 16th Annual ACM-SIAM Sympos. Discrete Algorithms, 11561163.Google ScholarGoogle Scholar
  14. [14] Blum A, Even-Dar E, Ligett K (2010) Routing without regret: On convergence to Nash equilibria of regret-minimizing algorithms in routing games. Theory Comput. 6(1):179199.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Blum A, Hajiaghayi M, Ligett K, Roth A (2008) Regret minimization and the price of total anarchy. Proc. 40th Annual ACM Sympos. Theory Comput., 373382.Google ScholarGoogle Scholar
  16. [16] Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learning 5(1):1–122. https://www.nowpublishers.com/article/Details/MAL-024.Google ScholarGoogle Scholar
  17. [17] Cesa-Bianchi N, Lugosi G (2006) Prediction, Learning, and Games (Cambridge University Press).Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Cesa-Bianchi N, Gentile C, Mansour Y (2013) Regret minimization for reserve prices in second-price auctions. Proc. 24th Annual ACM-SIAM Sympos. Discrete Algorithms, 11901204.Google ScholarGoogle Scholar
  19. [19] Cesa-Bianchi N, Lugosi G, Stoltz G (2005) Minimizing regret with label efficient prediction. IEEE Trans. Inform. Theory 51(6):21522162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Cohen A, Hazan T, Koren T (2016) Online learning with feedback graphs without the graphs. Proc. 33rd Internat. Conf. Machine Learn., 811819.Google ScholarGoogle Scholar
  21. [21] Cover TM (1991) Universal portfolios. Math. Finance 1(1):129.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Daniely A, Gonen A, Shalev-Shwartz S (2015) Strongly adaptive online learning. Proc. 32nd Internat. Conf. Machine Learn., 14051411.Google ScholarGoogle Scholar
  23. [23] Foster DJ, Li Z, Lykouris T, Sridharan K, Tardos É (2016) Learning in games: Robustness of fast convergence. Annual Conf. Neural Inform. Processing Systems 2016, 47274735.Google ScholarGoogle Scholar
  24. [24] Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55(1):119139.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Hannan J (1957) Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, vol. 3 (Princeton University Press), 97139.Google ScholarGoogle Scholar
  26. [26] Hazan E, Agarwal A, Kale S (2007) Logarithmic regret algorithms for online convex optimization. Machine Learn. 69(2–3):169192.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Herbster M, Warmuth MK (1998) Tracking the best expert. Machine Learn. 32(2):151178.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Kalai A, Vempala S (2005) Efficient algorithms for online decision problems. J. Comput. System Sci. 71(3):291307.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Kocák T, Neu G, Valko M (2016) Online learning with noisy side observations. Proc. 19th Internat. Conf. Artificial Intelligence Statist., 11861194.Google ScholarGoogle Scholar
  30. [30] Kocák T, Neu G, Valko M, Munos R (2014) Efficient learning by implicit exploration in bandit problems with side observations. Adv. Neural Inform. Processing Systems, 613621.Google ScholarGoogle Scholar
  31. [31] Lai T, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):422.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Langford J, Zhang T (2007) The epoch-greedy algorithm for contextual multi-armed bandits. Proc. 20th Internat. Conf. Neural Inform. Processing Systems, 817824.Google ScholarGoogle Scholar
  33. [33] Littlestone N, Warmuth MK (1994) The weighted majority algorithm. Inform. Comput. 108(2):212261.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Liu YP, Sellke M (2018) Personal communication via email.Google ScholarGoogle Scholar
  35. [35] Luo H, Schapire RE (2015) Achieving all with no parameters: Adanormalhedge. Proc. 28th Conf. Learn. Theory, 12861304.Google ScholarGoogle Scholar
  36. [36] Lykouris T, Syrgkanis V, Tardos E (2016) Learning and efficiency in games with dynamic population. Proc. 27th Annual ACM-SIAM Sympos. Discrete Algorithms, 120129.Google ScholarGoogle Scholar
  37. [37] Mannor S, Shamir O (2011) From bandits to experts: On the value of side-observations. Proc. 24th Internat. Conf. Neural Inform. Processing Systems, 684692.Google ScholarGoogle Scholar
  38. [38] Neu G (2015) Explore no more: Improved high-probability regret bounds for non-stochastic bandits. Annual Conf. Neural Inform. Processing Systems, 31683176.Google ScholarGoogle Scholar
  39. [39] Neu G (2015) First-order regret bounds for combinatorial semi-bandits. Proc. 28th Conf. Learn. Theory, 13601375.Google ScholarGoogle Scholar
  40. [40] Neu G, Bartók G (2016) Importance weighting without importance weights: An efficient algorithm for combinatorial semi-bandits. J. Machine Learn. Res. 17(1):53555375.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Rakhlin A, Sridharan K (2013) Online learning with predictable sequences. Proc. 26th Annual Conf. Learn. Theory, 9931019.Google ScholarGoogle Scholar
  42. [42] Rakhlin A, Sridharan K (2014) Online non-parametric regression. Proc. 27th Conf. Learn. Theory, 12321264.Google ScholarGoogle Scholar
  43. [43] Rakhlin A, Sridharan K (2016) Bistro: An efficient relaxation-based method for contextual bandits. Proc. 33rd Internat. Conf. Machine Learn., vol. 48, 19771985.Google ScholarGoogle Scholar
  44. [44] Rakhlin A, Sridharan K (2017) On equivalence of martingale tail bounds and deterministic regret inequalities. Proc. 30th Conf. Learn. Theory, 17041722.Google ScholarGoogle Scholar
  45. [45] Rakhlin A, Sridharan K, Tewari A (2010) Online learning: Random averages, combinatorial parameters, and learnability. Preprint, submitted June 6, https://arxiv.org/abs/1006.1138.Google ScholarGoogle Scholar
  46. [46] Roughgarden T (2015) Intrinsic robustness of the price of anarchy. J. ACM 62(5)142.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Roughgarden T, Wang JR (2016) Minimizing regret with multiple reserves. Proc. 2016 ACM Conf. Econom. Comput., 601616.Google ScholarGoogle Scholar
  48. [48] Syrgkanis V, Krishnamurthy A, Schapire RE (2016a) Efficient algorithms for adversarial contextual learning. Proc. 33rd Internat. Conf. Machine Learn., 21592168.Google ScholarGoogle Scholar
  49. [49] Syrgkanis V, Luo H, Krishnamurthy A, Schapire RE (2016b) Improved regret bounds for oracle-based adversarial contextual bandits. Proc. 30th Internat. Conf. Neural Inform. Processing Systems, 31433151.Google ScholarGoogle Scholar
  50. [50] Tossou A, Dimitrakakis C, Dubhashi D (2017) Thompson sampling for stochastic bandits with graph feedback. Proc. Conf. AAAI Artificial Intelligence, 31(1).Google ScholarGoogle Scholar

Index Terms

(auto-classified)
  1. Small-Loss Bounds for Online Learning with Partial Information

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0

      Other Metrics

    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!