Tabular and Deep Learning for the Whittle Index

Authors: Francisco Robledo Relaño (LMAP, UPPA, UPV / EHU), Vivek Borkar (EE-IIT), Urtzi Ayesta (IRIT-RMESS, UPV/EHU, CNRS), Konstantin Avrachenkov (Inria)

Year: 2024

Source: https://arxiv.org/abs/2406.02057

TLDR:

The document discusses the development of two reinforcement learning algorithms, QWI and QWINN, designed to learn the Whittle index for the total discounted criterion in Restless Multi-Armed Bandit Problems (RMABPs). These algorithms leverage a dual time-scale strategy, with QWI being a tabular implementation and QWINN being an adaptation using neural networks to compute Q-values on a faster time-scale. The document presents theoretical results showing that QWI converges to the real Whittle indices, while QWINN is able to extrapolate information from one state to another and scales naturally to large state-space environments. Numerical computations demonstrate that both QWI and QWINN outperform other algorithms in terms of convergence rate and discounted reward optimization. QWINN, in particular, exhibits remarkable proficiency in deriving accurate Whittle indices from limited data samples. The document also discusses the challenges and future research directions in restless bandits with partial observations and acknowledges the contributions of various funding sources and research projects. Overall, the algorithms presented in the document show promise in solving RMABPs and learning optimal policies.

Free Login To Access AI Capability

Free Access To ChatGPT

The document discusses the development of QWI and QWINN algorithms, which aim to learn Whittle indices for Restless Multi-Armed Bandit Problems, with QWINN showing superior performance in large state spaces and QWI excelling in accurately estimating Whittle indices for smaller to moderate-sized problems.

Free Access to ChatGPT

Abstract

The Whittle index policy is a heuristic that has shown remarkably good performance (with guaranteed asymptotic optimality) when applied to the class of problems known as Restless Multi-Armed Bandit Problems (RMABPs). In this paper we present QWI and QWINN, two reinforcement learning algorithms, respectively tabular and deep, to learn the Whittle index for the total discounted criterion. The key feature is the use of two time-scales, a faster one to update the state-action Q -values, and a relatively slower one to update the Whittle indices. In our main theoretical result we show that QWI, which is a tabular implementation, converges to the real Whittle indices. We then present QWINN, an adaptation of QWI algorithm using neural networks to compute the Q -values on the faster time-scale, which is able to extrapolate information from one state to another and scales naturally to large state-space environments. For QWINN, we show that all local minima of the Bellman error are locally stable equilibria, which is the first result of its kind for DQN-based schemes. Numerical computations show that QWI and QWINN converge faster than the standard Q -learning algorithm, neural-network based approximate Q-learning and other state of the art algorithms.

Method

The authors employed a dual time-scale strategy, utilizing a rapid cycle for updating state-action Q-values and a slower one for refining Whittle indices. They introduced two reinforcement learning algorithms, QWI and QWINN, with QWI being a tabular implementation and QWINN integrating neural networks, drawing inspiration from DQN. The algorithms were designed to learn the Whittle index for the total discounted criterion in Restless Multi-Armed Bandit Problems (RMABPs). The document also presented theoretical findings, numerical comparisons, and a proof of theorem 3.1 to support the efficacy of the proposed methodologies.

Main Finding

The authors introduced two algorithms, QWI and QWINN, to learn the Whittle index for the total discounted criterion in Restless Multi-Armed Bandit Problems (RMABPs). They found that QWINN, which integrates neural networks, excels in large state spaces due to its extrapolation capabilities, while QWI, a tabular implementation, accurately estimates Whittle indices for smaller to moderate-sized problems. Their numerical comparisons demonstrated superior performance of QWI and QWINN in terms of convergence rate and discounted reward optimization, with QWINN showing remarkable proficiency in deriving accurate Whittle indices from limited data samples. Additionally, the authors highlighted the potential of neural network-based approaches in addressing restless bandits with partial observations and identified regret analysis in this framework as a future research direction.

Conclusion

The paper introduces two algorithms, QWI and QWINN, for learning the Whittle index in Restless Multi-Armed Bandit Problems (RMABPs). It demonstrates that QWINN excels in larger state spaces, while QWI accurately estimates Whittle indices for smaller to moderate-sized problems. The algorithms outperform other relevant algorithms in terms of convergence rate and discounted reward optimization. Additionally, the paper identifies future research directions, such as addressing restless bandits with partial observations and regret analysis in this framework. The authors also acknowledge the support received from various funding sources.

Keywords

Machine learning, Reinforcement Learning, Whittle Index, Markov Decision Problem, Multi-armed Restless Bandit

The Best AI PDF Reader