Hey Guys,
I am a final year undergrad, and as a part of my undergrad thesis, I have been working on Reinforcement Learning for the past one year. Recently, me and my professor, have submitted a part of our work for a conference, and also published the extended version of the same on Arxiv, which you can check out here.

To all those out there who are intrigued about Reinforcement Learning, please do check out our work, and let us know your thoughts on our work, so that we can improve our remaining work, and make it even better. Thanks a lot in advance, for your valuable time and efforts.

Would you mind to share with us about a restless multi-armed bandit problem? Like if I am running a shop that only allows me to display 10 products at a time, but I have 100 products of different margins and different customer sizes (e.g. product A is highly profitable but only popular among a small portion of people who will walk by my shop). Can this be posed as a RMAB? If so, what exactly is your work trying to help a shop owner like me in making my decisions?

Would you mind to describe the actual decision making progress with my example, so that as the shop owner without any technical knowledge, I know which 10 products to pick each time, and finally I will make a very good profit?

Sure, I would love to share what I understand from â€śRestless Multi-Armed Bandits (RMABs)â€ť. I am attaching my presentation slides here for your reference. I will add references to some of the slides which you can read to know more about the different concepts.

The â€śarmâ€ť in RMAB is nothing but a Markov Decision Process (MDP). I canâ€™t recall if Prof Andrew discussed about MDPs explicitly or not in MLS C3 W3, but for a quick intro, you can refer to slide 8. Now, letâ€™s say that we take a RMAB with 10 arms. It means that 10 MDPs constitutes the environment for the RMAB. We create the environment by something known as Coupling of MDPs (please refer to slides 48 and 49). Now, when we create an environment via coupling, the agent is imposed with a constraint, which is, it can only interact with n MDPs at any time-step (or epoch), where n \in [1, 10] (for our example).

Here, please note that we have made some simplifying assumptions for our work, which are as follows:

The agent only has 2 actions, passive action (a = 0, i.e., the agent doesnâ€™t interact with the given MDP) and active action (a = 1, i.e., the agent interacts with the given MDP).

We have fixed n = 1, i.e., the agent can only interact with 1 MDP at any given time-step.

These assumptions provide us with 2 major advantages:

The results are more intuitive and easier to understand.

For rollout algorithms (Monte Carlo Tree Search) based simulations , computation time has been reduced by manifolds.

I might have missed something on the theoretical side of RMABs, so can you please review it up until now, and let me know if there is any gap in my explanation. Once the theoretical definition is clear, I will try to link it with the example that you mentioned.

Hey @rmwkwok,
Thanks for the update. Let us start discussing about your example.

I am quoting your question here for our reference. Let us define the terminology first. You can refer to slides 9 and 10 to know more about the terminology. The following is the mapping:

Shopkeeper is the agent

Each product constitutes an individual MDP (or an arm)

Each day (or an hour) is taken as a single time-step (or epoch)

Active action for a particular MDP means that the agent chooses to display that product at time-step t

Passive action for a particular MDP means that the associated product is not displayed at time-step t

The different states for a MDP could represent the number of units available and purchases, in that time-step

The rewards for a MDP could represent the profit (positive reward) or loss (negative reward) made by the agent for a given product in a given time-step. The loss could be due to storage costs.

For your example, as we can observe, each MDP will have different transition probability and reward matrices. In other words, each of the MDPs are distinct, and hence, for this example, we have coupled non-identical MDPs.

So, this is the formulation of your problem as a RMAB, with 100 non-identical MDPs, in which the agentâ€™s constraint is that it can only interact with 10 MDPs at any given time-step. Now, based on the previous interactions, the agent will learn a policy, and based on that policy, it will interact in the present, and will store that as an interaction to refine the policy for the future decisions.

Some of the ways the policy can help the agent:

For instance, the agent could come up with a policy so that all the products that require cold-storage facilities are displayed towards the closing hours, so that refrigeration costs could be saved when the shop is closed, by switching off the cold-storage facility.

But at the same time, if the products requiring cold-storage are available in large quantities, then there is no such incentive, since it is highly unlikely that large quantities will be sold in a short duration.

Similarly, the agent can come up with a policy to display cooked food items as soon as they are available, so that they wonâ€™t go stale.

Products with caffeine could be displayed in the early morning hours, so that the office workers are satisfied

Products with sugar content could be displayed in the afternoon, so that the children who just got over with their school could be entertained

â€¦ and so on. Now, I havenâ€™t applied this framework to a real world application yet, and this is the first time, I am reading about this framework myself, but my supervisor said, that this framework is extensively used to model wireless communication systems. And I guess, he himself has produced some research work using this framework for wireless communication systems. But your example is indeed much more intuitive, and I believe, this framework should be much easier to implement in your example, as opposed to wireless communication systems, since they are much more sophisticated to understand.

Let me know if there is any gap in my explanation. I will try to fill it up

Thanks for extending your work to my example. I think one critical thing that I have been looking for is how your workâ€™s contribution is like in my example.

Letâ€™s say I am the shopkeeper and my shop opens for 10 hours a day. With 10 products per hour, I have 100 slots. Since it is the first day of business and no policy has ever been trained. As a compromise, I have scheduled which product will be displayed at which hour according to my own preference. Normally, to train my policy, what I need is to collect the profit/loss of all 100 products at each hour as my reward, given the states as you described (quantities, â€¦) and my pre-scheduled actions.

Hey @rmwkwok,
Apologies for the delayed response. I thought the question was looking into how to fit the RMAB framework into your problem. Looks like, I got it wrong

Anyways, collecting the profil/loss, i.e., the reward, and basing our policy on the immediate reward only is referred to as myopic policy. And this is what has been done mostly in the research work till now for RMABs. In our work, we have tried other policies for RMABs, such as index policy and rollout policy, which we have found to be performing better than the myopic policy.

Additionally, some work has tried index policy as well, but for that, we need to find the indices for all the arms, given that each of the arm is indexable. The methodology used to verify whether a MDP is indexable or not, and then find itâ€™s indices, is mathematically rigorous and non-trivial to understand. Our approach to verify indexability and finding indices is much more trivial to understand, and in certain cases, less computationally intensive as well.