Figure 1. Reinforcement Learning: based on the pic by Andrea Piacquadio from Pexels

Introduction

With the current saturation setting into Deep Learning (DL) methods, there is quite a bit of expectation that Reinforcement (RL) will be the next big thing in AI.

Given that RL based approaches can basically be applied to any optimization problem, its enterprise adoption is picking up fast.

RL refers to a branch of Artificial Intelligence (AI), which is able to achieve complex goals by maximizing a reward function in real-time.

The reward function works similar to incentivizing a child with candy and spankings, such that the algorithm is penalized when it takes a wrong decision and rewarded when it takes a right one — this is reinforcement. The reinforcement aspect also allows it to adapt faster to real-time changes in the user sentiment. For a detailed introduction to RL frameworks, the interested reader is referred to [1].

Figure 2. Reinforcement Learning (RL) formulation (Image by Author)

RL Observations

Some interesting observations about RL, without going into too many technical details:

Rewards and Policies are not the same: The roles and responsibilities of the Reward function vs. RL Agent policies are not very well defined at this stage, and can vary between architectures. A naïve understanding would be that given an associated reward / cost with every state-action pair, the policy would always try to minimize the overall cost. Apparently, it seems that sometimes keeping the ecosystem in a stable state can be more important than minimizing the cost (e.g. in a climate control use-case). As such, the RL Agent policy goal need not always be aligned with the Reward function, and that is why two separate functions are needed.
Similar to supervised approaches in Machine Learning/ Deep Learning, the RL approach most suitable for enterprise adoption is ‘Model based RL’. In Model based RL, it is possible to develop a model of the problem scenario, and bootstrap initial RL training based on the model simulation values. For instance, for energy optimization use-cases, a blueprint of the building HVAC systems serves as a model, whose simulation values can be used to train the RL model. For complex scenarios (e.g. games, robotic tasks), where it is not possible to build a model of the problem scenario, it is possible to bootstrap an RL model based on historical values.

This is referred to as ‘offline training’, and is considered a good starting point in the absence of a model. And, this is also the reason why RL is often considered as a hybrid between supervised and unsupervised learning, rather than a purely unsupervised learning paradigm.

Online and model-free RL remain the most challenging, where the RL agent is trying to learn and react in real-time without any supervision. Research in this field seems to lack a theoretical foundation at this stage. Researchers are trying out different approaches by simply throwing more data and computing power at the problems. As such, this remains the most “interesting” (and and also the farthest from enterprise adoption) part of RL, with current research primarily focusing on efficient heuristics and distributed computation to cover the search space in an accelerated fashion. Applying DL (neural networks) to the different RL aspects, e.g. policies, rewards, also remains a hot topic — referred to as Deep Reinforcement Learning [1].
Given the fundamental nature of RL, there seems to be many interesting concepts that can be borrowed from existing research in Decision Sciences and Human Psychology. For example, an interesting quote from Tom Griffiths, in his presentation “Rational use of cognitive resources in humans and machines” [3]:

while mimicking the human brain seems to be the holy grail of AI/RL research; humans have long have been considered as essentially flawed characters in psychological studies. So what we really want to do is of course to mimic the “rational behavior” of the human brain.

The summary is of course that we need to bring the two fields together if we ever want machines to reach the level of true human intelligence.

RL — Enterprise Use-cases

Recommenders

D. Biswas. Reinforcement Learning based Recommender Systems. (Medium link), also presented in the “Advances in Artificial Intelligence for Healthcare” track at the 24th European Conference on Artificial Intelligence (ECAI), Sep 2020. (paper pdf) (ppt)

Abstract. We present a Reinforcement Learning (RL) based approach to implement Recommender Systems.

The results are based on a real-life Wellness app that is able to provide personalized health / activity related content to users in an interactive fashion. Unfortunately, current recommender systems are unable to adapt to continuously evolving features, e.g. user sentiment, and scenarios where the RL reward needs to computed based on multiple and unreliable feedback channels (e.g., sensors, wearables).

To overcome this, we propose three constructs: (i) weighted feedback channels, (ii) delayed rewards, and (iii) reward boosting, which we believe are essential for RL to be used in Recommender Systems.

Figure 3. Delayed Reward — Policy based RL formulation (Image by Author)

Related Work. Previous works have explored RL in the context of Recommender Systems [R1, R2, R3], and enterprise adoption also seems to be gaining momentum with the recent availability of Cloud APIs (e.g. Azure Personalizer [R4]) and Google’s RecSim. Given a user profile and categorized recommendations, the system makes a recommendation based on popularity, interests, demographics, frequency and other features.

The main novelty of these systems is that they are able to identify the features (or combination of features) of recommendations getting higher rewards for a specific user; which can then be customized for that user to provide better recommendations [R5].

Chatbots

E. Ricciardelli, D. Biswas. Self-improving Chatbots based on Deep Reinforcement Learning. (Medium link), also published in the 4th Conference on Reinforcement Learning and Decision Making (RLDM), Montreal, 2019 (Paper) (Code)

Abstract. We present a Reinforcement Learning (RL) model for self-improving chatbots, specifically targeting FAQ-type chatbots. The model is not aimed at building a dialog system from scratch, but to leverage data from user conversations to improve chatbot performance.

At the core of our approach is a score model, which is trained to score chatbot utterance-response tuples based on user feedback. The scores predicted by this model are used as rewards for the RL agent. Policy learning takes place offline, thanks to an user simulator which is fed with utterances from the FAQ-database.

Policy learning is implemented using a Deep Q-Network (DQN) agent with epsilon-greedy exploration, which is tailored to effectively include fallback answers for out-of-scope questions.

The potential of our approach is shown on a small case extracted from an enterprise chatbot. It shows an increase in performance from an initial 50% success rate to 75% in 20–30 training epochs.

Figure 4. RL Chatbot model architecture (Image by Author)

Related Work. Several research papers [C1, C2, C3, C4] have shown the effectiveness of a RL approach in developing dialog systems. Critical to this approach is the choice of a good reward model.

A typical reward model is the implementation of a penalty term for each dialog turn. However, such rewards only apply to task completion chatbots where the purpose of the agent is to satisfy user’s request in the shortest time, but it is not suitable for FAQ-type chatbots where the chatbot is expected to provide a good answer in one turn.

The user’s feedback can also be used as a reward model in an online reinforcement learning. However, applying RL on live conversations can be challenging and it may incur a significant cost in case of RL failure.

A better approach for deployed systems is to perform the RL training offline and then update the NLU policy once satisfactory levels of performance have been reached.

Energy Optimization

D. Biswas. Reinforcement Learning based Energy Optimization in Factories. (Medium link), also published in proceedings of the 11th ACM e-Energy Conference, Jun 2020, (ppt)

Abstract. Heating, Ventilation and Air Conditioning (HVAC) units are responsible for maintaining the temperature and humidity settings in a building. Studies have shown that HVAC accounts for almost 50% energy consumption in a building and 10% of global electricity usage.

HVAC optimization thus has the potential to contribute significantly towards our sustainability goals, reducing energy consumption and CO2 emissions. In this work, we explore ways to optimize the HVAC controls in factories.

Unfortunately, this is a complex problem as it requires computing an optimal state considering multiple variable factors, e.g. the occupancy, manufacturing schedule, temperature requirements of operating machines, air flow dynamics within the building, external weather conditions, energy savings, etc.

We present a Reinforcement Learning (RL) based energy optimization model that has been applied in our factories. We show that RL is a good fit as it is able to learn and adapt to multi-parameterized system dynamics in real-time. It provides around 25% energy savings on top of the previously used Proportional–Integral–Derivative (PID) controllers.

Figure 5. HVAC Reinforcement Learning formulation (Image by Author)

Related Work. RL based approaches [E1, E2] have recently been proposed to address such problems given their ability to learn and optimize multi-parameterized systems in real-time.

An initial (offline) training phase is required for RL based approaches, as training an RL algorithm in live settings (online) can take time to converge leading to potentially hazardous violations as the RL agent explores its state space. [E1, E2] outline solutions to perform this offline training based on EnergyPlus based simulation models of the HVAC unit.

EnergyPlus™ is an open source HVAC simulator from the US Department of Energy that can be used to model both energy consumption — for heating, cooling, ventilation, lighting and plug and process loads — and water use in buildings.

Unfortunately, developing an accurate EnergyPlus based simulation model of an HVAC unit is a non-trivial, time consuming and expensive process; and as such can be a blocker for their use in offline training.