
Causal Reinforcement Learning Breakthrough: Optimizing mHealth Interventions with Bagged Decision Times
Researchers associated with the mDOT Center have successfully developed a novel reinforcement learning (RL) framework specifically designed to tackle complex decision-making challenges inherent in mobile health (mHealth) interventions. This new approach, summarized in the paper “Harnessing Causality in Reinforcement Learning With Bagged Decision Times,” provides an effective strategy for optimizing personalized interventions where multiple actions contribute to a single, delayed outcome.
This work is particularly critical for programs like the HeartSteps intervention, which aims to encourage individuals to increase physical activity (PA).
The Challenge: Learning When Rewards Are Delayed
Traditional RL algorithms typically assume that state transitions and rewards are Markovian (dependent only on the immediate previous state and action) and stationary across all decision times. However, many real-world applications, especially in mHealth, violate these assumptions.
The key difficulty lies in bagged decision times. In HeartSteps, for instance, a day is considered a “bag” containing a finite sequence of five consecutive decision times where activity suggestions may be delivered. All actions taken throughout the day collectively impact a single reward—the user’s daily “commitment to being active”—observed only at the end of the day.
Within this bag, the environment dynamics are often non-Markovian and non-stationary. This structure makes it challenging for standard RL methods to determine the long-term impact of individual actions.
The Solution: Combining Causality and Periodic MDPs
To overcome these complexities, the research team developed an online RL algorithm built upon two core innovations:
1. Periodic Markov Decision Processes (Periodic MDPs): The problem is formally framed as a periodic MDP, which allows for state transition and reward functions that are non-stationary within a bag (or period) while remaining stationary on the bag level. This formulation is essential for modeling time-varying preferences and dynamics within a daily intervention cycle.
2. Causality-Based State Construction: To manage the non-Markovian transitions within the bag, the framework utilizes an expert-provided causal directed acyclic graph (DAG). This DAG encodes domain knowledge about cause-effect relationships among variables. Using the DAG, the researchers constructed states based on a dynamical Bayesian sufficient statistic (D-BaSS) of the observed history.
This D-BaSS construction is crucial because it ensures that the resulting state transitions are Markovian both within and across bags. Furthermore, using the D-BaSS state construction yields the maximal optimal value function compared to other possible state constructions. This suggests that identifying and leveraging mediators—such as the 30-minute step count following a suggestion ($M_{d,k}$) or app engagement ($E_d$)—is highly beneficial for improving the value of the optimal policy.
Introducing Bagged RLSVI (BRLSVI)
The resulting algorithm, Bagged RLSVI (BRLSVI), generalizes existing Bayesian RL methods to handle these periodic MDPs and bagged rewards.
BRLSVI provides a practical and effective method for intervention optimization, tested on real-world scenarios derived from the HeartSteps V2 clinical trial, which involved 42 users.
Key Findings from Simulation:
• Superior Performance: BRLSVI was compared against several baseline algorithms, including Stationary RLSVI (SRLSVI), Finite-horizon RLSVI, and Thompson Sampling (TS). The proposed BRLSVI method outperformed the baselines in nearly all testbed variants.
• Effective Trade-off: BRLSVI successfully considers the long-term effects of actions while effectively leveraging real-time information observed at each decision time within the day.
• Robustness: Crucially for real-world deployment, BRLSVI proved robust against misspecified assumptions in the causal DAG. Because BRLSVI uses a model-free policy update method, it performs well even when the underlying causal assumptions (like whether contexts are truly exogenous or whether unobserved mediators exist) are violated in the testbed environment. For example, BRLSVI still achieved strong results even in variants where the contexts were not exogenous or where unobserved mediators were present.
Next Steps
This new framework provides a robust foundation for building the next generation of online, personalized mHealth interventions. While the original patient data used to motivate this research cannot be released due to confidentiality agreements, the researchers have published the testbed and accompanying code, allowing other scientists to reproduce these findings and explore new algorithms in this critical area of sequential decision-making.