High Confidence Off-Policy Improvement
This was our final course project for COMPSCI 687: Reinforcement Learning.
This project simulates the application of RL to a real problem, like a medical application or business application, where deploying a policy worse than the current one would be dangerous or costly. To simulate this, we were provided data, but not the underlying MDP that created the data. The goal was to provide a set of policies that are better than the current policy that generated the data without accessing the MDP.
I implemented the consistent weighted per-decision importance sampling (CW-PDIS)algorithm for off-policy evaluation. For candidate policy generation, I used CMA-ES as a black-box algortithm to maximize the expected reward (the CWPDIS estimate). I also augmented the objective with safety bounds, and performed an additional safety with the student t-test concentration inequality to reject solutions.