Superare i limiti del Deep Reinforcement Learning tramite l' approccio basato su apprendimento del modello.

Reinforcement Learning (RL) is a subfield of Machine Learning where an agent learns to solve a task by interacting with the environment by trial and error without explicit knowledge. The agent receives a reward signal as feedback for every action it takes, and it learns to prefer those accompanied by a positive reward over those accompanied by a negative reward. This simple formulation allows the agent to directly choose the right actions from sensory observations, e.g. high dimensional inputs like camera frames, and to solve many complex tasks, like playing video-games or controlling robots. The standard formulation exposed before is called Model-Free Reinforcement Learning because it does not require the agent to predict explicitly the environment dynamics, thus it can be viewed as a black-box approach. However, it requires a tremendous amount of experience and the lack of sample efficiency limits the usefulness of these algorithms in practice. One possible solution to overcome this problem is to combine the Reinforcement Learning framework with planning algorithms. This approach is called Model-Based RL. Instead of directly mapping observations to actions, Model-based RL allows the agent to plan explicitly the sequence of actions to be taken by observing the environment dynamics predicted by an environment model. In recent years RL has been combined with Deep Learning algorithms to obtain outstanding results and reach superhuman performance in complex tasks. This combination of Reinforcement Learning and Deep Learning has been called Deep Reinforcement Learning (DRL). In this thesis, one of the state-of-the-art Model-Based DRL algorithms called PlaNet is deeply investigated and compared with the model-free DRL algorithm called Deep Deterministic Policy Gradient (DDPG). All the experiments are based on Deepmind Control Suite that is a set of continuous control tasks that are built for benchmarking reinforcement learning agents. Both the algorithms examined were tested on a subset of four environments. The main strengths and weaknesses of both approaches are highlighted in order to show if and how much a Model-Based RL can overcome the limits of Model-Free RL.

Il Reinforcement Learning (RL) è una tecnica di Machine Learning in cui vi è un agente che impara a risolvere un task interagendo col l'ambiente in cui si trova e di cui non ha nessuna conoscenza procedendo con un approccio trial and error. L'agente riceve un segnale di feedback chiamato reward per ogni azione che compie e impara a favorire quelle azioni accompagnate da un reward positivo a discapito di quelle accompagnate da un reward negativo. Questa semplice formulazione permette all'agente di prevedere le migliori azioni a partire dai sensori di input, come ad esempio i frame della camera, e di risolvere quindi molti task complessi, come giocare a videogiochi o controllare robot. La formulazione standard espressa finora viene definita Model-Free Reinforcement Learning perchè non richiede all'agente di costruirsi un modello dell'environment e di prevederne esplicitamente le dinamiche. Purtroppo, questo approccio diretto richiede un numero tremendamente elevato di esperienza e questo scarso livello di sample-efficiency limita l'utilità di questi algoritmi nell'uso pratico. Una possibile soluzione per superare questo problema è quella di combinare il Reinforcement Learning con algoritmi di planning. Questo approccio è chiamato Model-Based DRL. Invece di creare un mapping diretto tra osservazioni e azioni, il MBDRL permette all'agente di pianificare in modo esplicito la sequenza di azioni da intraprendere in base alle previsioni sugli sviluppi dell' environment. Negli ultimi anni il RL è stato combinato con algoritmi di Deep Learning ottenendo cosi risultati eccezionali arrivando a superare umani esperti anche in task complessi. Questa combinazione di RL e DL prende il nome di Deep Reinforcement Learning (DRL). In questa tesi, uno degli più recenti algoritmi di Model-Based DRL chiamato PlaNet viene esplorato e comparato con un algoritmo di Model-Free DRL chiamato Deep Deterministic Policy Gradient. Tutti gli esperimenti sono basati sulla Deepmind Control Suite, un set di task creati per effettuare benchmark di agenti addestrati tramite DRL. Entrambi gli algoritmi presi in esame sono stati testati su quattro environments. I maggiori punti di forza e di debolezza dei due approcci vengono messi in luce per mostrare se e quanto l'approccio Model-Based sia in grado di superare i limiti del Model-Free DRL.