In this work a logistic problem is proposed: coordinating a set of robots in order to inspect a configuration of solar panels. The provided robots are of two types: robots of the first type are designated for the actual inspection of the panels, while robots of the second type recharge the ones of the first type, that have only a limited charge. The objective is to inspect each panel as often as possible. When this problem is approached with Reinforcement Learning, some obstacles occur: in particular there are two limitations. First, as the complexity of the modeled environment increases, the size of the state space rapidly explodes. This leads to the impossibility of storing the state space even for the most simple environment configurations. The second limitation depends on the fact that, in order to evaluate the quality of a sequence of actions, it is reasonable to consider only quantities that depend on the overall behaviour of the robots in a certain time interval. It follows that it is not possible to judge a single move: only sequences of moves can be evaluated, so that it is not possible to use the common adjournment formulas of tabular Q-learning. In order to tackle the problem, the following approach is proposed. First, it is introduced a state representation that reduces the proposed multi-agent problem into a single-agent one. Then, a modification of the Monte Carlo Tree Search algorithm is used to find optimal sequences of actions in a given time interval, this in order to store a set of couples state - action. These couples are then used to train a Neural Network that generalizes the ability to choose a proper action in a given state. This Neural Network defines a policy that, if exploited in the simulation phase of the Monte Carlo Tree Search algorithm, enhances its performance. The Monte Carlo Tree Search algorithm is thus improved accordingly and a new dataset, containing more accurate examples of proper behaviour, is gathered. This dataset allows the training of a second Neural Network, that induces a second policy. The second Neural Network, since trained on more accurate examples than the first, proves to be better. When it exploits this second policy in its simulation phases, the Monte Carlo Tree Search algorithm becomes even more effective in finding good examples than before. This procedure - of training a Neural Network on a dataset and using the induced policy in the Monte Carlo Tree Search algorithm for the storage of a new dataset - is repeated three times in total. The three constructed policies are thus compared, various tests confirm that their performance gradually improve. The depicted approach, even if constructed and tested on the proposed problem, does not exploit any problem-specificity: it can be retried in every scenario that has the same features of the one introduced.

Coordinazione di robot con le reti neurali e l'algoritmo di ricerca ad albero Monte Carlo.

ANGELI, EMILIO
2021/2022

Abstract

In this work a logistic problem is proposed: coordinating a set of robots in order to inspect a configuration of solar panels. The provided robots are of two types: robots of the first type are designated for the actual inspection of the panels, while robots of the second type recharge the ones of the first type, that have only a limited charge. The objective is to inspect each panel as often as possible. When this problem is approached with Reinforcement Learning, some obstacles occur: in particular there are two limitations. First, as the complexity of the modeled environment increases, the size of the state space rapidly explodes. This leads to the impossibility of storing the state space even for the most simple environment configurations. The second limitation depends on the fact that, in order to evaluate the quality of a sequence of actions, it is reasonable to consider only quantities that depend on the overall behaviour of the robots in a certain time interval. It follows that it is not possible to judge a single move: only sequences of moves can be evaluated, so that it is not possible to use the common adjournment formulas of tabular Q-learning. In order to tackle the problem, the following approach is proposed. First, it is introduced a state representation that reduces the proposed multi-agent problem into a single-agent one. Then, a modification of the Monte Carlo Tree Search algorithm is used to find optimal sequences of actions in a given time interval, this in order to store a set of couples state - action. These couples are then used to train a Neural Network that generalizes the ability to choose a proper action in a given state. This Neural Network defines a policy that, if exploited in the simulation phase of the Monte Carlo Tree Search algorithm, enhances its performance. The Monte Carlo Tree Search algorithm is thus improved accordingly and a new dataset, containing more accurate examples of proper behaviour, is gathered. This dataset allows the training of a second Neural Network, that induces a second policy. The second Neural Network, since trained on more accurate examples than the first, proves to be better. When it exploits this second policy in its simulation phases, the Monte Carlo Tree Search algorithm becomes even more effective in finding good examples than before. This procedure - of training a Neural Network on a dataset and using the induced policy in the Monte Carlo Tree Search algorithm for the storage of a new dataset - is repeated three times in total. The three constructed policies are thus compared, various tests confirm that their performance gradually improve. The depicted approach, even if constructed and tested on the proposed problem, does not exploit any problem-specificity: it can be retried in every scenario that has the same features of the one introduced.
ENG
IMPORT DA TESIONLINE
File in questo prodotto:
File Dimensione Formato  
954106_thesis_emilio_angeli.pdf

non disponibili

Tipologia: Altro materiale allegato
Dimensione 2.2 MB
Formato Adobe PDF
2.2 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/86038