The field of Machine Learning is rapidly growing, as is the typical size of the models implemented for real-world applications. Artificial Neural Networks, in various forms and architectures, have shattered previous performance records on Machine Learning datasets in many fields, including Computer Vision, Graph Analysis, Reinforcement Learning, and more. Despite this success, many experimental observations still lack a theoretical understanding, as is the case for one of the main phenomena that allows this amazing performance: benign overfitting. It has been shown in multiple instances, indeed, that Neural Networks do not overfit the training set even if they have the necessary number of parameters and explicit regularization is not applied, hence allowing the perfect fitting of the training set while having an extremely low generalization error. Recently, some studies[1][2] analyzed the representation of the hidden neurons of the network and observed that Stochastic Gradient Descent, the algorithm commonly used to train the networks, is able to find solutions in which some information is redundant: the hypothesis behind this research line is that copying the relevant information reduces the possibility of fitting noise, and can therefore lead to the benign overfitting phenomenon. This thesis is mainly concerned with realistic experimental settings, such as the one introduced in [3], in which a phenomenon called mitosis transition is observed: when the width of the Networks’ last layer reaches a critical threshold, the aforementioned duplication of information happens. I further investigate the characteristics of this transition, analyzing how its location depends on the parameters of the model, such as the depth of the network, the width of its layers and the number of input dimensions. Moreover, I compare the different representations created by SGD and Adam, the two most common Neural Networks optimizers, and discuss whether the observations for 2-layer NNs in [1] and [2] can explain the emergence of the duplication in more realistic settings. [1] Lénaıc Chizat and Francis Bach. “On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport”. In: Advances in Neural Information Processing Systems. Ed. by S. Bengio et al. Vol. 31. Curran Associates, Inc., 2018. url: https://proceedings.neurips.cc/paper/2018/file/ a1afc58c6ca9540d057299ec3016d726-Paper.pdf. [2] Sebastian Goldt et al. “Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup”. In: Advances in Neural Information Processing Systems. Ed. by H. Wallach et al. Vol. 32. Curran Associates, Inc., 2019. url: https://proceedings.neurips.cc/paper/2019/file/ cab070d53bd0d200746fb852a922064a-Paper.pdf. [3] Diego Doimo et al. Redundant representations help generalization in wide neural networks. 2021. url: https://arxiv.org/abs/2106.03485.

The field of Machine Learning is rapidly growing, as is the typical size of the models implemented for real-world applications. Artificial Neural Networks, in various forms and architectures, have shattered previous performance records on Machine Learning datasets in many fields, including Computer Vision, Graph Analysis, Reinforcement Learning, and more. Despite this success, many experimental observations still lack a theoretical understanding, as is the case for one of the main phenomena that allows this amazing performance: benign overfitting. It has been shown in multiple instances, indeed, that Neural Networks do not overfit the training set even if they have the necessary number of parameters and explicit regularization is not applied, hence allowing the perfect fitting of the training set while having an extremely low generalization error. Recently, some studies[1][2] analyzed the representation of the hidden neurons of the network and observed that Stochastic Gradient Descent, the algorithm commonly used to train the networks, is able to find solutions in which some information is redundant: the hypothesis behind this research line is that copying the relevant information reduces the possibility of fitting noise, and can therefore lead to the benign overfitting phenomenon. This thesis is mainly concerned with realistic experimental settings, such as the one introduced in [3], in which a phenomenon called mitosis transition is observed: when the width of the Networks’ last layer reaches a critical threshold, the aforementioned duplication of information happens. I further investigate the characteristics of this transition, analyzing how its location depends on the parameters of the model, such as the depth of the network, the width of its layers and the number of input dimensions. Moreover, I compare the different representations created by SGD and Adam, the two most common Neural Networks optimizers, and discuss whether the observations for 2-layer NNs in [1] and [2] can explain the emergence of the duplication in more realistic settings. [1] Lénaıc Chizat and Francis Bach. “On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport”. In: Advances in Neural Information Processing Systems. Ed. by S. Bengio et al. Vol. 31. Curran Associates, Inc., 2018. url: https://proceedings.neurips.cc/paper/2018/file/ a1afc58c6ca9540d057299ec3016d726-Paper.pdf. [2] Sebastian Goldt et al. “Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup”. In: Advances in Neural Information Processing Systems. Ed. by H. Wallach et al. Vol. 32. Curran Associates, Inc., 2019. url: https://proceedings.neurips.cc/paper/2019/file/ cab070d53bd0d200746fb852a922064a-Paper.pdf. [3] Diego Doimo et al. Redundant representations help generalization in wide neural networks. 2021. url: https://arxiv.org/abs/2106.03485.

Caratterizzazione delle rappresentazioni ridondanti nelle Reti Neurali

CARETTI, FEDERICO
2021/2022

Abstract

The field of Machine Learning is rapidly growing, as is the typical size of the models implemented for real-world applications. Artificial Neural Networks, in various forms and architectures, have shattered previous performance records on Machine Learning datasets in many fields, including Computer Vision, Graph Analysis, Reinforcement Learning, and more. Despite this success, many experimental observations still lack a theoretical understanding, as is the case for one of the main phenomena that allows this amazing performance: benign overfitting. It has been shown in multiple instances, indeed, that Neural Networks do not overfit the training set even if they have the necessary number of parameters and explicit regularization is not applied, hence allowing the perfect fitting of the training set while having an extremely low generalization error. Recently, some studies[1][2] analyzed the representation of the hidden neurons of the network and observed that Stochastic Gradient Descent, the algorithm commonly used to train the networks, is able to find solutions in which some information is redundant: the hypothesis behind this research line is that copying the relevant information reduces the possibility of fitting noise, and can therefore lead to the benign overfitting phenomenon. This thesis is mainly concerned with realistic experimental settings, such as the one introduced in [3], in which a phenomenon called mitosis transition is observed: when the width of the Networks’ last layer reaches a critical threshold, the aforementioned duplication of information happens. I further investigate the characteristics of this transition, analyzing how its location depends on the parameters of the model, such as the depth of the network, the width of its layers and the number of input dimensions. Moreover, I compare the different representations created by SGD and Adam, the two most common Neural Networks optimizers, and discuss whether the observations for 2-layer NNs in [1] and [2] can explain the emergence of the duplication in more realistic settings. [1] Lénaıc Chizat and Francis Bach. “On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport”. In: Advances in Neural Information Processing Systems. Ed. by S. Bengio et al. Vol. 31. Curran Associates, Inc., 2018. url: https://proceedings.neurips.cc/paper/2018/file/ a1afc58c6ca9540d057299ec3016d726-Paper.pdf. [2] Sebastian Goldt et al. “Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup”. In: Advances in Neural Information Processing Systems. Ed. by H. Wallach et al. Vol. 32. Curran Associates, Inc., 2019. url: https://proceedings.neurips.cc/paper/2019/file/ cab070d53bd0d200746fb852a922064a-Paper.pdf. [3] Diego Doimo et al. Redundant representations help generalization in wide neural networks. 2021. url: https://arxiv.org/abs/2106.03485.
ENG
The field of Machine Learning is rapidly growing, as is the typical size of the models implemented for real-world applications. Artificial Neural Networks, in various forms and architectures, have shattered previous performance records on Machine Learning datasets in many fields, including Computer Vision, Graph Analysis, Reinforcement Learning, and more. Despite this success, many experimental observations still lack a theoretical understanding, as is the case for one of the main phenomena that allows this amazing performance: benign overfitting. It has been shown in multiple instances, indeed, that Neural Networks do not overfit the training set even if they have the necessary number of parameters and explicit regularization is not applied, hence allowing the perfect fitting of the training set while having an extremely low generalization error. Recently, some studies[1][2] analyzed the representation of the hidden neurons of the network and observed that Stochastic Gradient Descent, the algorithm commonly used to train the networks, is able to find solutions in which some information is redundant: the hypothesis behind this research line is that copying the relevant information reduces the possibility of fitting noise, and can therefore lead to the benign overfitting phenomenon. This thesis is mainly concerned with realistic experimental settings, such as the one introduced in [3], in which a phenomenon called mitosis transition is observed: when the width of the Networks’ last layer reaches a critical threshold, the aforementioned duplication of information happens. I further investigate the characteristics of this transition, analyzing how its location depends on the parameters of the model, such as the depth of the network, the width of its layers and the number of input dimensions. Moreover, I compare the different representations created by SGD and Adam, the two most common Neural Networks optimizers, and discuss whether the observations for 2-layer NNs in [1] and [2] can explain the emergence of the duplication in more realistic settings. [1] Lénaıc Chizat and Francis Bach. “On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport”. In: Advances in Neural Information Processing Systems. Ed. by S. Bengio et al. Vol. 31. Curran Associates, Inc., 2018. url: https://proceedings.neurips.cc/paper/2018/file/ a1afc58c6ca9540d057299ec3016d726-Paper.pdf. [2] Sebastian Goldt et al. “Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup”. In: Advances in Neural Information Processing Systems. Ed. by H. Wallach et al. Vol. 32. Curran Associates, Inc., 2019. url: https://proceedings.neurips.cc/paper/2019/file/ cab070d53bd0d200746fb852a922064a-Paper.pdf. [3] Diego Doimo et al. Redundant representations help generalization in wide neural networks. 2021. url: https://arxiv.org/abs/2106.03485.
IMPORT DA TESIONLINE
File in questo prodotto:
File Dimensione Formato  
863914_caretti_tesi.pdf

non disponibili

Tipologia: Altro materiale allegato
Dimensione 2.6 MB
Formato Adobe PDF
2.6 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/85995