Caratterizzazione delle rappresentazioni ridondanti nelle Reti Neurali

The field of Machine Learning is rapidly growing, as is the typical size of the models implemented for real-world applications. Artificial Neural Networks, in various forms and architectures, have shattered previous performance records on Machine Learning datasets in many fields, including Computer Vision, Graph Analysis, Reinforcement Learning, and more. Despite this success, many experimental observations still lack a theoretical understanding, as is the case for one of the main phenomena that allows this amazing performance: benign overfitting. It has been shown in multiple instances, indeed, that Neural Networks do not overfit the training set even if they have the necessary number of parameters and explicit regularization is not applied, hence allowing the perfect fitting of the training set while having an extremely low generalization error. Recently, some studies[1][2] analyzed the representation of the hidden neurons of the network and observed that Stochastic Gradient Descent, the algorithm commonly used to train the networks, is able to find solutions in which some information is redundant: the hypothesis behind this research line is that copying the relevant information reduces the possibility of fitting noise, and can therefore lead to the benign overfitting phenomenon. This thesis is mainly concerned with realistic experimental settings, such as the one introduced in [3], in which a phenomenon called mitosis transition is observed: when the width of the Networks’ last layer reaches a critical threshold, the aforementioned duplication of information happens. I further investigate the characteristics of this transition, analyzing how its location depends on the parameters of the model, such as the depth of the network, the width of its layers and the number of input dimensions. Moreover, I compare the different representations created by SGD and Adam, the two most common Neural Networks optimizers, and discuss whether the observations for 2-layer NNs in [1] and [2] can explain the emergence of the duplication in more realistic settings. [1] Lénaıc Chizat and Francis Bach. “On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport”. In: Advances in Neural Information Processing Systems. Ed. by S. Bengio et al. Vol. 31. Curran Associates, Inc., 2018. url: https://proceedings.neurips.cc/paper/2018/file/ a1afc58c6ca9540d057299ec3016d726-Paper.pdf. [2] Sebastian Goldt et al. “Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup”. In: Advances in Neural Information Processing Systems. Ed. by H. Wallach et al. Vol. 32. Curran Associates, Inc., 2019. url: https://proceedings.neurips.cc/paper/2019/file/ cab070d53bd0d200746fb852a922064a-Paper.pdf. [3] Diego Doimo et al. Redundant representations help generalization in wide neural networks. 2021. url: https://arxiv.org/abs/2106.03485.

Caratterizzazione delle rappresentazioni ridondanti nelle Reti Neurali

CARETTI, FEDERICO

2021/2022

Abstract

The field of Machine Learning is rapidly growing, as is the typical size of the models implemented for real-world applications. Artificial Neural Networks, in various forms and architectures, have shattered previous performance records on Machine Learning datasets in many fields, including Computer Vision, Graph Analysis, Reinforcement Learning, and more. Despite this success, many experimental observations still lack a theoretical understanding, as is the case for one of the main phenomena that allows this amazing performance: benign overfitting. It has been shown in multiple instances, indeed, that Neural Networks do not overfit the training set even if they have the necessary number of parameters and explicit regularization is not applied, hence allowing the perfect fitting of the training set while having an extremely low generalization error. Recently, some studies[1][2] analyzed the representation of the hidden neurons of the network and observed that Stochastic Gradient Descent, the algorithm commonly used to train the networks, is able to find solutions in which some information is redundant: the hypothesis behind this research line is that copying the relevant information reduces the possibility of fitting noise, and can therefore lead to the benign overfitting phenomenon. This thesis is mainly concerned with realistic experimental settings, such as the one introduced in [3], in which a phenomenon called mitosis transition is observed: when the width of the Networks’ last layer reaches a critical threshold, the aforementioned duplication of information happens. I further investigate the characteristics of this transition, analyzing how its location depends on the parameters of the model, such as the depth of the network, the width of its layers and the number of input dimensions. Moreover, I compare the different representations created by SGD and Adam, the two most common Neural Networks optimizers, and discuss whether the observations for 2-layer NNs in [1] and [2] can explain the emergence of the duplication in more realistic settings. [1] Lénaıc Chizat and Francis Bach. “On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport”. In: Advances in Neural Information Processing Systems. Ed. by S. Bengio et al. Vol. 31. Curran Associates, Inc., 2018. url: https://proceedings.neurips.cc/paper/2018/file/ a1afc58c6ca9540d057299ec3016d726-Paper.pdf. [2] Sebastian Goldt et al. “Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup”. In: Advances in Neural Information Processing Systems. Ed. by H. Wallach et al. Vol. 32. Curran Associates, Inc., 2019. url: https://proceedings.neurips.cc/paper/2019/file/ cab070d53bd0d200746fb852a922064a-Paper.pdf. [3] Diego Doimo et al. Redundant representations help generalization in wide neural networks. 2021. url: https://arxiv.org/abs/2106.03485.

Scheda breve

	Facoltà/Dipartimento
	
				FISICA
			
	Corso di studio
	
				FISICA DEI SISTEMI COMPLESSI
			
	Lingua
	
				ENG
			
	Abstract in inglese
	
				The field of Machine Learning is rapidly growing, as is the typical size of the models
implemented for real-world applications. Artificial Neural Networks, in various forms
and architectures, have shattered previous performance records on Machine Learning
datasets in many fields, including Computer Vision, Graph Analysis, Reinforcement
Learning, and more. Despite this success, many experimental observations still
lack a theoretical understanding, as is the case for one of the main phenomena
that allows this amazing performance: benign overfitting. It has been shown in
multiple instances, indeed, that Neural Networks do not overfit the training set
even if they have the necessary number of parameters and explicit regularization
is not applied, hence allowing the perfect fitting of the training set while having
an extremely low generalization error.
Recently, some studies[1][2] analyzed the representation of the hidden neurons
of the network and observed that Stochastic Gradient Descent, the algorithm
commonly used to train the networks, is able to find solutions in which some
information is redundant: the hypothesis behind this research line is that copying
the relevant information reduces the possibility of fitting noise, and can therefore
lead to the benign overfitting phenomenon.
This thesis is mainly concerned with realistic experimental settings, such as
the one introduced in [3], in which a phenomenon called mitosis transition is
observed: when the width of the Networks’ last layer reaches a critical threshold,
the aforementioned duplication of information happens. I further investigate
the characteristics of this transition, analyzing how its location depends on the
parameters of the model, such as the depth of the network, the width of its layers and
the number of input dimensions. Moreover, I compare the different representations
created by SGD and Adam, the two most common Neural Networks optimizers,
and discuss whether the observations for 2-layer NNs in [1] and [2] can explain
the emergence of the duplication in more realistic settings.

[1] Lénaıc Chizat and Francis Bach. “On the Global Convergence of Gradient Descent
for Over-parameterized Models using Optimal Transport”. In: Advances in Neural
Information Processing Systems. Ed. by S. Bengio et al. Vol. 31. Curran Associates,
Inc., 2018. url: https://proceedings.neurips.cc/paper/2018/file/
a1afc58c6ca9540d057299ec3016d726-Paper.pdf.
[2] Sebastian Goldt et al. “Dynamics of stochastic gradient descent for two-layer
neural networks in the teacher-student setup”. In: Advances in Neural Information
Processing Systems. Ed. by H. Wallach et al. Vol. 32. Curran Associates, Inc., 2019.
url: https://proceedings.neurips.cc/paper/2019/file/
cab070d53bd0d200746fb852a922064a-Paper.pdf.
[3] Diego Doimo et al. Redundant representations help generalization in wide neural
networks. 2021. url: https://arxiv.org/abs/2106.03485.
			
	Relatrice / Relatore
	
				OSELLA, Matteo
			
	Modalità consultazione tesi
	
				IMPORT DA TESIONLINE
			
	Appare nelle tipologie:
	
				Corso di Laurea Magistrale

File in questo prodotto:

File	Dimensione	Formato
863914_caretti_tesi.pdf non disponibili Tipologia: Altro materiale allegato Dimensione 2.6 MB Formato Adobe PDF	2.6 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/85995