Community Detection In Stochastic Blockmodel con Numero di Comunità non Noto a Priori: un Approccio Bayesiano Nonparametrico.

Nowadays, networks are increasingly common compelling the development of methods to study their complexity. A relevant research field is community detection which is useful to extract the underlying structure from complex networks and to understand their patterns. In literature we can find two macro areas where methods used to tackle such a problem are divided: the algorithmic and the model-based one. A common problem with many methods of both classes is that they require in advance the number of communities K to split the units into, making them unattractive for practical purposes. Among the algorithmic methods, i.e. not based on statistical models, the most widely used to estimate both the number of communities and the communities structure is the Louvain algorithm. This is a fast algorithm but, like most of them, has limitations on the type of communities it can find and on its theoretical foundations. These issues have motivated a growing interest in model-based solutions which rely on generative statistical models. Among the generative models for learning communities in network data, the stochastic block model (SBM) is arguably the most widely implemented and well-established formulation, owing also to its unique balance between simplicity and flexibility. However, classical SBM formulation is based on a fixed and pre-specified number of communities which has led to the flourishing of various generalizations thereof. Through our research we investigated the mixture of finite mixtures version of the stochastic blockmodel (MFM-SBM model) introduced by Geng et al. (2019). This model is based on a mixture of finite mixtures model which admits a Pólya urn scheme similar to CRP. We implemented in R the collapsed Gibbs sampler formulated by Geng et al. (2019) and we studied its behaviour. MFM-SBM model turned out to perform better than the Louvain algorithm both in estimating the community structure and the number of communities itself. We carried out also a simulation study where we compared two versions of the MFM-SBM model, respectively one with a zero-truncated-Poisson prior on the parameter K and the other one with a Gnedin prior on K. The overall performance of the two algorithms was comparable with the only difference that the MFM-SBM model with zero-truncated-Poisson prior was more sensible in terms of execution time to different parametrization. As expected, we noticed an increment of the execution time of the MFM-SBM model (with both priors) as the size of the network increased, while we got confirmation that with a medium size of the network, the algorithm fails to recover the true number of communities from the posterior distribution of |z|. For an insufficiently large size of the network compared to the number of communities, it should be preferred the approach of Miller and Harrison (2018) in estimating K. Finally, we evaluated the MFM-SBM model on the dolphin social network data, typically used in literature as a benchmark network where is recognized to have two main communities. Compared to the Louvain algorithm, the MFM-SBM method proved to perform better either with a zero-truncated-Poisson prior or the Gnedin prior. Regarding the two different priors for K used in the MFM-SBM on the dolphin social network we noted that using a Gnedin distribution with a non informative parametrization allowed to get an improvement that the zero-truncated-Poisson prior.

Community Detection In Stochastic Blockmodel con Numero di Comunità non Noto a Priori: un Approccio Bayesiano Nonparametrico.

CHINI, MICHELE

2019/2020

Abstract

Nowadays, networks are increasingly common compelling the development of methods to study their complexity. A relevant research field is community detection which is useful to extract the underlying structure from complex networks and to understand their patterns. In literature we can find two macro areas where methods used to tackle such a problem are divided: the algorithmic and the model-based one. A common problem with many methods of both classes is that they require in advance the number of communities K to split the units into, making them unattractive for practical purposes. Among the algorithmic methods, i.e. not based on statistical models, the most widely used to estimate both the number of communities and the communities structure is the Louvain algorithm. This is a fast algorithm but, like most of them, has limitations on the type of communities it can find and on its theoretical foundations. These issues have motivated a growing interest in model-based solutions which rely on generative statistical models. Among the generative models for learning communities in network data, the stochastic block model (SBM) is arguably the most widely implemented and well-established formulation, owing also to its unique balance between simplicity and flexibility. However, classical SBM formulation is based on a fixed and pre-specified number of communities which has led to the flourishing of various generalizations thereof. Through our research we investigated the mixture of finite mixtures version of the stochastic blockmodel (MFM-SBM model) introduced by Geng et al. (2019). This model is based on a mixture of finite mixtures model which admits a Pólya urn scheme similar to CRP. We implemented in R the collapsed Gibbs sampler formulated by Geng et al. (2019) and we studied its behaviour. MFM-SBM model turned out to perform better than the Louvain algorithm both in estimating the community structure and the number of communities itself. We carried out also a simulation study where we compared two versions of the MFM-SBM model, respectively one with a zero-truncated-Poisson prior on the parameter K and the other one with a Gnedin prior on K. The overall performance of the two algorithms was comparable with the only difference that the MFM-SBM model with zero-truncated-Poisson prior was more sensible in terms of execution time to different parametrization. As expected, we noticed an increment of the execution time of the MFM-SBM model (with both priors) as the size of the network increased, while we got confirmation that with a medium size of the network, the algorithm fails to recover the true number of communities from the posterior distribution of |z|. For an insufficiently large size of the network compared to the number of communities, it should be preferred the approach of Miller and Harrison (2018) in estimating K. Finally, we evaluated the MFM-SBM model on the dolphin social network data, typically used in literature as a benchmark network where is recognized to have two main communities. Compared to the Louvain algorithm, the MFM-SBM method proved to perform better either with a zero-truncated-Poisson prior or the Gnedin prior. Regarding the two different priors for K used in the MFM-SBM on the dolphin social network we noted that using a Gnedin distribution with a non informative parametrization allowed to get an improvement that the zero-truncated-Poisson prior.

Scheda breve

	Facoltà/Dipartimento
	
				MATEMATICA "GIUSEPPE PEANO"
			
	Corso di studio
	
				STOCHASTICS AND DATA SCIENCE
			
	Lingua
	
				ENG
			
	Relatrice / Relatore
	
				DE BLASI, Pierpaolo
			
	Modalità consultazione tesi
	
				IMPORT DA TESIONLINE
			
	Appare nelle tipologie:
	
				Corso di Laurea Magistrale

File in questo prodotto:

File	Dimensione	Formato
899694_thesis.pdf non disponibili Tipologia: Altro materiale allegato Dimensione 2.91 MB Formato Adobe PDF	2.91 MB	Adobe PDF

Se sei interessato/a a consultare l'elaborato, vai nella sezione Home in alto a destra, dove troverai le informazioni su come richiederlo. I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Usare il seguente URL per citare questo documento: https://hdl.handle.net/20.500.14240/33579