Networks feature in many fields: from social networks to biological systems such as gene and neuron interactions, as well as technological structures like the Internet and transportation systems. A prominent aspect of interest of networks is their observed tendency to exhibit a community structure: nodes in the same group tend to share features and are more strongly connected with other nodes within the same group. Community Detection refers to the task of identifying clusters of nodes based on patterns in the edges of networks in order to shed light on large-scale patterns of connection. Community Detection algorithms have been widely used in online and offline social networks, biological interaction networks, and even academic citation networks. A recent development in the field introduced the Batched Gibbs Sampler, a scalable modification of the Gibbs sampler, which leverages the Stochastic Block Model (SBM), a well-known probabilistic generative model for networks, to perform community detection. Using proof techniques from both Optimization and Statistics, existing literature provides theoretical guarantees on its mini-max rate, albeit within the context of a simplified Bayesian hierarchical SBM. The aim of this research is to further investigate this class of algorithms by proposing a variation that incorporates an additional layer in the hierarchical model and providing theoretical guarantees for its convergence. This research can be seen as a preliminary step toward theoretically understanding how modifications aimed at improving scalability affect the inherently sequential Gibbs samplers, particularly in terms of their statistical guarantees on convergence to the correct stationary distribution, and toward the development of scalable Gibbs samplers capable of inferring the number of communities, a task that has so far been addressed a posteriori using model evaluation metrics. Combining literature on modifications of MCMC algorithms for scalability, Bayesian statistics on collapsed Gibbs samplers, and the Batched Gibbs Sampler, we modified the pre-existing algorithm to adapt it to a more complex model involving an additional latent variable. Moreover, by making a slightly more restrictive assumption regarding the balance of community sizes, we prove that the same mini-max optimality rate of the pre-existing algorithm holds for our modification. Through simulations on synthetic SBM data, we verified that the accuracy of our algorithm is comparable to that of the pre-existing sequential Gibbs sampler under suitable initializations, with enhanced scalability. We also identified an unusual behavior of the algorithm in cases where it fails to converge: it appears that the sampler still reaches a stationary distribution, which seems to coincide with the prior placed on the SBM. In conclusion, the algorithm developed in this thesis is a mini-max optimal and scalable alternative to currently used community detection algorithms, particularly its sequential counterpart. The theoretical guarantees discussed in this thesis suggest that future developments of Batched Gibbs Samplers built on it, particularly in more complex models involving community number inference, may maintain mini-max optimal performance. Moreover, the findings on the “convergence to the prior” phenomenon can have interesting implications for future developments: they could be used to develop actionable strategies for assessing convergence to the correct distribution and guide further investigation into the stationary distribution of the Markov chain of the Batched Gibbs Sampler.

Networks feature in many fields: from social networks to biological systems such as gene and neuron interactions, as well as technological structures like the Internet and transportation systems. A prominent aspect of interest of networks is their observed tendency to exhibit a community structure: nodes in the same group tend to share features and are more strongly connected with other nodes within the same group. Community Detection refers to the task of identifying clusters of nodes based on patterns in the edges of networks in order to shed light on large-scale patterns of connection. Community Detection algorithms have been widely used in online and offline social networks, biological interaction networks, and even academic citation networks. A recent development in the field introduced the Batched Gibbs Sampler, a scalable modification of the Gibbs sampler, which leverages the Stochastic Block Model (SBM), a well-known probabilistic generative model for networks, to perform community detection. Using proof techniques from both Optimization and Statistics, existing literature provides theoretical guarantees on its mini-max rate, albeit within the context of a simplified Bayesian hierarchical SBM. The aim of this research is to further investigate this class of algorithms by proposing a variation that incorporates an additional layer in the hierarchical model and providing theoretical guarantees for its convergence. This research can be seen as a preliminary step toward theoretically understanding how modifications aimed at improving scalability affect the inherently sequential Gibbs samplers, particularly in terms of their statistical guarantees on convergence to the correct stationary distribution, and toward the development of scalable Gibbs samplers capable of inferring the number of communities, a task that has so far been addressed a posteriori using model evaluation metrics. Combining literature on modifications of MCMC algorithms for scalability, Bayesian statistics on collapsed Gibbs samplers, and the Batched Gibbs Sampler, we modified the pre-existing algorithm to adapt it to a more complex model involving an additional latent variable. Moreover, by making a slightly more restrictive assumption regarding the balance of community sizes, we prove that the same mini-max optimality rate of the pre-existing algorithm holds for our modification. Through simulations on synthetic SBM data, we verified that the accuracy of our algorithm is comparable to that of the pre-existing sequential Gibbs sampler under suitable initializations, with enhanced scalability. We also identified an unusual behavior of the algorithm in cases where it fails to converge: it appears that the sampler still reaches a stationary distribution, which seems to coincide with the prior placed on the SBM. In conclusion, the algorithm developed in this thesis is a mini-max optimal and scalable alternative to currently used community detection algorithms, particularly its sequential counterpart. The theoretical guarantees discussed in this thesis suggest that future developments of Batched Gibbs Samplers built on it, particularly in more complex models involving community number inference, may maintain mini-max optimal performance. Moreover, the findings on the “convergence to the prior” phenomenon can have interesting implications for future developments: they could be used to develop actionable strategies for assessing convergence to the correct distribution and guide further investigation into the stationary distribution of the Markov chain of the Batched Gibbs Sampler.

Batched Gibbs Sampler: Theoretical and Computational Guarantees for Community Detection in Stochastic Block Models

TESTA, FEDERICO
2023/2024

Abstract

Networks feature in many fields: from social networks to biological systems such as gene and neuron interactions, as well as technological structures like the Internet and transportation systems. A prominent aspect of interest of networks is their observed tendency to exhibit a community structure: nodes in the same group tend to share features and are more strongly connected with other nodes within the same group. Community Detection refers to the task of identifying clusters of nodes based on patterns in the edges of networks in order to shed light on large-scale patterns of connection. Community Detection algorithms have been widely used in online and offline social networks, biological interaction networks, and even academic citation networks. A recent development in the field introduced the Batched Gibbs Sampler, a scalable modification of the Gibbs sampler, which leverages the Stochastic Block Model (SBM), a well-known probabilistic generative model for networks, to perform community detection. Using proof techniques from both Optimization and Statistics, existing literature provides theoretical guarantees on its mini-max rate, albeit within the context of a simplified Bayesian hierarchical SBM. The aim of this research is to further investigate this class of algorithms by proposing a variation that incorporates an additional layer in the hierarchical model and providing theoretical guarantees for its convergence. This research can be seen as a preliminary step toward theoretically understanding how modifications aimed at improving scalability affect the inherently sequential Gibbs samplers, particularly in terms of their statistical guarantees on convergence to the correct stationary distribution, and toward the development of scalable Gibbs samplers capable of inferring the number of communities, a task that has so far been addressed a posteriori using model evaluation metrics. Combining literature on modifications of MCMC algorithms for scalability, Bayesian statistics on collapsed Gibbs samplers, and the Batched Gibbs Sampler, we modified the pre-existing algorithm to adapt it to a more complex model involving an additional latent variable. Moreover, by making a slightly more restrictive assumption regarding the balance of community sizes, we prove that the same mini-max optimality rate of the pre-existing algorithm holds for our modification. Through simulations on synthetic SBM data, we verified that the accuracy of our algorithm is comparable to that of the pre-existing sequential Gibbs sampler under suitable initializations, with enhanced scalability. We also identified an unusual behavior of the algorithm in cases where it fails to converge: it appears that the sampler still reaches a stationary distribution, which seems to coincide with the prior placed on the SBM. In conclusion, the algorithm developed in this thesis is a mini-max optimal and scalable alternative to currently used community detection algorithms, particularly its sequential counterpart. The theoretical guarantees discussed in this thesis suggest that future developments of Batched Gibbs Samplers built on it, particularly in more complex models involving community number inference, may maintain mini-max optimal performance. Moreover, the findings on the “convergence to the prior” phenomenon can have interesting implications for future developments: they could be used to develop actionable strategies for assessing convergence to the correct distribution and guide further investigation into the stationary distribution of the Markov chain of the Batched Gibbs Sampler.
Batched Gibbs Sampler: Theoretical and Computational Guarantees for Community Detection in Stochastic Block Models
Networks feature in many fields: from social networks to biological systems such as gene and neuron interactions, as well as technological structures like the Internet and transportation systems. A prominent aspect of interest of networks is their observed tendency to exhibit a community structure: nodes in the same group tend to share features and are more strongly connected with other nodes within the same group. Community Detection refers to the task of identifying clusters of nodes based on patterns in the edges of networks in order to shed light on large-scale patterns of connection. Community Detection algorithms have been widely used in online and offline social networks, biological interaction networks, and even academic citation networks. A recent development in the field introduced the Batched Gibbs Sampler, a scalable modification of the Gibbs sampler, which leverages the Stochastic Block Model (SBM), a well-known probabilistic generative model for networks, to perform community detection. Using proof techniques from both Optimization and Statistics, existing literature provides theoretical guarantees on its mini-max rate, albeit within the context of a simplified Bayesian hierarchical SBM. The aim of this research is to further investigate this class of algorithms by proposing a variation that incorporates an additional layer in the hierarchical model and providing theoretical guarantees for its convergence. This research can be seen as a preliminary step toward theoretically understanding how modifications aimed at improving scalability affect the inherently sequential Gibbs samplers, particularly in terms of their statistical guarantees on convergence to the correct stationary distribution, and toward the development of scalable Gibbs samplers capable of inferring the number of communities, a task that has so far been addressed a posteriori using model evaluation metrics. Combining literature on modifications of MCMC algorithms for scalability, Bayesian statistics on collapsed Gibbs samplers, and the Batched Gibbs Sampler, we modified the pre-existing algorithm to adapt it to a more complex model involving an additional latent variable. Moreover, by making a slightly more restrictive assumption regarding the balance of community sizes, we prove that the same mini-max optimality rate of the pre-existing algorithm holds for our modification. Through simulations on synthetic SBM data, we verified that the accuracy of our algorithm is comparable to that of the pre-existing sequential Gibbs sampler under suitable initializations, with enhanced scalability. We also identified an unusual behavior of the algorithm in cases where it fails to converge: it appears that the sampler still reaches a stationary distribution, which seems to coincide with the prior placed on the SBM. In conclusion, the algorithm developed in this thesis is a mini-max optimal and scalable alternative to currently used community detection algorithms, particularly its sequential counterpart. The theoretical guarantees discussed in this thesis suggest that future developments of Batched Gibbs Samplers built on it, particularly in more complex models involving community number inference, may maintain mini-max optimal performance. Moreover, the findings on the “convergence to the prior” phenomenon can have interesting implications for future developments: they could be used to develop actionable strategies for assessing convergence to the correct distribution and guide further investigation into the stationary distribution of the Markov chain of the Batched Gibbs Sampler.
Autorizzo consultazione esterna dell'elaborato
File in questo prodotto:
File Dimensione Formato  
Testa_Federico_Thesis.pdf

non disponibili

Dimensione 2.92 MB
Formato Adobe PDF
2.92 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/9650