Detecting mental health disorders (e.g. depression, anxiety and bipolar disorder) through the analysis of written linguistic markers expressed in online social network systems, such as Twitter, Facebook and Reddit, is an important application of text classification. In traditional binary classification, classifiers are trained on the basis of pre-labeled positive and negative instances. The feasibility of extracting positive samples from social networking data has been demonstrated, e.g. through self-reported diagnosis, but randomly labeling non-positive users as negative instances might cause undesired sample bias effects. For example, depressed users might be more likely to write about specific topics than the general population. Randomly selecting users as negative instances may thus result in a model that learns to detect topical differences, but fails to identify the more relevant linguistic markers that truly characterize affected users. Here we outline a methodology that leverages the prevalent homophily of social networks to create representative samples of negative instances. We select users and comments that are related to positive instances in terms of their network connections and evaluate them as negative instances, thus equalizing irrelevanttopical difference. This allows our classifiers to detect differences relevant to the condition under consideration. We compare the results of different selection methods in a controlled scenario,where true negatives are known but hidden from the process. We show that using the underlyingnetwork structure to select negative samples results in classification models that approximate, interms of selected linguistic markers, to models that rely on true negative samples.
Apprendimento parzialmente supervisionato per l'identificazione di malattie mentali
CERIA, ALBERTO
2016/2017
Abstract
Detecting mental health disorders (e.g. depression, anxiety and bipolar disorder) through the analysis of written linguistic markers expressed in online social network systems, such as Twitter, Facebook and Reddit, is an important application of text classification. In traditional binary classification, classifiers are trained on the basis of pre-labeled positive and negative instances. The feasibility of extracting positive samples from social networking data has been demonstrated, e.g. through self-reported diagnosis, but randomly labeling non-positive users as negative instances might cause undesired sample bias effects. For example, depressed users might be more likely to write about specific topics than the general population. Randomly selecting users as negative instances may thus result in a model that learns to detect topical differences, but fails to identify the more relevant linguistic markers that truly characterize affected users. Here we outline a methodology that leverages the prevalent homophily of social networks to create representative samples of negative instances. We select users and comments that are related to positive instances in terms of their network connections and evaluate them as negative instances, thus equalizing irrelevanttopical difference. This allows our classifiers to detect differences relevant to the condition under consideration. We compare the results of different selection methods in a controlled scenario,where true negatives are known but hidden from the process. We show that using the underlyingnetwork structure to select negative samples results in classification models that approximate, interms of selected linguistic markers, to models that rely on true negative samples.File | Dimensione | Formato | |
---|---|---|---|
809520_main2.pdf
non disponibili
Tipologia:
Altro materiale allegato
Dimensione
6.37 MB
Formato
Adobe PDF
|
6.37 MB | Adobe PDF |
Se sei interessato/a a consultare l'elaborato, vai nella sezione Home in alto a destra, dove troverai le informazioni su come richiederlo. I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14240/91077