DNA mutations, called variants, make each individual unique but some of these may lead to the development of rare diseases. The massively parallel sequencing technology known as next-generation sequencing (NGS) has improved the prospect of obtaining a genetic diagnosis thanks to the speed of delivery and cost. However, after DNA sequencing, the identification of those genetic variations that may be the cause of diseases is not an easy task: domain experts need time to analyze the large amount of data produced. After introducing the state of the art, the thesis focuses on this issue presenting a new innovative pipeline for variant filtering, the process of selection of dangerous mutations, exploiting machine learning techniques. Data employed come from Turin genomic lab which provided variants tables resulting from exome sequencing of approximately 200 patients with rare kidney diseases. Further public data from large genetic projects, such as 1000 Genomes and GnomAD, were also used in the analysis. First of all, variants are filtered out following the standard guidelines; afterwards, the number of variations is further reduced exploiting the external population-level data-sets first with a statistical test based on odds-ratio and then with a statistical machine learning method for feature selection. This new successful approach can really improve and speed up the diagnosis process giving new hope to patients suffering from rare kidney diseases. Eventually, future researches could be related to the direct prediction of the diagnosis based on genetic variations present in each individual.
AI4Genomics: applicazioni di machine learning alle variazioni genetiche umane per le malattie rare dei reni
PORTA, CAMILLA
2019/2020
Abstract
DNA mutations, called variants, make each individual unique but some of these may lead to the development of rare diseases. The massively parallel sequencing technology known as next-generation sequencing (NGS) has improved the prospect of obtaining a genetic diagnosis thanks to the speed of delivery and cost. However, after DNA sequencing, the identification of those genetic variations that may be the cause of diseases is not an easy task: domain experts need time to analyze the large amount of data produced. After introducing the state of the art, the thesis focuses on this issue presenting a new innovative pipeline for variant filtering, the process of selection of dangerous mutations, exploiting machine learning techniques. Data employed come from Turin genomic lab which provided variants tables resulting from exome sequencing of approximately 200 patients with rare kidney diseases. Further public data from large genetic projects, such as 1000 Genomes and GnomAD, were also used in the analysis. First of all, variants are filtered out following the standard guidelines; afterwards, the number of variations is further reduced exploiting the external population-level data-sets first with a statistical test based on odds-ratio and then with a statistical machine learning method for feature selection. This new successful approach can really improve and speed up the diagnosis process giving new hope to patients suffering from rare kidney diseases. Eventually, future researches could be related to the direct prediction of the diagnosis based on genetic variations present in each individual.File | Dimensione | Formato | |
---|---|---|---|
811986_porta_camilla_thesis.pdf
non disponibili
Tipologia:
Altro materiale allegato
Dimensione
7.6 MB
Formato
Adobe PDF
|
7.6 MB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14240/156338