The aim to find and develop novel sources of proteins is complicated by increasing allergenicity rates around the world. Testing for allergenicity imposes additional challenges in the risk assessment of novel food proteins as it is still unknown whether there are structural features that are deterministic of allergenicity. Being able to identify allergens based on common characteristics would significantly improve the risk assessment process of novel food proteins. Previous studies posit sequence similarity against the human proteome as a suitable measure. An analysis of a subset of allergens suggests that those that share a sequence similarity of more than 62 \% against the human proteome are rarely allergenic. To date, these experiments have not been extended to all categories of allergens. In addition, no large-scale analysis of allergen symptomatology has been published until now. In this work, two overlooked aspects are introduced to the computational analysis of allergens, \textit{i.e.}, a comprehensive overview of allergen sequences in the existing literature and symptomatology. Both aspects were evaluated for plant and animal food allergens. The underlying data for this work is based on the results of a text mining effort which compiled information on allergens from more than 35 million scientific articles available in PubMed. The data can be considered comprehensive as it is based on a complete list of all allergens currently available in the WHO/IUIS and FARRP databases. All associated protein sequences pertaining to these allergens were retrieved from scientific literature. Symptomatology was inferred with CoMent based on significant co-mentions in scientific literature and is reflected in the data in terms of the human phenotype ontology. Protein classification of all sequences was performed with InterPro. All sequences were aligned using ClustalO and searched against human proteome using BLAST. Multiple clinically relevant animal food allergens were found to be highly similar to the human proteome, exceeding the threshold that has been described in literature to date. This signifies that a sequence similarity threshold of 62 \% to the human proteome is an inadequate measure to determine the allergenicity of animal food proteins. For plant food allergens, the results indicate that a threshold may be feasible, as only two hits exceeded 50 \% similarity to the human proteome and both are currently assumed to be of low clinical relevance. No statistically significant correlation was found between the sequence similarity of allergen pairs and their symptomatic similarity. The obtained results suggested that allergenicity of food proteins cannot be reliably determined by sequence alignment against the human proteome, as clinically relevant animal food allergens closely resemble their human counterparts. On the other hand, the analysis of sequence similarity with the human proteome could be a useful tool for the allergy risk assessment of novel plant food proteins. Furthermore, allergen protein sequences are insufficient for the prediction of symptomatology. Thus, future computational studies of the symptomatology of allergens should investigate additional structural characteristics of allergens such as the prevalence of disulfide bridges.

Two New Dimensions for the Computational Analysis of Allergens: A Comparative Study of Databases and Symptomatology

HIRT, ADRIAN MARSILIUS
2022/2023

Abstract

The aim to find and develop novel sources of proteins is complicated by increasing allergenicity rates around the world. Testing for allergenicity imposes additional challenges in the risk assessment of novel food proteins as it is still unknown whether there are structural features that are deterministic of allergenicity. Being able to identify allergens based on common characteristics would significantly improve the risk assessment process of novel food proteins. Previous studies posit sequence similarity against the human proteome as a suitable measure. An analysis of a subset of allergens suggests that those that share a sequence similarity of more than 62 \% against the human proteome are rarely allergenic. To date, these experiments have not been extended to all categories of allergens. In addition, no large-scale analysis of allergen symptomatology has been published until now. In this work, two overlooked aspects are introduced to the computational analysis of allergens, \textit{i.e.}, a comprehensive overview of allergen sequences in the existing literature and symptomatology. Both aspects were evaluated for plant and animal food allergens. The underlying data for this work is based on the results of a text mining effort which compiled information on allergens from more than 35 million scientific articles available in PubMed. The data can be considered comprehensive as it is based on a complete list of all allergens currently available in the WHO/IUIS and FARRP databases. All associated protein sequences pertaining to these allergens were retrieved from scientific literature. Symptomatology was inferred with CoMent based on significant co-mentions in scientific literature and is reflected in the data in terms of the human phenotype ontology. Protein classification of all sequences was performed with InterPro. All sequences were aligned using ClustalO and searched against human proteome using BLAST. Multiple clinically relevant animal food allergens were found to be highly similar to the human proteome, exceeding the threshold that has been described in literature to date. This signifies that a sequence similarity threshold of 62 \% to the human proteome is an inadequate measure to determine the allergenicity of animal food proteins. For plant food allergens, the results indicate that a threshold may be feasible, as only two hits exceeded 50 \% similarity to the human proteome and both are currently assumed to be of low clinical relevance. No statistically significant correlation was found between the sequence similarity of allergen pairs and their symptomatic similarity. The obtained results suggested that allergenicity of food proteins cannot be reliably determined by sequence alignment against the human proteome, as clinically relevant animal food allergens closely resemble their human counterparts. On the other hand, the analysis of sequence similarity with the human proteome could be a useful tool for the allergy risk assessment of novel plant food proteins. Furthermore, allergen protein sequences are insufficient for the prediction of symptomatology. Thus, future computational studies of the symptomatology of allergens should investigate additional structural characteristics of allergens such as the prevalence of disulfide bridges.
ENG
IMPORT DA TESIONLINE
File in questo prodotto:
File Dimensione Formato  
1032347_hirtadrian_allergens.pdf

non disponibili

Tipologia: Altro materiale allegato
Dimensione 5.87 MB
Formato Adobe PDF
5.87 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/146227