Tecniche di OCR avanzate basate su Transformer

This thesis deals with automating consumer retail operations, where operators manually check product placements on the shelves, along with their relative prices. This task is done by hand and is time consuming: the goal of this thesis is automating such process (or part thereof) using computer vision algorithms based on Artificial Neural Networks (NNs). NNs are a class of machine learning algorithms that has seen increasingly adoption in our daily tasks over the last decade. Over the last 10 years, NNs and have seen increasing popularity thanks to advances in parallel computing (GPUs, TPUs, etc.) and the increasing availability of large annotated dataset. A family of NNs that has seen its popularity grow since its introduction in 2017 is the Transformer, which today is the state-of-the-art model in Natural Language Processing (NLP). This model was successfully adapted to a number of computer vision tasks as the Vision Transformer showing state of the art performance in a number of tasks. This thesis aims at evaluating the use of Transformer NNs for the tasks of i) detecting price tags on shelves (object detection) and ii) reading the price on the tag (Optical Character Reading - OCR). The requirement for this application is that the entire process above is performed on the user device, typically a smartphone or a tablet with limited computational and memory resources. The challenge this thesis tackles is twofold: First, we aim at assessing the feasibility of deploying memory and computational resources hungry Transformer networks over resource constrained mobile devices. Second, Transformers are data-hungry architectures that require large annotated training sets due to their relative complexity, and manually collecting enough annotated training samples in our domain is simply unfeasible. We tackle the above challenges as follows: Concerning the first challenge, we experiment with a ViTSTR Vision Transformer and we compare with an SVTR - Scene Text Recognition with a Single Visual Model -, which is designed to handle various challenges in scene text recognition. Concerning the second challenge, we refine the above architectures that had been preliminarily trained over a synthetic dataset to cope with the limited number of manually annotated real samples. We also consider a number of pre-processing and post-processing strategies aiming at improving the accuracy of these architectures, especially for the OCR task.

Tecniche di OCR avanzate basate su Transformer

GARRO, CHRISTIAN

2021/2022

Abstract

This thesis deals with automating consumer retail operations, where operators manually check product placements on the shelves, along with their relative prices. This task is done by hand and is time consuming: the goal of this thesis is automating such process (or part thereof) using computer vision algorithms based on Artificial Neural Networks (NNs). NNs are a class of machine learning algorithms that has seen increasingly adoption in our daily tasks over the last decade. Over the last 10 years, NNs and have seen increasing popularity thanks to advances in parallel computing (GPUs, TPUs, etc.) and the increasing availability of large annotated dataset. A family of NNs that has seen its popularity grow since its introduction in 2017 is the Transformer, which today is the state-of-the-art model in Natural Language Processing (NLP). This model was successfully adapted to a number of computer vision tasks as the Vision Transformer showing state of the art performance in a number of tasks. This thesis aims at evaluating the use of Transformer NNs for the tasks of i) detecting price tags on shelves (object detection) and ii) reading the price on the tag (Optical Character Reading - OCR). The requirement for this application is that the entire process above is performed on the user device, typically a smartphone or a tablet with limited computational and memory resources. The challenge this thesis tackles is twofold: First, we aim at assessing the feasibility of deploying memory and computational resources hungry Transformer networks over resource constrained mobile devices. Second, Transformers are data-hungry architectures that require large annotated training sets due to their relative complexity, and manually collecting enough annotated training samples in our domain is simply unfeasible. We tackle the above challenges as follows: Concerning the first challenge, we experiment with a ViTSTR Vision Transformer and we compare with an SVTR - Scene Text Recognition with a Single Visual Model -, which is designed to handle various challenges in scene text recognition. Concerning the second challenge, we refine the above architectures that had been preliminarily trained over a synthetic dataset to cope with the limited number of manually annotated real samples. We also consider a number of pre-processing and post-processing strategies aiming at improving the accuracy of these architectures, especially for the OCR task.

Scheda breve

	Facoltà/Dipartimento
	
				INFORMATICA
			
	Corso di studio
	
				INFORMATICA
			
	Lingua
	
				ENG
			
	Relatrice / Relatore
	
				FIANDROTTI, Attilio
			
	Modalità consultazione tesi
	
				IMPORT DA TESIONLINE
			
	Appare nelle tipologie:
	
				Corso di Laurea Magistrale

File in questo prodotto:

File	Dimensione	Formato
845918_tesimagistralechristiangarro.pdf non disponibili Tipologia: Altro materiale allegato Dimensione 4.37 MB Formato Adobe PDF	4.37 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/51477