This thesis deals with automating consumer retail operations, where operators manually check product placements on the shelves, along with their relative prices. This task is done by hand and is time consuming: the goal of this thesis is automating such process (or part thereof) using computer vision algorithms based on Artificial Neural Networks (NNs). NNs are a class of machine learning algorithms that has seen increasingly adoption in our daily tasks over the last decade. Over the last 10 years, NNs and have seen increasing popularity thanks to advances in parallel computing (GPUs, TPUs, etc.) and the increasing availability of large annotated dataset. A family of NNs that has seen its popularity grow since its introduction in 2017 is the Transformer, which today is the state-of-the-art model in Natural Language Processing (NLP). This model was successfully adapted to a number of computer vision tasks as the Vision Transformer showing state of the art performance in a number of tasks. This thesis aims at evaluating the use of Transformer NNs for the tasks of i) detecting price tags on shelves (object detection) and ii) reading the price on the tag (Optical Character Reading - OCR). The requirement for this application is that the entire process above is performed on the user device, typically a smartphone or a tablet with limited computational and memory resources. The challenge this thesis tackles is twofold: First, we aim at assessing the feasibility of deploying memory and computational resources hungry Transformer networks over resource constrained mobile devices. Second, Transformers are data-hungry architectures that require large annotated training sets due to their relative complexity, and manually collecting enough annotated training samples in our domain is simply unfeasible. We tackle the above challenges as follows: Concerning the first challenge, we experiment with a ViTSTR Vision Transformer and we compare with an SVTR - Scene Text Recognition with a Single Visual Model -, which is designed to handle various challenges in scene text recognition. Concerning the second challenge, we refine the above architectures that had been preliminarily trained over a synthetic dataset to cope with the limited number of manually annotated real samples. We also consider a number of pre-processing and post-processing strategies aiming at improving the accuracy of these architectures, especially for the OCR task.

Tecniche di OCR avanzate basate su Transformer

GARRO, CHRISTIAN
2021/2022

Abstract

This thesis deals with automating consumer retail operations, where operators manually check product placements on the shelves, along with their relative prices. This task is done by hand and is time consuming: the goal of this thesis is automating such process (or part thereof) using computer vision algorithms based on Artificial Neural Networks (NNs). NNs are a class of machine learning algorithms that has seen increasingly adoption in our daily tasks over the last decade. Over the last 10 years, NNs and have seen increasing popularity thanks to advances in parallel computing (GPUs, TPUs, etc.) and the increasing availability of large annotated dataset. A family of NNs that has seen its popularity grow since its introduction in 2017 is the Transformer, which today is the state-of-the-art model in Natural Language Processing (NLP). This model was successfully adapted to a number of computer vision tasks as the Vision Transformer showing state of the art performance in a number of tasks. This thesis aims at evaluating the use of Transformer NNs for the tasks of i) detecting price tags on shelves (object detection) and ii) reading the price on the tag (Optical Character Reading - OCR). The requirement for this application is that the entire process above is performed on the user device, typically a smartphone or a tablet with limited computational and memory resources. The challenge this thesis tackles is twofold: First, we aim at assessing the feasibility of deploying memory and computational resources hungry Transformer networks over resource constrained mobile devices. Second, Transformers are data-hungry architectures that require large annotated training sets due to their relative complexity, and manually collecting enough annotated training samples in our domain is simply unfeasible. We tackle the above challenges as follows: Concerning the first challenge, we experiment with a ViTSTR Vision Transformer and we compare with an SVTR - Scene Text Recognition with a Single Visual Model -, which is designed to handle various challenges in scene text recognition. Concerning the second challenge, we refine the above architectures that had been preliminarily trained over a synthetic dataset to cope with the limited number of manually annotated real samples. We also consider a number of pre-processing and post-processing strategies aiming at improving the accuracy of these architectures, especially for the OCR task.
ENG
IMPORT DA TESIONLINE
File in questo prodotto:
File Dimensione Formato  
845918_tesimagistralechristiangarro.pdf

non disponibili

Tipologia: Altro materiale allegato
Dimensione 4.37 MB
Formato Adobe PDF
4.37 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/51477