A Human Action/Activity is defined as a sequence of body parts’ movements of arbitrary length. Human Action/Activity Recognition (HAR) is the task of automatically classifying human actions from videos. The field of HAR has been very lively in terms of deep learning research due to its wide range of applications, varying from surveillance to video indexing and retrieval, video analysis and possibly even human-machine interaction. The task presents a series of difficult challenges to overcome. Firstly, those caused by the expensive nature of the data both computationally and storage-wise. The second kind of problems comes from the huge variation in the environment an action is performed into. Both of these difficulties are addressed by using skeleton sequences instead of the more common RGB videos. A skeleton sequence is a series of coordinates that represent the positions of a set of body joints through time that can be easily computed from RGB videos. This makes them a lot more efficient to store and process compared to other modalities but with the disadvantage of not having a representations for visual context. The goal of this thesis is to build an effective architecture for skeleton-based HAR using the Transformer. This choice is motivated by the architecture’s capabilities at modelling long sequence dependencies thanks to its reliance on selfattention. In particular, the Decoupled Spatial Temporal Attention Network (DSTA) is studied due to its ability to decouple the attention mechanism to effectively consider both the spatial and temporal relations in a skeleton sequence. Improvements to DSTA are proposed that can be divided into 4 phases: first an optimization of the embedding dimension and LR scheduler; then the addition of ReZero; the addition of late temporal modelling through BERT; and finally, the separation of the spatial and temporal streams. Experiments on the public NTU 60 dataset are performed to test each proposed improvement to test its effectiveness. The obtained results indicate that the biggest leap forward is obtained by simply fine-tuning the model’s parameter. ReZero and late temporal modelling do not seem to improve the performances, potentially due to the need of further testing to determine the appropriate integration into the model. Finally the decoupling of the streams seems to give negligible improvements while increasing training time.

Decoupled Attention in Skeleton-based Action Recognition

BORELLO, NAZARENO MARIA
2020/2021

Abstract

A Human Action/Activity is defined as a sequence of body parts’ movements of arbitrary length. Human Action/Activity Recognition (HAR) is the task of automatically classifying human actions from videos. The field of HAR has been very lively in terms of deep learning research due to its wide range of applications, varying from surveillance to video indexing and retrieval, video analysis and possibly even human-machine interaction. The task presents a series of difficult challenges to overcome. Firstly, those caused by the expensive nature of the data both computationally and storage-wise. The second kind of problems comes from the huge variation in the environment an action is performed into. Both of these difficulties are addressed by using skeleton sequences instead of the more common RGB videos. A skeleton sequence is a series of coordinates that represent the positions of a set of body joints through time that can be easily computed from RGB videos. This makes them a lot more efficient to store and process compared to other modalities but with the disadvantage of not having a representations for visual context. The goal of this thesis is to build an effective architecture for skeleton-based HAR using the Transformer. This choice is motivated by the architecture’s capabilities at modelling long sequence dependencies thanks to its reliance on selfattention. In particular, the Decoupled Spatial Temporal Attention Network (DSTA) is studied due to its ability to decouple the attention mechanism to effectively consider both the spatial and temporal relations in a skeleton sequence. Improvements to DSTA are proposed that can be divided into 4 phases: first an optimization of the embedding dimension and LR scheduler; then the addition of ReZero; the addition of late temporal modelling through BERT; and finally, the separation of the spatial and temporal streams. Experiments on the public NTU 60 dataset are performed to test each proposed improvement to test its effectiveness. The obtained results indicate that the biggest leap forward is obtained by simply fine-tuning the model’s parameter. ReZero and late temporal modelling do not seem to improve the performances, potentially due to the need of further testing to determine the appropriate integration into the model. Finally the decoupling of the streams seems to give negligible improvements while increasing training time.
ENG
IMPORT DA TESIONLINE
File in questo prodotto:
File Dimensione Formato  
815023_tesi_nazareno_borello_815023.pdf

non disponibili

Tipologia: Altro materiale allegato
Dimensione 14.45 MB
Formato Adobe PDF
14.45 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/78907