Decoupled Attention in Skeleton-based Action Recognition

A Human Action/Activity is defined as a sequence of body parts’ movements of arbitrary length. Human Action/Activity Recognition (HAR) is the task of automatically classifying human actions from videos. The field of HAR has been very lively in terms of deep learning research due to its wide range of applications, varying from surveillance to video indexing and retrieval, video analysis and possibly even human-machine interaction. The task presents a series of difficult challenges to overcome. Firstly, those caused by the expensive nature of the data both computationally and storage-wise. The second kind of problems comes from the huge variation in the environment an action is performed into. Both of these difficulties are addressed by using skeleton sequences instead of the more common RGB videos. A skeleton sequence is a series of coordinates that represent the positions of a set of body joints through time that can be easily computed from RGB videos. This makes them a lot more efficient to store and process compared to other modalities but with the disadvantage of not having a representations for visual context. The goal of this thesis is to build an effective architecture for skeleton-based HAR using the Transformer. This choice is motivated by the architecture’s capabilities at modelling long sequence dependencies thanks to its reliance on selfattention. In particular, the Decoupled Spatial Temporal Attention Network (DSTA) is studied due to its ability to decouple the attention mechanism to effectively consider both the spatial and temporal relations in a skeleton sequence. Improvements to DSTA are proposed that can be divided into 4 phases: first an optimization of the embedding dimension and LR scheduler; then the addition of ReZero; the addition of late temporal modelling through BERT; and finally, the separation of the spatial and temporal streams. Experiments on the public NTU 60 dataset are performed to test each proposed improvement to test its effectiveness. The obtained results indicate that the biggest leap forward is obtained by simply fine-tuning the model’s parameter. ReZero and late temporal modelling do not seem to improve the performances, potentially due to the need of further testing to determine the appropriate integration into the model. Finally the decoupling of the streams seems to give negligible improvements while increasing training time.

Decoupled Attention in Skeleton-based Action Recognition

BORELLO, NAZARENO MARIA

2020/2021

Abstract

A Human Action/Activity is defined as a sequence of body parts’ movements of arbitrary length. Human Action/Activity Recognition (HAR) is the task of automatically classifying human actions from videos. The field of HAR has been very lively in terms of deep learning research due to its wide range of applications, varying from surveillance to video indexing and retrieval, video analysis and possibly even human-machine interaction. The task presents a series of difficult challenges to overcome. Firstly, those caused by the expensive nature of the data both computationally and storage-wise. The second kind of problems comes from the huge variation in the environment an action is performed into. Both of these difficulties are addressed by using skeleton sequences instead of the more common RGB videos. A skeleton sequence is a series of coordinates that represent the positions of a set of body joints through time that can be easily computed from RGB videos. This makes them a lot more efficient to store and process compared to other modalities but with the disadvantage of not having a representations for visual context. The goal of this thesis is to build an effective architecture for skeleton-based HAR using the Transformer. This choice is motivated by the architecture’s capabilities at modelling long sequence dependencies thanks to its reliance on selfattention. In particular, the Decoupled Spatial Temporal Attention Network (DSTA) is studied due to its ability to decouple the attention mechanism to effectively consider both the spatial and temporal relations in a skeleton sequence. Improvements to DSTA are proposed that can be divided into 4 phases: first an optimization of the embedding dimension and LR scheduler; then the addition of ReZero; the addition of late temporal modelling through BERT; and finally, the separation of the spatial and temporal streams. Experiments on the public NTU 60 dataset are performed to test each proposed improvement to test its effectiveness. The obtained results indicate that the biggest leap forward is obtained by simply fine-tuning the model’s parameter. ReZero and late temporal modelling do not seem to improve the performances, potentially due to the need of further testing to determine the appropriate integration into the model. Finally the decoupling of the streams seems to give negligible improvements while increasing training time.

Scheda breve

	Facoltà/Dipartimento
	
				INFORMATICA
			
	Corso di studio
	
				INFORMATICA
			
	Lingua
	
				ENG
			
	Relatrice / Relatore
	
				ESPOSITO, Roberto
			
	Modalità consultazione tesi
	
				IMPORT DA TESIONLINE
			
	Appare nelle tipologie:
	
				Corso di Laurea Magistrale

File in questo prodotto:

File	Dimensione	Formato
815023_tesi_nazareno_borello_815023.pdf non disponibili Tipologia: Altro materiale allegato Dimensione 14.45 MB Formato Adobe PDF	14.45 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/78907