A Strategy to Identify the Peculiarity of a Lexicon in the Analysis of a Corpus

IRIS

Among the basic objectives of automatic text analysis is to select the most significant tokens within a corpus. In this context, the added value provided to the availability of statistical-linguistic resources is indisputable, both for the grammatical annotation of the forms of a corpus and for the extraction of contents according to their over-/underuse compared to the occurrences of a reference frequency lexicon for the identification of the peculiar language. The selection of these terms does not guarantee their semantic disambiguation; for this reason, it is appropriate to define an automatic text analysis strategy that shall include the recognition of the nominal multiword expressions found in a corpus, considered to be both as nominal idiomatic expressions and as linguistic collocations. An accurate identification of multiword expressions (MWE) makes it possible to disambiguate the meaning of words and to define or enrich the glossary of terminology of a specific specialist sector. In this paper, we expose the two functions present in the TaLTaC software aimed at providing a selection of the peculiar language and the identification of the nominal multiword expressions contained in a corpus. In particular, the peculiarity of the forms under analysis is obtained through a measure of overuse with respect to the resource of standard Italian attributable to eight different genres of the language, grouped into five basic types: speech, web, fiction, press, and specialized languages, while the recognition of MWE is obtained through an algorithm based on lexical-textual concepts. These functions are applied to a corpus of tweets about the Russian-Ukrainian war.

A Strategy to Identify the Peculiarity of a Lexicon in the Analysis of a Corpus

Giovanni De Gasperis;Pasquale Pavone;Sergio Bolasco

2024-01-01

Abstract

Among the basic objectives of automatic text analysis is to select the most significant tokens within a corpus. In this context, the added value provided to the availability of statistical-linguistic resources is indisputable, both for the grammatical annotation of the forms of a corpus and for the extraction of contents according to their over-/underuse compared to the occurrences of a reference frequency lexicon for the identification of the peculiar language. The selection of these terms does not guarantee their semantic disambiguation; for this reason, it is appropriate to define an automatic text analysis strategy that shall include the recognition of the nominal multiword expressions found in a corpus, considered to be both as nominal idiomatic expressions and as linguistic collocations. An accurate identification of multiword expressions (MWE) makes it possible to disambiguate the meaning of words and to define or enrich the glossary of terminology of a specific specialist sector. In this paper, we expose the two functions present in the TaLTaC software aimed at providing a selection of the peculiar language and the identification of the nominal multiword expressions contained in a corpus. In particular, the peculiarity of the forms under analysis is obtained through a measure of overuse with respect to the resource of standard Italian attributable to eight different genres of the language, grouped into five basic types: speech, web, fiction, press, and specialized languages, while the recognition of MWE is obtained through an algorithm based on lexical-textual concepts. These functions are applied to a corpus of tweets about the Russian-Ukrainian war.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2024
			
	Codice ISBN
	
				978-3-031-55916-7
			
	Parole chiave
	
				Standard Italian · Peculiarity · Multiword expressions · TaLTaC · Text mining
			
	Appare nelle tipologie:
	
				2.1 Contributo in volume (Capitolo o Saggio)

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12607/60441

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

social impact