Among the basic objectives of automatic text analysis is to select the most significant tokens within a corpus. In this context, the added value provided to the availability of statistical-linguistic resources is indisputable, both for the grammatical annotation of the forms of a corpus and for the extraction of contents according to their over-/underuse compared to the occurrences of a reference frequency lexicon for the identification of the peculiar language. The selection of these terms does not guarantee their semantic disambiguation; for this reason, it is appropriate to define an automatic text analysis strategy that shall include the recognition of the nominal multiword expressions found in a corpus, considered to be both as nominal idiomatic expressions and as linguistic collocations. An accurate identification of multiword expressions (MWE) makes it possible to disambiguate the meaning of words and to define or enrich the glossary of terminology of a specific specialist sector. In this paper, we expose the two functions present in the TaLTaC software aimed at providing a selection of the peculiar language and the identification of the nominal multiword expressions contained in a corpus. In particular, the peculiarity of the forms under analysis is obtained through a measure of overuse with respect to the resource of standard Italian attributable to eight different genres of the language, grouped into five basic types: speech, web, fiction, press, and specialized languages, while the recognition of MWE is obtained through an algorithm based on lexical-textual concepts. These functions are applied to a corpus of tweets about the Russian-Ukrainian war.
A Strategy to Identify the Peculiarity of a Lexicon in the Analysis of a Corpus
Pasquale Pavone
;
2024-01-01
Abstract
Among the basic objectives of automatic text analysis is to select the most significant tokens within a corpus. In this context, the added value provided to the availability of statistical-linguistic resources is indisputable, both for the grammatical annotation of the forms of a corpus and for the extraction of contents according to their over-/underuse compared to the occurrences of a reference frequency lexicon for the identification of the peculiar language. The selection of these terms does not guarantee their semantic disambiguation; for this reason, it is appropriate to define an automatic text analysis strategy that shall include the recognition of the nominal multiword expressions found in a corpus, considered to be both as nominal idiomatic expressions and as linguistic collocations. An accurate identification of multiword expressions (MWE) makes it possible to disambiguate the meaning of words and to define or enrich the glossary of terminology of a specific specialist sector. In this paper, we expose the two functions present in the TaLTaC software aimed at providing a selection of the peculiar language and the identification of the nominal multiword expressions contained in a corpus. In particular, the peculiarity of the forms under analysis is obtained through a measure of overuse with respect to the resource of standard Italian attributable to eight different genres of the language, grouped into five basic types: speech, web, fiction, press, and specialized languages, while the recognition of MWE is obtained through an algorithm based on lexical-textual concepts. These functions are applied to a corpus of tweets about the Russian-Ukrainian war.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.