Appendix A — Pre-processing

Tokenisation

Depending on the task at hand, different preprocessing methods were used. In all use cases, the corpora needed to be split into simpler linguistic units, or tokens. This process, known as tokenisation, can be carried through with a plethora of different techniques. The predominant approach in this study is rule-based. This method entails identifying the set of grammatical rules defining the ways to signal the end of a word and the beginning of another. Once this first stage of desk research is done, the only problem left to tackle is that of translating the rules in machine-readable form, so that a computer can divide long-form text into words (or sets of words). In many western languages, for instance, blank space divides one word from the next, and sentences are divided by punctuation (e.g. full stops, question marks, and so forth). In this specific case, a tokeniser designed for the English language was manually adjusted to Italian punctuation, using the tidytext package (Silge and Robinson 2016). More advanced, machine-learning techniques were also tested. Taking advantage of spacyr, a wrapper around the Python package spacy, a pre-trained tagging and tokenising model was also implemented. At the price of immensely higher computational demands, this method promises much richer results, complete with PoS (Part of Speech) tagging and even entity recognition. Even at a first glance, however, the results appeared unreliable in more complicated tasks, and indistinguishable from simpler approaches in simpler ones. This approach was therefore abandoned, in favour of tidytext’s less demanding methods.

The last aspect of tokenisation to be taken into account is defining what a token is. In this study, tokens will be defined as either individual words or bigrams (two consecutive words). The reason for including bigrams in the analysis is that a relational database of co-occurring words might be more informative, especially in the case of dictionary-based analyses.

Deleting unnecessary content

After tokenisation, the corpora were stripped of all unnecessary content. This includes the more or less obvious punctuation and numbers, but also all words which have no real value to the analysis. There are many words which we use in passing, but add no semantic value to the text. All words which do not contribute in defining a document’s meaning and topic are called stopwords. The corpora were all matched with a dictionary of Italian stopwords, using the stopwords package (Benoit, Muhr, and Watanabe 2021).

Stemming and lemmatisation

Especially when using dictionary-based text mining methods, stemming is a common pre-processing step. It entails turning words into stems, as in only keeping their roots instead of their conjugated or declined form. An obvious case in which this would be useful is verbs: the Italian translation for “to eat”, mangiare, can be found in the conjugated forms mangio, mangi, mangiarono, mangerà, and so on, which can all be reduced to the stem mang-. There are, however, very common cases in which a verb’s conjugated form disregards the infinitive form’s root. This is the case, for instance, of the verb andare (“to go”), whose conjugated forms include vado, vai, vanno. A solution to this is adopting a more sophisticated approach called lemmatisation. It consists of automatically recognising a word’s dictionary form, and assigning it to each token. Once again, this task is hardly a simple one for a computer to take on, so it tends to be rather demanding in computational terms. An attempt was made at the early stages of this study to adopt lemmatisation, with very unsatisfactory results. The method of choice was thus that of taking on this problem at a later stage, using tokens as they are and then adjusting analytical dictionaries to accommodate this issue.