Many questions of the humanities, which relate to specific text resources, can be reduced to the analysis of vocabulary. Especially the comparison of such vocabulary often is of central interest.
CLARIN-centre Leipzig allows to easily perform such comparative analyses using the resources and Web tools provided here. This allows you to either compare two own text resources or a text resource with a reference corpus provided by us. The result of this analysis is a list of words which appear significantly more often in one of the corpora.
Short guide:
- Configuration: Select two or more text corpora from the list or add your own text resources that you want to compare. Enter a title and start your analysis.
- Job Selection: Shortly thereafter, the results of the calculation will be available and can be selected for visualization.
- Visualization: You will be presented with an overview of the overall similarity of the vocabulary used in the corpora. The measure used for comparison is cosine similarity (for an evaluation of similarity measures, see: Improving Burrows' Delta - An empirical evaluation of text distance measures, Jannidis et al. 2015). The value range lies between 0 and 1, whereby a high value implies a large similarity of the two compared text corpora.
Please select a corpus pair by clicking on the corresponding box for further details on word statistics.
- Word statistics: Finally, those words are displayed which are unevenly distributed between the corpora. Either those words can be selected which occur more frequently in one of the two text corpora, or those which occur only in one of the two corpora. By default, results are sorted by the ratio of the relative frequencies of the words. Depending on the language and source data, the analysis can be restricted to individual parts of speech (POS).
A more detailed guide based on a simple example can be found following this link. This show case covers the discovery and selection of resources, their processing and finally their analysis. The aim is to demonstrate to scholars how to answer own scientific questions with the help of comparative text analysis within CLARIN.