Families and resemblances

(2010) Prokic, Jelena

Dialectometry is a multidisciplinary field that uses quantitative methods in the analysis of dialect data. From the very beginning, most of the research in dialectometry has been focused on including large amounts of data in analyses and offering alternative views to researchers. Later it was used for the identification of dialect groups and development of methods that would tell us how similar (or different) one variety is when compared to the neighboring varieties. In this book we present advances in several techniques that allow the researcher to automatically measure the differences between language varieties. We test all methods on Bulgarian dialect pronunciation data.

Part of the research presented relies on the Levenshtein algorithm to aggregate over the numerous features found in the data and infer the similarities/distances among the groups of dialects. We investigate the application of clustering techniques in the detection of dialect groups, and propose several evaluation techniques that can be used to estimate the quality of the automatically obtained groups. In order to automatically infer the distances between the phones in the data set we combine the Levenshtein algorithm with the technique called pointwise mutual information. Information on the distances between the phones helps us get better estimates on the distances between the strings, and consequently on the distances between language varieties.

In this thesis we also test an alternative approach to dialect variation that is more historically motivated. We employ a method taken from phylogenetics, namely Bayesian inference of phylogeny, which focuses on systematic shared innovations as a signal of common ancestry, and reexamine the relatedness among the Bulgarian dialect varieties. This method is applied to the automatically multiply aligned strings, which we produce and evaluate using two novel methods.

The results of applying different quantitative techniques to the Bulgarian dialect data show that some of the traditional divisions of this area have to be questioned if only pronunciation data is taken into account. The comparison of the divisions resulting from the geographic and historical approaches has shown that these two different perspectives gave very similar picture of the Bulgarian dialect variation. None of the methods developed are language specific, nor are they applicable only to the dialect data.

