Workshop
  7 Dec 2022
  • Where? Jura Soyfer Saal, Hofburg
  • Organizers: Hannes Fellner and Andreas Baumann


Nathan W. Hill & James Engels: "Modeling the contribution of a phonological hypothesis: Using graph theory to compare two Old Chinese reconstruction systems"

Abstract: Most Chinese characters consist of a semantic and a phonetic component. For example, 媽 má “mother” consists of 女 nü“woman'' on the left and 馬 mǎ “horse'' on the right. Already in the Song dynasty (960–1279 CE) phonologists noticed that characters sharing the same phonetic component often rhymed with each other in the Book of Odes. In the hand of 段玉裁 Duàn Yùcái this observation became the methodological assumption known as the “xiéshēng hypothesis”, namely 同聲必同部 ‘characters with the same phonetic component had the same Old Chinese rhyme category’. This principle is accepted by all researchers in Old Chinese phonology, but it has a different impact for different researchers, depending on their analysis of xiéshēng relationships, rhymes in the Book of Odes, and their overall reconstruction of Old Chinese.

This paper uses graph theory to computationally examine the impact of the xiéshēng hypothesis on the Old Chinese reconstructions of Wang Li (1980) and Baxter & Sagart (2014). In particular we measure the modularity of rhyme networks before and after the application of the xiéshēng hypothesis and use this measure to contrast the two reconstruction systems.  


Tara Andrews: "Not just Stemmatology anymore: Creating a full critical edition with the StemmaREST graph model"

Abstract: The Stemmaweb service had its origin in a project at KU Leuven, running from 2010–2012, which sought to establish some principles of stemmatic method based on empirical study of existing edited texts and their stemmata. While the original impetus of the data model established for Stemmaweb was to examine the suitability of stemmata produced via different methods for a text transmitted in multiple copies, it has in the intervening years demonstrated its merit as a more general model for the examination of how these texts vary, the establishment of a canonical text based on these variants, and the annotation, either of the canonical text or of particular witnesses. In this talk I will discuss the recent advances in the data model, and how we are using it in the edition of the Chronicle of Matthew of Edessa to organize and publish our work.


Jeremy Bradley: "Digital infrastructures for low-resource languages: The case of Uralic"

Abstract: The Uralic language family consists of several dozens genealogically and typologically diverse languages spoken across Northeastern Europe and Siberia. With the exception of the three state languages of Hungarian, Finnish, and Estonian, all Uralic languages are in precarious situation, ranging from vulnerable (e.g., Northern Saami) over endangered (e.g., Udmurt) to moribund (e.g., Mansi) or dormant (e.g., Livonian).

While Uralic minority languages lack sturdy political support structures, their support structures in the digital world are comparatively sturdy: Uralic language technology, specifically in regard to the myriad minority languages, is a fruitful field of research not commensurate with the otherwise lacking resources and prestige of the languages in question. This talk will illustrate which resources have been developed and are being developed for Uralic languages, and provide an inside view into ongoing projects esp. in the field of corpus design.


Amelie Dorn: "Exploring lexical variation digitally – applications and examples of German in Austria"

Abstract: The rapid development of digital methods, tools and infrastructures in recent years has brought additional possibilities and an increased potential for the collection and analysis of humanities data.  The structured exploration of large and complex collections of language data together with diverse linguistic phenomena was enabled and enhanced. This presentation will provide insights into selected aspects and phenomena of a nation-wide survey (n=1049) on lexical variation of German in Austria supported by digital methods. The results are part of and contribute to the larger Special Research Programme (SFB) “German in Austria: Variation – Contact – Perception” (Lenz, 2018), a collaborative endeavour of investigating variation and change of the German language in Austria.

Lenz, A.N. (2018). The Special Research Programme „German in Austria. Variation – Contact – Perception“. In: Ammon, U./ Costa, M. (eds.): Sprachwahl im Tourismus – mit Schwerpunkt Europa. (Sociolinguistica. Internationales Jahrbuch für europäische Soziolinguistik 32). Berlin/Boston, 269–277.


Stefan Hagel: "A digital edition of ancient musical documents: Challenges and chances"

Abstract: An ongoing project at the Austrian Academy of Sciences establishes an updated edition of the extant documents of ancient ‘Greek’ music, preserved on papyrus, stone and in the manuscript tradition. Such an endeavour faces challenges beyond those regularly met by editors of texts. This regards not only the disposition of information in two dimensions that is inherent to the notation of song, but also the ensuing requirement to record, in addition to what is there and what not, what might have been there. All this requires special methods of encoding as well as presentation. It will also be discussed how creating a multimedia edition of this kind can enhance our methodological awareness, forcing us to focus on details that might become dismissed too easily in traditional working environments.


Alexandra N. Lenz, Johanna Fanta-Jende & Markus Pluschkovits: "DiÖ - A corpus on language variation and language contact in Austria" 

Abstract: The talk is situated in the paradigm of Digital Linguistics, more precisely Digital Variationist Linguistics. This paradigm will first be presented from a theoretical and methodological point of view. In a second part of the talk we will present the project "German in Austria" from the perspective of Digital Variationist Linguistics. Our discussion will focus on digital aspects of the extensive data collection in the project, the data processing and the data analyses. We will also present tools for data transcription and data visualization that have been developed within the project.


Theresa Matzinger & Irene Böhm: "Digital explorations of the emergence of Middle English sound patterns"

Abstract: This talk will focus on current research by the historical linguistics group of the University of Vienna’s English department. We will briefly sketch the general interests of our group before focusing on two current projects that investigate speakers’ sensitivity to the occurrence frequencies of phonotactic patterns in the lexicon and in use:

First, we will introduce a corpus study (Matzinger & Ritt 2022) on Open Syllable Lengthening, a sound change during which the short vowels in some Middle English words were lengthened. Investigating the reasons for this sound change, we compiled an extensive database of Middle English vowel lengths and found that this change made newly emerging words conform to pre-existing prototypical phonotactic shapes. The change thereby increased predictable majority patterns. In addition, our study suggests that vowel lengthening was sensitive to morphological patterns and changed phonotactic shapes in a way that they came to predictably signal morphological structure. This may facilitate the recognition, processing, learning, and retrieval of words (cf. Post et al. 2008).

In order to explore the underlying causalities of those findings more, we examine whether speakers prefer to use words whose phonotactic shapes unambiguously indicate their morphological structure in an artificial language learning experiment (e.g. also Dressler & Dziubalska-Kołaczyk 2006). Participants are asked to learn a miniature language in which both simple and complex word forms have the same sound shape. We aim at assessing whether participants indeed experience difficulty in learning ambivalent phonotactic shapes and alter them in a systematic, morphotactically disambiguating manner.

Thus, our research links corpus studies on naturally occurring diachronic sound changes with systematic explorations in highly controlled experimental settings. Overall, this can help us to understand how cognitive processing biases may select for particular sound patterns in cultural evolution.

Dressler, Wolfgang U. & Dziubalska-Kołaczyk, Katarzyna & Pestal, Lina. 2010. Change and variation in morphonotactics. Folia Linguistica Historica 31. 51–68.

Matzinger, Theresa & Ritt, Nikolaus. 2022. Phonotactically probable word shapes represent attractors in the cultural evolution of sound patterns. Cognitive Linguistics 33(2). 415-446.

Post, Brechtje & Marslen-Wilson, William D. & Randall, Billi & Tyler, Lorraine K. 2008. The processing of English regular inflections: Phonological cues to morphological struc-ture. Cognition 109. 1–17.


Claudia Resch & Nina C. Rastinger: "Whatʼs in the news? A corpus of historical newspapers under investigation"

Abstract: This contribution will focus on a historical newspaper corpus entitled DIGITARIUM (vgl. digitarium.acdh.oeaw.ac.at, Resch & Kampkaspar 2020) containing more than 300 issues of the Wien[n]erisches Diarium (today: Wiener Zeitung) , which is among the oldest newspapers still published today and whose form is currently being questioned again

After a brief insight into the creation of this 18th century newspaper corpus, we would like to show how we (and others) can now examine this treasure trove of historical data and make it fruitful for philological research. Starting from the premise that corpora should ideally serve multiple research interests, we will demonstrate the enormous potential for the reuse of the DIGITARIUM. A selection of empirical results will be used to discuss the purposes for which the corpus data has been explored so far, and the questions we are currently working on. Those fields of investigation range from changes in spelling and printing in the 18th century to aspects of intermediality between the newspaper and other media to the closer evaluation of lists and advertisements, which are still an essential feature of the Wiener Zeitung today.


Renato Rocha Souza: Crafting a system for knowledge discovery and organisation: A case-study on KOS for a non-standard German legacy dataset

Abstract: We will present a case-study about the development of a knowledge organisation system (KOS) on the example of a non-standard German language legacy dataset, DBÖ [Datenbank der bairischen Mundarten in Österreich / Database of Bavarian Dialects in Austria]). A particular focus is placed on the 109 original data collection questionnaires contained in the collection, which are understood as an entry point to the entire collection.


Lukas Thoma: "Large pre-trained deep learning language models are capable of computing abstract sameness relations"

Abstract: As state-of-the-art deep neural language models display almost human-like performance in tasks such as text generation and next-word prediction, these can be considered as suitable holistic systems for exploring cognitive core mechanisms involved in human language processing. Firstly, it needs to be clarified whether modern NLP models already rely on some of the elementary mechanisms known from human cognition. In our efforts so far, we focused on the computation of “abstract sameness relations”, a mechanism which is already present in humans in early infancy and which allows the detection and abstract representation of repetition rules from (linguistic) input, e.g. from syllable sequences such as ag gu gu and ol ki ki.

In order to investigate this mechanism in pre-trained language models, we adapted an experimental design from psycholinguistic studies with human infants and created a novel generative experimental setting. Our results provide strong evidence that (large) deep neural language models are capable of detecting repetition rules in language input and of generalizing these to new elements in their vocabulary. Since this ability requires the computation of sameness at an abstract level beyond the item-specific context, our results imply that Transformer NLP models successful at this task may rely on a similar core mechanism as humans with regard to language processing.


Tanja Wissik: "The ParlaMint-AT corpus: An annotated corpus of the shorthand records of the Austrian National Council as part of comparable parliamentary corpora"

Abstract: Specialized corpora are an important language resource for research questions from different fields such as linguistics, language for special purpose or terminology science. For examining parliamentary discourse and political language, for example, specialized corpora from parliamentary records are of interest. Since in most countries official records of parliamentary sessions are freely available in electronic form on the websites of the respective parliaments, there is also a growing number of machine-readable and annotated parliamentary corpora (see e.g. Fišer, Lenardič and Erjavec 2018). Since the different parliamentary corpora are available in very different formats with very different annotations and metadata, the (re)use of such corpora is rather difficult, especially for comparative or contrastive studies. The ParlaMint project (Erjavec et al. 2022), in which the ParlaMint-AT corpus was created, wants to overcome this obstacle. In this contribution the ParlaMint-AT corpus - annotated corpus of the shorthand records of the Austrian National Council - will be described as well as the whole ParlaMint project.

Tomaž Erjavec, Maciej Ogrodniczuk, Petya Osenova, Nikola Ljubešić, Kiril Simov, Andrej Pančur, Michał Rudolf, Matyáš Kopp, Starkaður Barkarson Steinþór Steingrímsson, Çağrı Çöltekin, Jesse de Does, Katrien Depuydt, Tommaso Agnoloni, Giulia Venturi, María Calzada Pérez, Luciana D. de Macedo, Costanza Navarretta, Giancarlo Luxardo, Matthew Coole, Paul Rayson, Vaidas Morkevičius, Tomas Krilavičius, Roberts Darģis, Orsolya Ring, Ruben van Heusden, Maarten Marx, and Darja Fišer. The ParlaMint corpora of parliamentary proceedings. Language Resources and Evaluation, 2022.

Fišer, Darja., Lenardič, Jakob, & Erjavec, Tomaž (2018). CLARIN’s Key Resource Families. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA). Available at www.aclweb.org/anthology/L18-1210.


Igor Yanovich: "Posterior predictive checking for the models of language diversification"

Abstract: In the recent decades, computational-statistical methods have become popular in historical linguistics. In particular, computational phylogenetic analyses regularly generate hypotheses for the structure of different language families (and sometimes claim to have resolved old questions about the subgrouping within a family). However, not all the apparatus of computational statistics is used in practice. In this talk, I explain the method of posterior predictive checking; illustrate its application in a case study on the Bantu family that Silvia Ghirotto, Patricia Santos, Andrea Benazzo and I have conducted; and call for the method’s application elsewhere.