The Centre for Corpus and Experimental Research on Slavic Languages “Slavicus”


The Centre for Corpus and Experimental Research on Slavic Languages “Slavicus”

We mainly focus on studying variation among Slavic and Baltic languages in the expression and interpretation of different functional categories in search of semantic universals. One our main goals is to create a large parallel corpus of 11 Slavic, 2 Baltic languages and English as a reference language, equipped with a user-friendly interface as well as sophisticated data search mechanisms. Translation mining and multidimensional scaling will be used in finding semantic universals. The project will integrate theoretical and computational linguists and open vast research areas for syntacticians, semanticists, morphologists, translators and lexicographers.

Our goal is to create a modern center for Slavic research of worldwide scope, combining theoretical, computational and experimental linguistics in the comparative study of the category of tense and grammatical aspect in Slavic and Baltic languages.

The research plans of the Centre are:

Creation of tools and methodological foundation in the field of computer linguistics for modern comparative research

Prof. Bożena Rozwadowska: We will build an innovative parallel corpus of 11 Slavic languages (Polish, Russian, Belarusian, Ukrainian, Czech, Slovak, Serbian, Slovenian, Croatian, Macedonian, Bulgarian, 2 Baltic languages (Lithuanian and Latvian) and English as a reference language. Our parallel corpus will consist of original texts translated into the Slavic and Baltic languages mentioned above. Thus, we will look for texts which have been translated into many languages. All language pairs will be parallelized, which will allow us to compare semantically equivalent constructions for all 14 languages. This corpus, with its user-friendly interface and search engine tailored to the expectations of linguists, should attract researchers from all over the world. It will be integrated with the WordNet system in cooperation with Clarin-PL. The corpus will be created in cooperation with corpus specialists from the Institute of Slavic Studies of the Polish Academy of Sciences and NLP specialists from the Faculty of Computer Science of the University of Wrocław.

Internationalisation

In collaboration with Prof. Henriette de Swart and Dr. Bert Le Bruyn of Utrecht University, Slavicus researchers will develop tools for data processing and visualization using multidimensional scaling.

Development of semantic micro-typology of time and aspect categories in Slavic and Baltic languages

There are about 6500 languages in the world. Obviously, languages differ from each other, but this variation is not accidental. Languages from different language families and from geographically unrelated areas share many properties. Assuming that language is an integral part of our mind, linguists focus on comparing languages in search of universals and parameters of variation. Theoretical studies usually rely on selective data. Recently, quantitative corpus based methods exploring big data have begun to be used in theoretical research. Advanced big data search, mining, and visualization mechanisms will allow Incubator’s researchers to better understand the similarities and differences between languages in the studied phenomena.

Development of improved NLP methods to improve parallelization of data from multilingual web resources and to improve machine translation between Slavic and Baltic languages

In collaboration with Natural Language Processing (NLP) specialists, the Incubator’s researchers will strive to develop better methods to access multilingual web data, to improve parallelization of data in parallel corpora, and to improve the quality of machine translation.

Comparative neurolinguistic research: semantic universals versus the category of time and aspect in the mind

The researchers intend to verify the results of corpus-based studies with psycholinguistic experiments including oculographic studies and ERP (evoked brain potentials) studies.

Identification of semantic universals of tense and aspect categories in Slavic and Baltic languages and language history

The researchers will seek to understand the observed micro-typological regularities in the semantics of tense and aspect by linking them to facts about linguistic change in Slavic and Baltic languages.

Development of semantic formal models

Our overarching goal for Slavicus researchers is to develop formal models to explain the typological observations made regarding differences and similarities in the semantics of tense and aspect in Slavic and Baltic languages.

Currently, the researchers are in the process of selecting the literature for the parallel corpus, specifying the rules of cooperation with partners, preparing the material for the preliminary survey of the micro-typology of aspect categories in Slavic and Baltic languages. They have also submitted their first grant application to NCN and are exploring the possibility of applying for grants for international cooperation. From 2022–2024, they plan to collaborate with researchers at the University of Leipzig on Lusatian languages under DAAD funding.

Projekt "Zintegrowany Program Rozwoju Uniwersytetu Wrocławskiego 2018-2022" współfinansowany ze środków Unii Europejskiej z Europejskiego Funduszu Społecznego