A1 Journal article (refereed)
The International Comparable Corpus : Challenges in building multilingual spoken and written comparable corpora (2021)
Čermáková, A., Jantunen, J., Jauhiainen, T., Kirk, J., Křen, M., Kupietz, M., & Uí Dhonnchadha, E. (2021). The International Comparable Corpus : Challenges in building multilingual spoken and written comparable corpora. Research in Corpus Linguistics, 9(1), 89-103. https://doi.org/10.32714/ricl.09.01.06
JYU authors or editors
Publication details
All authors or editors: Čermáková, Ann; Jantunen, Jarmo; Jauhiainen, Tommi; Kirk, John; Křen, Michal; Kupietz, Marc; Uí Dhonnchadha, Elaine
Journal or series: Research in Corpus Linguistics
eISSN: 2243-4712
Publication year: 2021
Volume: 9
Issue number: 1
Pages range: 89-103
Publisher: Asociacion Espanola de Linguistica de Corpus
Publication country: Spain
Publication language: English
DOI: https://doi.org/10.32714/ricl.09.01.06
Persistent website address: http://ricl.aelinco.es/first-view/155-Article%20Text-1147-1-10-20210618.pdf
Publication open access: Openly available
Publication channel open access: Open Access channel
Publication is parallel published (JYX): https://jyx.jyu.fi/handle/123456789/79643
Abstract
This paper reports on the efforts of twelve national teams in building the International Comparable Corpus (ICC; https://korpus.cz/icc) that will contain highly comparable datasets of spoken, written and electronic registers. The languages currently covered are Czech, Finnish, French, German, Irish, Italian, Norwegian, Polish, Slovak, Swedish and, more recently, Chinese, as well as English, which is considered to be the pivot language. The goal of the project is to provide much-needed data for contrastive corpus-based linguistics. The ICC corpus is committed to the idea of re-using existing multilingual resources as much as possible and the design is modelled, with various adjustments, on the International Corpus of English (ICE). As such, ICC will contain approximately the same balance of forty percent of written language and 60 percent of spoken language distributed across 27 different text types and contexts. A number of issues encountered by the project teams are discussed, ranging from copyright and data sustainability to technical advances in data distribution.
Keywords: linguistics; corpora; comparative linguistics; contrastive research; copyright
Free keywords: ICC corpus; contrastive linguistics; comparable corpus; ICE corpus; data sustainability; copyright
Contributing organizations
Ministry reporting: Yes
VIRTA submission year: 2022
JUFO rating: 1