Finding data in linguistics
In linguistics, data is everywhere, but what is actually meant by data? The Open Handbook of Linguistic Data Management defines data as "entities used as evidence of phenomena for the purposes of research or scholarship" (Berez-Kroeker et al. 2022: 3, adopted from Borgman 2015). The spectrum of linguistic data is thematized, for example, in the article "The Scope of Linguistic Data" by Jeff Good in The Open Handbook of Linguistic Data Management.
Language data in digital form is referred to as language resources in a lot of cases. The ELRA Language Resources Association defines language resources as "a set of speech or language data and descriptions in machine readable form." Some examples for language resources listed by ELRA are written or spoken corpora, speech collections, computational lexica, terminology databases and tools.
Depending on your research question, you may need to conduct an experiment or a survey to collect data, or you may need to analyze spoken or written data from existing corpora.
It is recommended to determine what type of data you would need in relation to your research question before you start searching.
Where can I search?
The following overview lists some of the main language resources catalogues as well as databases, repositories and single corpora. Further resources can be found in the Database Information System (DBIS) and in the ULB catalogue.
Please note that the listed resources may include datasets that are not freely available. Please consider the terms of use of the providers as well as the license information for single datasets.
When working with data, it is advisable to think about how to manage data. The Open Handbook of Linguistic Data Management addresses research data management aspects as archiving, sharing and citing data.
- re3data
The Registry of Research Data Repositories is an internationally recognized catalogue for research data repositories. - Zenodo
The research data repository Zenodo provides access to data from all disciplines including linguistics. - LinguistList
This international communication platform for linguists lists information about linguistic data and tools too.
- CLARIN Virtual Language Observatory (VLO)
A research infrastructure for the Arts and Humanities, Cultural Studies as well as Social Sciences that archives and allows to search for data from the area via the VLO.
Link: https://vlo.clarin.eu/ - Linguistic Data Consortium (LDC)
A consortium of universities, libraries and research laboratories that creates and distributes language resources.
Link: https://catalog.ldc.upenn.edu/ - Open Language Archives Community (OLAC)
An international virtual library for language resources.
Link: http://www.language-archives.org/ - Sketch Engine
A corpus manager and text analysis software licensed by HHU. - CoRD
A list of english corpora by the research group Variation, Contacts and Change in English at the University of Helsinki.
Link: https://varieng.helsinki.fi/CoRD/ - The Tromsø Repository of Language and Linguistics (TROLLing)
An Open Access repository and CLARIN centre for linguistic data and code.
Link: https://dataverse.no/dataverse/trolling - OPUS the open parallel corpus
A collection of freely available parallel corpora compiled under the supervision of Jörg Tiedemann from the University of Helsinki.
Link: https://opus.nlpl.eu/ - TalkBank
An Open Access repository for language data, especially spoken language data.
Link: https://www.talkbank.org/ - GESIS Leibnitz-Institut für Sozialwissenschaften
A collection of social media data as well as further digital behavioral data.
Link: https://www.gesis.org/ - Fachinformationsdienst (FID) Linguistik
The Lin|gu|is|tik-Portal offers subject information for all areas of linguistics.- Annohub-Repository of FID Linguistik
Link: https://www.linguistik.de/de/lod/annohub-repository/
- Annohub-Repository of FID Linguistik
- Nationale Forschungsdateninfrastruktur Text+
Link (Federated Content Search, CLARIN-FCS): https://text-plus.org/, https://fcs.text-plus.org/
- META-SHARE
A network of repositories with a catalogue that lists more than 2.500 language resources.
Link: http://www.meta-share.org/ - European Language Resources Association (ELRA) Catalogue
Link: https://catalog.elra.info/ - LRE Map
Data catalogue for some of the most important NLT conferences, such as LREC, COLING and LTC.
Link: https://lremap.elra.info/ - Kaggle
An online platform for knowledge exchange and competitions around data analysis, machine learning, data mining and big data. - Hugging Face
An online platform from the machine learning area, where users can share datasets and models.
Link: https://huggingface.co/datasets - Natural Language Toolkit (NLTK)
A collection of Python libraries and programs in computational linguistics with an interface to different corpora.
Link: https://www.nltk.org/data.html
- Glottolog
The database offers free access to scientific information about the languages of the world, especially about endangered languages, language families and dialects. - Ethnologue
The database provides basic information for all known living languages and is licensed by HHU. - The World Atlas of Language Structures (WALS)
An extensive database of structural properties of languages (e.g. phonological, grammatical, lexical properties). - MLA Langage Map
A resource by the Modern Language Association that allows users to discover and investigate different regions of the USA with respect to their linguistic environment in an easy and intuitive way. - SIL Language and Culture Archives
A bibliography that includes more than 40.000 books, journal articles, dissertations and data about ca. 1.600 languages and cultures. - Endangered Languages Archive (ELAR)
Lists everyday language multimedia material of endangered languages from all over the world. - The Language Archive (TLA) vom Max Plank Institute
Link: https://archive.mpi.nl/tla/ - Language Data Commons of Australia (LDaCA)
Link: https://www.ldaca.edu.au/
- PHOIBLE
A collection of phonological data from around 2.200 different languages. - Surrey Lexical Splits Database
Link: https://lexicalsplitsdb.surrey.ac.uk/ - The World Loanword Database (WOLD)
Link: https://wold.clld.org/ - UCLA Phonetics Lab Archive
Link: http://archive.phonetics.ucla.edu/
A list of corpus tools by Kristin Berberich and Ingo Kleiber
Link: https://corpus-analysis.com/
- English
- english-corpora.org: Corpora by Prof. em. Mark Davies (Brigham Young University, Utah), licensed by FID Linguistik
- Oxford Text Archive
Link: https://ota.bodleian.ox.ac.uk/repository/xmlui/
- German
- IDS Mannheim:
- Hamburg Zentrum für Sprachkorpora
- Bayerisches Archiv für Sprachsignale
- Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart
- Arbeitsbereich Allgemeine Sprachwissenschaft & Computerlinguistik, Universität Tübingen
- Corpus Linguistics and Morphology, Humboldt-Universität zu Berlin
- Leipzig Corpora Collection
- Slavic languages
- List of corpora, provided by the Lehrstuhl für slavische Sprachenwissenschaft at the University of Tübingen, in cooperation with the Lehrstuhl für Slavische Philologie (Sprachwissenschaft) at the University of Freiburg.
Access to resources
Not all listed resources are freely available. If you are interested in a resource that is not yet licensed by ULB, contact the responsible subject librarian to clarify whether or how access to the resources could be provided.
Please take note of the provider’s terms of use as well as the license information for single datasets.
Text and Data Mining for ULB resources
If you wish to use resources licensed by ULB for text and data mining (TDM), please note that the license might not cover such use. Contact the responsible subject librarian to clarify whether a license covers TDM.
In our Digital Collections you can find a large number of digital facsimiles that are freely available for TDM, in different formats and through a number of interfaces. If you have questions about ULB Digital Collections, contact us at .