Jump to contentJump to search

Finding data in linguistics

In linguistics, data is everywhere, but what is actually meant by data? The Open Handbook of Linguistic Data Management defines data as "entities used as evidence of phenomena for the purposes of research or scholarship" (Berez-Kroeker et al. 2022: 3, adopted from Borgman 2015). The spectrum of linguistic data is thematized, for example, in the article "The Scope of Linguistic Data" by Jeff Good in The Open Handbook of Linguistic Data Management.

Language data in digital form is referred to as language resources in a lot of cases. The ELRA Language Resources Association defines language resources as "a set of speech or language data and descriptions in machine readable form." Some examples for language resources listed by ELRA are written or spoken corpora, speech collections, computational lexica, terminology databases and tools.

Depending on your research question, you may need to conduct an experiment or a survey to collect data, or you may need to analyze spoken or written data from existing corpora.

It is recommended to determine what type of data you would need in relation to your research question before you start searching.

Where can I search?

The following overview lists some of the main language resources catalogues as well as databases, repositories and single corpora. Further resources can be found in the Database Information System (DBIS) and in the ULB catalogue.

Please note that the listed resources may include datasets that are not freely available. Please consider the terms of use of the providers as well as the license information for single datasets.

When working with data, it is advisable to think about how to manage data. The Open Handbook of Linguistic Data Management addresses research data management aspects as archiving, sharing and citing data.

  • re3data
    The Registry of Research Data Repositories is an internationally recognized catalogue for research data repositories.
  • Zenodo
    The research data repository Zenodo provides access to data from all disciplines including linguistics.
  • LinguistList
    This international communication platform for linguists lists information about linguistic data and tools too.
  • META-SHARE
    A network of repositories with a catalogue that lists more than 2.500 language resources.
    Link: http://www.meta-share.org/
  • European Language Resources Association (ELRA) Catalogue
    Link: https://catalog.elra.info/
  • LRE Map
    Data catalogue for some of the most important NLT conferences, such as LREC, COLING and LTC.
    Link: https://lremap.elra.info/
  • Kaggle
    An online platform for knowledge exchange and competitions around data analysis, machine learning, data mining and big data.
  • Hugging Face
    An online platform from the machine learning area, where users can share datasets and models.
    Link: https://huggingface.co/datasets
  • Natural Language Toolkit (NLTK)
    A collection of Python libraries and programs in computational linguistics with an interface to different corpora.
    Link: https://www.nltk.org/data.html

     
  • Glottolog
    The database offers free access to scientific information about the languages of the world, especially about endangered languages, language families and dialects.
  • Ethnologue
    The database provides basic information for all known living languages and is licensed by HHU.
  • The World Atlas of Language Structures (WALS)
    An extensive database of structural properties of languages (e.g. phonological, grammatical, lexical properties).
  • MLA Langage Map
    A resource by the Modern Language Association that allows users to discover and investigate different regions of the USA with respect to their linguistic environment in an easy and intuitive way.
  • SIL Language and Culture Archives
    A bibliography that includes more than 40.000 books, journal articles, dissertations and data about ca. 1.600 languages and cultures.
  • Endangered Languages Archive (ELAR)
    Lists everyday language multimedia material of endangered languages from all over the world.
  • The Language Archive (TLA) vom Max Plank Institute
    Link: https://archive.mpi.nl/tla/
  • Language Data Commons of Australia (LDaCA)
    Link: https://www.ldaca.edu.au/

A list of corpus tools by Kristin Berberich and Ingo Kleiber
Link: https://corpus-analysis.com/

Access to resources

Not all listed resources are freely available. If you are interested in a resource that is not yet licensed by ULB, contact the responsible subject librarian to clarify whether or how access to the resources could be provided.

Please take note of the provider’s terms of use as well as the license information for single datasets.

Text and Data Mining for ULB resources

If you wish to use resources licensed by ULB for text and data mining (TDM), please note that the license might not cover such use. Contact the responsible subject librarian to clarify whether a license covers TDM.

In our Digital Collections you can find a large number of digital facsimiles that are freely available for TDM, in different formats and through a number of interfaces. If you have questions about ULB Digital Collections, contact us at .

Responsible for the content: