OpenSoNaR

OpenSoNaR is an easy-to-use online corpus retrieval system that allows for analyzing and searching the SoNaR and CGN corpora.

Overview

  • OpenSoNaR is an online application for exploration of and searching in the SoNaR and CGN corpora. In the Exploration (Dutch: verken) interface one can investigate corpus distributions, request statistics from sub-corpora, retrieve n-grams from sub-corpora and search for specific documents. In the Search (Dutch: zoek) interface one can use four different search strategies: simple (simpel), extended (uitgebreid), advanced (geavanceerd) or expert.
  • OpenSoNaR can be accessed on (https://opensonar.ivdnt.org/).
  • To use OpenSoNaR, an account is required. Employees of universities or research institutes from the Netherlands can log in with the user ID and password of their own organization. If you do not have an account at an academic institute, please apply for an account at clarin.eu.
  • The SoNaR corpus contains more than 500 million words of text from various domains and genres. All texts were tokenised, POS tagged and lemmatised. The named entities were also labelled. All annotations of SoNaR were produced automatically.
  • The Corpus of Spoken Dutch (Corpus Gesproken Nederlands, CGN) is a collection of 900 hours (almost 9 million words) of contemporary Dutch speech, originating from Flemish and Dutch speakers. The speech fragments (spontaneous and prepared) are aligned with various transcriptions (including orthographic, phonetic) and annotations (lemma, POS tags). All annotations have been verified manually, except for the phonetic transcription: only 11,3% was verified.
  • Due to the size of the corpora the number of hits shown in OpenSoNaR is limited to 8 million hits. If the results of your query exceeds this limit only the first 8,000,000 hits will be shown.

Learn

Instructional webpages and manuals

  • OpenSoNaR has a built-in page guide with four steps, found on the top right of the home page of the tool.
  • A more detailed application manual can be found here. Note: access to this manual requires the user to log in to the application.
  • A resource webpage including a manual to the SoNaR corpus can be found here.
  • A resource webpage including a manual to the CGN corpus can be found here.

Workshop and tutorial

The Week van het Nederlands hosted a tutorial on using OpenSoNaR on October 9, 2020. The videos of this tutorial can be found here.

The slides and exercises accompanying said tutorial can be found here.

Mentions

The application is a web-based frontend for the BlackLab search engine for corpora with token-based annotation. The current frontend is a further development of the corpus-frontend application developed by INT and its design is inspired by the first version of the OpenSoNaR user interface by Tilburg and Radboud University.

Publications

  • Oostdijk, N., Reynaert, M., Hoste, V., & Schuurman, I. (2013). The construction of a 500-million-word reference corpus of contemporary written Dutch. Essential speech and language technology for Dutch: Results by the STEVIN programme, 219-247.
  • Corpus Gesproken Nederlands - CGN (Version 2.0.3) (2014) [Data set]. Available at the Dutch Language Institute: (http://hdl.handle.net/10032/tm-a2-k6)
  • Reynaert, M., Camp, M. V. D., & Zaanen, M. V. (2014). OpenSoNaR: user-driven development of the SoNaR corpus interfaces. Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations, 124–128. (https://aclanthology.org/C14-2027.pdf)

Webpages

  • (https://opensonar.ivdnt.org/)