Ucto
Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor.
Overview
-
Comes with tokenization rules for English, Dutch, French, Italian, and Swedish; easily extendible to other languages.
-
Recognizes dates, times, units, currencies, abbreviations.
-
Recognizes paired quote spans, sentences, and paragraphs.
-
Produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input.
-
Optional conversion to all lowercase or uppercase.
-
Supports FoLiA XML.
-
Ucto is also available as a webservice.
Learn
Webservice
Ucto is available as a webservice.
Download & Installation
Ucto is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation.
To download and install Ucto:
- First check if there are up-to-date packages included in your distribution's package manager. There are packages for Arch Linux, Debian, FreeBSD and Ubuntu.
- If not, it is strongly recommended to use the LaMachine software distribution, which includes Ucto and all necessary dependencies, and runs on Linux, BSD and Mac OS X. It can also run as a virtual machine under any host OS.
- Alternatively, you can always download, compile and install Ucto manually, as described below.
Manual installation
To compile these manually consult the included INSTALL documents, you will need current versions of the following dependencies of our software:
As well as the following 3rd party dependencies:
- icu - A C++ library for Unicode and Globalization support. On Debian/Ubuntu systems, install the package
libicu-dev
. - libxml2 - An XML library. On Debian/Ubuntu systems install the package
libxml2-dev
. A sane build environment with a C++ compiler (e.g. gcc or clang), autotools, libtool, pkg-config.
Documentation
The Ucto documentation can be found here.
Python binding
Ucto can be used from Python through the python-ucto binding, which has to be obtained separately unless you are using LaMachine.
Support
The development and improvement of Ucto also relies on your bug reports, suggestions, and comments. Use the github issue tracker or mail lamasoftware (at) science.ru.nl.
Mentions
Publications
- Van Gompel, M., van der Sloot, K., & van den Bosch, A. (2012). Ucto: Unicode Tokeniser. Version 0.5, 3, 12-05.
Webpages
- Ucto home page
-
Ucto is included in LaMachine, a unified software distribution for Natural Language Processing, also including (among others):
- Alpino
- Frog
- FoLiA
- Colibri Core
- CLAM
- FLAT
Credits, Contact Information and License
Ucto was written by Maarten van Gompel and Ko van der Sloot (Radboud University). Work on Ucto was funded by NWO, the Netherlands Organisation for Scientific Research, under the Implicit Linguistics project and the CLARIN-NL program.
The development and improvement of Ucto also relies on your bug reports, suggestions, and comments. Use the github issue tracker or mail lamasoftware (at) science.ru.nl.
Ucto is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation.