As the goal of Linguistica 5 is to perform automatic linguistic analysis for some given data, it is important to understand what data Linguistica 5 can handle.
Two types of data are recognized:
- raw corpus text
If you use Linguistica 5 via the Graphical user interface (GUI) or the Command line interface (CLI), then your data is a file on your local drive. However, if you use it as a Python library instead, your data can either be a file from the local drive or an in-memory Python object.
Raw corpus text¶
A raw corpus text is simply a plain text file.
An example is the
(Kučera and Francis 1967) with
about one million word tokens (for about 50,000 word types).
This corpus is a built-in dataset that comes with Linguistica 5 –
its file path is accessible as
brown as follows:
import linguistica as lxa from linguistica.datasets import brown lxa_object = lxa.read_corpus(brown)
In lieu of
brown, you may pass the file path as argument to
read_corpus() for your raw corpus text.
As long as your corpus is
one single text file encoded in Unicode UTF-8,
Linguistica 5 can handle it.
Alternatively, if you use Linguistica 5 as a Python library in your own
Python programs and would like to use an in-memory object as corpus data,
from_corpus() (see Data source) is probably what
A wordlist file is a plain text file in a specific format (more below)
that Linguistica 5 recognizes as a wordlist.
The Linguistica team has conventionally called this format
(think: a dictionary file; that’s “one” but not “L” at the end)
which is also the file extension name.
An example of a
.dx1 file is
(This file is essentially the
CMU Pronouncing Dictionary
with English words represented phonemically by Arpabet,
with the addition of word token counts—if available—based on the English
Brown corpus.) This wordlist is also a built-in dataset:
import linguistica as lxa from linguistica.datasets import cmudict lxa_object = lxa.read_wordlist(cmudict)
cmudict is the file path for the built-in
It can be replaced with the file path of a wordlist from your local drive.
As for the format of a wordlist file, each line of the file contains three pieces of information for a unique word type:
The word itself
By default, case does not matter, as Linguistica 5 internally processes all words in lowercase.
Token count of the word type
This is based on some corpus data. If unavailable, put down 1 here.
Phonemic representation (optional)
A list of phonemes/phones (separated by spaces) for the pronunciation of the word. If unavailable, the word itself with its list of letters is taken to be phonemic representation.
To illustrate what the required format looks like, here are a few lines
ABANDON 18 AH0 B AE1 N D AH0 N ABANDONED 26 AH0 B AE1 N D AH0 N D ABANDONING 8 AH0 B AE1 N D AH0 N IH0 NG
If phonemic representations were not available, these lines would be as follows:
ABANDON 18 ABANDONED 26 ABANDONING 8
Blank lines as well as those that begin with
# (for comments, metadata etc)
are ignored by Linguistica 5.
To use Linguistica 5 as a Python library with an in-memory wordlist object,
the relevant function is
from_wordlist() (see Data source).