Python library

To use Linguistica 5 as a Python library, an essential step is to initialize a Linguistica object. The way this can be done depends on the nature of your data source:

Data source

read_corpus(file_path[, encoding]) Create a Linguistica object with a corpus data file.
read_wordlist(file_path[, encoding]) Create a Linguistica object with a wordlist file.
from_corpus(corpus_object, \*\*kwargs) Create a Linguistica object with a corpus object.
from_wordlist(wordlist_object, \*\*kwargs) Create a Linguistica object with a wordlist object.

For instance, if the Brown corpus is available on your local drive (see Raw corpus text):

>>> import linguistica as lxa
>>> lxa_object = lxa.read_corpus('path/to/english-brown.txt')

Use read_wordlist() if you have a wordlist text file instead (see Wordlist).

Use from_corpus() or from_wordlist() if your data is an in-memory Python object (either a corpus text or a wordlist).

Parameters

The functions introduced in Data source all allow optional keyword arguments which are parameters for the Linguistica object. Different Linguistica modules make use of different parameters; see Full API documentation.

For example, to deal with only the first 500,000 word tokens in the Brown corpus:

>>> import linguistica as lxa
>>> lxa_object = lxa.read_corpus('path/to/english-brown.txt', max_word_tokens=500000)
Parameter Meaning Default
max_word_tokens maximum number of word tokens to be handled 0 (= all)
max_word_types maximum number of word types to be handled 1000
min_stem_length minimum stem length 4
max_affix_length maximum affix length 4
min_sig_count minimum number of stems for a valid signature 5
min_context_count minimum number of occurrences for a valid context 3
n_neighbors number of syntactic word neighbors 9
n_eigenvectors number of eigenvectors (in dimensionality reduction) 11
suffixing whether the language is suffixing 1 (= yes)
keep_case whether case distinctions (“the” vs “The”) are kept 0 (= no)

The method parameters() returns the parameters and their values as a dict:

>>> from pprint import pprint
>>> pprint(lxa_object.parameters())
{'keep_case': 0,
 'max_affix_length': 4,
 'max_word_tokens': 0,
 'max_word_types': 1000,
 'min_context_count': 3,
 'min_sig_count': 5,
 'min_stem_length': 4,
 'n_eigenvectors': 11,
 'n_neighbors': 9,
 'suffixing': 1}

To change one or multiple parameters of a Linguistica object, use change_parameters() with keyword arguments:

>>> lxa_object.parameters()['min_stem_length']  # before the change
4
>>> lxa_object.change_parameters(min_stem_length=3)
>>> lxa_object.parameters()['min_stem_length']  # after the change
3

To reset all parameters to their default values, use use_default_parameters():

>>> lxa_object.parameters()['min_stem_length']  # non-default value
3
>>> lxa_object.use_default_parameters()
>>> lxa_object.parameters()['min_stem_length']
4
linguistica.from_corpus(corpus_object, **kwargs)

Create a Linguistica object with a corpus object.

Parameters:
  • corpus_object – either a long string of text (with spaces separating word tokens) or a list of strings as word tokens
  • kwargs – keyword arguments for parameters and their values.
linguistica.from_wordlist(wordlist_object, **kwargs)

Create a Linguistica object with a wordlist object.

Parameters:
  • wordlist_object – either a dict of word types (as strings) mapped to their token counts or an iterable of word types (as strings).
  • kwargs – keyword arguments for parameters and their values.
linguistica.read_corpus(file_path, encoding='utf8', **kwargs)

Create a Linguistica object with a corpus data file.

Parameters:
  • file_path – path of input corpus file
  • encoding – encoding of the file at file_path. Default: 'utf8'
  • kwargs – keyword arguments for parameters and their values.
linguistica.read_wordlist(file_path, encoding='utf8', **kwargs)

Create a Linguistica object with a wordlist file.

Parameters:
  • file_path – path of input wordlist file where each line contains one word type (and, optionally, a whitespace plus the token count for that word).
  • encoding – encoding of the file at file_path. Default: 'utf8'
  • kwargs – keyword arguments for parameters and their values.