Python library¶
To use Linguistica 5 as a Python library, an essential step is to initialize a Linguistica object. The way this can be done depends on the nature of your data source:
Data source¶
read_corpus (file_path[, encoding]) |
Create a Linguistica object with a corpus data file. |
read_wordlist (file_path[, encoding]) |
Create a Linguistica object with a wordlist file. |
from_corpus (corpus_object, **kwargs) |
Create a Linguistica object with a corpus object. |
from_wordlist (wordlist_object, **kwargs) |
Create a Linguistica object with a wordlist object. |
For instance, if the Brown corpus is available on your local drive (see Raw corpus text):
>>> import linguistica as lxa
>>> lxa_object = lxa.read_corpus('path/to/english-brown.txt')
Use read_wordlist()
if you have a wordlist text file instead
(see Wordlist).
Use from_corpus()
or from_wordlist()
if your data is an in-memory Python object (either a corpus text or a wordlist).
Parameters¶
The functions introduced in Data source all allow optional keyword arguments which are parameters for the Linguistica object. Different Linguistica modules make use of different parameters; see Full API documentation.
For example, to deal with only the first 500,000 word tokens in the Brown corpus:
>>> import linguistica as lxa
>>> lxa_object = lxa.read_corpus('path/to/english-brown.txt', max_word_tokens=500000)
Parameter | Meaning | Default |
---|---|---|
max_word_tokens |
maximum number of word tokens to be handled | 0 (= all) |
max_word_types |
maximum number of word types to be handled | 1000 |
min_stem_length |
minimum stem length | 4 |
max_affix_length |
maximum affix length | 4 |
min_sig_count |
minimum number of stems for a valid signature | 5 |
min_context_count |
minimum number of occurrences for a valid context | 3 |
n_neighbors |
number of syntactic word neighbors | 9 |
n_eigenvectors |
number of eigenvectors (in dimensionality reduction) | 11 |
suffixing |
whether the language is suffixing | 1 (= yes) |
keep_case |
whether case distinctions (“the” vs “The”) are kept | 0 (= no) |
The method parameters()
returns the parameters and their values as a dict:
>>> from pprint import pprint
>>> pprint(lxa_object.parameters())
{'keep_case': 0,
'max_affix_length': 4,
'max_word_tokens': 0,
'max_word_types': 1000,
'min_context_count': 3,
'min_sig_count': 5,
'min_stem_length': 4,
'n_eigenvectors': 11,
'n_neighbors': 9,
'suffixing': 1}
To change one or multiple parameters of a Linguistica object,
use change_parameters()
with keyword arguments:
>>> lxa_object.parameters()['min_stem_length'] # before the change
4
>>> lxa_object.change_parameters(min_stem_length=3)
>>> lxa_object.parameters()['min_stem_length'] # after the change
3
To reset all parameters to their default values,
use use_default_parameters()
:
>>> lxa_object.parameters()['min_stem_length'] # non-default value
3
>>> lxa_object.use_default_parameters()
>>> lxa_object.parameters()['min_stem_length']
4
-
linguistica.
from_corpus
(corpus_object, **kwargs)¶ Create a Linguistica object with a corpus object.
Parameters: - corpus_object – either a long string of text (with spaces separating word tokens) or a list of strings as word tokens
- kwargs – keyword arguments for parameters and their values.
-
linguistica.
from_wordlist
(wordlist_object, **kwargs)¶ Create a Linguistica object with a wordlist object.
Parameters: - wordlist_object – either a dict of word types (as strings) mapped to their token counts or an iterable of word types (as strings).
- kwargs – keyword arguments for parameters and their values.
-
linguistica.
read_corpus
(file_path, encoding='utf8', **kwargs)¶ Create a Linguistica object with a corpus data file.
Parameters: - file_path – path of input corpus file
- encoding – encoding of the file at file_path. Default:
'utf8'
- kwargs – keyword arguments for parameters and their values.
-
linguistica.
read_wordlist
(file_path, encoding='utf8', **kwargs)¶ Create a Linguistica object with a wordlist file.
Parameters: - file_path – path of input wordlist file where each line contains one word type (and, optionally, a whitespace plus the token count for that word).
- encoding – encoding of the file at file_path. Default:
'utf8'
- kwargs – keyword arguments for parameters and their values.