Quick demonstration

This demonstration focuses on Linguistica 5 as a Python library. There are two other use modes: Graphical user interface (GUI) and Command line interface (CLI).

For the nature and format of datasets: Data

For more details about Linguistica 5 as a Python library: Python library and Full API documentation

The Basics

  • Importing the linguistica library
  • Using a built-in dataset (the English Brown corpus)
  • Creating a Linguistica object for a given dataset
import linguistica as lxa
from linguistica.datasets import brown

lxa_object = lxa.read_corpus(brown)

brown (str) is the file path of the English Brown corpus text file built in the Linguistica 5 library. If you would like to use a corpus text from the local drive, the argument for read_corpus() should take the file path (as a str) of your desired text file.

Methods with a Linguistica object

Sample uses: (1) word trigrams, (2) signatures to stems

(1) Word trigrams

trigrams = lxa_object.word_trigram_counter()

trigrams is a dict with word trigrams (each as a tuple) mapped to their respective counts.

for trigram, count_ in sorted(trigrams.items(), key=lambda x: x[1], reverse=True):
    print(trigram, count_)

    if count_ < 100:
        break
(',', 'and', 'the') 662
('one', 'of', 'the') 403
('the', 'united', 'states') 328
(',', 'however', ',') 321
(',', 'in', 'the') 266
('.', 's', '.') 266
(',', 'he', 'said') 257
('as', 'well', 'as') 238
('u', '.', 's') 235
(',', 'it', 'is') 234
(',', 'and', 'he') 225
('of', 'course', ',') 220
(',', 'of', 'course') 189
('some', 'of', 'the') 179
('the', 'u', '.') 176
('out', 'of', 'the') 174
('the', 'fact', 'that') 167
(',', 'but', 'the') 161
(',', 'mr', '.') 159
(',', 'and', 'a') 158
('for', 'example', ',') 153
('.', 'm', '.') 153
('the', 'end', 'of') 149
(',', 'but', 'he') 148
('part', 'of', 'the') 144
('he', 'said', ',') 143
('it', 'was', 'a') 143
('there', 'was', 'a') 142
('it', 'is', 'not') 136
('to', 'be', 'a') 133
('there', 'was', 'no') 132
(',', 'and', 'i') 132
(',', 'too', ',') 131
(',', 'it', 'was') 129
('there', 'is', 'a') 128
('of', 'the', 'united') 127
(',', 'with', 'the') 124
('a', 'number', 'of') 123
(',', 'mrs', '.') 121
('in', 'order', 'to') 120
(',', 'and', 'that') 120
(',', 'but', 'it') 120
(',', 'and', 'in') 119
('it', 'is', 'a') 114
('most', 'of', 'the') 114
('members', 'of', 'the') 110
(',', 'and', 'it') 109
(',', 'he', 'was') 109
('end', 'of', 'the') 108
('of', 'the', 'new') 107
('it', 'would', 'be') 107
(',', 'for', 'the') 106
('the', 'number', 'of') 104
('there', 'is', 'no') 104
('he', 'did', 'not') 103
('at', 'the', 'same') 103
('.', 'c', '.') 102
(',', 'and', 'then') 102
(',', 'she', 'said') 102
('the', 'use', 'of') 102
('in', 'fact', ',') 101
('on', 'the', 'other') 100
('he', 'said', '.') 100
(',', 'on', 'the') 99

Given trigrams is a dict that maps something to counts, it is natural to convert it to a Counter instance (via the collections module in the standard library) and take advantage of the methods available (e.g., most_common(k) for accessible the most common k items).

(2) Signatures to stems

sigs_to_stems = lxa_object.signatures_to_stems()
for sig, stems in sorted(sigs_to_stems.items(), key=lambda x: len(x[1]), reverse=True):
    print(sig, len(stems))

    if len(stems) < 50:
        break
('NULL', 's') 2327
("'s", 'NULL') 813
('NULL', 'ly') 587
('NULL', 'd', 's') 346
('NULL', 'd') 314
('ed', 'ing') 197
("'", 'NULL') 190
("'s", 'NULL', 's') 181
('d', 's') 175
('ies', 'y') 173
('NULL', 'ed', 'ing', 's') 151
('NULL', 'ed') 134
('NULL', 'ed', 'ing') 130
('e', 'ed', 'es', 'ing') 130
('NULL', 'ing') 105
('d', 'r') 98
('e', 'y') 95
('e', 'ed', 'ing') 88
('ng', 'on') 85
('NULL', 'ed', 's') 82
('NULL', 'ly', 'ness') 74
("'", 'g') 72
('d', 'r', 'rs') 66
('NULL', 'es') 63
('NULL', 'ness') 60
('ng', 'on', 'ons') 57
('NULL', 'e') 51
('NULL', 'ally') 47

For all methods available to a Linguistica objects: Full API documentation