Steps for building the database for Cantalausa's dictionary
Cantalausa project
The project works on two opus of the abbot Louis Combes aka Cantalausa, the dictionary and Lenga Viva.
Cantalausa dictionary
The Cantalausa's dictionary comes with a pack of .pdf files for each letter of the occitan alphabet. From each letter we take the content of the pdf in text format to end with a batch of [letter].txt in a dedicated directory.
These files will be the raw material of the process to populate the database.
cantalausa_cli usage
dc@xxx > target/debug/cantalausa_cli --help
Cantalausa dictionary text file Super Cleaner and Enhancer 1.0
Domenge Castel <domenge.chateau@free.fr>
Cleans, do some housekeeping to prepare a file for populating the Cantalausa dictionary database
USAGE:
cantalausa_cli [OPTIONS] --action <[lapexvwi] l => clean leading folio, a => assemble definition, p => populate database, e => populate term extension, x => process term extension, v => populate Lenga viva, w => wordization, i => indexation>
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
OPTIONS:
-a, --action <[lapexvwi]
l => clean leading folio,
a => assemble definition,
p => populate database,
e => populate term extension,
x => process term extension,
v => populate Lenga viva,
w => wordization,
i => indexation
>
l : removes the folio,
a => reassembles definition in one line when splitted,
p => populates database,
e => populates a table of term extensions,
x => term extension is cleaned and placed either in ext_term or pos field,
v => takes the categorie in the config file and populates the database w/ Cantalausa''s Lenga viva data,
w => populates an list of words and its occurences.
-c, --config <config> Get the basic configuration in a file
-f, --file <file> If set the name of the input file
-l, --letter <a lettera-z] Sets the letter to process, the text file will be [letter].txt ex: A.txt for letter 'a'
-o, --output <output>
Steps
- Remove first line letter for every letter text file. [letter].txt in w/sources/Cantalausa/text_via_pdf
- Detect and remove leading folios, pipe the output in [letter].out.
target/debug/cantalausa_cli -l Y -a l > Y.out
- Assemble lines, output will be piped in [letter].joined.out.
target/debug/cantalausa_cli -l A -a a -f A.out > A.joined.out
- Populate the database, set the flag if multiple colons, add trailing period if needed.
- Edit separatly lines w/ multiple colons. Consider writing a javascript helper for detection.
- Populate a table
ext_term_pos
w/ term extensions and their count. In the term there could be some text between parenthesis, this is taken as an extension. A counter must be incremented.target/debug/cantalausa_cli -a e
select count(*) from ext_term_pos; => 25345
- Export
ext_term_pos
in a spreadsheet, then manually set the POS value, 0 for an extension of the term, 1 for a POS, 2 for a reflexive pronoun. - Set the flag
dictionary_txt.to_check
to true if an term extension exists. Scroll backward (the most frequent extension first) throughext_term_pos
to check the extension and how to act with. As of pos value :- 0 extension goes in the ext_term field ;
- 1 extension goes in the pos field ;
- 2 does nothing, the reflexive pronoun stays in place ;
- when done the flag
dictionary_txt.to_check
is set to false.
Cautions
- at step 4. Always ends an entry with a period.
- at step 5. In a multiple-colon line, a legal entry begins at the beginning of a line and further entries are delimated with a '. ' (period followed by space). An entry MUST end with a period. The leading space remaining must be trimmed.
- at step 5. Be aware of erroneous colon written instead of semi-colon. Change the colon to a semi-colon, the definition remains the same.
Lenga viva
- For every category a file is created containing all the phrases that belong to it.
- A routine is written to populate a table
lenga_viva (id, phrase, category)
.Phrase
is ending space trimmed. Thecategory
is considered as the primary since this is the original category from the author. Lenga viva processing gives 11521 rows/phrases & 27 categories.
Indexing
Indexing is populating the term_id
table. The table refers to terms mapping a definition by its id in dictionary_txt
table. The field contains the masculine and feminine form of the term in dictionary_txt
and also other graphies or synonyms all pointing to the same definition.
- Splits the
dictionary_txt.term
field separated by a "/", an entry is created interm_id
for each,term_id.term
contains the term splited andterm_id.id
contains thedictionary_txt.id
. - Split the
term_id.term
by its masculine and feminine form, the fieldgender
is documented and theid
field is duplicated for the both forms.
Wordization
- Every word in a phrase in the
lenga_viva
table is inserted if not exists in thewordcount
table. For each occurence of the word thelv_counter
is incremented. - Every word in a definition of the table
dictionary_txt
is inserted if not exists in thewordcount
table. For each occurence of the word thedict_counter
is incremented. - Every word in the
term_id
table is inserted if not exists in thewordcount
table. For each occurence of the word thedict_counter
is incremented as the term belongs todictionary_txt
.
To do
- Set an interface to act on the definitions (both actions are contradictory) :
- Separate the definitions imbeded in a big definition
- Bring back the entries unduely separated from their original definition (undo the 1.)