Steps for building the database for Cantalausa's dictionary
Cantalausa project
The project works on two opus of the abbot Louis Combes aka Cantalausa, the dictionary and Lenga Viva.
Cantalausa dictionary
The Cantalausa's dictionary comes with a pack of .pdf files for each letter of the occitan alphabet. From each letter we take the content of the pdf in text format to end with a batch of [letter].txt in a dedicated directory.
These files will be the raw material of the process to populate the database.
cantalausa_cli usage
dc@xxx > target/debug/cantalausa_cli --help
Cantalausa dictionary text file Super Cleaner and Enhancer 1.0
Domenge Castel <domenge.chateau@free.fr>
Cleans, do some housekeeping to prepare a file for populating the Cantalausa dictionary database
USAGE:
cantalausa_cli [OPTIONS] --action <[lapexvwi] l => clean leading folio, a => assemble definition, p => populate database, e => populate term extension, x => process term extension, v => populate Lenga viva, w => wordization, i => indexation>
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
OPTIONS:
-a, --action <[lapexvwi]
l => clean leading folio,
a => assemble definition,
p => populate database,
e => populate term extension,
x => process term extension,
v => populate Lenga viva,
w => wordization,
i => indexation
>
l : removes the folio,
a => reassembles definition in one line when splitted,
p => populates database,
e => populates a table of term extensions,
x => term extension is cleaned and placed either in ext_term or pos field,
v => takes the categorie in the config file and populates the database w/ Cantalausa''s Lenga viva data,
w => populates an list of words and its occurences.
-c, --config <config> Get the basic configuration in a file
-f, --file <file> If set the name of the input file
-l, --letter <a lettera-z] Sets the letter to process, the text file will be [letter].txt ex: A.txt for letter 'a'
-o, --output <output>
Steps
- Remove first line letter for every letter text file. [letter].txt in w/sources/Cantalausa/text_via_pdf
- Detect and remove leading folios, pipe the output in [letter].out.
target/debug/cantalausa_cli -l Y -a l > Y.out
- Assemble lines, output will be piped in [letter].joined.out.
target/debug/cantalausa_cli -l A -a a -f A.out > A.joined.out
- Populate the database, set the flag if multiple colons, add trailing period if needed.
- Edit separatly lines w/ multiple colons. Consider writing a javascript helper for detection.
- Populate a table
ext_term_posw/ term extensions and their count. In the term there could be some text between parenthesis, this is taken as an extension. A counter must be incremented.target/debug/cantalausa_cli -a eselect count(*) from ext_term_pos; => 25345
- Export
ext_term_posin a spreadsheet, then manually set the POS value, 0 for an extension of the term, 1 for a POS, 2 for a reflexive pronoun. - Set the flag
dictionary_txt.to_checkto true if an term extension exists. Scroll backward (the most frequent extension first) throughext_term_posto check the extension and how to act with. As of pos value :- 0 extension goes in the ext_term field ;
- 1 extension goes in the pos field ;
- 2 does nothing, the reflexive pronoun stays in place ;
- when done the flag
dictionary_txt.to_checkis set to false.
Cautions
- at step 4. Always ends an entry with a period.
- at step 5. In a multiple-colon line, a legal entry begins at the beginning of a line and further entries are delimated with a '. ' (period followed by space). An entry MUST end with a period. The leading space remaining must be trimmed.
- at step 5. Be aware of erroneous colon written instead of semi-colon. Change the colon to a semi-colon, the definition remains the same.
Lenga viva
- For every category a file is created containing all the phrases that belong to it.
- A routine is written to populate a table
lenga_viva (id, phrase, category).Phraseis ending space trimmed. Thecategoryis considered as the primary since this is the original category from the author. Lenga viva processing gives 11521 rows/phrases & 27 categories.
Indexing
Indexing is populating the term_id table. The table refers to terms mapping a definition by its id in dictionary_txt table. The field contains the masculine and feminine form of the term in dictionary_txt and also other graphies or synonyms all pointing to the same definition.
- Splits the
dictionary_txt.termfield separated by a "/", an entry is created interm_idfor each,term_id.termcontains the term splited andterm_id.idcontains thedictionary_txt.id. - Split the
term_id.termby its masculine and feminine form, the fieldgenderis documented and theidfield is duplicated for the both forms.
Wordization
- Every word in a phrase in the
lenga_vivatable is inserted if not exists in thewordcounttable. For each occurence of the word thelv_counteris incremented. - Every word in a definition of the table
dictionary_txtis inserted if not exists in thewordcounttable. For each occurence of the word thedict_counteris incremented. - Every word in the
term_idtable is inserted if not exists in thewordcounttable. For each occurence of the word thedict_counteris incremented as the term belongs todictionary_txt.
To do
- Set an interface to act on the definitions (both actions are contradictory) :
- Separate the definitions imbeded in a big definition
- Bring back the entries unduely separated from their original definition (undo the 1.)