Steps for building the database for Cantalausa's dictionary

Domenge published on
5 min, 878 words

Categories: Development


Cantalausa project

The project works on two opus of the abbot Louis Combes aka Cantalausa, the dictionary and Lenga Viva.

Cantalausa dictionary

The Cantalausa's dictionary comes with a pack of .pdf files for each letter of the occitan alphabet. From each letter we take the content of the pdf in text format to end with a batch of [letter].txt in a dedicated directory.

These files will be the raw material of the process to populate the database.

cantalausa_cli usage

dc@xxx > target/debug/cantalausa_cli --help

Cantalausa dictionary text file Super Cleaner and Enhancer 1.0
Domenge Castel <domenge.chateau@free.fr>
Cleans, do some housekeeping to prepare a file for populating the Cantalausa dictionary database 

USAGE:
    cantalausa_cli [OPTIONS] --action <[lapexvwi] l =>  clean leading folio, a => assemble definition, p => populate database, e => populate term extension, x => process term extension, v => populate Lenga viva, w => wordization, i => indexation>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -a, --action <[lapexvwi] 
    l =>  clean leading folio, 
    a => assemble definition, 
    p => populate database, 
    e => populate term extension, 
    x => process term extension, 
    v => populate Lenga viva, 
    w => wordization, 
    i => indexation
    >
    l : removes the folio, 
    a => reassembles definition in one line when splitted, 
    p => populates database, 
    e => populates a table of term extensions, 
    x => term extension is cleaned and placed either in ext_term or pos field, 
    v => takes the categorie in the config file and populates the database w/ Cantalausa''s Lenga viva data, 
    w => populates an list of words and its occurences.
    -c, --config <config>                                                                                                                                                                  Get the basic configuration in a file 
    -f, --file <file>                                                                                                                                                                If set the name of the input file 
    -l, --letter <a lettera-z]                                                                                                                                                                Sets the letter to process, the text file will be [letter].txt ex: A.txt for letter 'a'
    -o, --output <output>                                                                                                                                  

Steps

  1. Remove first line letter for every letter text file. [letter].txt in w/sources/Cantalausa/text_via_pdf
  2. Detect and remove leading folios, pipe the output in [letter].out.
    • target/debug/cantalausa_cli -l Y -a l > Y.out
  3. Assemble lines, output will be piped in [letter].joined.out.
    • target/debug/cantalausa_cli -l A -a a -f A.out > A.joined.out
  4. Populate the database, set the flag if multiple colons, add trailing period if needed.
  5. Edit separatly lines w/ multiple colons. Consider writing a javascript helper for detection.
  6. Populate a table ext_term_pos w/ term extensions and their count. In the term there could be some text between parenthesis, this is taken as an extension. A counter must be incremented.
    • target/debug/cantalausa_cli -a e
    • select count(*) from ext_term_pos; => 25345
  7. Export ext_term_pos in a spreadsheet, then manually set the POS value, 0 for an extension of the term, 1 for a POS, 2 for a reflexive pronoun.
  8. Set the flag dictionary_txt.to_check to true if an term extension exists. Scroll backward (the most frequent extension first) through ext_term_pos to check the extension and how to act with. As of pos value :
    • 0 extension goes in the ext_term field ;
    • 1 extension goes in the pos field ;
    • 2 does nothing, the reflexive pronoun stays in place ;
    • when done the flag dictionary_txt.to_check is set to false.

Cautions

  • at step 4. Always ends an entry with a period.
  • at step 5. In a multiple-colon line, a legal entry begins at the beginning of a line and further entries are delimated with a '. ' (period followed by space). An entry MUST end with a period. The leading space remaining must be trimmed.
  • at step 5. Be aware of erroneous colon written instead of semi-colon. Change the colon to a semi-colon, the definition remains the same.

Lenga viva

  • For every category a file is created containing all the phrases that belong to it.
  • A routine is written to populate a table lenga_viva (id, phrase, category). Phrase is ending space trimmed. The category is considered as the primary since this is the original category from the author. Lenga viva processing gives 11521 rows/phrases & 27 categories.

Indexing

Indexing is populating the term_id table. The table refers to terms mapping a definition by its id in dictionary_txt table. The field contains the masculine and feminine form of the term in dictionary_txt and also other graphies or synonyms all pointing to the same definition.

  1. Splits the dictionary_txt.term field separated by a "/", an entry is created in term_id for each, term_id.term contains the term splited and term_id.id contains the dictionary_txt.id.
  2. Split the term_id.term by its masculine and feminine form, the field gender is documented and the id field is duplicated for the both forms.

Wordization

  • Every word in a phrase in the lenga_viva table is inserted if not exists in the wordcount table. For each occurence of the word the lv_counter is incremented.
  • Every word in a definition of the table dictionary_txt is inserted if not exists in the wordcount table. For each occurence of the word the dict_counter is incremented.
  • Every word in the term_id table is inserted if not exists in the wordcount table. For each occurence of the word the dict_counter is incremented as the term belongs to dictionary_txt.

To do

  • Set an interface to act on the definitions (both actions are contradictory) :
    1. Separate the definitions imbeded in a big definition
    2. Bring back the entries unduely separated from their original definition (undo the 1.)
#