Calelh

Domenge published on
4 min, 748 words

Categories: Development


https://calelh.osca.dev

Calelh presentation

The Calelh application is the Louis Alibert's dictionary digitalized. Being numeric gives a new shine to the remarkable work of the linguist. Now its dictionary is only available in a facsimile edition, however a numeric version of the second part (entries and definitions) has been typed by to the Paul Valery University of Montpellier, there are those data that were used to populate the database.

The first part of the dictionary is the booklet and has been typed by ourselves inside the Calelh project.

Alibert's work concentrates on listing the lemmas as entries. A lemma is exploded in all the terms producted following the derivations or the compositions. Data processing starts from the lemma to develop its production in an ontologic form.

The booklet (first introductory pages)

There is plenty of information in the first pages of a dictionary.

Those pages are organized according to an ensenhador (a table of content). The site mimics the book by following the same organization.

The booklet has four parts :

  • Phonetic mutations of the lengadocian parlance exposes its linguistic terminology;
  • Morphology (how occitan popular words are elaborated);
  • How greek and latin words are used to form scientific and scolar occitan words;
  • The list of the abbreviations according to their type.

Across the booklet, text is enriched to ease the reading by highlighting and discriminating recommanded forms among the used ones. Inside, the markup language and CSS help to isolate and mark the terms for an easy and sure extraction.

types of abbreviation

  • POS for part of speech,
  • LOC for the word localization,
  • STRUCT to qualifiy the definitions structure,
  • ACCEPTION to deambiguate the different meanings,
  • META for contextual information not valorized yet.

STRUCT

derv: derivation
comp: composition
etym: etymology
pos: Part of Speech
loc: localization
vrnt: variant
syn: synonym
f: french
cmnt: comment

Dictionary

By letter list

This menu is a listbox showing each occitan letter and two badges with numbers alongside.

  • First shows the total entries number;
  • Second shows the corrected and validated entries.

By clicking the item the entire list of corresponding letter is displayed. It is the page of the letter. First the entries to correct are displayed with an ❗ exclamation mark icon, valid ones show a ☑️ checked mark icon.

Clicking on an entry launches the editor page.

Entry visualization

The visualization page of an entry displays its representation in a ontological form. The definition is exploded in its components, see STRUCT from the types of abbreviation paragraph.

Starting from the headword or the lemma figured by a star at the ontology center a network spreads links for each components (derv, etym, comp, pos) according to a hierarchy driven by a formal grammar. The hierarchy is encoded in YAML formula that can be edited in an editbox placed on the left side of the window. Once the YAML formula is corrected and saved, the ☑️ (corrected) flag is put to on.

Future developments

A paper edition may be generated from the database and a LATEX document will be produced.

The ontological image (a bitmap) could be inserted if the entry structure is complex/interesting enough.

The YAML formula will be converted in a true well-formed dictionary entry, with tag coloring as a plus (loc, pos, etym, …).

On the digital edition, synonyms will be accessed by a URL.

Statistics

Whenever all the terms will be encoded, statistics could be conducted easily along the identified ontological links.

Morphological analyzer

For experiment use only.

Part III of the booklet lists greek and latin prefixes and suffixes of the occitan language. These are flexions or endings that concatenate to the lemmas to coin new words, it is morphology.

In the digital edition those lists are nomenclatured and completed by regular expressions allowing to isolate them from the word. Each entry in this list is then a rule that Alibert gave the occitan ending, etymon, examples and the matching regex.

The morphological engine consists of applying all the rules to the word in order to isolate all the rules that apply.

Limitations

The morphological analyzer is still in the early stage of development and some rough edges disqualify it from a serious use.

Regular expressions are applied only once more than one prefix or suffix. Ex : otorinolaringologia for otorhinolaringology cannot resolve in oto rino laringo logia, only the prefix 'oto' will be isolated among other analysis.

Some more algorithm tuning is expected.

#