One of the key challenges that lemon attempts to solve is the attachment of linguistic information to an ontology, which is essential to providing linguistic annotations to the text. This is achieved by using a linguistic taxonomy also known as a linguistic description ontology or data category registry (Romary, 2010). Examples of these resources include the GOLD ontology or the ISOcat data category registry. For example, indicating that “cat” has a singular canonical form “cat” and a plural form “cats” can be done as follows.3
:cat lemon:canonicalForm [ lemon:writtenRep "cat"@en ; lemon:property isocat:singular ] . :cat lemon:otherForm [ lemon:writtenRep "cats"@en ; lemon:property isocat:plural ] .
In practice it is neither feasible nor desirable to impose a complete set of all possible annotations, thus lemon does not attempt to do so, and is intended to be used with a source of linguistic categories. Also lemon is used to define the structure and relations of the lexicon, and properties from data categories are introduced by asserting them as sub-properties of existing lemon properties (in particular lemon:property). For example we will now show an example that marks “ICBM” as a noun and an initialism, using sub-properties of lemon's property4
@prefix rdfs: <"http://www.w3.org/2000/01/rdf-schema#"> :ICBM isocat:partOfSpeech isocat:noun ; isocat:termType isocat:initialism . isocat:partOfSpeech rdfs:subPropertyOf lemon:property . isocat:termType rdfs:subPropertyOf lemon:property .
These sub-property declarations only need to be made once per lexicon. An alternative is to use a controlled vocabulary for lemon, such as LexInfo 25 which defines a practical set of data categories for general NLP tasks.
It is of course possible to define a lexical entry or form as having several linguistic properties. For example, in many cases we might wish to introduce this set of data categories again from a source such as ISOcat or GOLD. The use of multiple properties makes it much simpler for applications to query different properties.
:eat lemon:otherForm [ lemon:writtenRep "eats"@en ; isocat:person isocat:thirdPerson ; isocat:grammaticalNumber isocat:singular ; isocat:tense isocat:present ] . isocat:person rdfs:subPropertyOf lemon:property . isocat:grammaticalNumber rdfs:subPropertyOf lemon:property . isocat:tense rdfs:subPropertyOf lemon:property .
It may also be used to define the linguistic properties of forms, e.g. to indicate whether they are roots or stems etc. For example, the Spanish verb “pescar” has alternative stems “pesc-” and “pesqu-” that may be useful for generating inflectional variants.
:pescar lemon:abstractForm [ lemon:writtenRep "pesc"@es ; isocat:morphologicalUnit isocat:stem ] ; lemon:abstractForm [ lemon:writtenRep "pesqu"@es ; isocat:morphologicalUnit isocat:stem ] .
Morphology can also be partly handled by assigning the morphological pattern as a type of linguistic annotation. For example the Latin verb “amare” may be stated to have the “first conjugation” morphological pattern as such.
:amare isocat:morphologicalPattern :first_conjugation .
This means that it is possible to use this modelling to avoid stating all inflected forms of a word. The implication for implementation however should still be that stated forms override forms generated from a morphological pattern. This is useful as for example the verb “speak” has a regular third person singular present form “speaks” but irregular simple past and past participle forms “spoke” and “spoken”.
John McCrae 2012-07-31