Memory-based Partial Parsing

The writeup below describes a very simple approach to memory-based parsing. The code that implements this simple approach is in parse.lisp. The accompanying file parse-tests.lisp has the toy examples given below.

We will model parsing for a MOP-based memory as a process that takes a list of words an input and returns a list of the MOPs in memory that the text might be referring to. We'll call this model memory-based because the parsing rules will use lexical, syntactic and semantic knowledge. We'll call this partial parsing because we won't try for completeness. The parser will accept and ignore words it doesn't know or concepts it can't combine together, on the assumption that it's knowledge is very incomplete.

Our parser will return multiple answers because input texts are often ambiguous and can be interpreted in several different ways. More specifically, our parser will return three values:

The first value is a memory structure (MOP instance) for the best interpretation, if there's just one. Otherwise, it's NIL.
The second value is a list of the best parse results, i.e., all results that are not worse than any other result.
The third value is a list of all parse results. This will include interpretations of fragments of the input, e.g., individual words and phrases.

A parse result is an object that contains the concept identified and any additional attributes specified, the input words thatwere used to construct the result (in reverse order), and any words that followed the words used.

Here are some simple examples. First, an example that, in this small knowledge base is unambiguous:

> (parse '(catch a butterfly))
      M-CATCH-ANIMAL7      an instance of catching an animal
      (#<M-CATCH-ANIMAL (:OBJECT (M-BUTTERFLY)) Used: (A BUTTERFLY CATCH)>)
      (#<M-CATCH-ANIMAL (:OBJECT (M-BUTTERFLY)) Used: (A BUTTERFLY CATCH)>
      #<M-BUTTERFLY Used: (A BUTTERFLY)>)

We can see the memory structure that resulted with show-frame:

> ((show-frame 'm-catch-animal7)
         TYPE :INSTANCE
         ISA M-CATCH-ANIMAL
         :OBJECT M-BUTTERFLY
            TYPE :MOP
            ISA M-INSECT

Another unambiguous sentence that uses "catch" to mean something very different:

>  (parse '(catch a cold))
      M-GET-DISEASE8
      (#<M-GET-DISEASE (:OBJECT (M-COLD-DISEASE)) Used: (COLD CATCH)>)
      (#<M-GET-DISEASE (:OBJECT (M-COLD-DISEASE)) Used: (COLD CATCH)>
      #<M-COLD-DISEASE Used: (COLD)>)

Again, to see the contents of the new instance:

> ((show-frame 'm-get-disease8)
         TYPE :INSTANCE
         ISA M-GET-DISEASE
         :OBJECT M-COLD-DISEASE
            TYPE :MOP
            ISA M-DISEASE

Finally, an ambiguous sentence, using the word "bug" which might mean an insect or a disease:

>  (parse '(catch a bug))
      NIL            no instance because of ambiguity
      (#<M-CATCH-ANIMAL (:OBJECT (M-INSECT)) Used: (BUG CATCH)>
      #<M-GET-DISEASE (:OBJECT (M-DISEASE)) Used: (BUG CATCH)>)
      (#<M-GET-DISEASE (:OBJECT (M-DISEASE)) Used: (BUG CATCH)>
      #<M-CATCH-ANIMAL (:OBJECT (M-INSECT)) Used: (BUG CATCH)>
      #<M-DISEASE Used: (BUG)>
      #<M-INSECT Used: (BUG)>)

Basic Concepts

We're going to attach phrases directly to concepts in memory. A phrase is a list of items. The simplest items are words. Phrases are attached to concepts with defphrase, e.g.,

(defphrase m-tiger tiger)
          
         (defphrase m-kangaroo-rat kangaroo rat)

The general syntax is (defphrase concept item1 item2 ...). The general rule for processing a phrase is:

If you see all the item of phrase in the input, in the given order, then you've recognized the phrase, and therefore have seen a reference to the concept to which the phrase is attached.

The kicker is that another kind of item can appear in a phrase: a role name. When a role name appears, it represents whatever kind of concept fills that role in the concept to which the phrase is attached. So if the phrase (:actor eat :object) is attached to a MOP for eating, then :actor means whatever concept can be an actor for eating, e.g., an animal concept, and :object means whatever concept can be the object of eating, e.g., a food concept.

A concept can have many phrases attached, using multiple defphrase's, and the same phrase can be attached to many concepts. The latter is where many, though not all, ambiguities arise.

A Parsing Algorithm

The basic algorithm uses two steps:

Given an input sentence (list of words), collect all the phrases that might match it. This "fetch" phase should be quick and fast and rule out 99% of the phrases.
Apply each of the phrases that were fetched to the input. This is a more expensive process, and involves both structure building and recursive parsing.

This two-phase approach is common in large knowledge bases using potentially expensive matching operations.

The fetch phase implements the following rules to decide what phrases might apply to a given list of input words. For a phrase to be a potential parser of the input, it must:

be no longer than the input
have no words not in the input
have words in the same relative order as the input

Note that the input may have words not in the phrase, that keywords are almost ignored, and that a phrase with all keywords only rejects inputs shorter than it.

The rules for the match phase are slightly more complicated. Matching is done from left to right:

If the phrase starts with a word, then that word must appear in the input, and the rest of the phrase must match the rest of the input. Intervening words in the input are thrown away.
If the phrase starts with the name of a role, e.g., :OBJECT, then, recursively, see if the input matches some phrase that is attached to the concept that fills that role in the phrase's base concept.

For example, when matching (catch :object) with the base concept m-get-disease against the input (catch a cold):

The two catch's match immediately.
To match :object, the matcher has to find some phrase that
- matches the input, and
- has the base concept m-disease, which is the filler of :object in m-get-disease

This is where the list generated by the fetch phrase comes in handy. The only phrases that might satisfy the first condition have already been fetched. Therefore, all the matcher has to do is try those phrases in that list whose base concept satisfies the second condition. There's no need to search all of phrase memory again.