- match any thing that does not match pattern
- match a list beginning with item1 followed by any number of other items followed by item2
- match a number
We could add these features directly to the code but
- the code will become more complex and harder to maintain
- there will always be new features we want to add
The better way to handle this is to make our matcher extensible, using data-driven programming.
But first we need a coherent design for what these more powerful patterns will look like.
An Extensible Pattern Syntax
The easiest format to use to specify the kinds of patterns mentioned above is
(?pattern-type arg1 arg2 ...)
- match any thing that does not match pattern
?notto indicate "not this pattern"
- match a list beginning with item1 followed by any
number of other items followed by item2 becomes
(item1 (?* var) item2)
?*to indicate "match zero or more items"
- match a number becomes
?isto indicate "match anything that satisfies this predicate
Two other useful extensions that fit this format are:
(?and pat1 pat2 ...)
to mean "match anything that can matches all these patterns with one set of bindings"
(?or pat1 pat2 ...)
to mean "match anything that matches any of the patterns, and each match can have different bindings"
This format has two nice properties:
- It's easy to give any number of "arguments" to the pattern, as can be seen in the examples above.
- It's a flexible format for future extensions that might come up.
It does have a downside however. The simple pattern variable form
we used before, e.g.,
?X, doesn't fit the above format,
and in fact can cause patterns to be ambiguous and hard to maintain:
(?foo ...)a list beginning with the variable
?fooan extension to the pattern language?
?foois not an extension when we wrote the pattern, but someone defines it as such later, our pattern that used to work will now behave differently, with no warning.
To avoid these problems, we drop the convenient but unstable
symbolic form for pattern variables in favor of the form
(? X). Now there will be no ambiguity:
- If we want the variable
foo, we write
- If we want a pattern extension, we write
- If someone else defines an extended meaning for
?foo, it doesn't affect patterns with
Implementing Extensible Patterns
One way to implement all these extensions to our matcher would be to use a conditional, e.g.,
(defun pat-match (pat form &optional (blists '(nil))) (cond ((eql pat form) blists) ((or (atom pat) (atom form)) nil) ((eql (first pat) '?) (match-variable pat form blists)) ((eql (first pat) '?is) (match-predicate pat form blists)) ((eql (first pat) '?and) (match-and pat form blists)) (t match car and cdr)))
The problem is that
pat-match will get pretty long,
but never long enough. If we or some programmer using our matcher
wants a new extension, they're going to have to edit this definition.
The right way to do this is to replace the above with
(defun pat-match (pat form &optional (blists '(nil))) (cond ((pat-extension-p pat) (match-extension pat form blists)) ((eql pat form) blists) ((or (atom pat) (atom form)) nil) (t (pat-match (cdr pat) (cdr form) (pat-match (car pat) (car form) blists)))))
This code presumes that
pat-extension-p can determine
if an item is a defined pattern extension. This is the approach taken
pat-match is simpler, don't we have to
match-extension with a conditional, in fact with two,
one each function? Won't this actually be harder to maintain than the
The key is to define
match-extension to look in a table. The table associates
a symbol, e.g.,
?is, with a function that knows how to
match patterns involving that symbol. How we implement this table
isn't important. We could use a-lists or hashtables or whatever.
For example, if the function
) returns the pattern
matcher for the symbol, if any, or
pat-extension-p is pretty easy to define:
(defun pat-extension-p (name) (not (null (pat-function name)))
match-extension is only a little more complex:
(defun match-extension (pat form blists) (let ((fn (pat-function (first pat)))) (when (null fn) (error "Undefined pattern extension ~S" pat)) (funcall fn (rest pat) form blists)))
This will call the matching function attached to the pattern
?is, passing it the arguments to the
extension, e.g., the list
(numberp) if the pattern were
(?is numberp), the form to be matched against, and the
current list of binding lists. This should be enough for the matching
function to do its job.
Putting things in the table is also easy. We define a function add-extension to associate a pattern matching function with a pattern label. If match-predicate is the function that does the matching work for ?is, we'd write
(add-extension '?is :single 'match-predicate)
The :single says that an ?is pattern matches one input pattern. Most patterns are of this type.
This approach is called data-driven programming because how our code runs (which code gets called) is driven by the data (in this case, symbols in the pattern).
Data-driven programming is a very useful technique. Consider using it whenever you start writing a
CASEthat chooses between various pieces of code based on some input data. It's a technique that simultaneously makes code simpler and more flexible.
Segment Matching Patterns
Segment matching is involved for any pattern that can match zero or more items. Two examples are:
(?* L)which can match any number of items
)which can zero or one occurrences of pat
Just adding segment variables, i.e, patterns of the form
L), changes how things work in several ways.
First, we can have more than one answer:
> (pat-match '((?* l) (?* r)) '(a b c)) (((L) (R A B C)) ((L A) (R B C)) ((L A B) (R C)) ((L A B C) (R)))
There are four binding lists returned, each representing a different way that the two segment variables can be matched to the input list.
Second, if a segment variable appears twice, it must match the same sequence of items as it did before, e.g.,
> (pat-match '((?* l) (?* r) (?* l)) '(a b c))) (((R A B C) (L)))
There is only one answer now, because the only way the first
segment variable can match a segment before and after some other
(A B C) is if the first segment variable
matches zero items. On the other hand,
> (pat-match '((?* l) (?* r) (?* l)) '(a b c a b))) (((R A B C A B) (L)) ((R C) (L A B)))
has two possible answers.
An important point about segment patterns is that they can only
occur inside lists. The following call to
pat-match makes no sense:
(pat-match '(?* l) '(a b c))
However, the following does:
(pat-match '((?* l)) '(a b c))
The latter example will bind
L to the list
C). It's not a particularly efficient way to do it, but it
does make sense conceptually.
Implementing Segment Matching
In order for the pattern matcher to match a segment pattern against zero or more input items, the matcher has to check to see if it has a pattern that is a list whose first element is a segment matching pattern.
Here's the code from extend-match.lisp:
(defun pat-match (pat form &optional (blists '(nil))) (cond ((pat-extension-p pat) (match-extension pat form blists)) ((eql pat form) blists) ((atom pat) nil) ((segment-pat-extension-p (first pat)) (match-segment-extension pat form blists)) ((atom form) nil) (t (pat-match (cdr pat) (cdr form) (pat-match (car pat) (car form) blists)))))
The fourth branch is the important one. It checks for a list
starting with a pattern labelled as a segment matching pattern. If
found, it calls
match-segment-extension, the segment
(defun match-segment-extension (pats form blists) (let ((pat (first pats)) (rest-pats (rest pats))) (let ((fn (pat-function (first pat)))) (when (null fn) (error "Undefined pattern extension ~S" pat)) (funcall fn (rest pat) rest-pats form blists))))
pats is a list of a patterns, starting
with a segment pattern.
form is a list of input items.
The code above gets the matching function and passes it
- the segment pattern's arguments, e.g.,
- the remaining patterns in the list
- the list of input items
- the list of binding lists
The list of remaining patterns is needed because a segment matcher
?* will consider many different bindings, each
differing in how many input items are matched, but only the bindings
that allow the rest of the patterns to match will be kept.
Here's the code for the segment variable matcher:
(defun match-segment-variable (args pats input blists) (destructuring-bind (&optional name) args (loop for tail = input then (rest tail) append (match-tail tail name pats input blists) until (null tail)))) (defun match-tail (tail name pats input blists) (let ((blists (pat-match pats tail blists))) (if (null blists) nil (bind-variable name (ldiff input tail) blists))))
match-tail with the input list, then the input list
minus the first element, then the input list minus the first two
elements, and so on, up to and including the input list minus all its
cdr's of a list are called the
tails of the list.
match-tail simply calls
pat-match to see
if the remaining patterns match the given tail of the input list. If
they do, then
match-tail tries to bind the segment
variable to the elements in the input list preceding the tail.
ldiff is one of the great unknown functions of Common
Lisp. Given a list and a tail of that list (not a list that
has the same elements, but an actual tail),
returns a list of the items preceding the tail. This function turns
out to be surprisingly handy but often overlooked, even by veteran
To see why this works, consider
(pat-match '((?* l) (? x)) '(a b c))
match-tail will be called with
(A B C),
(). It will
try matching each of these with the remaining patterns, namely
((? X)). Following the normal rules of matching, the
only match that succeeds is the one that binds
C, i.e., where the tail was
L will be bound to
The importance of
bind-variable becomes clear when we
(pat-match '((?* l) (?* r) (?* l)) '(a b c)))
The sequence of events is roughly this:
- The first occurrence of
(?* L)first tries passing the entire list
(A B C)to the remaining patterns
((?* R) (?* L)).
(?* R)in turn first tries passing
(A B C)to
- After numerous tries,
(?* L)binds itself to
(A B C). The list of binding lists
(A B C)))is returned to
(?* R)adds the binding
(R)to each of these binding lists.
(?* R)then tries
(B C) to ((?* L))and gets back the answer
(?* R)adds the binding
(R A)to each of these binding lists.
(?* R)has tried every tail, and returns the list of binding lists
(((R) (L A B C)) ((R A) (L B C)) ((R A B) (L C)) ((R A B C) (L))).
(?* L)now tries to add the binding
(L)to each of these lists. This succeeds only on the last binding list.
- The steps above now repeat with
(B C)to the remaining patterns
((?* R) (?* L)).
- Eventually, the binding lists returned are
(((R) (L B C)) ((R B) (L C)) ((R B C) (L))).
(?* L)is unable to add the binding
(L A)to any of these lists. And similarly for the remaining passes.