1997 Qual

Programming Question

Analysis

How to Grade Your Answer

I graded on the following items:

flexibility and open-ness of the design
exploration of possible metrics and what makes a good one
Lisp competence

These items and possible answers are discussed in detail below. The best possible score is 10 points. I graded the 5 answers submitted this year as follows:

2 answers got 8 points.
They lost 2 points for clumsy recursion (note how simple the metric code can be) and/or not questioning the two obvious metrics, both of which are actually useless
2 answers got 5 points.
The code let you change how important the two (useless) metrics were, but didn't support things like " deprecate nesting more when it occurs in the tests of cond's and if's and do's"
1 answer got 3 points.
The code had no flexibility and had numerous Lisp mistakes.

In other words, a recursive function that simply calculated maximum list length and nesting depth and scaled the results using user-adjustable weights for length and depth got 5 points out of 10, assuming the Lisp code was competent otherwise. There needed to be more flexibility than that.

Analysis Point 1: Flexibility

Program for flexibility. It should be easy to change the scoring algorithm and for a Lisp programmer like me to specify context-specific rules, such as "deprecate nesting more when it occurs in the tests of cond's and if's and do's.

There were two techniques used to get flexibility:

penalty weights to multiply depth and length by when calculating a total score
pattern-matching rules to invoke scoring functions or specialized penalty weights

Another viable option would be

data-driven code, using a table of scoring functions, indexed by names of Lisp functions and special forms

Analysis Point 2: Metrics

difference-one-do is significantly worse than replace-hex-escape even though it only has a few more atoms and doesn't nest as deeply.

There was several hours of empirical experience hidden in this one sentence. All but one qual-taker used the following obvious metrics:

maximum nesting
length of long list

Unfortunately, they're both terrible, as hinted at by "doesn't nest as deeply". If you look at the table below, you'll see that

These metrics produce values in a very narrow range, making it very difficult to specify a "threshold of badness."
They're unreliable. stream-subst scores better than replace-hex-escapebut it's obviously longer.

Multiplying unreliable values with weights only makes things worse.

Two other simple metrics do a lot better:

total number of atoms
sum of all list lengths

The first metric was hinted at by " has a few more atoms" and was mentioned by one qual-taker. The table below shows a pretty good correlation with reasonable spread for both of these. difference-one-do does indeed have fewer atoms than replace-hex-escape, but if you don't count the parameter list, which isn't really the function definition's fault, then difference-one-do has a few more atoms than replace-hex-escape.

The second metric penalizes parentheses, e.g.,

(a b c) scores 3
((a b c)) scores 4
((a) (b) (c)) scores 6

These differences add up fast.

Baseline Metric Results

Here's a table showing how the above metrics score the example pieces of code.

	MAX-LENGTH	MAX-DEPTH	ATOM-COUNT	LIST-COUNT
REPLACE-HEX-ESC	5	10	36	58
DIFFERENCE-ONE-DO	4	8	42	71
RIEGER-SIMPLE	5	12	58	96
INTERSECTP	6	9	111	175
STREAM-SUBST	5	8	89	138

As a further test, when I applied list-count to several of my Lisp files that I use a "model" code in CS C25, most functions scored only 20 to 60. Since the scores for the example "too long" functions start at 70 and quickly pass 100, list-count seems like the very reasonable baseline metric.

Baseline Metric Code

Lisp code to do the above metrics is pretty simple. Most of the qual-takers wrote pretty clumsy recursive code.

(defun max-length (form)
  (cond ((atom form) 1)
        (t (reduce #'max form
                   :key #'max-length
                   :initial-value (length form)))))
 
 
(defun max-depth (form &optional (depth 0))
  (cond ((atom form) depth)
        (t (max (max-depth (car form) (1+ depth))
                (max-depth (cdr form) depth)))))
 
(defun atom-count (form)
  (cond ((atom form) 1)
        (t (reduce #'+ form :key #'atom-count))))
 
(defun list-count (form)
  (cond ((atom form) 0)
        (t (reduce #'+ form
                   :key #'list-count
                   :initial-value (length form)))))

Handling the special cases

A Data-Driven Approach

My original answer for this question was data-driven. It had a table associating a scoring function with Lisp function names. For example, the following scoring function(s) for cond (one of the more complex cases), penalizes forms in the tests of each branch nested over a certain amount and penalizes long branches.

(setf (score-fn 'cond)
      #'(lambda (form)
          (reduce #'max (rest form) :key 'score-branch)))
 
(defun score-branch (branch)
  (max (penalize (max-depth (first branch)) 1 2)
       (penalize (length (rest branch)) 1 2)))
 
(defun penalize (n &optional (threshold 1) (factor 2))
  (+ 1 (* (max 0 (- n threshold)) factor)))

Implementing a data-driven solution was quite easy. Here was the top-level scoring function. Note that implemeting function-too-long-p simply means "max-score > threshold."

(defun max-score (defn)
  (cond ((atom defn) 1)
        (t (reduce #'max defn
                   :initial-value (score-form defn)
                   :key #'max-score))))
 
(defun score-form (form)
  (cond ((atom form) 1)
        (t (funcall (get-scorer (car form)) form))))

However, I liked the pattern-directed rules two qual-takers used, and especially liked the suggestion by one to use a syntax similar to Scheme's define-syntax special form. Patterns are certainly easier to maintain than Lisp code. Furthermore, I've never tried implementing define-syntax style of patterns, so that's what I've implemented here. The approach is inspired by one of the qual answers, though the code is totally different.

You can test my answer by calling score-file on a file of Lisp code. The examples in the original question are here.

A Pattern-Driven Approach

A rule pattern is basically a template for a Lisp expression, with numbers for subexpressions. The numbers are penalty weights. Whatever score the matching code is given will be multiplied by the corresponding weight.

The special symbol etc in a pattern means "repeat the previous pattern until the end of the list." (This is equivalent to ... in define-syntax in Scheme, but Common Lisp doesn't allow ... as a symbol name.) This kind of syntax is not as general as segment variables in a full pattern matcher, but it's simpler to use and implement, and provides all the generality we need here.

Here's three rules, to show the format:

(defparameter *score-pats* 
  '((if 2 1 etc)
    (cond (2 1 etc) etc) 
    (let ((1 2) etc) 1 etc)
    ))

The first rule says that when scoring IF forms, double the score in the test position. The second rule says to double the score for the test in each branch of a COND form. The third rule says to double the score of the value forms in LET variable specifications.

Matching

The matcher for these patterns is similar to standard pattern matchers, where

a number matches (and binds) to anything
a number can be bound many times to different subexpressions
any other atom matches only itself
a list of the form (pat etc) matches if pat matches every item in list of expressions -- all the bindings are collected
any other list matches if the car andcdr match -- all the bindings are collected

Life is simpler in that we never need to check the bindings during the match, just save them for later. The only complexity is handling etc and this is still simpler than handling full-fledged segment variables.

As usual, we need to make sure that we can distinguish a successful match with no bindings from a failed match. Here, we'll use a binding list of the form (pat binding binding ...) where pat is the top-level pattern that matched. Each binding will be a pair whose car is some penalty weight from the pattern and whose cdr is the subexpression that the weight matched.

Here's an example result:

? (match-score-pat '(if 2 1 etc) '(if (null x) t (car x)))
((IF 2 1 ETC) (1 . T) (1 CAR X) (2 NULL X))

This says that

(IF 2 1 ETC) is the pattern
the score for T should be multiplied by 1
the score for (CAR X) should be multiplied by 1
the score for (NULL X) should be multiplied by 2

Notice how the weight 1 was assigned to the two subexpressions following the test.

Adding Up the Scores

The algorithm for combining scores as implemented by the code is as follows.

It assumes that there are two user-changeable scoring functions:

a baseline code metric that takes an expression and assigns a score non-recursively
- the default is length
- another reasonable choice is count-top-atoms, i.e., (count-if #'atom l)
a binding penalizer that scores for number of bindings produced
- the default is length

The algorithm to score a piece of code, code, is:

Match code against the stored rule patterns.
Let bindings be the bindings returned by the first match, if any, or NIL.
Recursively score the subexpressions in bindings.
Add
- the subexpression scores multiplied by the corresponding weights
- the result of applying the binding penalizer to bindings
- the result of applying the baseline code metric to code

Notice that if no patterns apply to any part of the code, then the above algorithm does

list-count if the baseline code metric is length
atom-count if the baseline code metric is count-top-atoms

Notice that patterns with internalize parentheses don't penalize those parentheses, in this algorithm. This could be considered a bug or a feature. The binding penalizer partially compensates for this.

Testing the Extensions

So, how much better do those patterns make our Lisp scoring function? Here's what I get for the example functions, along side the original values returned by list-count:

	LIST-COUNT	SCORE-CODE
REPLACE-HEX-ESC	58	62
DIFFERENCE-ONE-DO	71	99
RIEGER-SIMPLE	96	115
INTERSECTP	175	192
STREAM-SUBST	138	160

Are these better? They're spread out more but I leave it as an exercise for the reader to decide if anything substantial has been gained (beyond learning, of course).