D64 Assignment #4 Solutions

1. Iterators

Here's one solution to the problem.  The key was to be careful about not losing the first row of a new group - we have to save the row that ends a group for use in the next group.

/* State variables - would be dynamically allocated in a real system */
boolean atEOF;
Row nextRow;
int nextGroupByVal;
void open(void)
{
    /* in real code, allocate state variables */
    inputRelation.open();
    nextRow = inputRelation.getNext();
    atEOF = (nextRow == NULL);
    if (!atEOF)
        nextGroupByVal = getAttributeFromRow(nextRow,groupByColumnNum);
}
Row getNext(void)
{
    int count=1;
    int total;
    int thisGroupByValue = nextGroupByVal;
    if (atEOF)
        return NULL;
    total = getAttributeFromRow(nextRow,averageColumnNum);
    while (true) {
        nextRow = inputRelation.getNext();
        if (nextRow == NULL) {
            atEOF = true;
            break;
        }
        nextGroupByValue = getAttributeFromRow(nextRow,groupByColumnNum);
        if (nextGroupByValue == thisGroupByVal) {
            count++;
            total += getAttributeFromRow(nextRow,averageColumnNum);
        }
        else
            break;
    }
    return(composeRow((float)total/count,thisGroupByValue));
}
void close(void)
{
    /* in real code, deallocate state variables */
    inputRelation.close();
}
 

2. Query Evaluation

A few values we'll need later on:

M = 20

sort-merge join

To perform sort-merge join, we need both relations to be smaller than M2. In this case, the Student table is too large. If we perform the selection before sorting, however, it will be only 125 pages long which is small enough. Thus the cost of the sort-merge is 3*B(Dorm)+B(Student)+2*B(Student)/4. This computes to 740 I/Os.

hash join

To perform a hash join, we need at least one of the relations to be smaller than M2 which Dorm is. We can also push the selection down on Student and so our cost is the same as with sort-merge join (740 I/Os).

index join

We can use the index on Dorm.name to perform an index join with Dorm as the inner relation. Each lookup in the Dorm table will require an index lookup plus a read of the actual record. If we push the selection on Student, we will only have to look in the Dorm table for 2500 of the rows in the Student table. The total cost will be B(Student) + 2500 * (index lookup + 1). How deep is the Dorm.name index? We have to solve our favorite equation:

12*(n+1)+100*n <= 4096

which tells us n=36 (at least in the leaves). At 80% occupancy we need roughly 11 leaves and 1 root. Assuming the root is cached, each lookup is only 1 I/O so the total cost of the index join is 5500 I/Os.

nested loop join

Of the 20 buffers we have to work with, we can use 19 to read blocks from Dorm and the remaining buffer to scan through Student. This will require two passes through Student and so the cost is B(Dorm) + 2*B(Student) = 1030 I/Os.

M = 100

sort-merge join

To perform sort-merge join, we need both relations to be smaller than M2 which they are. If they were both less than M we could use the single pass algorithm. Using the two pass algorithm, we get the same result as in part 1.

hash join

Since B(Dorm) is smaller than M, we can use the single pass algorithm. The cost is therefore just B(Student)+B(Dorm) = 530 I/Os.

index join

This is the same as in part 1. The extra memory does not help.

nested loop join

Since Dorm fits completely in memory, we only need one pass through Student so the cost is 530 I/Os.

Notes:

  1. We end up not using the index on Student.year in this problem. This is not a surprise since it's a secondary index and it will return 1/4 of the rows. Finding all of the freshman using the index would require 2500 random I/Os whereas scanning the entire table and discarding the non-freshmen would only require 500 sequential I/Os.
  2. Although the index join was much worse than the other joins, it only requires 2 buffers so if memory had been really tight it would have won.