D64 Fall '98 Midterm Solutions

1. True/False

  1. False. Seek time and rotational latency dominate.
  2. True.
  3. True. The larger the blocks are the fewer will fit into a fixed amount of memory. This limits the fan-in of the merge step.
  4. True.
  5. True. The insertion could cause a split of a bucket that has overflowed.

2. Mirroring versus striping

  1. Block read - The two approaches are similar, but mirroring is a little faster since we can read from the disk that happens to have its head assembly closer to the desired cylinder. With striping, we're stuck reading from only one disk which means we'll pay the average seek time.
  2. Large read - Striping is faster since we can read the two disks in parallel.
  3. Block write - The two approaches are similar, but mirroring is a little slower since we have to write to both disks and the write is not complete until the slower of the two is finished.
  4. Large write - Striping is faster for the same reason as in (2).
  5. Other differences - Mirroring is fault tolerant whereas striping is not. Striping results in twice the effective storage as mirroring.

3. Sorting

Using the new technique for writing sorted runs, runs will now we 2M long. In the merge phase, we can still only merge M files. This means that we can sort a file that is less than 2M2 blocks.

4. Sorted Files

There's no one right answer to this one. The keys to think about are:

  1. We can't break database pointers. For this reason, we cannot sort the slot directory directly.
  2. The solution of keeping the records sorted at the end of the block will work, but results in moving large blocks of memory and also makes scanning difficult.
  3. Any approach should handle insertion of a record that does not fit on the page it should be placed on.
  4. Any approach should handle the update of the attribute which the file is sorted on.

I think the best approach is to augment the directory entries with sort order info. For instance, treat each entry as a (start_offset, next_record) pair where start_offset is the offset of the record within the block and next_record is the slot number of the next record in sort order. The header will also need a first_record field that contains the slot number of the first record in the block in sort order. Here's a simple example:

first_record = 2

slot_directory = (400,1),(600,-1),(800,0)

Which means that record 2 is first, followed by record 0, and then record 1.

5. Extensible Hashing

Part A: It is possible to drop the nub info since it can be reconstructed from the directory array. If a bucket uses i bits to determine membership, there will be only 1 entry in the directory that points to it. If a bucket uses i-1 bits to determine membership, two entries in the directory that point to it, corresponding to the high order bit being 0 and 1. Thus we could write code to quickly compute the nub value given the directory.

Part B: While there are different answers to this question, the main one is that duplicate values will cause unnecessary splitting on buckets. In the case where there are more duplicate values than will fit on a single block, the directory will keep doubling until storage is exceeded.

6. Range Queries

Part A: 10 records fit on a page so the relation will be 4000 blocks long. Since the query will return 10% of the students ((4-3.8)/(4-2)) and the file is sorted on gpa, this will require reading 400 datafile blocks. We must also calculate how many I/Os are required to read the index to find the appropriate 400 blocks of the datafile. The parameter n for the B-tree is 314, and we've stated that the tree is 80% full so each node will have 251 entries. Assuming a dense index, there will be 160 leaf nodes which can hang directly off of the root. 16 of them will contain pointers for the 10% of the students we're looking for, so the total I/O requires for the query is 417. If you assume a sparse index, the total is 403.

Part B: Now we know we need a dense index, so we'll have 17 index I/Os. The amount of I/O requires to read the data records is greater than in part A, however. Since the file is not sorted on gpa, each dbptr we follow from the index will probably lead us to a different block. In the worst case, each of the 4000 tuples will be in different blocks so the I/O total will be 4017.