Using the new technique for writing sorted runs, runs will now we 2M long. In the merge phase, we can still only merge M files. This means that we can sort a file that is less than 2M2 blocks.
There's no one right answer to this one. The keys to think about are:
I think the best approach is to augment the directory entries with sort order info. For instance, treat each entry as a (start_offset, next_record) pair where start_offset is the offset of the record within the block and next_record is the slot number of the next record in sort order. The header will also need a first_record field that contains the slot number of the first record in the block in sort order. Here's a simple example:
first_record = 2
slot_directory = (400,1),(600,-1),(800,0)
Which means that record 2 is first, followed by record 0, and then record 1.
Part A: It is possible to drop the nub info since it can be reconstructed from the directory array. If a bucket uses i bits to determine membership, there will be only 1 entry in the directory that points to it. If a bucket uses i-1 bits to determine membership, two entries in the directory that point to it, corresponding to the high order bit being 0 and 1. Thus we could write code to quickly compute the nub value given the directory.
Part B: While there are different answers to this question, the main one is that duplicate values will cause unnecessary splitting on buckets. In the case where there are more duplicate values than will fit on a single block, the directory will keep doubling until storage is exceeded.
Part A: 10 records fit on a page so the relation will be 4000 blocks long. Since the query will return 10% of the students ((4-3.8)/(4-2)) and the file is sorted on gpa, this will require reading 400 datafile blocks. We must also calculate how many I/Os are required to read the index to find the appropriate 400 blocks of the datafile. The parameter n for the B-tree is 314, and we've stated that the tree is 80% full so each node will have 251 entries. Assuming a dense index, there will be 160 leaf nodes which can hang directly off of the root. 16 of them will contain pointers for the 10% of the students we're looking for, so the total I/O requires for the query is 417. If you assume a sparse index, the total is 403.
Part B: Now we know we need a dense index, so we'll have 17 index I/Os. The amount of I/O requires to read the data records is greater than in part A, however. Since the file is not sorted on gpa, each dbptr we follow from the index will probably lead us to a different block. In the worst case, each of the 4000 tuples will be in different blocks so the I/O total will be 4017.