# D64 Study Question Solutions

## 1. Optimization

### Part 1:

To compute the join order, we follow the dynamic programming algorithm described in the book. Of course the first thing we'd do is push the selection down below the join, so the table below shows the values for the tables and selection nodes. In particular, the selection operator on the Student relation will yield T(Student)/V(Student,Class) = 2500 tuples. The selection operator on the Enrollment relation will yield T(Enrollment)/V(Enrollment,Grade) = 2500 tuples.

The entries for all of the singletons are:

 {Student} {Enrollment} {Course} Size 2500 2500 500 Cost 0 0 0 Best Plan Student Enrollment Course

The entries for all of the pairs of relations are shown below. Note that where the two relations are of different sizes, the smaller relation is made the outer relation of the join. Also, since there's no join symbol available in my HTML editor I used the ¤ symbol instead.

 {Student,Enrollment} {Enrollment,Course} {Student,Course} Size 2500 2500 1250000 Cost 0 0 0 Best Plan Student ¤ Enrollment Course ¤ Enrollment Course ¤ Student

We can now compute the optimal join plan by considering using any of the three pairs and joining the last relation to them. Remember that we're only considering left deep join plans.

1. (Student ¤ Enrollment) ¤ Course: The estimated cost of this plan is 2500 as is its size.
2. (Course ¤ Enrollment) ¤ Student: The estimated cost of this plan is 2500 as is its size.
3. (Course ¤ Student) ¤ Enrollment: The estimated cost of this plan is 1250000. It's size will be 2500 tuples.

This means that there is a tie for the best plan between 1 and 2. We'll break the tie by considering row size in which case plan 2 has the smaller cost.

### Part 2:

We need to determine the best join algorithm for the two joins in the query. Let's look at the join between Course and Enrollment first.

#### Index Join

Enrollment only has an unclustered index on grade which is not part of the join predicate (Enrollment.courseNum = Course.number) so we can't perform an index join. In reality, there would be a B-tree index on Course.number since it's the key of Course but I didn't explicitly say that in the problem so you're off the hook.

#### Nested-loop Join

The smaller of the two relations, Course, will require 500/20 = 25 blocks on the disk. Assuming we can use 26 buffer pages for the join, we can read all of Course into memory and then scan through Enrollment once. The question is: how should we perform the scan on Enrollment? If we use the unclustered index, we will need 2500 I/Os plus the cost to read the index itself. If we perform a full scan of Enrollment, we will need ceil(25000/40) = 625 I/Os. This is the clear winner. The total cost for the nested-loop join is thus 25+625=650 I/Os and it needs 26 buffer pages leaving us with 474 pages for the other scan.

#### Sort-merge Join

Since both relations are small, we can consider loading them completely into memory and then sorting them. The Course table will occupy 25 pages as calculated above and the Enrollment table (after selection which can be applied as we read the relation in) will occupy ceil(625/10) = 63 pages. Thus if we have 88 pages available to us we can perform the sort and merge in one pass requiring just 650 I/Os as above.

Now let us consider the join between (Course ¤ Enrollment) and Student. Before we start, let's calculate the size of (Course ¤ Enrollment) in case we need to materialize it. Its cardinality is 2500 and the row size is roughly 300 bytes (13/page). The temporary relation would therefore require ceil(2500/13) = 193 pages.

#### Index Join

Student only has a clustered index on class which is not part of the join predicate (Enrollment.id = Student.id) so we can't perform an index join. In reality, there would be a B-tree index on Student.id as there was on Course.number.

#### Nested-loop Join

If we have enough buffer pages, we can materialize the outer relation and then perform the join by scanning the inner relation in a single pass. As calculated above, this will requre 193 buffer pages plus 1 page to read the Student relation. From the calculation of the first join, we will have enough memory to perform the nested-loop join in this manner.

The next question is: how should we perform the scan on Student? If we use the clustered index, the 2500 tuples we're looking for will be contiguous, so they'll occupy 2500/10 = 250 blocks. We'll also have to calculate the number of I/Os to read the index but a full scan would require 10000/10 = 1000 I/Os so the index looks like a win. To calculate the index cost we need to compute how many entries are on a page. Since a DBPtr is 5 bytes and the key (class) is probably 1 byte, (n+(n+1))*6 <= 4096 so n = 681. If the nodes are 80% occupied, there will be 544 entries which means there will only be 2 leaf nodes and all of the freshmen will probably be on one of them so the index will only cost 2 I/O. The total cost is thus 252 I/Os and the buffering requirement is 194 pages.

#### Sort-merge Join

The first thing we should check is whether we can load both relations into memory. We have already calculated that the outer relation will require 193 pages and the Students relation, after selection, will require 250 blocks. The total is 443 pages which is within our limit. Thus we can buffer both relations in memory and sort and merge them without writing run files. The I/O cost is only 252 to read Students using its index.

Combining our answers from both parts, it looks like nested loop join is the best bet for both joins since its cheap and only uses 26 buffer pages. We could also use sort-merge for one of the two joins (although not both) for the same cost but with a higher buffer page requirement. It may be worth it, though, if there were an order by or group by in the query. Also, our cost model is unrealistically simple so we can't definitively decide between the two. In either case, the total cost is 650+252=902 I/Os.

For added fun, try the problem again with different (smaller) values for the number of available buffer pages.

## 2. Recovery

If you see anything that seems fishy below let me know - there's lots of room to make mistakes when typing this stuff in.

### Part 1:

1. After step 5.  The minimum log record that must be flushed to disk is shown with a '>>'.

 LSN prevLSN TID Type PageID 0 null 1 Insert 1 1 0 1 Update 3 2 null 2 Update 1 >>3 1 1 Commit 4 2 2 Update 2

 TID lastLSN Status 1 3 Committing 2 4 Running

 pageID recLSN 1 0 2 4 3 1

 pageID LSN in mem LSN on disk 1 2 ? 2 4 ? 3 1 ?

2. After step 8.

 LSN prevLSN TID Type PageID 0 null 1 Insert 1 1 0 1 Update 3 2 null 2 Update 1 >>3 1 1 Commit 4 2 2 Update 2 5 4 2 Delete 1 6 null 3 Delete 3 7 3 1 End

 TID lastLSN Status 2 5 Running 3 6 Running

 pageID recLSN 1 0 2 4 3 1

 pageID LSN in mem LSN on disk 1 5 ? 2 4 ? 3 6 ?

3.  After step 11.  The numbers in parenthesis after CLR are the nextUndoLSNs.
 LSN prevLSN TID Type PageID 0 null 1 Insert 1 1 0 1 Update 3 2 null 2 Update 1 >>3 1 1 Commit 4 2 2 Update 2 5 4 2 Delete 1 6 null 3 Delete 3 7 3 1 End 8 5 2 Abort 9 8 2 CLR(4) 1 10 9 2 CLR(2) 2 11 10 2 CLR(null) 1 12 6 3 Update 1 13 11 2 End

 TID lastLSN Status 3 12 Running

 pageID recLSN 1 0 2 4 3 1

 pageID LSN in mem LSN on disk 1 12 ? 2 10 ? 3 6 ?
4. After step 8 assuming two buffer pages.

 LSN prevLSN TID Type PageID 0 null 1 Insert 1 1 0 1 Update 3 2 null 2 Update 1 3 1 1 Commit >>4 2 2 Update 2 5 4 2 Delete 1 6 null 3 Delete 3 7 3 1 End

 TID lastLSN Status 2 5 Running 3 6 Running

 pageID recLSN 1 0 3 6

 pageID LSN in mem LSN on disk 1 5 ? 2 -- 4 3 6 1

### Part 2:

1. Suppose that the log has only been flushed up through LSN 3 which was the minimum.  After analysis, the tables will look like:

 TID lastLSN Status 1 3 Commiting 2 2 Running

 pageID recLSN 1 0 3 1

After redo:
 TID lastLSN Status 2 2 Aborting

 pageID recLSN 1 0 3 1

 pageID LSN in mem LSN on disk 1 2 ? 2 -- ? 3 1 1

After undo:
 LSN prevLSN TID Type PageID 0 null 1 Insert 1 1 0 1 Update 3 2 null 2 Update 1 >>3 1 1 Commit 4 3 1 End 5 4 2 CLR(null) 1 6 5 2 End

 TID lastLSN Status

 pageID recLSN 1 0 3 1

 pageID LSN in mem LSN on disk 1 5 ? 3 1 ?

### Part 3:

If the log is not written before the crash, the answer will be the same as in part 2.  If the new log records are flushed to disk, the answers will be:

After analysis:
 TID lastLSN Status 2 5 Aborting

 pageID recLSN 1 0 3 1

After redo:
 TID lastLSN Status 2 5 Aborting

 pageID recLSN 1 0 3 1

 pageID LSN in mem LSN on disk 1 5 ? 2 -- ? 3 1 ?

After undo:
 LSN prevLSN TID Type PageID 0 null 1 Insert 1 1 0 1 Update 3 2 null 2 Update 1 >>3 1 1 Commit 4 3 1 End 5 4 2 CLR(null) 1 6 5 2 End

 TID lastLSN Status

 pageID recLSN 1 0 3 1

 pageID LSN in mem LSN on disk 1 5 ? 3 1 ?

## 3. Schedules

1. Since T2 aborts and T1 commits, the schedule is equivalent to running just T1.  Thus the schedule is both serializable and conflict serializable.  Since no other transaction read the value of X written by T2 before it aborted, the schedule must be recoverable and must avoid cascading aborts.  The schedule is not strict, however, since the value of X written by T2 is overwritten by T1 before T2 commits.
2. The schedule is not strict since T2 reads X after T1 has written it but before T1 has committed.  It's also not recoverable since T2 commits before a transaction that updated a value it read.  Since it is not recoverable, it doesn't even make sense to talk about it avoiding cascading abort.  The schedule is serializable since T2 is read only but not conflict serializable.

## 4. Two transactions

1. Both serial orders of the transactions yield consistent final states:
• T1;T2:  {A=0,B=0} T1 {A=0,B=1} T2 {A=0,B=1}.
• T2;T1:  {A=0,B=0} T2 {A=1,B=0} T1 {A=1,B=0}.

2. One possibility is:
 T1 T2 A B read(A) 0 0 read(B) 0 0 read(A) 0 0 if (B==0) then A=A+1 0 0 write(A) 1 0 read(B) 1 0 if (A==0) then B=B+1 1 0 write(B) 1 1
3. There is no concurrent schedule of T1 and T2 that is serializable.  Since the question asked about serializability and not conflict serializability we cannot reason about conflicts.  Instead, we have to consider all possible concurrent executions:
• Suppose that T1 starts first.  Its first instruction will read A=0 which means that it will eventually write B=1.  If the executions of the two transactions are concurrent, T2 must execute its first instruction before T1 completes which means that read(B) will fetch the value 0.  This means that T2 will eventually write A=1.  Thus if T1 starts first the final state will be inconsistent.
• If T2 starts first a similar argument can be made that in the final state, A=B=1 which violates the database constraint.

## 5. Conflict serializability and two phase locking

One answer is the schedule: T1:R(A);T2:R(A);T1:W(A).  The schedule is conflict equivalent to the serial schedule T2;T1 but could not be produced by a two phase locking scheduler since T1 could not release the lock on A until after the write to A which would mean that T2:R(A) would have to be delayed.