Some Practice Problems
The questions appear in this font
And the answers in this font
SQL Overview
Beer SQL and Relational Algebra
Consider the beer drinkers' database with the tables
FREQUENT(Drinker, Bar)
SERVES(Bar, Beer)
LIKES(Drinker, Beer)
SELLS(Bar, Beer, Amount)
- Write a view PAIRS with schema (Drinker, Friend). The table contains all pairs of Drinker (D) and Friend (F), where F is a Drinker who
- Likes at least one beer that D also likes, and
- Frequents in at least one bar that D also frequents and this bar serves one or more of the beers they both like
- Write a relational algebra expression for the view. (Recall a relational algebra expression for a view is identical to the expression for the query defining the view.) The expression you will write must have the following properties
- No
cartesian products
- Every selection or join operator must be associated with an atomic condition, i.e., a condition of the form A=B or A!=B, but no condition of the form, say, C1 AND C2.
- Because of the transitivity of equality there are many equivalent ways to write the following query CREATE VIEW PAIRS(Drinker, Friend) AS SELECT F1.Drinker, F2.Drinker FROM FREQUENT F1 F2, SERVES, LIKES L1 L2 WHERE F1.Drinker!=F2.Drinker AND F1.Bar=F2.Bar AND F1.Bar=SERVES.Bar AND L1.Beer=L2.Beer AND L1.Drinker=F1.Drinker AND L2.Drinker=F2.Drinker AND L1.Beer=SERVES.Beer
- Relational algebra expression…
Beer Aggregation
Find the Bar that sells the highest total Amount of Beer. Hint: You can use a nested query in the HAVING clause of an SQL query.
SELECT Bar FROM SELLS
GROUPBY Bar
HAVING SUM(Amount) >= ALL
(SELECT SUM(Amount) FROM SELLS GROUPBY Bar)
Storage
Sequential Vs Random Access
Consider a table R which is stored in consecutive pages of the disk. We know that reading pages of the disk in the sequential order in which they are stored is at least ten times faster than reading pages in some random order. Give two reasons why we may still prefer to read in random order instead of reading sequentially.
- Using an index structure we may have to read very few pages in a random order
- If we want to do a merge join on attribute A we may have to read the tuples in increasing order of A. However, the relation may not be stored in the same order with A.
Indexing
No More Indexes Please…
Provide reasons why we should not keep an index on an attribute A of some relation R.
- The index is useless for the queries that are typically submitted to the database. For example, if no query requests a selection or a join on A there is no reason to maintain an index on A.
- Indexes make deletions and insertions more expensive because the index has to be updated.
- Indexes require storage space.
Estimating the Cost of Operations on a Dense Index
Consider a relation R(K,A,B,…) where the attribute K is an integer key. The relation is sorted along the key K, i.e., K is a primary key. Then there is a multilevel dense index on K. Assume that the top level of the multilevel index always has just one page.
Each tuple has a block pointer pointing to the next tuple. Also, each index entry has a pointer to the next entry. As usual, the system attempts to "pack" consecutive tuples in contiguous blocks and uses the overflow when this is not possible.
Assume that n tuples are inserted in R.
- The size of a disk page is d.
- The size of a tuple of R is r.
- The size of each K value is k bytes.
- Block pointers are used and each one is p bytes long.
- No use of main memory buffers.
- Records (tuples and index entries) do not span blocks.
Questions
- Assume that initially the tuples with keys m*i, i=1,…,n and the corresponding index entries are stored in consecutive blocks, no overflow is needed and space utilization is 66%. Display the state of the indexes and the relation assuming p=6, r=44, k=20, d=160, m=1000, n=30.
- Assume that at every minute j there's a burst of inserting 100 tuples with keys m*j+k, k=1,…,l in ascending order. If you were the administrator how often would you reorganize this database?
The Weird Properties of C+ Trees
The C+ trees are a variation of B+ trees. They have the following properties:
- Each page/node of the C+ tree has m to 3m/2 elements (m is even). The root may have any number of elements.
- m > 2
- has more than one levels
- All other properties of the C+ tree are identical with the B+ tree.
Write an O(logn) procedure for insertion of a new element e into a C+ tree.
The problem with insertion in a C+ tree is that if the page p, where e should appear, is full we cannot simply allocate one more page and split the elements of p because then each page will have (3m+1)/4 elements - unacceptable. Furthermore, it may not be sufficient to find adjacent sibling pages p1 p2 … pa merge their contents and split them in a+1 pages. Here goes the solution
locate the leaf page p where e should appear
if p has less than 3m/2 elements (before the insertion)
insert e in p
if e is now the first element in p
update the appropriate parent node
else if p has exactly 3m/2 elements and there is an adjacent sibling p' with less than 3m/2 elements
if p' is the left adjacent sibling of p
transfer the first element of p to p'
insert e in p
update the appropriate parent node of p to reflect the change of the first element of p
if p' is the right adjacent sibling of p
if e is the largest than any element of p
insert e in p'
update the parent node to reflect that e is the first element of p'
else
transfer the last element of p to p'
update the parent node to reflect the new first element of p'
insert e in p
else if p has no adjacent sibling with less than 3m/2 elements
pick randomly one adjacent sibling p' of p
(Note that p' definitely has 3m/2 elements. For simplicity let us assume that p' is the right adjacent sibling of p. The case for left adjacent sibling is similar.)
concatenate the elements of p', p, and the element e into a list l and then sort l.
(note that l has exactly 3m/2 + 3m/2 + 1 = 3m+1 elements)
allocate a new page p''
put the first m elements of l in p
put the second m elements of l in p'
put the last m+1 elements of l in p''
update the parent to reflect the new first element of p'
insert in the parent node of p the first element of p'' along with a pointer to p''
Query Processing
B+ trees for joins
Consider
- a B+ tree index on some attribute A of a relation R
- the join query SELECT * FROM R, S WHERE R.A=S.B
Check the join techniques, from the list below, that can make use of the index. For each one, describe very briefly why the index is useful (i.e., how it is used) if it is useful.
Consider only the following four join techniques:
- nested loops join
- index join
- merge join
- hash join
A B+ tree index is useful for
- index join: given a tuple r of R the index provides the tuples of S that have the value r.A in attribute B
- merge join: Indexes on A and B can allow us to retrieve the tuples of R and S in increasing order of A and B.
Relational Algebra Transformation Rules
For each one of the two rules below state whether the rule holds or not. If the rule holds show why every tuple generated by the left hand expression will also be generated by the right hand expression, for every R and S. Otherwise generate an example where the rule breaks down.
- p
B(R1 -R2) = (pB R1) - (pB R2)
- S JOINA=B(R1 È R2) = (S JOINA=B R1) È (S JOINA=B R2)
Assume R1 and R2 have the same schema, they have an attribute named B, and they do not have an A attribute. S has an A attribute.
- The first rule does not hold. Consider
R={(C=2,B=1)} and S={(C=3,B=1)}.
- The second rule holds because
S JOINA=B(R1 È R2) = sA=B(S x (R1 È R2)) = sA=B((S x R1) È (S x R2)) = sA=B(S x R1) È sA=B(S x R2) = (S JOINA=B R1) È (S JOINA=B R2)
Choosing Selection Technique
System R-CSE232 supports the techniques of System R for performing selections and in addition it supports a selection execution technique based on pointer intersection. This is the complete list of selections that System R-CSE232 can do:
- s
CSCANR : Scans the whole table R and outputs the tuples that satisfy condition C, where C is a condition that refers to R attributes only. Apparently, C can be arbitrarily complicated (e.g., it could be Year*1.5 > Salary + 2*Children).
- s
CINDEXR : Uses an index to find the tuples that satisfy condition C. C may be of the form A=2 or A>2 (A>=2, etc) if there is B+ tree on A. If the index is a hash structure on A then C is of the form A=2.
- s
CPINTR : It uses the pointer intersection method that we have discussed in class. C has the form C1 AND C2 AND … AND Cm where Ci has the form A=2 or A>2 if there is a B+ tree index on A or the form A=2 if there is a hash index on A. The technique retrieves the sets S1 , S2 , … , Sm of pointers to the tuples that satisfy the corresponding conditions C1 , C2 , … , Cm, then it intersects the sets, and retrieves the tuples that correspond to the intersection.
Consider the following query
SELECT * FROM R WHERE B=10 AND C=20 AND D=30
Consider the following statistics
- R has 108 tuples, is not sorted, and each page contains as many tuples as possible.
- The page size is 8192 bytes but only 8004 bytes can be used by the database.
- The tuple size is 80 bytes.
- Each one of the attributes B, C, and D of R is 4 bytes long.
- The index structure uses record pointers to the table and they are 4 bytes long. For simplicity, assume that the block pointers are also 4 bytes long.
- There are B+ indexes on B, C, D and they are as full as possible. Do not confuse C with the condition C in the above paragraph.
- The conditions of the query are independent, i.e., the probability that one condition is satisfied for one tuple is independent of the probability that it is satisfied for another condition.
State which is the best plan for each of the following combinations of V(B,R), V(C,R), and V(D,R). Provide approximate calculations that prove your statement (do not worry for non-dominating factors in the cost.)
- V(B,R)=5 107 , V(C,R)=104 , V(D,R)=104
- V(B,R)=103 , V(C,R)=104 , V(D,R)=102
- V(B,R)=102 , V(C,R)=102 , V(D,R)=102
For example, you may say sD=30SCAN(sB=10 AND C=20PINTR) is the optimal plan for the third case.
Hint: It is only the first selection that has a cost and is important. Whatever conditions are not handled by the first selection can be put in a subsequent scan selection that is executed directly on the result of the first selection. In the example, sD=30SCANR does not cost anything.
- s
C=20 AND D=30SCAN(sB=10 INDEXR) Since V(B,R)= 5 107 we expect that only two tuples will satisfy the condition B=10. A B+ tree index will be 3 levels high (because 102x3 < 5 107 < 103x3). Hence retrieving the two tuples that satisfy the condition B=10 requires 5 disk accesses. Any PINT plan will have to retrieve 104 pointers, which occupy at least 10 pages. Hence no PINT plan can be better than the suggested one.
- s
D=30SCAN(sB=10 AND C=20PINTR) Using an INDEX plan can not be very efficient anymore because even the most selective condition (C=20) requires that 108-4 tuples are retrieved. Among the PINT plans with 2 conditions the best one will be the suggested plan because it has the two most selective conditions. Assuming full B+ trees this plan requires. Retrieval of 108-3 pointers that satisfy the B condition. This takes around 108-3-3=102 disk accesses. Similarly, the B condition requires 10 disk accesses. The intersection of the two pointer lists results in 10 common pointers, i.e., we have to access another 10 pages to get the tuples pointed by these pointers.
- s
B=10 AND C=20 AND D=30PINTR If we use a PINT plan with two conditions we’ll end up retrieving 108-2-2=104 tuples. It is worthwhile then considering the PINT with 3 conditions. Retrieving the pointers that satisfy B, C, and D, requires 3x108-2-3=3x103 pages. The intersection consists of 100 pointers only. Hence it is clear this is the best plan.