Parallel Databases
Explain Inter-operational and Intra-operational parallelism with relevant examples.
Intraoperation Parallelism- Relational operations work on relations that contain large sets of tuples, that we can parallelize the operations by executing them in parallel on different subsets of the relations.
- The number of tuples in a relation can be large so the degree of parallelism is potentially enormous.
- Hence, we can say intraoperation parallelism is natural in a database system.
The parallel versions of some common relational operations are as follows:
Parallel Sort- For example, we want to sort a relation that resides on n disks D0,D1,......Dn-1.
- If this relation is range partitioned on the attributes then each partition is sorted out separately and can concatenate the results to get the full sorted relation.
- As the tuples are partitioned on the n disks the time that is required for reading the entire relation is reduced by the parallel access.
- If the relation is partitioned in any other way in can be sorted out by using any of the following ways:
1. Range-partition it on the sort attributes and then sort each partition separately.
2. Use the parallel version of the external sort-merge algorithm.
Range-partitioning sort- It basically works in two steps: first is to range partition the relation and second is to sort out each partition separately.
- When we sort the relation it is not necessary to range -partition the relation on the same set of processors or disks as those on which that relation is stored.
- The range-partitioning should be done with a good range-partition vector so that each partition will approximately have the same number of tuples.
Parallel External Sort-Merge- It is an alternative to range partitioning.
- Suppose a relation has already been partitioned among the disks D0,D1,....Dn-1.
The parallel sort-merge will work in the following manner:
1. Each processor Pi will locally sort the data on the disk Di.
2. To get the final sorted output the system merges the sorted runs on each processor.
Parallel Join- The join operation tests the pairs of tuples to see whether they satisfy the join condition and if they do the system adds the pair to the join output.
- The parallel join algorithms attempt to split the pairs that are to be tested over several processors.
- Each processor then checks part of the join locally.
- After this the system collects the results from each of the processor for producing the final result.
The types of joins are:- Partitioned join
- Fragment and Replicate join
- Partitioned Parallel Hash join
- Parallel Nested-Loop join
Other relational operators- Selection
- Duplicate elimination
The duplicates can be eliminated by sorting by using either of the parallel sort techniques. The duplicate elimination can also be parallelized by partitioning the tuples and eliminating the duplicates locally at each processor.- Projection
The projection without the duplicate elimination can be performed as the tuples are read from the disk in parallel. To eliminate the duplicates any of the techniques can be used.- Aggregation
the operation can be parallelized by partitioning the relation on the grouping attributes and computing the aggregate values locally at each processor. Either hash partitioning or range partitioning can be used.
Interoperation parallelism
It has two types of parallelism:
1. Pipelined Parallelism- The parallel systems use the pipelining mainly for the same that the sequential systems do.
- The pipelines are a source of parallelism in the same way that the instructions pipelines are a source of parallelism in hardware design.
- Two operations can be run simultaneously on different processors so that the tuple consumes the tuples in parallel to the one producing them.
- This form of parallelism is known as pipelined parallelism.
Independent parallelism- The operations in a query expression that do not depend on one another can be executed in parallel. This is known as independent parallelism.
- The independent parallelism does not provide a high degree of parallelism and is less useful in a highly parallel system, even if it is useful with a lower degree of parallelism.
Explain I/O parallelism? Define Parallelism on Multicore processor.
- For the purpose of parallel I/O the data can be partitioned across multiple disks.
- The relational operators such as the sort, join, aggregation can be executed in parallel.
- The data here can be partitioned in such a manner that each processor can work independently on its own partition.
- The queries are expressed in the high level language which helps in making the parallelization easier.
- Different queries can be made run in parallel with each other the conflicts can be taken care by the concurrency control.
- Hence, we can say that the databases lend themselves to parallelism.
I/O Parallelism- It helps in reducing the time that is required to retrieve the relations from the disk by partitioning.
- All the relations are maintained on multiple disks.
- Horizontal partitioning is where the tuples of a relation are divided among many other disks such that each of the tuple resides on one disk.
- Partitioning techniques used in I/O parallelism.
- Assume that the number of disks = n.
The partitioning techniques are as follows:
Round Robin- It scans the relation in any order and sends the ith tuple to disk number Di mod n.
- The scheme ensures an even distribution of tuples across disks; that is, each disk has approximately the same number of tuples as the others.
Hash partitioning- It is a declustering strategy that designates one or more attributes from the given relation’s schema as the partitioning attributes.
- A hash function is chosen whose range is {0, 1, . . . , n - 1}.
- Each tuple of the original relation is hashed on the partitioning attributes.
- If 'i' is returned by the hash function, then the tuple is placed on disk Di 1.
Range partitioning- It distributes the tuples by assigning contiguous attribute-value ranges to each disk.
- It selects a partitioning attribute, A, and a partitioning vector [v0, v1, . . . , vn-2], such that, if i < j, then vi < vj.
- The relation is partitioned as follows: Consider a tuple 't' such that t[A] = x. If x < v0, then 't' goes on disk D0. If x = vn-2, then 't' goes on disk Dn-1. If vi = x < vi+1, then 't' goes on disk Di+1.
- Example of this can be with three disks numbered 0, 1, and 2 that may assign tuples with values less than 5 to disk 0, values between 5 and 40 to disk 1, and values greater than 40 to disk 2.
Comparison of Partitioning Techniques- A relation can be retrieved in parallel by using all the disks once a relation has been partitioned among several disks.
- Similarly, when a relation is being partitioned, it can be written to multiple disks in parallel.
- The rate transfer for reading or writing an entire relation are much faster with I/O parallelism than without it.
- However, it is only one kind of access to data for reading an entire relation, or scanning a relation.
Access to data can be classified as follows:- The entire relation is scanned.
- A tuple is located associatively (example, employee name = “Pooja”); these queries, also known as point queries, seek tuples that have a specified value for a specific attribute.
- Locating all tuples for which the value of a given attribute lies within a specified range (example, 10000 < salary < 20000); these queries are called range queries.