Big Data Analytics Spring 2022 Homework #3

Big Data Analytics
Spring 2022
Homework #3
Due Date: March 2
nd, 2022, 10PM
Consider the following set of transactions and
answer the following questions.
1. (30) Using the data in this table show the
execution of the two parallel algorithms,
Count Distribution and Data Distribution,
described in the paper discussed in class on
Feb 17
th. Assume a minimum support level
of 3. Show the execution of each algorithm
only till the computation of frequent 2-
itemsets.
2. (10) Which of the above two algorithms, in
general, will require the most
communication among the servers? Justify
your answer and you can use instances
from answer to Q#1 above.
3. (10) Which of the above two algorithms, in
general, will require the smallest memory
at servers? Justify your answer and you can
use instances from answer to Q#1 above.
4. (15) What issues will need to be addressed if we want to use the third algorithm, the
Candidate distribution algorithm, on this data. Specifically, address the issues of which
data and which candidates will go on each processor. Do not show the execution of the
algorithm for this data.
5. (20) Use FP-Tree algorithm to generate a separate FP-tree at each of the servers. Now
merge the two FP-Trees to form a single FP-Tree.
6. (15) Propose your intuitive idea for an algorithm to, in general, merge two FP-trees to
form a merged FP-Tree, and give a pseudo-code for your proposed algorithm. For a
context, assume that each Map server uses a combiner to generate an FP-Tree for its
data chunk. The Reducer then takes the trees from all Mapper servers and generates
the combined global FP-Tree.

Trans ID Items in
Transaction
Server in
which the
item
resides
T1 A B C E M R Server1
T2 B C G M P Server1
T3 A C E T Server1
T4 B E G P Server1
T5 A B G P Q Server2
T6 B C G M Server2
T7 A B C G Server2
T8 A C G S T Server2