Đang chuẩn bị liên kết để tải về tài liệu:
Advances in Database Technology- P13

Ân Thiện 41 50 pdf

Không đóng trình duyệt đến khi xuất hiện nút TẢI XUỐNG Tải xuống

Tham khảo tài liệu 'advances in database technology- p13', công nghệ thông tin, cơ sở dữ liệu phục vụ nhu cầu học tập, nghiên cứu và làm việc hiệu quả | 582 S. Ganguly M. Garofalakis and R. Rastogi The second observation is that the subjoin size Jdd between the dense frequency components can be computed accurately that is with zero error since f and g are known exactly. Thus sketches are only needed to compute subjoin sizes for the cases when one of the components is sparse. Let us consider the problem of estimating the subjoin size Jds f g - For each domain value u that is non-zero in f an estimate for the quantity can be generated from each hash table by multiplying with where Thus by summing these individual estimates for hash table we can obtain an estimate for from hash table Finally we can boost the confidence of the final estimate Jds by selecting it to be the median of the set of estimates . J . Estimating the subjoin size Jsd f g is completely symmetric see the pseudo-code for procedure EstSubJoinSize in Figure 4. To estimate the subjoin size Jss f g Steps 3-7 of procedure EstSkimJoinSize we again generate estimates Jfs for each hash table p and then select the median of the estimates to boost confidence. Since the hash tables in the two hash sketches and employ the same hash function hp the domain values that map to a bucket q in each of the two hash tables are identical. Thus estimate J s for each hash table p can be generated by simply summing H F p H G p q for all the buckets q of hash tablep. Analysis. We now give a sketch of the analysis for the accuracy of the join size estimate J returned by procedure EstSkimJoinSize. First observe that on expectation J J. This is because Jdd Jdd and for all other i j E Jij shown in 4 . Thus P J Jdd Jds Jsd Jss Pin the following we show that with high probability the additive error in each of the estimates Jij and thus also the final estimate J is at most 0 n2 si logn 1 2 . Intuitively the reason for this is that these errors depend on hash bucket self-join sizes and since every residual frequency fF in H F and H G is at most each bucket self-join size is .