The big data
bootstrap
Kleiner
et al. (2012) proposed a method to develop the original bootstrap method
(Efron, 1979) for a very large dataset called big data. Kleiner’s method is
called the Bag of Little Bootstraps (BLB) which combined the original bootstrap
with sub-sampling technique (Polistis et al., 1999) to reduce a computation in
the bootstrap process. As a result, BLB only requires a computation on small
subsets less than the original bootstrap. Efron (1979) shows that each
bootstrap re-sample consists of 0.632n distinct data, which is very
large if n is large, while each BLB re-sample contains smaller than
0.632n distinct data. The number of BLB re-sample may be chosen nγ
where γ ϵ [0.5,1]. This paper used different simulated datasets with generating
distributions (Bernoulli, Normal, and Gamma) and UCI datasets (real data) for
comparing the experiments of the BLB and the bootstrap. Two different setting
were considered: regression and classification where n was set to be
20,000. The results showed that BLB normally requires less time than the
bootstrap with the same high accuracy. Due to the large re-samples of bootstrap
recalled 63% of data in each computation of re-sample, if the number of data is
very large, may overflow memory in computer. In contrast, BLB can reduce the
number of re-sample of data in each computation. For example, if n = 1
TB, each bootstrap re-sample contains approximately 632 GB. But BLB subsample
and re-sample can be chosen (γ =0.6) approximately 4GB in each computation. The
researcher suggested that BLB can also use parallel computing
architectures.
This
study provides the results of experiments and theoretical investigation,
including a study of its statistical performance. However there are some
limitations.
All
computations in the bootstrap process is used on RAM, however the very large original
dataset is to be stored in a computer’s hard drive. The researchers may have some
problems when the dataset is larger than the dataset in the experiment.
A
parameter γ is an important factor in this paper. To find an optimal value γ, a
parameter γ was varied in the experiments. As the result, it seems that γ = 0.7
is a reasonable and effective choice in many datasets. But the researchers did
not explain or prove that assumption. In the real world, the parameter γ should
not be constant, it should be properly chosen depending on each dataset.
The
researchers did not say in the depth details such as how to set any parameters
on each tool. As a result, the experiments could not repeat in the same way. And
the results were only shown in some graphs over relative error vs. time (sec).
The
strength of this study is that the bootstrap method is widely known for many
researches but there were no the researchers who can reduce a computational
time and still keep high accuracy for a big data. The results of this study
showed that BLB uses a computational time less than the original bootstrap
because the number of re-sample is reduced in the bootstrap process.
Reference
List
Kleiner,
A., Talwalkar, A., Sarkar, P., & Jordan, M.I. (2012). The big data
bootstrap.The Proceeding of the 29th
International Conference on Machine Learning (pp. 1759-1766), Scotland: Omnipress.
Efron,
B. (1997). Another look at the jackknife. Annals of Statistics, 7(1),1-26.
Politis,
D., Romano, J., & Wolf, M. (1999). Subsampling. Springer.
My comments on Friend's blogs.
#1
http://suwatthikul.blogspot.com/2015/04/major-project.html?showComment=1429632701204#c6401955663896746197
#2
http://woratouch.blogspot.com/2015/04/draft-2-major-project.html?showComment=1429633631474#c9084318632999150175
#3
http://sujitratc.blogspot.com/2015/04/major-project-draft-1.html?showComment=1429633809559#c5705402777656864666
#4
http://edwardkrit.blogspot.com/2015/04/my-major-project.html?showComment=1429634244560#c7011589826254288019
#5
http://suphatka.blogspot.com/2015/04/major-project-draft-1.html?showComment=1429634839678#c5891116359063417312
My comments on Friend's blogs.
#1
http://suwatthikul.blogspot.com/2015/04/major-project.html?showComment=1429632701204#c6401955663896746197
#2
http://woratouch.blogspot.com/2015/04/draft-2-major-project.html?showComment=1429633631474#c9084318632999150175
#3
http://sujitratc.blogspot.com/2015/04/major-project-draft-1.html?showComment=1429633809559#c5705402777656864666
#4
http://edwardkrit.blogspot.com/2015/04/my-major-project.html?showComment=1429634244560#c7011589826254288019
#5
http://suphatka.blogspot.com/2015/04/major-project-draft-1.html?showComment=1429634839678#c5891116359063417312
I'm not sure. I think "Efron (1979) shows that..." may be change to "Efron (1979) showed that...".
ReplyDeleteThank you.
DeleteIn my opinion, it may be changed in paragragh 3 to " All computing in...
ReplyDeleteHi "P"
ReplyDeleteI wonder in a phase "All computations in the bootstrap process is used on RAM" it should change "is" to "are", shouldn't it? and in a sentence "As a result, the experiments could not repeat in the same way" that I think "the experiments" can't do by themselves so I think you should use passive voice, right?
Thx
Thank you for your suggestions ^ ^
Delete