The big data bootstrap
Kleiner
et al. (2012) proposed a method to develop the original bootstrap method
(Efron, 1979) for a very large dataset called big data. Kleiner’s method is
called the Bag of Little Bootstraps (BLB) which is combined the original
bootstrap with sub-sampling technique (Polistis et al., 1999) to reduce a
computation in the bootstrap process. As a result, BLB only requires a
computation on small subsets less than the original bootstrap. Efron (1979) showed
that each bootstrap re-sample consists of 0.632n distinct data, which is
very large if n is large, while each BLB re-sample contains smaller than
0.632n distinct data. The number of BLB re-sample may be chosen nγ
where γ ϵ [0.5,1]. This paper used different simulated datasets with generating
distributions (Bernoulli, Normal, and Gamma) and UCI datasets (real data) for
comparing the experiments of the BLB and the bootstrap. Two different settings
were considered: Regression and classification where n was set to be
20,000. The results showed that BLB normally requires less time than the
bootstrap with the same high accuracy. Due to the large re-samples of bootstrap
recalled 63% of data in each computation of re-sample, if the number of data is
very large, it may overflow in computer’s memory. In contrast, BLB can reduce
the number of re-sample of data in each computation. For example, if n =
1 TB, each bootstrap re-sample contains approximately 632 GB. But BLB subsample
and re-sample can be chosen (γ =0.6) approximately 4GB in each computation. The
researchers suggested that BLB can also use parallel computing
architectures.
This
study provides the results of experiments and theoretical investigation,
including a study of its statistical performance. However there are some
limitations.
All
computing in the bootstrap process are used on RAM, however the very large
original dataset is to be stored in a computer’s hard drive. The researchers
may have some problems when the dataset is larger than the dataset in the
experiment.
A
parameter γ is an important factor in this paper. To find an optimal value γ, a
parameter γ was varied in the experiments. As the result, it seems that γ = 0.7
is a reasonable and effective choice in many datasets. But the researchers did
not explain or prove that assumption. In the real world, the parameter γ should
not be constant, it should be properly chosen depending on each dataset.
The
researchers did not say in the depth details such as how to set any parameters
on each tool. As a result, the experiments could not be repeated in the same
way. And the results were only shown in some graphs over relative error vs.
time (sec).
The
strength of this study is that the bootstrap method is widely known for many
researches but there were no the researchers who can reduce a computational
time and still keep high accuracy for a big data. The results of this study
showed that BLB uses a computational time less than the original bootstrap
because the number of re-sample is reduced in the bootstrap process.
Reference List
Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M.I. (2012). The big data bootstrap.The Proceeding of the 29th International Conference on Machine Learning (pp. 1759-1766), Scotland: Omnipress.
Efron, B. (1997). Another look at the jackknife. Annals of Statistics, 7(1),1-26.
Politis, D., Romano, J., & Wolf, M. (1999). Subsampling. Springer.
Note :
I
changed some tenses following my friends’ suggestion. For examples, Ajaree (suwatthikul.blogspot) suggested me
“Efron (1979) shows that …” in line 6 paragraph 1 should be past tense.
In
paragraph 5, Edward (edwardkrit.blogspot) told me another mistakes. His first suggestion
is in a sentence "As a result, the experiments could not repeat in the
same way" that he thought "the experiments" cannot do by
themselves so he suggested me that I should use passive voice. I agree with
him. Thus I changed it “As a result, the experiments could not be repeated in
the same way.”
In
paragraph 3, Edward’s second suggestion is in a phase “All computations in the
bootstrap process is used on RAM” it should change “is” to “are”. And Quanta (qquanta.blogspot)
had a suggestion in the same sentence. She told me “All computations in …”
may be changed to “All computing in …”
I
edited myself in line 11 paragraph 1 from “Two different setting were
considered: regression and classification” to “Two different settings were
considered: Regression and classification”.