Blog for Thesis Writing : Major Project ( 1st draft)

The big data bootstrap

Kleiner et al. (2012) proposed a method to develop the original bootstrap method (Efron, 1979) for a very large dataset called big data. Kleiner’s method is called the Bag of Little Bootstraps (BLB) which combined the original bootstrap with sub-sampling technique (Polistis et al., 1999) to reduce a computation in the bootstrap process. As a result, BLB only requires a computation on small subsets less than the original bootstrap. Efron (1979) shows that each bootstrap re-sample consists of 0.632n distinct data, which is very large if n is large, while each BLB re-sample contains smaller than 0.632n distinct data. The number of BLB re-sample may be chosen n^γ where γ ϵ [0.5,1]. This paper used different simulated datasets with generating distributions (Bernoulli, Normal, and Gamma) and UCI datasets (real data) for comparing the experiments of the BLB and the bootstrap. Two different setting were considered: regression and classification where n was set to be 20,000. The results showed that BLB normally requires less time than the bootstrap with the same high accuracy. Due to the large re-samples of bootstrap recalled 63% of data in each computation of re-sample, if the number of data is very large, may overflow memory in computer. In contrast, BLB can reduce the number of re-sample of data in each computation. For example, if n = 1 TB, each bootstrap re-sample contains approximately 632 GB. But BLB subsample and re-sample can be chosen (γ =0.6) approximately 4GB in each computation. The researcher suggested that BLB can also use parallel computing architectures.

This study provides the results of experiments and theoretical investigation, including a study of its statistical performance. However there are some limitations.

All computations in the bootstrap process is used on RAM, however the very large original dataset is to be stored in a computer’s hard drive. The researchers may have some problems when the dataset is larger than the dataset in the experiment.

A parameter γ is an important factor in this paper. To find an optimal value γ, a parameter γ was varied in the experiments. As the result, it seems that γ = 0.7 is a reasonable and effective choice in many datasets. But the researchers did not explain or prove that assumption. In the real world, the parameter γ should not be constant, it should be properly chosen depending on each dataset.

The researchers did not say in the depth details such as how to set any parameters on each tool. As a result, the experiments could not repeat in the same way. And the results were only shown in some graphs over relative error vs. time (sec).

The strength of this study is that the bootstrap method is widely known for many researches but there were no the researchers who can reduce a computational time and still keep high accuracy for a big data. The results of this study showed that BLB uses a computational time less than the original bootstrap because the number of re-sample is reduced in the bootstrap process.

Reference List

Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M.I. (2012). The big data bootstrap.The Proceeding of the 29^th International Conference on Machine Learning (pp. 1759-1766), Scotland: Omnipress.

Efron, B. (1997). Another look at the jackknife. Annals of Statistics, 7(1),1-26.

Politis, D., Romano, J., & Wolf, M. (1999). Subsampling. Springer.

My comments on Friend's blogs.
#1
http://suwatthikul.blogspot.com/2015/04/major-project.html?showComment=1429632701204#c6401955663896746197
#2
http://woratouch.blogspot.com/2015/04/draft-2-major-project.html?showComment=1429633631474#c9084318632999150175
#3
http://sujitratc.blogspot.com/2015/04/major-project-draft-1.html?showComment=1429633809559#c5705402777656864666
#4
http://edwardkrit.blogspot.com/2015/04/my-major-project.html?showComment=1429634244560#c7011589826254288019
#5
http://suphatka.blogspot.com/2015/04/major-project-draft-1.html?showComment=1429634839678#c5891116359063417312

5 comments:

UnknownApril 23, 2015 at 10:40 AM
I'm not sure. I think "Efron (1979) shows that..." may be change to "Efron (1979) showed that...".
UnknownApril 23, 2015 at 5:18 PM
In my opinion, it may be changed in paragragh 3 to " All computing in...
UnknownApril 23, 2015 at 9:55 PM
Hi "P"

I wonder in a phase "All computations in the bootstrap process is used on RAM" it should change "is" to "are", shouldn't it? and in a sentence "As a result, the experiments could not repeat in the same way" that I think "the experiments" can't do by themselves so I think you should use passive voice, right?

Thx

Monday, April 20, 2015

Major Project ( 1st draft)

5 comments: