Friday, April 24, 2015

Major Project (2nd draft)

                          The big data bootstrap

Kleiner et al. (2012) proposed a method to develop the original bootstrap method (Efron, 1979) for a very large dataset called big data. Kleiner’s method is called the Bag of Little Bootstraps (BLB) which is combined the original bootstrap with sub-sampling technique (Polistis et al., 1999) to reduce a computation in the bootstrap process. As a result, BLB only requires a computation on small subsets less than the original bootstrap. Efron (1979) showed that each bootstrap re-sample consists of 0.632n distinct data, which is very large if n is large, while each BLB re-sample contains smaller than 0.632n distinct data. The number of BLB re-sample may be chosen nγ where γ ϵ [0.5,1]. This paper used different simulated datasets with generating distributions (Bernoulli, Normal, and Gamma) and UCI datasets (real data) for comparing the experiments of the BLB and the bootstrap. Two different settings were considered: Regression and classification where n was set to be 20,000. The results showed that BLB normally requires less time than the bootstrap with the same high accuracy. Due to the large re-samples of bootstrap recalled 63% of data in each computation of re-sample, if the number of data is very large, it may overflow in computer’s memory. In contrast, BLB can reduce the number of re-sample of data in each computation. For example, if n = 1 TB, each bootstrap re-sample contains approximately 632 GB. But BLB subsample and re-sample can be chosen (γ =0.6) approximately 4GB in each computation. The researchers suggested that BLB can also use parallel computing architectures. 

This study provides the results of experiments and theoretical investigation, including a study of its statistical performance. However there are some limitations.

All computing in the bootstrap process are used on RAM, however the very large original dataset is to be stored in a computer’s hard drive. The researchers may have some problems when the dataset is larger than the dataset in the experiment.

A parameter γ is an important factor in this paper. To find an optimal value γ, a parameter γ was varied in the experiments. As the result, it seems that γ = 0.7 is a reasonable and effective choice in many datasets. But the researchers did not explain or prove that assumption. In the real world, the parameter γ should not be constant, it should be properly chosen depending on each dataset.

The researchers did not say in the depth details such as how to set any parameters on each tool. As a result, the experiments could not be repeated in the same way. And the results were only shown in some graphs over relative error vs. time (sec).


The strength of this study is that the bootstrap method is widely known for many researches but there were no the researchers who can reduce a computational time and still keep high accuracy for a big data. The results of this study showed that BLB uses a computational time less than the original bootstrap because the number of re-sample is reduced in the bootstrap process.

Reference List

Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M.I. (2012). The big data bootstrap.The Proceeding of the 29th International Conference on Machine Learning (pp. 1759-1766), Scotland: Omnipress.

Efron, B. (1997). Another look at the jackknife. Annals of Statistics, 7(1),1-26.


Politis, D., Romano, J., & Wolf, M. (1999). Subsampling. Springer.


Note :
I changed some tenses following my friends’ suggestion. For examples, Ajaree (suwatthikul.blogspot)  suggested me  “Efron (1979) shows that …” in line 6 paragraph 1 should be past tense. 
In paragraph 5, Edward (edwardkrit.blogspot) told me another mistakes. His first suggestion is in a sentence "As a result, the experiments could not repeat in the same way" that he thought "the experiments" cannot do by themselves so he suggested me that I should use passive voice. I agree with him. Thus I changed it “As a result, the experiments could not be repeated in the same way.”
In paragraph 3, Edward’s second suggestion is in a phase “All computations in the bootstrap process is used on RAM” it should change “is” to “are”. And Quanta (qquanta.blogspot) had a suggestion in the same sentence. She told me “All computations      in …” may be changed to “All computing in …”
I edited myself in line 11 paragraph 1 from “Two different setting were considered: regression and classification” to “Two different settings were considered: Regression and classification”.

Monday, April 20, 2015

Major Project ( 1st draft)

The big data bootstrap

Kleiner et al. (2012) proposed a method to develop the original bootstrap method (Efron, 1979) for a very large dataset called big data. Kleiner’s method is called the Bag of Little Bootstraps (BLB) which combined the original bootstrap with sub-sampling technique (Polistis et al., 1999) to reduce a computation in the bootstrap process. As a result, BLB only requires a computation on small subsets less than the original bootstrap. Efron (1979) shows that each bootstrap re-sample consists of 0.632n distinct data, which is very large if n is large, while each BLB re-sample contains smaller than 0.632n distinct data. The number of BLB re-sample may be chosen nγ where γ ϵ [0.5,1]. This paper used different simulated datasets with generating distributions (Bernoulli, Normal, and Gamma) and UCI datasets (real data) for comparing the experiments of the BLB and the bootstrap. Two different setting were considered: regression and classification where n was set to be 20,000. The results showed that BLB normally requires less time than the bootstrap with the same high accuracy. Due to the large re-samples of bootstrap recalled 63% of data in each computation of re-sample, if the number of data is very large, may overflow memory in computer. In contrast, BLB can reduce the number of re-sample of data in each computation. For example, if n = 1 TB, each bootstrap re-sample contains approximately 632 GB. But BLB subsample and re-sample can be chosen (γ =0.6) approximately 4GB in each computation. The researcher suggested that BLB can also use parallel computing architectures. 

This study provides the results of experiments and theoretical investigation, including a study of its statistical performance. However there are some limitations.

All computations in the bootstrap process is used on RAM, however the very large original dataset is to be stored in a computer’s hard drive. The researchers may have some problems when the dataset is larger than the dataset in the experiment.

A parameter γ is an important factor in this paper. To find an optimal value γ, a parameter γ was varied in the experiments. As the result, it seems that γ = 0.7 is a reasonable and effective choice in many datasets. But the researchers did not explain or prove that assumption. In the real world, the parameter γ should not be constant, it should be properly chosen depending on each dataset.

The researchers did not say in the depth details such as how to set any parameters on each tool. As a result, the experiments could not repeat in the same way. And the results were only shown in some graphs over relative error vs. time (sec).

The strength of this study is that the bootstrap method is widely known for many researches but there were no the researchers who can reduce a computational time and still keep high accuracy for a big data. The results of this study showed that BLB uses a computational time less than the original bootstrap because the number of re-sample is reduced in the bootstrap process.

Reference List

Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M.I. (2012). The big data bootstrap.The Proceeding of the 29th International Conference on Machine Learning (pp. 1759-1766), Scotland: Omnipress.

Efron, B. (1997). Another look at the jackknife. Annals of Statistics, 7(1),1-26.


Politis, D., Romano, J., & Wolf, M. (1999). Subsampling. Springer.




My comments on Friend's blogs.
#1
http://suwatthikul.blogspot.com/2015/04/major-project.html?showComment=1429632701204#c6401955663896746197
#2
http://woratouch.blogspot.com/2015/04/draft-2-major-project.html?showComment=1429633631474#c9084318632999150175
#3
http://sujitratc.blogspot.com/2015/04/major-project-draft-1.html?showComment=1429633809559#c5705402777656864666
#4
http://edwardkrit.blogspot.com/2015/04/my-major-project.html?showComment=1429634244560#c7011589826254288019
#5
http://suphatka.blogspot.com/2015/04/major-project-draft-1.html?showComment=1429634839678#c5891116359063417312