Friday, April 24, 2015

Major Project (2nd draft)

                          The big data bootstrap

Kleiner et al. (2012) proposed a method to develop the original bootstrap method (Efron, 1979) for a very large dataset called big data. Kleiner’s method is called the Bag of Little Bootstraps (BLB) which is combined the original bootstrap with sub-sampling technique (Polistis et al., 1999) to reduce a computation in the bootstrap process. As a result, BLB only requires a computation on small subsets less than the original bootstrap. Efron (1979) showed that each bootstrap re-sample consists of 0.632n distinct data, which is very large if n is large, while each BLB re-sample contains smaller than 0.632n distinct data. The number of BLB re-sample may be chosen nγ where γ ϵ [0.5,1]. This paper used different simulated datasets with generating distributions (Bernoulli, Normal, and Gamma) and UCI datasets (real data) for comparing the experiments of the BLB and the bootstrap. Two different settings were considered: Regression and classification where n was set to be 20,000. The results showed that BLB normally requires less time than the bootstrap with the same high accuracy. Due to the large re-samples of bootstrap recalled 63% of data in each computation of re-sample, if the number of data is very large, it may overflow in computer’s memory. In contrast, BLB can reduce the number of re-sample of data in each computation. For example, if n = 1 TB, each bootstrap re-sample contains approximately 632 GB. But BLB subsample and re-sample can be chosen (γ =0.6) approximately 4GB in each computation. The researchers suggested that BLB can also use parallel computing architectures. 

This study provides the results of experiments and theoretical investigation, including a study of its statistical performance. However there are some limitations.

All computing in the bootstrap process are used on RAM, however the very large original dataset is to be stored in a computer’s hard drive. The researchers may have some problems when the dataset is larger than the dataset in the experiment.

A parameter γ is an important factor in this paper. To find an optimal value γ, a parameter γ was varied in the experiments. As the result, it seems that γ = 0.7 is a reasonable and effective choice in many datasets. But the researchers did not explain or prove that assumption. In the real world, the parameter γ should not be constant, it should be properly chosen depending on each dataset.

The researchers did not say in the depth details such as how to set any parameters on each tool. As a result, the experiments could not be repeated in the same way. And the results were only shown in some graphs over relative error vs. time (sec).


The strength of this study is that the bootstrap method is widely known for many researches but there were no the researchers who can reduce a computational time and still keep high accuracy for a big data. The results of this study showed that BLB uses a computational time less than the original bootstrap because the number of re-sample is reduced in the bootstrap process.

Reference List

Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M.I. (2012). The big data bootstrap.The Proceeding of the 29th International Conference on Machine Learning (pp. 1759-1766), Scotland: Omnipress.

Efron, B. (1997). Another look at the jackknife. Annals of Statistics, 7(1),1-26.


Politis, D., Romano, J., & Wolf, M. (1999). Subsampling. Springer.


Note :
I changed some tenses following my friends’ suggestion. For examples, Ajaree (suwatthikul.blogspot)  suggested me  “Efron (1979) shows that …” in line 6 paragraph 1 should be past tense. 
In paragraph 5, Edward (edwardkrit.blogspot) told me another mistakes. His first suggestion is in a sentence "As a result, the experiments could not repeat in the same way" that he thought "the experiments" cannot do by themselves so he suggested me that I should use passive voice. I agree with him. Thus I changed it “As a result, the experiments could not be repeated in the same way.”
In paragraph 3, Edward’s second suggestion is in a phase “All computations in the bootstrap process is used on RAM” it should change “is” to “are”. And Quanta (qquanta.blogspot) had a suggestion in the same sentence. She told me “All computations      in …” may be changed to “All computing in …”
I edited myself in line 11 paragraph 1 from “Two different setting were considered: regression and classification” to “Two different settings were considered: Regression and classification”.

Monday, April 20, 2015

Major Project ( 1st draft)

The big data bootstrap

Kleiner et al. (2012) proposed a method to develop the original bootstrap method (Efron, 1979) for a very large dataset called big data. Kleiner’s method is called the Bag of Little Bootstraps (BLB) which combined the original bootstrap with sub-sampling technique (Polistis et al., 1999) to reduce a computation in the bootstrap process. As a result, BLB only requires a computation on small subsets less than the original bootstrap. Efron (1979) shows that each bootstrap re-sample consists of 0.632n distinct data, which is very large if n is large, while each BLB re-sample contains smaller than 0.632n distinct data. The number of BLB re-sample may be chosen nγ where γ ϵ [0.5,1]. This paper used different simulated datasets with generating distributions (Bernoulli, Normal, and Gamma) and UCI datasets (real data) for comparing the experiments of the BLB and the bootstrap. Two different setting were considered: regression and classification where n was set to be 20,000. The results showed that BLB normally requires less time than the bootstrap with the same high accuracy. Due to the large re-samples of bootstrap recalled 63% of data in each computation of re-sample, if the number of data is very large, may overflow memory in computer. In contrast, BLB can reduce the number of re-sample of data in each computation. For example, if n = 1 TB, each bootstrap re-sample contains approximately 632 GB. But BLB subsample and re-sample can be chosen (γ =0.6) approximately 4GB in each computation. The researcher suggested that BLB can also use parallel computing architectures. 

This study provides the results of experiments and theoretical investigation, including a study of its statistical performance. However there are some limitations.

All computations in the bootstrap process is used on RAM, however the very large original dataset is to be stored in a computer’s hard drive. The researchers may have some problems when the dataset is larger than the dataset in the experiment.

A parameter γ is an important factor in this paper. To find an optimal value γ, a parameter γ was varied in the experiments. As the result, it seems that γ = 0.7 is a reasonable and effective choice in many datasets. But the researchers did not explain or prove that assumption. In the real world, the parameter γ should not be constant, it should be properly chosen depending on each dataset.

The researchers did not say in the depth details such as how to set any parameters on each tool. As a result, the experiments could not repeat in the same way. And the results were only shown in some graphs over relative error vs. time (sec).

The strength of this study is that the bootstrap method is widely known for many researches but there were no the researchers who can reduce a computational time and still keep high accuracy for a big data. The results of this study showed that BLB uses a computational time less than the original bootstrap because the number of re-sample is reduced in the bootstrap process.

Reference List

Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M.I. (2012). The big data bootstrap.The Proceeding of the 29th International Conference on Machine Learning (pp. 1759-1766), Scotland: Omnipress.

Efron, B. (1997). Another look at the jackknife. Annals of Statistics, 7(1),1-26.


Politis, D., Romano, J., & Wolf, M. (1999). Subsampling. Springer.




My comments on Friend's blogs.
#1
http://suwatthikul.blogspot.com/2015/04/major-project.html?showComment=1429632701204#c6401955663896746197
#2
http://woratouch.blogspot.com/2015/04/draft-2-major-project.html?showComment=1429633631474#c9084318632999150175
#3
http://sujitratc.blogspot.com/2015/04/major-project-draft-1.html?showComment=1429633809559#c5705402777656864666
#4
http://edwardkrit.blogspot.com/2015/04/my-major-project.html?showComment=1429634244560#c7011589826254288019
#5
http://suphatka.blogspot.com/2015/04/major-project-draft-1.html?showComment=1429634839678#c5891116359063417312


Wednesday, February 25, 2015

Minor Project


Minor Project

            My research question is whether the traditional bootstrap method can be adapted to classify the streaming data. There are many applications such as face detection that an incoming data is very large so it may be impossible to store and to classify the whole data. Therefore several methods are researched and developed to handle those problems.

            Researchers who have looked at this subject are Kleiner et al. and Wang et al. The former proposed how to apply the bootstrap method to large-scale data and the latter adopted Kleiner’s study to clustering problem.

         Kleiner et al. (2012) proposed the Bag of Little Bootstraps (BLB) which combined the original bootstrap method with sub-sampling technique in order to reduce computation in the bootstrap process. “BLB only requires repeated computation on small subsets of the original dataset and avoids the bootstrap’s problematic need for repeated computation of estimates on re-samples,” they said.

            Wang et al. (2014) proposed the Bag of Little Bootstraps Clustering (BLBC) which combined the clustering results with Kleiner’s study. Their study is inspired by BLB technique. BLBC decreases the total computation of clustering on a massive data (very large data).

Debate centers on this issue showed that they can apply the bootstrap to the very large dataset but the streaming data is not interested in both studies of them. The massive data (or the big data) and the streaming data have a little difference in a detail.

My work will be closer to Wang’s because I would like to improve the original bootstrap method in order to classify streaming data. I will use the BLBC's idea about how to insert clustering method into the bootstrap in order to lead to a new idea for my classification problem research.

Hopefully my contribution will be to ensure my proposed method will can keep statistical correctness and high accuracy in classification problem.

Reference List (proceedings)

Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M.I. (2012). The big data 
          bootstrap. The Proceeding of the 29th International Conference 
           on Machine Learning (pp. 1759-1766), Scotland: Omnipress.

Wang, H., Zhuang, F., Ao, X., He, Q., & Shi, Z. (2014). Scalable bootstrap
        clustering massive data. Software Engineering, Artificial Intelligence,
        Networking and Parallel/Distributed Computing (SNPD), 2014 
       15th IEEE/ACIS International Conference on (pp.1-6), Las Vegas: CPS.


My comments on my friend's blogs : 

#1  http://qquanta.blogspot.com/2015/02/minor-project-before-midterm.html?showComment=1424882247626#c3691086820542128134

#2http://kanokudomsit.blogspot.com/2015/02/minor-project.html?showComment=1424882848921#c7303694604604936908

Tuesday, February 3, 2015

Assignment 2 : Writing an introduction

Bootstrap Method for Streaming Data

Perasut  Rungcharassang


   Stage 1 :    Typical statistical methods work with static data sets. the static data set can be indicated as follows : the data set is unchanged (not depend on time), the size of the data set is fixed (can be stored), there is clearly distribution on that data set (such as normal distribution or uniformly distribution) and so on. The whole static data set will be calculated in order to obtain statistical values (mean, standard deviation, etc.). However, in the recent years, the format of the data set is changed. Many applications need to work with non-static data sets. This type of the non-static data set can be called as data stream or streaming data. Its property is opposite to the static data set.

  Stage 2 :    Efron (1979) introduced the bootstrap method which is a statistical tool for estimating statistical values. The Bootstrap method is a very simple method used to estimate the sampling distribution of a sample data. It generates many re-samples by sampling the original training data with replacement to represent the sampling distribution. The bootstrap method will be applied when we know little statistical information of the data set, there is only a small amount of the data set or standard methods cannot be applied. The bootstrap  method is used to handle in several problems such as the signal processing (Zoubir & Boashash, 1998; Zoubir & Iskander, 2007), class imbalance problem (Thanathamathee & Lursinsap 2013), etc.

  Stage 3 :    The original bootstrap method needs to use the whole data set in order to generate many re-samples. However data set may be huge, it will take more time to calculate and use more storage to store. Since the data set interested in this paper is streaming data, it cannot be calculated by the original bootstrap method with the whole streaming data.

 Stage 4&5 :  The purpose of this paper is to improve the original bootstrap method in order to apply to classifying streaming data

     

My comments on my friend's blogs : 


#1 
http://sornjarodoonsiri.blogspot.com/2015/01/introduction.html?showComment=1423053397044#c1028524126658686890
#2 
http://suwatthikul.blogspot.com/2015/02/assignment2-writing-introduction.html?showComment=1423055970697#c6274032821322178279

                    

Saturday, January 24, 2015

Citation assignment 1


A Very Fast Neural Learning for Classification
Using Only New Incoming Datum

 Abstract -- This paper proposes a very fast 1-pass-throw-away learning algorithm based on a hyperellipsoidal function that can be translated and rotated to cover the data set during learning process. The translation and rotation of hyperellipsoidal function depends upon the distribution of the data set. In addition, we present versatile elliptic basis function (VEBF) neural network with one hidden layer. The hidden layer is adaptively divided into subhidden layers according to the number of classes of the training data set. Each subhidden layer can be scaled by incrementing a new node to learn new samples during training process. The learning time is O(n), where n is the number of data. The network can independently learn any new incoming datum without involving the previously learned data. There is no need to store all the data in order to mix with the new incoming data during the learning process.

Reference

        Jaiyen, S., Lursinsap, C., & Phimoltares, S. (2010). A very fast neural learning for classification using only new incoming datum. IEEE Trans. Neural Netw., 21(3), 381-392.

Results/Findings
          
1.Versatile Elliptic Basis Function (VEBF) neural network with one hidden                      layer, which is a new method for classification problem, is proposed.

2. The learning time of this method is O(n) (called big O), where n is the                         number of data. 

3. There is no need to store the whole previous data in order to mix with 
    the new incoming data during the learning process. 

         
Citations

1. According to Jaiyen et al. (2010), the versatile elliptic basis function can 
    learn the data set very fast and save a data storage during the learning 
    process (p.381).

2. Jaiyen et al. (2010present that "[the versatile elliptic basis function does not] 
    need to store all the data in order to mix with the new incoming data 
    during the learning process" (p.381).