Wednesday, February 25, 2015

Minor Project


Minor Project

            My research question is whether the traditional bootstrap method can be adapted to classify the streaming data. There are many applications such as face detection that an incoming data is very large so it may be impossible to store and to classify the whole data. Therefore several methods are researched and developed to handle those problems.

            Researchers who have looked at this subject are Kleiner et al. and Wang et al. The former proposed how to apply the bootstrap method to large-scale data and the latter adopted Kleiner’s study to clustering problem.

         Kleiner et al. (2012) proposed the Bag of Little Bootstraps (BLB) which combined the original bootstrap method with sub-sampling technique in order to reduce computation in the bootstrap process. “BLB only requires repeated computation on small subsets of the original dataset and avoids the bootstrap’s problematic need for repeated computation of estimates on re-samples,” they said.

            Wang et al. (2014) proposed the Bag of Little Bootstraps Clustering (BLBC) which combined the clustering results with Kleiner’s study. Their study is inspired by BLB technique. BLBC decreases the total computation of clustering on a massive data (very large data).

Debate centers on this issue showed that they can apply the bootstrap to the very large dataset but the streaming data is not interested in both studies of them. The massive data (or the big data) and the streaming data have a little difference in a detail.

My work will be closer to Wang’s because I would like to improve the original bootstrap method in order to classify streaming data. I will use the BLBC's idea about how to insert clustering method into the bootstrap in order to lead to a new idea for my classification problem research.

Hopefully my contribution will be to ensure my proposed method will can keep statistical correctness and high accuracy in classification problem.

Reference List (proceedings)

Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M.I. (2012). The big data 
          bootstrap. The Proceeding of the 29th International Conference 
           on Machine Learning (pp. 1759-1766), Scotland: Omnipress.

Wang, H., Zhuang, F., Ao, X., He, Q., & Shi, Z. (2014). Scalable bootstrap
        clustering massive data. Software Engineering, Artificial Intelligence,
        Networking and Parallel/Distributed Computing (SNPD), 2014 
       15th IEEE/ACIS International Conference on (pp.1-6), Las Vegas: CPS.


My comments on my friend's blogs : 

#1  http://qquanta.blogspot.com/2015/02/minor-project-before-midterm.html?showComment=1424882247626#c3691086820542128134

#2http://kanokudomsit.blogspot.com/2015/02/minor-project.html?showComment=1424882848921#c7303694604604936908

Tuesday, February 3, 2015

Assignment 2 : Writing an introduction

Bootstrap Method for Streaming Data

Perasut  Rungcharassang


   Stage 1 :    Typical statistical methods work with static data sets. the static data set can be indicated as follows : the data set is unchanged (not depend on time), the size of the data set is fixed (can be stored), there is clearly distribution on that data set (such as normal distribution or uniformly distribution) and so on. The whole static data set will be calculated in order to obtain statistical values (mean, standard deviation, etc.). However, in the recent years, the format of the data set is changed. Many applications need to work with non-static data sets. This type of the non-static data set can be called as data stream or streaming data. Its property is opposite to the static data set.

  Stage 2 :    Efron (1979) introduced the bootstrap method which is a statistical tool for estimating statistical values. The Bootstrap method is a very simple method used to estimate the sampling distribution of a sample data. It generates many re-samples by sampling the original training data with replacement to represent the sampling distribution. The bootstrap method will be applied when we know little statistical information of the data set, there is only a small amount of the data set or standard methods cannot be applied. The bootstrap  method is used to handle in several problems such as the signal processing (Zoubir & Boashash, 1998; Zoubir & Iskander, 2007), class imbalance problem (Thanathamathee & Lursinsap 2013), etc.

  Stage 3 :    The original bootstrap method needs to use the whole data set in order to generate many re-samples. However data set may be huge, it will take more time to calculate and use more storage to store. Since the data set interested in this paper is streaming data, it cannot be calculated by the original bootstrap method with the whole streaming data.

 Stage 4&5 :  The purpose of this paper is to improve the original bootstrap method in order to apply to classifying streaming data

     

My comments on my friend's blogs : 


#1 
http://sornjarodoonsiri.blogspot.com/2015/01/introduction.html?showComment=1423053397044#c1028524126658686890
#2 
http://suwatthikul.blogspot.com/2015/02/assignment2-writing-introduction.html?showComment=1423055970697#c6274032821322178279