HDFS was born on 2006 to fix the distributed storage/computing problem and then came spark riding on cheap hardware in 2012 to fix distributed processing problem in disruptive style.
One problem space that has seen lot of innovative tools is "How to build information system" for huge data, cheap hardware has enabled to build fast analytics system by using in-Memory techniques.
Data is growing every moment and it is becoming very challenging to all analytics tools to keep-up with it, some system use Column stores, special index like Bitmap index etc to build responsive system.
One more technique called Summarization is also used in many system to solve this problem, in this blog i will discuss some ideas around Summarization and tradeoff of using this technique.
Definition of summarization as per vocabulary.com is
To summarize means to sum up the main points of something — a summarization is this kind of summing up. Elementary school book reports are big on summarization.
When you're a trial lawyer, the last part of the argument you make before the court is called a summation. It included a summarization of the points you have made in the trial so far, but it goes one step further, to push this summarization toward a conclusion you hope the judge or jury will accept.
This picture explains it, you take something big and make it small by retaining important attributes.
One more technique is Sampling that also helps in reducing size of something that is unmanageable to manageable, but it trade-offs accuracy for this and may system can tolerate some percentage of error.
Sampling is also used to to validate some tests or assumption, so sampling allows to consumer information using limited resource.
Summarization and Sampling can used together to build some fast information system.
One of the simple summary approach is build group the data into different set and each record/row should be just in one set.
Lets take a toy data set to build groups.
This data is for what type of course people study, it can split into couple of groups using year,sex,type of course.
1993,Male,Humanities & Social Science
1994,Female,Education
1994,Male,Law
1993,Male,Law
Grouping convert huge data set to small set and then sampling can be done based on these groups to pick up items from this group.
One of the way is to take random sample but it has limitation that some group will be missed and sample data will be not representative to original dataset and it will be impossible to answer many query.
We need smart sampler and stats guru have already figured it out, it is called Stratified Sampling
This sampling involves
- Create Strata, this is nothing but create groups.
- Decided sample size
I used singapore open data graduates-from-university-first-degree-courses-by-type-of-course to build app that answer query based on sampling.
Code is available @ github
One problem space that has seen lot of innovative tools is "How to build information system" for huge data, cheap hardware has enabled to build fast analytics system by using in-Memory techniques.
Data is growing every moment and it is becoming very challenging to all analytics tools to keep-up with it, some system use Column stores, special index like Bitmap index etc to build responsive system.
One more technique called Summarization is also used in many system to solve this problem, in this blog i will discuss some ideas around Summarization and tradeoff of using this technique.
Definition of summarization as per vocabulary.com is
To summarize means to sum up the main points of something — a summarization is this kind of summing up. Elementary school book reports are big on summarization.
When you're a trial lawyer, the last part of the argument you make before the court is called a summation. It included a summarization of the points you have made in the trial so far, but it goes one step further, to push this summarization toward a conclusion you hope the judge or jury will accept.
This picture explains it, you take something big and make it small by retaining important attributes.
One more technique is Sampling that also helps in reducing size of something that is unmanageable to manageable, but it trade-offs accuracy for this and may system can tolerate some percentage of error.
Sampling is also used to to validate some tests or assumption, so sampling allows to consumer information using limited resource.
Summarization and Sampling can used together to build some fast information system.
One of the simple summary approach is build group the data into different set and each record/row should be just in one set.
Lets take a toy data set to build groups.
This data is for what type of course people study, it can split into couple of groups using year,sex,type of course.
1993,Male,Humanities & Social Science
1994,Female,Education
1994,Male,Law
1993,Male,Law
Grouping convert huge data set to small set and then sampling can be done based on these groups to pick up items from this group.
One of the way is to take random sample but it has limitation that some group will be missed and sample data will be not representative to original dataset and it will be impossible to answer many query.
We need smart sampler and stats guru have already figured it out, it is called Stratified Sampling
This sampling involves
- Create Strata, this is nothing but create groups.
- Decided sample size
Lets take 5% of sample that comes to 77 (5% of 1530)
- Compute contribution of each group
- Select sample records
77 records are required and break of 77 will be based weight of individual group
Number of records selected are based on weight of group and this gives data from each group.
By this approach 1530 records got reduced to 77 , lets try to answer some of query using this sample data, we also try to estimate what it will look at 100% and calculate error.
100 % estimate = 1% * 100
1% = Sample Answer/Sample %
Look at the error column, worst answer has 3% error and with this small error query can be answer in milli second vs seconds.
I used singapore open data graduates-from-university-first-degree-courses-by-type-of-course to build app that answer query based on sampling.
Code is available @ github