As continuation of big data query system blog, i want to share more techniques for building Analytics engine.
Take a problem where you have to build system that will be used for analyzing customer data at scale.
What options are available to solve this problem ?
- Spark SQL by removing Job start overhead.
Analytics query generate different type of load, it only needs few columns from the whole set and executes some aggregate function over it, so column based database will make good choice for analytics query.
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
So using Spark data can converted to parquet and then Spark SQL can be used on top of it to answer analytics query.
To put all in context convert HDFS data to parquet(i.e column store), have a micro services that will open Sparksession , pin data in memory and keep spark session open forever just like database pool connection.
Connection pool is more than decade old trick and it can be used for spark session to build analytics engine.
High level diagram on how this will look like
Spark Session is thread safe, so no need to add any locks/synchronization.
Depending on use case single or multiple spark context can be created in single JVM.
Spark 2.X has simple API to create singleton instance for SparkContext and handles thread based SparkSession also.
Code snippet for creation spark session
Caution
All this works fine if you have micro service running on single machine but if this micro service is load balanced then each instance will have one context.
If single spark context requests for thousands of cores then some strategy is required to load balancing Spark context creation. This is same as database pool issue, you can only request for resource that is physically available.
Another thing to remember that now driver is running in web container so allocate proper memory to process so that web server does not blow up with out of memory error.
I have create micro services application using Spring boot and it is hosting Spark Session session via Rest API.
This code has 2 types of query
- Single query per http thread
- Multiple query per http thread. This model is very powerful and can be used for answering complex query.
Code is available on github @ sparkmicroservices
Take a problem where you have to build system that will be used for analyzing customer data at scale.
What options are available to solve this problem ?
- Load the data in your favorite database and have right indexes.
This works when data is small, when i say small less then 1TB or even less.
- other option is to use something like elastic search
Elastic search works but it comes up with overhead of managing another cluster and shipping data to elastic search
-use spark SQL or presto
Using these for interactive query is tricky because of minimum overhead that is required to execute query can be more than latency required for query which could be 1 or 2 sec.
Using these for interactive query is tricky because of minimum overhead that is required to execute query can be more than latency required for query which could be 1 or 2 sec.
-use distributed In-Memory database.
This looks good option but it also has some issues like many solution is proprietary and open source one will have overhead similar to Elastic Search.
This looks good option but it also has some issues like many solution is proprietary and open source one will have overhead similar to Elastic Search.
- Spark SQL by removing Job start overhead.
I will deep dive in to this option. Spark has become number one choice for build ETL pipeline because of simplicity and big community support and Spark SQL can connect to any data source ( JDBC,Hive ,ORC, JSON, Avro etc).
Analytics query generate different type of load, it only needs few columns from the whole set and executes some aggregate function over it, so column based database will make good choice for analytics query.
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
So using Spark data can converted to parquet and then Spark SQL can be used on top of it to answer analytics query.
To put all in context convert HDFS data to parquet(i.e column store), have a micro services that will open Sparksession , pin data in memory and keep spark session open forever just like database pool connection.
Connection pool is more than decade old trick and it can be used for spark session to build analytics engine.
High level diagram on how this will look like
Spark Session is thread safe, so no need to add any locks/synchronization.
Depending on use case single or multiple spark context can be created in single JVM.
Spark 2.X has simple API to create singleton instance for SparkContext and handles thread based SparkSession also.
Code snippet for creation spark session
Caution
All this works fine if you have micro service running on single machine but if this micro service is load balanced then each instance will have one context.
If single spark context requests for thousands of cores then some strategy is required to load balancing Spark context creation. This is same as database pool issue, you can only request for resource that is physically available.
Another thing to remember that now driver is running in web container so allocate proper memory to process so that web server does not blow up with out of memory error.
This code has 2 types of query
- Single query per http thread
- Multiple query per http thread. This model is very powerful and can be used for answering complex query.
Code is available on github @ sparkmicroservices
Thanks for posting such a great article.you done a great job.
ReplyDeleteMicroservices training in Hyderabad
I have some wonderful microservices interview questions right here
ReplyDeleteGreat Article. Thanks for sharing info.
ReplyDeleteDigital Marketing Course in Hyderabad
Digital Marketing Training in Hyderabad
AWS Training in Hyderabad
Workday Training in Hyderabad
Salesforce Training in Hyderabad
SAP FICO Training in Hyderabad
SAP ABAP Training in Hyderabad
SAP HANA Training in Hyderabad
Thank your valuable content.we are very thankful to you.one of the recommended blog.which is very useful to new learners and professionals.content is very useful for hadoop learners
ReplyDeleteBest ASP.NET MVC Online Training Institute
Best Spring Online Training Institute
Best Devops Online Training Institute
Best Datascience Online Training Institute
Best Oracle Online Training Institute
Best AWS Online Training Institute
Best AngularJS Online Training Institute
I like your post very much. It is very much useful for my research. I hope you to share more info about this. Keep posting!!
ReplyDeleteBest Devops Training Institute
Great article ...Thanks for your great information, the contents are quiet interesting.
ReplyDeleteMicroservices Online Training
Microservices Training in Hyderabad
Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating big data online training
ReplyDeleteI like your post very much. It is very much useful for my research. I hope you to share more info about this. Keep posting Apache Spark Training
ReplyDeleteThanks for Sharing This Article.It is very so much valuable content. I hope these Commenting lists will help to my website
ReplyDeletebest microservices online training
Thanks for sharing the informative post.
ReplyDeleteMachine Learning training in Pallikranai Chennai
Pytorch training in Pallikaranai chennai
Data science training in Pallikaranai
Python Training in Pallikaranai chennai
Deep learning with Pytorch training in Pallikaranai chennai
Bigdata training in Pallikaranai chennai
Mongodb Nosql training in Pallikaranai chennai
Spark with ML training in Pallikaranai chennai
Data science Python training in Pallikaranai
Bigdata Spark training in Pallikaranai chennai
Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.
ReplyDeleteBig Data Services
Data Lake Services
Advanced Analytics Services
Full Stack Development Services
This comment has been removed by the author.
ReplyDeleteYour company is one of the data migration solutions providers , which had helped in integrating useful data for business purposes. Applications designing, as well as the correct solutions of data migration, are helpful for accurate loading data.
ReplyDelete