Saturday 26 May 2018

Spark Microservices

As continuation of big data query system blog, i want to share more techniques for building Analytics engine.

Take a problem where you have to build system that will be used for analyzing customer data at scale.

What options are available to solve this problem ?
 - Load the data in your favorite database and have right indexes.
   This works when data is small, when i say small less then 1TB or even less.

 - other option is to use something like elastic search 
Elastic search works but it comes up with overhead of managing another cluster and shipping data to elastic search

 -use spark SQL or presto 
Using these for interactive query is tricky because of minimum overhead that is required to execute query can be more than latency required for query which could be 1 or 2 sec.

 -use distributed In-Memory database. 
This looks good option but it also has some issues like many solution is proprietary and open source one will have overhead similar to Elastic Search.

 - Spark SQL by removing Job start overhead.
I will deep dive in to this option. Spark has become number one choice for build ETL pipeline because of simplicity and big community support and Spark SQL can connect to any data source ( JDBC,Hive ,ORC, JSON, Avro etc).

Analytics query generate different type of load, it only needs few columns from the whole set and executes some aggregate function over it, so column based database will make good choice for analytics query.

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
So using Spark data can converted to parquet and then Spark SQL can be used on top of it to answer analytics query.

To put all in context convert HDFS data to parquet(i.e column store), have a micro services that will open Sparksession  , pin data in memory and keep spark session open forever just like database pool connection.

Connection pool is more than decade old trick and it can be used for spark session to build analytics engine.

High level diagram on how this will look like

Spark Session is thread safe, so no need to add any locks/synchronization.
Depending on use case single or multiple  spark context can be created in single JVM.

Spark 2.X has simple API to create singleton instance for SparkContext and handles thread based SparkSession also.
Code snippet for creation spark session

All this works fine if you have micro service running on single machine but if this micro service is load balanced then each instance will have one context.
If single spark context requests for thousands of cores then some strategy is required to load balancing Spark context creation. This is same as database pool issue, you can only request for resource that is physically available.

Another thing to remember that now driver is running in web container so allocate proper memory to process so that web server does not blow up with out of memory error.

I have create micro services application using Spring boot and it is hosting Spark Session session via Rest API.

This code has 2 types of query
 - Single query per http thread
 - Multiple query per http thread. This model is very powerful and can be used for answering complex query.

Code is available on github @ sparkmicroservices


  1. Thanks for posting such a great done a great job.
    Microservices training in Hyderabad

  2. I have some wonderful microservices interview questions right here

  3. Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating big data online training

  4. I like your post very much. It is very much useful for my research. I hope you to share more info about this. Keep posting Apache Spark Training

  5. This comment has been removed by the author.

  6. You have leads in hand, your job being a digital marketer is done. Many business owners are able to increase sales and their brand recognition easily with the help of digital marketing on completing our Digital Marketing Course in Hyderabad. In 2021, after the pandemic, businesses are investing more in digital marketing. You could see a huge decline in Traditional marketing efforts. That is truly a positive sign. You master Digital Marketing and you would prosper.

  7. Our Affiliate Marketing course in Hyderabad will train you on successful marketing strategies that will help you earn money from different affiliate networks, we will demonstrate how to sign-up for an affiliate program and get approved from them. You will learn amazing new ways to generate traffic to your website. Converting the traffic to sales is very important for which we will guide you to create a winning landing page.

  8. Data Science Course in Hyderabad
    Become a Data Science Expert with us. we provide classroom training on IBM Certified Data Science at Hyderabad for the individuals who believe hand-held training. We teach as per the Indian Standard Time (IST) with In-depth practical Knowledge on each topic in classroom training, 80 – 90 Hrs of Real-time practical training classes. There are different slots available on weekends or weekdays according to your choices.

  9. Do check our blog MBA in Artificial Intelligence if anyone having a keen interest in Artificial Intelligence.Also check Other courses below.Online MBA in Data Science
    Online MBA in Business Intelligence

  10. Get trained on data science course in hyderabad by real-time industry experts and excel your career with Data Science training by Technologyforall. #1 online training institute for Data Science.

  11. Become a career-ready expert in the field of Python by getting enrolled for the advanced Python Course in Hyderabad from the hands of domain experts at AI Patasala.

  12. Here is the best music to calm and relax your mind

    1. best relaxing music
    2. best Depp sleep music
    3. best meditation music
    4. best calm music
    5. best deep focus music

  13. Get the best Data Science career opportunities by enrolling in AI Patasala Training Institutes Advanced Certified Data Science Course in Hyderabad program. Top-notch industry trainers deliver this with IIT & IIM specialization.

  14. It is very useful information at my studies time, i really very impressed very well articles and worth information, i can remember more days that development company in chennai. Thanks for sharing this information for us.

  15. Thank you for sharing this valuable information with us.
    Java full-stack course in Hyderabad