Scaling TB's of data with Apache Spark and Scala DSL at Production
Apache Spark is one of the top big-data processing platforms and has driven the adoption of Scala in many industry and academic settings. As entire Apache Spark framework has been written in Scala as a base, it’s real pleasure to understand beauty of functional Scala DSL with Spark.
This talk is intent to present :
- Primary data structures (RDD, DataSet, Dataframe) usage in universal large scale data processing with Hbase (Data lake), Hive (Analytical Engine).
- Case study: We will go through importance of physical data split up techniques such as coalesce, Partition, Repartition and other important spark internals in Scaling TB’s of data / ~17 billions records
- Also, We will understand crucial part and very interesting way of understanding parallel & concurrent distributed data processing – tuning memory, cache, Disk I/O, Leaking memory, Internal shuffle, spark executor, spark driver etc.
Chetankumar Khatri
Accion labs Inc. / India
Chetan Khatri is working as a Technical Lead at Accion labs, he has diverse experience in field of Data Science and Machine learning. He is a open source contributor at Apache Spark, Apache HBase, Apache Spark - HBase Connector, Elixir Lang and many other open source projects. He has been authored curriculum of Artificial Intelligence, Data Science, Distributed computing at KSKV Kachchh University, Government of Gujarat - INDIA. He has also reviewed couple of Books with Scala Machine learning, Tensorflow Deep learning, Machine learning for Web with Packt Publication. He has delivered many talks at Pycon India 2016, PyKutch 2016, FOSSASIA 2018
- Distributing Machine learning with Apache Spark - Pycon India 2016
- Think Machine learning with Scikit-learn - PyKutch 2016
Open Source Contributor:
- Apache Spark
- Apache HBase
- Apache MXNet
- ParlAI
- Spark HBase Connector