3 days / 40+ speakers
12 workshops

May 17-19, 2017 | Vilnius, Lithuania
TantusData, Poland

Marcin Szymaniuk

Data developer, Data infrastructure administrator, Consultant at TantusData. Companies I was working or consulting for include: Spotify, TrueCaller and most recently Apple.


Apache Spark? If Only It Worked

Do you have plans to start working with Apache Spark? Are you already working with Spark but you haven’t gotten the expected performance and stability and you are not sure where to look for a fix?
Spark has a very nice API and it promises high performance for crunching large data sets. It’s really easy to write an app in Spark, unfortunately, it’s also easy to write one which doesn’t perform the way you would expect or just fails for no obvious reason.
This talk will consist of multiple common problems you might face when running Spark at full scale and, of course, solutions for solving them. Each of the problems I will cover will come with well-described background and examples so it will be understood by people with no Spark experience. However, people who are working with Spark are the main audience. The ultimate objective is to give the audience a practical framework for optimizing most common problems with Spark applications.
Class of problems in the presentation:

  • Dealing with skewed data;
  • Spark on YARN and its memory model;
  • Caching;
  • Sizing executors;
  • Locality.





Apache Spark – Crash course

The course is dedicated to people who have no previous Spark experience. The ultimate goal is to provide an overview of the most important Spark features so attendees get enough knowledge to start building their first Spark applications.

Marcin is a data developer and architect with experience in data infrastructure administration. His main strength is that his knowledge is proven on real-life big data related problems that he solves on a daily basis. The course emphasizes practical aspects of Spark and common problems and misconceptions that he encounters when helping clients. The course is an introduction to Spark led by a “hands-on” practitioner who gained his experience solving real life problems for many of his clients.

The training is designed for developers who want to start their adventure with Spark. No prior experience in Spark is required. All the hands-on exercises will be in Scala but they will be simple enough for anybody with good knowledge of any modern programming language. Every trainee should have VirtualBox installed on their laptop so they can do hands on exercises and fully benefit from the workshop.

Programme overview

  1. Introduction to Spark
  • What is Spark?
  • Spark vs Hadoop
  • Spark with HDFS : quick overview
  • Spark on YARN : quick overview
  1. Basic building blocks in Spark
  • Introduction to Resilient distributed datasets
  • Spark shell
  • Overview of RDD operations
  • Key-Value Pair RDDs
  • Aggregating Data with pair RDDs

Hands-on excercises:

  • Word count
  1. Writing and deploying Spark applications
  • Spark context
  • Building Spark applications
  • Submitting a Spark application to a cluster
  • Spark Web UI
  • Spark Config: important options
  • Logging, YARN log aggregation

Hands-on excercises:

  • Joining RDDs

  1. Spark on a cluster
  • RDD partitions : HDFS, local file, shuffle
  • Data Locality
  • Execution model overview : Stages, Tasks, Executors
  • RDD persistence
  • Fault tolerance
  1. Hands-on exercises:
  • Spark-SQL aggregations
  1. Spark use cases
  • Data analysis
  • Machine learning
  • Iterative algorithms

Hands-on exercises:

  • Page rank
  1. Spark performance tips:
  • Controlling parallelism
  • Dealing with skewed data
  • Broadcast variables

Hands-on exercises:

  • Performance tuning challenge!