Data developer, Data infrastructure administrator, Consultant at TantusData. Companies I was working or consulting for include: Spotify, TrueCaller and most recently Apple.
Apache Spark? If Only It Worked
Do you have plans to start working with Apache Spark? Are you already working with Spark but you haven’t gotten the expected performance and stability and you are not sure where to look for a fix?
Spark has a very nice API and it promises high performance for crunching large data sets. It’s really easy to write an app in Spark, unfortunately, it’s also easy to write one which doesn’t perform the way you would expect or just fails for no obvious reason.
This talk will consist of multiple common problems you might face when running Spark at full scale and, of course, solutions for solving them. Each of the problems I will cover will come with well-described background and examples so it will be understood by people with no Spark experience. However, people who are working with Spark are the main audience. The ultimate objective is to give the audience a practical framework for optimizing most common problems with Spark applications.
Class of problems in the presentation:
- Dealing with skewed data;
- Spark on YARN and its memory model;
- Sizing executors;