You are here

Compute Canada / IRMACS Spark workshop

Date:

Monday, January 25, 2016 - 17:00 PST to Tuesday, January 26, 2016 - 0:30 PST

Speaker Info:

Félix-Antoine Fortin
Calcul Quebec

Description

Join Compute Canada and IRMACS for a hands-on workshop exploring the use of Apache Spark, a fast and general engine for large-scale data processing. Spark provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphXfor graph processing, and Spark Streaming.

Abstract:
With over 450 contributors, Apache Spark has become one of the most important open-source projects for Big Data. Based on the concept of in-memory data processing, Spark can achieve performance up to 100 times that of Hadoop. During this course, we will introduce Apache Spark and explain how to use cluster computing resources, like the ones available in Compute Canada, to analyze a large amount of data. A special attention will be given to interactive analysis with Jupyter (IPython Notebook).

Duration:
One day divided in two blocks of 3 hours

Requirements

Skills:

  • Functional knowledge of Unix command line
    • manipulating files and archives (tarball)
    • connecting to a remote server with ssh
    • retrieving files from a distant server
  • Python programming language
    • importing a module
    • manipulating strings
    • creating functions
    • familiarity with anonymous functions (lambda)

Computer and software requirements:

  • Laptop with internet access
  • Recent web browser (Chrome or Firefox)
  • Virtualbox 5.0 (https://www.virtualbox.org/)
  • Putty (Windows) or SSH client (Mac OS X / Linux)

Course plan:

1. Introduction

1.1 Introduction to Big Data
1.2 Presentation of the Map-Reduce paradigm
1.3 Presentation of Spark founding principles

2. Working with Spark

2.1 Spark Introduction, working with RDD
2.2 Working with key-value pairs
2.3 Spark SQL
2.4 Running and monitoring a Spark cluster
2.5 Machine learning with Spark

About the instructor:
Félix-Antoine Fortin is an advanced computing analyst at Calcul Québec since 2015. He is also completing a PhD in computer engineering on estimation of distribution algorithms for multimodal optimization. He has a strong interest in everything related to data analytics, advanced computing and big data. His language of choice since 2010 is Python, and he is one of the main contributors to the open-source framework DEAP (Distributed Evolutionary Algorithms in Python).

Sun Mon Tue Wed Thu Fri Sat
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
10
 
11
 
12
 
13
 
14
 
15
 
16
 
17
 
18
 
19
 
20
 
21
 
22
 
23
 
24
 
25
 
26
 
27
 
28
 
29
 
30
 
31
 
 
 
 

WestGrid Training Events

We're planning our next round of sessions! More details will be posted here soon. We invite you to send us topic requests for future sessions, join our mailing list for ongoing updates, and visit our Seminar Archives to review sessions you missed.

Research Seminars

Join our mailing list to stay informed of upcoming research seminars or visit our Seminar Archives to review sessions you missed.