You are here

Hadoop

Table of Contents

Introduction

Hadoop is an open source project written in Java to implement MapReduce for scalable, reliable and distributed computing. It is a framework to support distributed processing of large datasets across clusters of computers with the help of a simple programming model.

Restrictions / License Information

 Hadoop is distributed under the Apache License, Version 2.0.

Running Instructions

Running Hadoop on Bugaboo

Hadoop is originally written such that it needs to be run on a cluster dedicated to run Hadoop applications only. Since Compute Canada facilities are shared between users with many different applications the Hadoop implementation on the Bugaboo system has been modified such that the Hadoop cluster is created dynamically from the job submission script and the Hadoop configuration is removed at the end of the job. Currently only hadoop-1.2.1 is supported using the myHadoop implementation. This requires the loading of the Hadoop module and running of a configuration script. Here is an example submission script:

#!/bin/bash
################################################################################
# hadoop.pbs - A sample submit script for Torque that illustrates how to
# spin up a Hadoop cluster for a map/reduce task using myHadoop
#
# Glenn K. Lockwood, San Diego Supercomputer Center February 2014
# (modified for the Bugaboo cluster (WestGrid/ComputeCanada) by M. Siegert)
################################################################################
#PBS -l nodes=4:ppn=2
#PBS -l pmem=1000m
#PBS -l walltime=1:00:00
#PBS -M youremail@address.ca

cd $PBS_O_WORKDIR

# load the hadoop module to set the require environment variables
module load hadoop

# use a local scratch directory: TMPDIR is created by Torque uniquely for
# each job and removed after the job finished.
myhadoop-configure.sh -s $TMPDIR

if [ ! -f ./pg2701.txt ]; then
echo "*** Retrieving some sample input data"
wget 'http://www.gutenberg.org/cache/epub/2701/pg2701.txt'
fi

date
start-all.sh
hadoop dfsadmin -report

hadoop dfs -mkdir data
hadoop dfs -put ./pg2701.txt data/
hadoop dfs -ls data
hadoop jar $HADOOP_HOME/hadoop-examples-*.jar wordcount data wordcount-output
hadoop dfs -ls wordcount-output
hadoop dfs -get wordcount-output ./

stop-all.sh
myhadoop-cleanup.sh
date

In short: all of your hadoop commands must appear between the setup command myhadoop-configure.sh -s $TMPDIR and the cleanup command myhadoop-cleanup.sh.

The Hadoop implementation is still in an experimental stage and we are very much interested in feedback from users - please email your comments to support@westgrid.ca, thanks! We are also working on an implementation of Hadoop-2 - if you are interested, please contact us as well.

 

For More Information

Created 2015-02-04

System Bugaboo
Version 1.2.1