Monday, June 17, 2013

AWS - Elastic Map Reduce Tutorial

MapReduce has become a very common technique utilizing parallel computing.

Let's say you have a database table with username and description columns in it. You want to replace html tags in the description column with empty spaces. Let's say the database has petabytes of data. It will take forever for a single machine to do this job.

MapReduce works by distributing this job among multiple machines. Each machine executes different dataset in parallel, and then the outputs will be aggregated. Therefore, a job that may take days to compute can take only mins to finish.

In this tutorial, we will experiment with Amazon's Elastic MapReduce.

Let's get started.


Create a S3 bucket

Elastic MapReduce uses S3 to store it's input and output. We will first create a bucket.

Log into your Amazon Web S3 console. Create a bucket, say my_map_reduce_data. Amazon S3 bucket names need to be unique across all Amazon S3 buckets. It's best to prefix it with your company name.


Create input data

Let's create a text file and put some random data into it. We will create a MapReduce function to count word frequencies.

Ex.
apple apple orange orange orange
pear pear pear pear pear pear pear pineapple pineapple

Label this file input.txt.

Create a folder inside my_map_reduce_data and call it input.


Implementing the mapper function

Download the following file and save it as wordSplitter.py

https://s3.amazonaws.com/elasticmapreduce/samples/wordcount/wordSplitter.py

It's a script that reads the input file line by line and prints out the number of occurrence for each distinct word in that line.

Upload wordSplitter.py to my_map_reduce_data


Launch the Elastic MapReduce Cluster

Sign in to the Elastic MapReduce Console.

Click on Create New Job Flow.

Give the Job Flow Name WordSplitter.

Choose Amazon Distribution for the Hadoop version.

Choose Streaming as the job flow type. You write the mapper and reducer scripts in any of the following languages: Ruby, Perl, Python, PHP, R, Bash, or C++.

Click Continue.


Input, output locations

Fill in the following:
Input Location: my_map_reduce_data/input
Output Location: my_map_reduce_data/output
Mapper: my_map_reduce_data/wordSplitter.py
Reducer: aggregate

Click Continue.


Configure EC2 Instances

Leave the options as it is.

Click Continue.


Advanced Options

If you want to ssh into the master node, specify the Amazon EC2 Key Pair.

For the Amazon S3 Log Path, put my_map_reduce_data/logs

Check Yes for Enable debugging. It will create an index of the log files in Amazon SimpleDB.

Leave the other boxes as NO.


BootStrap Actions

Proceed without bootstrap actions. BootStrap allows additional software to be installed in the cluster nodes before the MapReduce process any data.


Review the information and start the job. You should see the job being started.

You can monitor the status of the nodes in the MapReduce Web console.

Check the output folder in S3 after the job is completed.

Remember to delete all buckets to avoid getting charges.

No comments:

Post a Comment