Wednesday, July 10, 2013

AWS Elastic MapReduce - EMR MySQL DBInputFormat

In this post, we will build a MapReduce program as a JAR executable. To make this example more interesting than most of the other online posts out there, we will modify the common WordCount example to fetch from MySQL instead of a text file.

You will need to at least understand the basics of what are the mapper and the reducer to follow this post. You may want to read this from Apache.

We will use Maven to build the project. If you have no idea how to do this, read Building a JAR Executable with Maven and Spring. We will feed this JAR via the Amazon Elastic MapReduce (EMR) and save the output in Amazon S3.

Here are the EMR supported Hadoop Versions. We will be using 1.0.3.


What we will do:

Assume we have a database called Company and there is a table called Employee with two columns: id and title.

We will count the number of employees with the same titles.

This is same as the WordCount examples you see in other tutorials, but we are fetching this from a database.


Install Hadoop Library

First in your java project, include the Maven Library in the pom.xml file.

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.0.3</version>
</dependency>


The File Structure

The program will be very basic and contain the following files. The filenames should be self-explanatory.

Main.java
Map.java
Reduce.java


The mapred library VS the mapreduce library

When you are reading other hadoop examples online, you will see them using either the mapred or the mapreduce library. mapred is the older version, while mapreduce is the cleaner and newer version. To upgrade from mapred to mapreduce, read Hadoop - mapred VS mapreduce libraries.

This example will use the org.apache.hadoop.mapreduce library.


EmployeeRecord

We will need to serialize the object of our interest by implementing Writable and DBWritable as show below.




The Mapper




The Reducer




Main.java

We will hope everything up. The steps are simple.

Create a Job.
Set output format.
Set input format.
Set Mapper class.
Set Reducer class.
Set input. (In our case, it will be from the database)
Set output.


Run the Job via the AWS EMR console

Compile the project and generate a self-contained JAR file. If you are using maven, read Building a JAR Executable with Maven and Spring.

Upload your JAR file to your s3 bucket.

In the AWS EMR console, specify the location of the JAR file.

JAR location: {your_bucket_name}/{jar_name}

Arguments: s3n://{your_bucket_name}/output

The program above takes in the output location as an argument.

Read AWS - Elastic Map Reduce Tutorial for more details on how to create a job flow in EMR.

If you encounter the mysql driver missing error, read Amazon Elastic MapReduce (EMR) ClassNotFoundException: com.mysql.jdbc.Driver.

No comments:

Post a Comment