Thursday, January 31, 2013

Cassandra Hector - May not be enough replicas present to handle consistency level.

If you get this error, make sure your datacenter and rack names are the same as the ones when you create your keyspace.

Make sure everything is lowercase.

For example:

In /etc/cassandra/cassandra-topology.properties, change all to lowercase.
10.216.218.73=dc1:rac1
10.31.2.31=dc2:rac1
default=dc1:rac1
Then create the keyspace in lowercase.
create keyspace helloworld with replication ={'class': 'NetworkTopologyStrategy', 'dc1': 1, 'dc2',1}; 

Setting up Github on AWS

sudo apt-get install git -y

Choose a location to store your code. I like to put my code on a separate EBS for portability. In the following, I will use /vol for this location.

mkdir /vol/src
cd /vol/src

git config --global user.name "your_name"
git config --global user.email "your_email"
git config --global github.user "your_github_login"
git clone ssh://git@github.com/username/repo.git

You will want to establish a connection with Github using ssh rather than https because if you are building an image that can be used for auto-scaling you don't want to input the username and password every time. See Generating SSH Keys for more details.

Your project should be located in /vol/src/{your_project}.

Tuesday, January 29, 2013

Setting up Cassandra Multi Nodes on Amazon EC2

Cassandra is a NoSQL database. It is designed to be launched in a cluster of machines, providing high availability and fault tolerance.

Before starting, make sure you scan through Node and Cluster Initialization Properties.


Important Node attributes:

cluster_name
All nodes in a cluster must have the same name.

commitlog_directory
Datastax recommends to put this into a separate disk partition (Perhaps, EBS).

data_file_directories
Stores the column family data.

partitoner
defaults to RandomPartitioner

rpc_address
set to 0.0.0.0 to listen on all configured interfaces

rpc_port
Port for Thrift server. Default is 9160.

saved_caches_directory
Location where column family key and row caches will be stored.

seeds
Nodes that contain information about the ring topology and obtain gossip information.

storage_port
Port for inter-node communication. Default is 7000.


Create a Large Instance

You will need to launch at least a large instance. See Cassandra Hardware for more details. If you have a smaller instance, you may not be able to use /etc/init.d/cassandra start command and you will see a JVM heap memory error.

Ideal Cassandra Instance Specs:

  • 32 GB RAM (Minimum 4GB)
  • 8-core cpu
  • 2 disks (one for CommitLogDirectory, one for DataFileDirectories)
  • RAID 0 for DataFileDirectories disk when disk capacity is 50% full
  • XFS file system
  • Minimum 3 replications (instances)


We will NOT be using EBS due to the bad I/O throughput and reliability. We will use the ephemeral volume instead. Click here for more details.


Security Group

PortDescription
Public Facing Ports
22SSH port.
8888OpsCenter website port.
Cassandra Inter-node Ports
1024+JMX reconnection/loopback ports. See description for port 7199.
7000Cassandra inter-node cluster communication.
7199Cassandra JMX monitoring port. After the initial handshake, the JMX protocol requires that the client reconnects on a randomly chosen port (1024+).
9160Cassandra client port (Thrift).
OpsCenter ports
61620OpsCenter monitoring port. The opscenterd daemon listens on this port for TCP traffic coming from the agent.
61621OpsCenter agent port. The agents listen on this port for SSL traffic initiated by OpsCenter.

Create a Security Group with the settings above.

Port 22 and 8888 will be 0.0.0.0/0.
1024-65535 will be your group id. (Click on Details tab on a Security Group to check your group id)
All other ports will be your group id.


Mounting the ephemeral drive

We will begin by formatting the ephemeral drive with XFS.

Use fdisk -l to check what's your ephemeral drive. It may come with ext3 already.

umount /dev/xvdb
mkfs.xfs -f /dev/xvdb

vi /etc/fstab

Remove the original entry and put

/vol xfs noatime 0 0

sudo mount /vol

You may also want to use RAID-0 to strip a set of ephemeral volumes.


Install Oracle Sun Java

Do not use OpenJDK. Cassandra works only with Oracle Sun Java.

Download jdk-6u38-linux-x64.bin.

mkdir /usr/java/latest

Upload or wget the JDK in this folder.

chmod +x jdk-6u38-linux-x64.bin

sudo ./jdk-6u38-linux-x64.bin

sudo update-alternatives --install "/usr/bin/java" "java" "/usr/java/latest/jdk1.6.0_38/bin/java" 1

sudo update-alternatives --set java /usr/java/latest/jdk1.6.0_38/bin/java


java -version
java version "1.6.0_38"
Java(TM) SE Runtime Environment (build 1.6.0_38-b05)
Java HotSpot(TM) 64-Bit Server VM (build 20.13-b02, mixed mode)

Make sure JNA is installed. Linux does not swap out the JVM and performance can improve.

sudo apt-get install libjna-java

vi /etc/security/limits.conf

Add:

cassandra soft memlock unlimited
cassandra hard memlock unlimited


Install Cassandra

Begin by installing a single node Cassandra. Read Cassandra - installing on Ubuntu 12.04 Amazon EC2.

Make sure Cassandra is at version 1.2.x and cqlsh is 2.3.x
cassandra -version 
cqlsh --version
We want to save the data in the ephemeral drive. The mount point we created earlier is /vol

mkdir /vol/cassandra
mkdir /vol/cassandra/commitlog
mkdir /vol/cassandra/data
mkdir /vol/cassandra/saved_caches

chown cassandra:cassandra -R /vol/cassandra

vi /etc/cassandra/cassandra.yaml

Point these directories to the ones we created above.

  • commitlog_directory
  • data_file_directories
  • saved_caches_directory

Kill Cassandra if you started with cassandra -f command. We will want to start from init.d

sudo /etc/init.d/cassandra start
sudo /etc/init.d/cassandra stop
sudo /etc/init.d/cassandra status

Use nodetool to check the status:
nodetool -h localhost -p 7199 ring
Reboot and check if it's running by running "netstat -tupln"

If it's not starting, check the log /var/log/cassandra/output.log

If it's complaining about oldClusterName != newClusterName, just remove everything in the data_file_directories.


Create a Cassandra AMI

We will be setting up a ring (multi-node Cassandra). Before you create an AMI, umount /dev/xvdb and comment out the xfs record in /etc/fstab. Else you won't be able to ssh into the instances launched by this image.

Launch a second instance in another availability zone


Setting up a Cassandra Ring

Before you begin, make sure you have the following:

  • Cassandra on each node
  • a cluster name
  • IP of each node
  • seed nodes
  • snitches (EC2Snitch, EC2MultiRegionSnitch)
  • open required firewalls

A snitch is used to determine which data centers and racks are written to and read from, and distribute replicas by grouping machines into data centers and racks.

For Ec2Snitch, a region is treated as a data center, and an availability zone is treated as a rack within a data center.


Setting up Multi Data Center Cassandra Ring

We will begin by tweaking the first node.

cd /etc/cassandra/cassandra.yaml

Set the following:
cluster_name: my_cluster
initial_token: 0
Start Cassandra. If you face any problems starting it, delete all the files in commitlog_directory and data_file_directories.

We will now add a second node in a different region (Ex. if first region is at us-east-1a, then make second region to be at us-east-1d).

ssh into your second instance. Remember to mount the partition back. Use "df" to make sure /dev/xvdb is mounted.
umount /mnt
vi /etc/fstab
uncomment "/dev/xvdb /vol xfs noatime 0 0" and remove entries that are using /dev/xvdb if appropriate
mkfs.xfs -f /dev/xvdb
mount /vol
mkdir /vol/cassandra
mkdir /vol/cassandra/commitlog
mkdir /vol/cassandra/data
mkdir /vol/cassandra/saved_caches
chown cassandra:cassandra -R /vol/cassandra
Now edit /etc/cassandra/cassandra.yaml.

The following needs to be changed on all nodes:
  • seeds
  • rpc_address
  • listen_address
The following needs to be changed on the new node:
  • initial_token
  • auto_bootstrap

Seeds
Add the private IPs of all nodes
- seeds: "10.31.2.31,10.216.218.73"

RPC Address
The address in which clients connect to


Listen Address
The address in which nodes connect with each other

For the first node,
listen_address: 10.31.2.31
rpc_address: 10.31.2.31
For the second node,
listen_address: 10.216.218.73
rpc_address: 10.216.218.73

Initial token (skip to Virtual nodes if you are using Cassandra 1.2.x and above)
This is used for load balancing. The first node should have a value of zero. All other nodes will need to recalculate this value every time a new node joins the cluster.

Calculate this based on the number of nodes. Use the Python problem from Cassandra.

Create a file called token_generator.py. Paste the following in the file.
#! /usr/bin/python
import sys
if (len(sys.argv) > 1):
        num=int(sys.argv[1])
else:
        num=int(raw_input("How many nodes are in your cluster? "))
for i in range(0, num):
        print 'node %d: %d' % (i, (i*(2**127)/num))
Change it to an executable.
chmod 777 token_generator.py
Execute the program with the number of nodes as the first argument. In our case, it's 2.
./token_generator.py 2
The output should be similar to the following
node 0: 0
node 1: 85070591730234615865843651857942052864
Put 85070591730234615865843651857942052864 as the initial token for the second node.
initial_token: '85070591730234615865843651857942052864'
If you get DatabaseDescriptor.java (line 509) Fatal configuration error, you are probably using Cassandra 1.2.x.


Virtual nodes (Cassandra 1.2.x or above)
vnodes are introducted in 1.2.x.

Set num_tokens to 256 and leave initial_token to empty.


Auto bootstrapping
When a new node is added, the cluster will automatically migrate the correct range of data from existing nodes.

Do not set autobootstrap: true and include it in the seed list together.

After all the above setup, start both nodes. Then check if they are up.
nodetool status

PropertyFileSnitch

Set endpoint_snitch: PropertyFileSnitch

We will be using PropertyFileSnitch and define our data centers and racks.

We will use dc1 to represent data center 1 and rac1 to represent rack 1.

Create /etc/cassandra/cassandra-topology.properties on all nodes and place the following:

10.216.218.73=dc1:rac1
10.31.2.31=dc2:rac1
default=dc1:rac1


default=dc1:rac1 is for when a node first joined and it's not specified in file.

Keep in mind that when creating our schema we will be using NetworkTopologyStrategy and use the dc and rac references we used above.

You may want to create an image again.


Testing the cluster replication 

Start Cassandra for both nodes:
service cassandra start
Check the status:


nodetool status

Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address           Load       Tokens  Owns   Host ID                               Rack
UN  10.32.6.31        28.94 KB   256     48.2%  eab0379f-2ac6-408a-b6dc-0ad475337a28  rac1
Datacenter: dc2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address           Load       Tokens  Owns   Host ID                               Rack
UN  10.108.23.52      47.66 KB   256     51.8%  35f1f17c-84c1-4e10-83b5-857feba03f4d  rac1


In both nodes, try executing the following:

cqlsh 10.216.218.73
cqlsh 10.31.2.31

You should not have a problem connecting to both of these machines. Make sure you are have the latest cqlsh (2.3.0 at the moment of this post).

We will be executing a script. I would recommend setting up Git and pull your code from Github for a production machine.

Create a script called test.cql.

Paste the following:
create keyspace helloworld with replication ={'class': 'NetworkTopologyStrategy', 'dc1': 1, 'dc2',1};
create table activity (
activity_key int,
activity_time timeuuid,
activity_type varint,
primary key (activity_key, activity_time)
)
with clustering order by (event_time desc);
Execute your script by running:
cqlsh 10.216.218.73 -f test.cql
Check to see if the keyspace "helloworld" exists:
cqlsh 10.216.218.73
describe keyspaces;

Connection your Application to Cassandra

In the Security Group for Cassandra, open up 9160 to your application security group id.

Check if the connection is okay by telnet
telnet 10.31.2.31 9160

What is RAID?

RAID (redundant array of independent disks) is a storage technology that combines multiple disk components into a logical unit. Different RAID levels define different configurations to employ striping, mirroring, or parity.

RAID levels

RAID 0 - block-striped volume
  • splits data across 2 or more disks
  • no data redundancy
  • no parity
  • used to increase performance
  • used to create a large logical disk out of 2 or more physical ones
  • size = n x min(all drive sizes)

RAID 1 - mirror
  • exact copy of data on 2 disks
  • size = n x min(all drive sizes)
  • does not provide protection against data corruption due to viruses, accidental file changes or deletions, or any other data-specific changes
  • speed = n * disk speed
  • use independent disk controllers to increase speed

RAID 2 - bit striped volume
  • stripes data at the bit level; uses hamming code for error correction
  • no commercial applications of RAID 2 today

RAID 3
  • byte-level striping with a dedicated parity disk
  • cannot serve multiple requests simultaneously
  • not commonly used

RAID 4
  • block-level striping with a dedicated parity disk
  • not commonly used

RAID 5
  • block-level striping with parity data

RAID 6
  • block-level striping with two parity blocks
  • penalty on write operations

RAID 10 - mirror stripes
  • top level RAID 0 array composed of 2 or more RAID 1 arrays

RAID 0+1 - striped mirrors
  • top level RAID 1 mirror composed of two or more RAID 0 strip sets

Friday, January 25, 2013

Setting up Lighttpd Load Balancer on EC2 Ubuntu

Lighttpd is an asynchronous server. Along with Nginx, Lighttpd is one of the fast servers designed to counter the C10k problem. If you want to set up Nginx, read Setting up Nginx on EC2 Ubuntu.

This tutorial will demonstrate how to use Lighttpd to load balance application servers.


Creating a EC2 Instance

In the AWS Management Console, begin by creating a t1.micro Ubuntu Server 12.04.1 LTS 64-bit. (If you don't know how to create an instance, read Amazon EC2 - Launching Ubuntu Server 12.04.1 LTS step by step guide.

Here are some guidelines:
  • Uncheck Delete on Termination for the root volume
  • Add port 22, 80 and 443 to the Security Group, call it lighttpd.

Install Lighttpd

ssh -i {key} ubuntu@{your_ec2_public_address}

sudo apt-get update -y

sudo apt-get install -y lighttpd

Lighttpd should be running. To check its status, run

service lighttpd status

All the configuration files are located in /etc/lighttpd

To enable/disable a module
  • Use /usr/sbin/lighty-enable-mod and /usr/sbin/lighty-disable-mod
  • Or create a symbolic link from /etc/lighttpd/conf-available/{module} to /etc/lighttpd/conf-enabled/module
To load balance application servers, we will be using the 10-proxy.conf file as a template.

cd /etc/lighttpd/conf-available
cp 10-proxy.conf 11-proxy.conf
vi 11-proxy.conf

We are interesting in the following two variables:
  • proxy.balance - choose from hash, round-robin or fair
  • proxy.server - put the servers you want to load balance to
For example:
proxy.balance     = "hash"
proxy.server     = ( "" =>
                     (
                       ( "host" => "10.204.199.85",
                         "port" => 80
                       ),
                       ( "host" => "10.202.111.140",
                         "port" => 80
                       )
                     )
                    )
The above settings will load balance to two other servers based on IP.

Restart the server.
service lighttpd restart
Test the server.

To check the status:
netstat -ntulp

Thursday, January 24, 2013

Setting up a Java Tomcat7 Production Server on Amazon EC2

This tutorial will demonstrate how to build a Tomcat7 server running a Java application on Amazon EC2.

Here are the tools we will set up:
  • Apache Tomcat7
  • Open JDK7
  • GitHub
  • Maven 3.0.4
  • MySQL

Creating a EC2 Instance

In the AWS Management Console, begin by creating a t1.micro Ubuntu Server 12.04.1 LTS 64-bit machine. (If you don't know how to create an instance, read Amazon EC2 - Launching Ubuntu SErver 12.04.1 LTS step by step guide.)

Here are some guidelines:
  • Uncheck Delete on Termination for the root volume
  • Add port 22, 80, 443 to the Security Group.

Create a EBS volume

We will create a 20GB volume to store our Java code. The EBS will be formated with XFS.


If the volume kept on getting stuck, keep restarting the EC2 instance until it's attached.


Configure the EC2 instance

ssh into the instance (ssh -i {key} ubuntu@{your_ec2_public_address})

sudo apt-get update -y

My mounting point for /dev/xvdf is called /vol.
cd /vol
mkdir src
mkdir webapps
mkdir war_backups

/vol/src is where we will place the application code. /vol/webapps is where we will deploy the WAR file. /vol/war_backups is for making war backups, as the name implies.


Deploying code from GitHub

Skip this if you are using other source control. The idea is that we will put the Java application code in the /vol/src folder.

sudo apt-get install git -y
mkdir /vol/src
cd /vol/src

git config --global user.name "your_name"
git config --global user.email "your_email"
git config --global github.user "your_github_login"
git clone ssh://git@github.com/username/repo.git

You will want to establish a connection with Github using ssh rather than https because if you are building an image that can be used for auto-scaling you don't want to input the username and password every time. See Generating SSH Keys for more details.

Your project should be located in /vol/src/{your_project}


Set up the Tomcat7 server

Begin by reading Install OpenJDK 7. Read Install Java OpenJDK 7 on Amazon EC2 Ubuntu.

echo $JAVA_HOME to check if it's set.

Install Tomcat 7. Read Install Tomcat 7 on Amazon EC2 Ubuntu.

Remember to change to ports 80, 443, and the root web directory as the Tomcat war root path.

Check http://{your_ec2_public_address} in your browser to make sure Tomcat7 is running.

Make sure Tomcat7 is still up after you reboot the machine.


Generating the war file

We will be using Maven to compile our Spring Java project. If you are using other build frameworks, skip this.

Read Install Maven 3 on Amazon EC2 Ubuntu.

Run "mvn --version" to make sure it's using OpenJDK 7 and running the latest version of Maven.

cd /vol/src/{your_project}
mvn clean install

A WAR file should be built.

Move this WAR file into the Tomcat webapps directory. If you are following this tutorial, it should be at /vol/webapps.

Remember to label this WAR file as ROOT.war

It's easier to do load balancing mapping later.

Browse to check you can access the site.


Using Amazon SES as the SMTP email service

Using SES will increase the likelihood of email delivery. Read Using Amazon SES to send emails.

Recompile your project and test it.


Moving MySQL to Amazon RDS

If you are using MySQL, you should move to Amazon RDS as it simplifies a lot of management, backup operations for you.

Read Using MySQL on Amazon RDS.

To interact with RDS through your EC2 instance, install MySQL Server or just the MySQL client interface.
sudo apt-get install -y mysql-server
Stop the local MySQL server. We won't be using it.
sudo /etc/init.d/mysql stop
Connect to your RDS instance
mysql -h {rds_public_address} -P 3306 -u{username} -p{password}
Do NOT use the following form. You will get a access denied.
mysql -u{username} -p{password} -h {rds_public_address} -P 3306
Update the JDBC settings in your application, recompile and test it.


Load Balancing Tomcat7

If you are planning to run multiple instances, read Setting up Lighttpd Load Balancer on EC2 Ubuntu or Setting up Nginx on EC2 Ubuntu.

Using Amazon SES to send emails

Amazon SES provides a easy way to send emails from your application. The strongest point of using SES is that it reduces the likelihood that your emails will be marked as spams by your ISPs.

You will need your AWS Access Keys and the SMTP Credentials.


Getting your AWS Access Keys

If you want to send email directly by using the Amazon SES API or the AWS SDK, you will need to use these keys.

In the AWS console, click on My Account/Console -> Security Credentials.

Obtain your access key id and the secret access key. (Scroll to the section called Access Credentials).


Getting the SMTP Credentials

Go to the Amazon SES console.

Click Create My SMTP Credentials.

Enter a new name for the IAM user or just use the default. Click Create.

Make sure you download the credentials or record it somewhere. You will NOT be able to see this again.


Testing Email Sending

In the SES console, click Verified Senders on the left sidebar.

Add a few email addresses by clicking on Verify a New Email Address.

Test sending a few emails and make sure you get them. Both senders and receivers need to be verified.


Request Production Access

Click in the form below. Usually, this process takes a day. It's better to request as early as possible.

https://portal.aws.amazon.com/gp/aws/html-forms-controller/contactus/SESProductionAccess2011Q3


Configuring your SMTP settings in your application

In the SES console, click SMTP Settings on the left sidebar.

You will need to configure your application with the following settings (also the SMTP credentials we set up above).

Server Name:email-smtp.us-east-1.amazonaws.com
Port:25, 465 or 587
Use Transport Layer Security (TLS):Yes
Authentication:Your SMTP credentials - see below.


For my set up, I use Spring's email provider and configure my provider with the above settings.

That's it. Test sending your emails through emails.

If you want to directly access the Amazon SES API, look at Using the Amazon SES API to Send Email.