GitHub - sunghoon/hadoop2solr: Code and data for dzone article

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
doc		doc
ec2		ec2
lib		lib
solr-conf/conf		solr-conf/conf
src/main/java/com/bixolabs		src/main/java/com/bixolabs
.gitignore		.gitignore
README		README
build.properties		build.properties
build.xml		build.xml
pom.xml		pom.xml

Repository files navigation

This project contains the code and data for the dzone article on using Solr as a NoSQL endpoint for a Hadoop workflow.

Below are the steps that would get you up and running, assuming you have an AWS account set up as per http://openbixo.org/documentation/running-bixo-in-ec2/, and a git client installed.

====================================================
Building the Hadoop job jar
====================================================

% mkdir hadoop2solr-home
% cd hadoop2solr-home
% git clone git://github.com/bixolabs/hadoop2solr.git 
% cd hadoop2solr
% ant job

====================================================
Setting up the Hadoop cluster
====================================================

% cd hadoop2solr-home
% git clone git://github.com/bixo/bixo.git bixo
% cd bixo/ec2
% . setenv.sh
% hadoop-ec2 launch-cluster hadoop2solr 1 m1.large
% hadoop-ec2 push hadoop2solr ../hadoop2solr/build/ hadoop2solr-job-1.0-SNAPSHOT.jar
% hadoop-ec2 screen hadoop2solr

====================================================
Running the Hadoop jop (on the hadoop2solr cluster)
====================================================

% hadoop fs -mkdir /user/root/working
% hadoop distcp s3n://bixolabs-dzone/urls /user/root/working/crawldb

Monitor the job progress using your browser until the job completes successfully

% hadoop jar hadoop2solr-job-1.0-SNAPSHOT.jar com.bixolabs.tools.IndexTool -input working/crawldb -output working/solr-index

Monitor the job progress using your browser until the job completes successfully

The output directory will contain a single 'part-00000' directory, which contains a set of Lucene files for a single index.

% hadoop fs -copyToLocal /user/root/working/solr-index /mnt/solr-index

====================================================
Setting up Solr (on the hadoop2solr master)
====================================================

% wget --no-check-certificate https://github.com/downloads/bixolabs/hadoop2solr/solr.zip
% wget --no-check-certificate https://github.com/downloads/bixolabs/hadoop2solr/solr-conf.zip
% unzip solr.zip
% unzip solr-conf.zip
% mkdir solr-data
% cd solr-data
% ln -s /mnt/solr-index/part-00000 index
% cd ../solr
% java -Dsolr.solr.home=../solr-conf -Dsolr.data.dir=../solr-data -jar start.jar > jetty.log 2>&1 &

====================================================
Cleaning up/testing
====================================================

Use the AWS Console to kill off the slave server.

Use the AWS Console to open up TCP port 8983 on the master server.

Open a browser window on http://<ec2 server public name>:8983/solr/admin