Skip to content

sunghoon/hadoop2solr

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This project contains the code and data for the dzone article on using Solr as a NoSQL endpoint for a Hadoop workflow.

Below are the steps that would get you up and running, assuming you have an AWS account set up as per http://openbixo.org/documentation/running-bixo-in-ec2/, and a git client installed.

====================================================
Building the Hadoop job jar
====================================================

% mkdir hadoop2solr-home
% cd hadoop2solr-home
% git clone git://github.com/bixolabs/hadoop2solr.git 
% cd hadoop2solr
% ant job

====================================================
Setting up the Hadoop cluster
====================================================

% cd hadoop2solr-home
% git clone git://github.com/bixo/bixo.git bixo
% cd bixo/ec2
% . setenv.sh
% hadoop-ec2 launch-cluster hadoop2solr 1 m1.large
% hadoop-ec2 push hadoop2solr ../hadoop2solr/build/ hadoop2solr-job-1.0-SNAPSHOT.jar
% hadoop-ec2 screen hadoop2solr

====================================================
Running the Hadoop jop (on the hadoop2solr cluster)
====================================================

% hadoop fs -mkdir /user/root/working
% hadoop distcp s3n://bixolabs-dzone/urls /user/root/working/crawldb

Monitor the job progress using your browser until the job completes successfully

% hadoop jar hadoop2solr-job-1.0-SNAPSHOT.jar com.bixolabs.tools.IndexTool -input working/crawldb -output working/solr-index

Monitor the job progress using your browser until the job completes successfully

The output directory will contain a single 'part-00000' directory, which contains a set of Lucene files for a single index.

% hadoop fs -copyToLocal /user/root/working/solr-index /mnt/solr-index

====================================================
Setting up Solr (on the hadoop2solr master)
====================================================

% wget --no-check-certificate https://github.com/downloads/bixolabs/hadoop2solr/solr.zip
% wget --no-check-certificate https://github.com/downloads/bixolabs/hadoop2solr/solr-conf.zip
% unzip solr.zip
% unzip solr-conf.zip
% mkdir solr-data
% cd solr-data
% ln -s /mnt/solr-index/part-00000 index
% cd ../solr
% java -Dsolr.solr.home=../solr-conf -Dsolr.data.dir=../solr-data -jar start.jar > jetty.log 2>&1 &

====================================================
Cleaning up/testing
====================================================

Use the AWS Console to kill off the slave server.

Use the AWS Console to open up TCP port 8983 on the master server.

Open a browser window on http://<ec2 server public name>:8983/solr/admin

About

Code and data for dzone article

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published