forked from juanplopes/hadoop2solr
-
Notifications
You must be signed in to change notification settings - Fork 0
sunghoon/hadoop2solr
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This project contains the code and data for the dzone article on using Solr as a NoSQL endpoint for a Hadoop workflow. Below are the steps that would get you up and running, assuming you have an AWS account set up as per http://openbixo.org/documentation/running-bixo-in-ec2/, and a git client installed. ==================================================== Building the Hadoop job jar ==================================================== % mkdir hadoop2solr-home % cd hadoop2solr-home % git clone git://github.com/bixolabs/hadoop2solr.git % cd hadoop2solr % ant job ==================================================== Setting up the Hadoop cluster ==================================================== % cd hadoop2solr-home % git clone git://github.com/bixo/bixo.git bixo % cd bixo/ec2 % . setenv.sh % hadoop-ec2 launch-cluster hadoop2solr 1 m1.large % hadoop-ec2 push hadoop2solr ../hadoop2solr/build/ hadoop2solr-job-1.0-SNAPSHOT.jar % hadoop-ec2 screen hadoop2solr ==================================================== Running the Hadoop jop (on the hadoop2solr cluster) ==================================================== % hadoop fs -mkdir /user/root/working % hadoop distcp s3n://bixolabs-dzone/urls /user/root/working/crawldb Monitor the job progress using your browser until the job completes successfully % hadoop jar hadoop2solr-job-1.0-SNAPSHOT.jar com.bixolabs.tools.IndexTool -input working/crawldb -output working/solr-index Monitor the job progress using your browser until the job completes successfully The output directory will contain a single 'part-00000' directory, which contains a set of Lucene files for a single index. % hadoop fs -copyToLocal /user/root/working/solr-index /mnt/solr-index ==================================================== Setting up Solr (on the hadoop2solr master) ==================================================== % wget --no-check-certificate https://github.com/downloads/bixolabs/hadoop2solr/solr.zip % wget --no-check-certificate https://github.com/downloads/bixolabs/hadoop2solr/solr-conf.zip % unzip solr.zip % unzip solr-conf.zip % mkdir solr-data % cd solr-data % ln -s /mnt/solr-index/part-00000 index % cd ../solr % java -Dsolr.solr.home=../solr-conf -Dsolr.data.dir=../solr-data -jar start.jar > jetty.log 2>&1 & ==================================================== Cleaning up/testing ==================================================== Use the AWS Console to kill off the slave server. Use the AWS Console to open up TCP port 8983 on the master server. Open a browser window on http://<ec2 server public name>:8983/solr/admin
About
Code and data for dzone article
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published