Skip to content

Commit c4ca413

Browse files
author
Justin Lee
committed
reformat a bit
1 parent b612382 commit c4ca413

File tree

1 file changed

+34
-17
lines changed

1 file changed

+34
-17
lines changed

streaming/README.md

+34-17
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,9 @@ Streaming support + MongoDB **requires** your Hadoop distribution include the pa
77
* [HADOOP-5450 - Add support for application-specific typecodes to typed bytes](https://issues.apache.org/jira/browse/HADOOP-5450)
88
* [MAPREDUCE-764 - TypedBytesInput's readRaw() does not preserve custom type codes](https://issues.apache.org/jira/browse/MAPREDUCE-764)
99

10-
For the mainline Apache Hadoop distribution, these patches were merged for the 0.21.0 release. We have verified as well that the [Cloudera](http://cloudera.com) distribution (while based on 0.20.x still) includes these patches in CDH3 Update 1; anecdotal evidence (which needs confirmation) indicates they may have been there since CDH2, and likely exist in CDH3 as well.
10+
For the mainline Apache Hadoop distribution, these patches were merged for the 0.21.0 release. We have verified as well that the
11+
[Cloudera](http://cloudera.com) distribution (while based on 0.20.x still) includes these patches in CDH3 Update 1; anecdotal evidence
12+
(which needs confirmation) indicates they may have been there since CDH2, and likely exist in CDH3 as well.
1113

1214
Building Streaming
1315
------------------
@@ -18,25 +20,35 @@ This will create a new “fat” jar in: `streaming/target/mongo-hadoop-streamin
1820

1921
This jar file is runnable with hadoop jar, and contains all of the dependencies necessary to run the job.
2022

21-
Each individual scripting language will have different requirements for working with MongoDB + Hadoop Streaming. Once you have the jar file built for mongo-hadoop-streaming, you will need to build and deploy the support libraries for your chosen language.
23+
Each individual scripting language will have different requirements for working with MongoDB + Hadoop Streaming. Once you have the jar
24+
file built for mongo-hadoop-streaming, you will need to build and deploy the support libraries for your chosen language.
2225

23-
It will also be necessary to ensure these libraries are available on each Hadoop node in your cluster, along with the mongo-hadoop-core driver as outlined in the main setup instructions. However, you do not need to distribute the mongo-hadoop-streaming jar anywhere.
26+
It will also be necessary to ensure these libraries are available on each Hadoop node in your cluster, along with the mongo-hadoop-core
27+
driver as outlined in the main setup instructions. However, you do not need to distribute the mongo-hadoop-streaming jar anywhere.
2428

2529

2630
### Overview
2731

28-
For distributions of Hadoop which support streaming, you can also use MongoDB collections as the input or output for these jobs as well. Here is a description of arguments needed to run a Hadoop streaming job including MongoDB support.
32+
For distributions of Hadoop which support streaming, you can also use MongoDB collections as the input or output for these jobs as well.
33+
Here is a description of arguments needed to run a Hadoop streaming job including MongoDB support.
2934

3035
* Launch the job with `$HADOOP/bin/hadoop jar <location of streaming jar> …`
31-
* Depending on which hadoop release you use, the jar needed for streaming may be located in a different directory, commonly it is located somewhere like `$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar`
32-
* Provide dependencies for the job by either placing each jar file in one your hadoop's classpath directories, or explicitly list it on the command line using `-libjars <jar file>`. You will need to do this for the Java MongoDB driver as well as the Mongo-Hadoop core library.
33-
* When using a mongoDB collection as input, add the arguments `-jobconf mongo.input.uri=<input mongo URI>` and `-inputformat com.mongodb.hadoop.mapred.MongoInputFormat`
34-
* When using a mongoDB collection as output, add the arguments `-jobconf mongo.output.uri=<input mongo URI>` and `-outputformat com.mongodb.hadoop.mapred.MongoOutputFormat`
36+
* Depending on which hadoop release you use, the jar needed for streaming may be located in a different directory, commonly it is located
37+
somewhere like `$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar`
38+
* Provide dependencies for the job by either placing each jar file in one your hadoop's classpath directories, or explicitly list it on the
39+
command line using `-libjars <jar file>`. You will need to do this for the Java MongoDB driver as well as the Mongo-Hadoop core library.
40+
* When using a mongoDB collection as input, add the arguments `-jobconf mongo.input.uri=<input mongo URI>`
41+
and `-inputformat com.mongodb.hadoop.mapred.MongoInputFormat`
42+
* When using a mongoDB collection as output, add the arguments `-jobconf mongo.output.uri=<input mongo URI>`
43+
and `-outputformat com.mongodb.hadoop.mapred.MongoOutputFormat`
3544
* When using BSON as input, use `-inputformat com.mongodb.hadoop.mapred.BSONFileInputFormat`.
36-
* Specify locations for the `-input` and `-output` arguments. Even when using mongoDB for input or output, these are required; you can use temporary directories for these in such a case.
37-
* Always add the arguments `-jobconf stream.io.identifier.resolver.class=com.mongodb.hadoop.streaming.io.MongoIdentifierResolver` and `-io mongodb` so that decoders/encoders needed by streaming can be resolved at run time.
45+
* Specify locations for the `-input` and `-output` arguments. Even when using mongoDB for input or output, these are required; you can
46+
use temporary directories for these in such a case.
47+
* Always add the arguments `-jobconf stream.io.identifier.resolver.class=com.mongodb.hadoop.streaming.io.MongoIdentifierResolver`
48+
and `-io mongodb` so that decoders/encoders needed by streaming can be resolved at run time.
3849
* Specify a `-mapper <mapper script>` and `-reducer <reducer script>`.
39-
* Any other arguments needed, by using `-jobconf`. for example `mongo.input.query` or other options for controlling splitting behavior or filtering.
50+
* Any other arguments needed, by using `-jobconf`. for example `mongo.input.query` or other options for controlling splitting behavior or
51+
filtering.
4052

4153
Here is a full example of a streaming command broken into separate lines for readability:
4254

@@ -57,17 +69,20 @@ Here is a full example of a streaming command broken into separate lines for rea
5769

5870
Also, refer to the `TestStreaming` class in the test suite for more concrete examples.
5971

60-
**Important note**: if you need to use `print` or any other kind of text output when debugging streaming Map/Reduce scripts, be sure that you are writing the debug statements to `stderr` or some kind of log file. Using `stdin` or `stdout` for any purpose other than communicating with the Hadoop Streaming layer will interfere with the encoding and decoding of data.
72+
**Important note**: if you need to use `print` or any other kind of text output when debugging streaming Map/Reduce scripts, be sure that
73+
you are writing the debug statements to `stderr` or some kind of log file. Using `stdin` or `stdout` for any purpose other than
74+
communicating with the Hadoop Streaming layer will interfere with the encoding and decoding of data.
6175

6276
### Python
6377

6478
#####Setup
6579

66-
To use Python for streaming first install the python package `pymongo_hadoop` using pip or easy_install (For best performance, ensure that you are using the C extensions for BSON).
80+
To use Python for streaming first install the python package `pymongo_hadoop` using pip or easy_install (For best performance, ensure that
81+
you are using the C extensions for BSON).
6782

6883
#####Mapper
69-
To implement a mapper, write a function which accepts an iterable sequence of documents and calls `yield` to produce each output document, then call BSONMapper() against that function.
70-
For example:
84+
To implement a mapper, write a function which accepts an iterable sequence of documents and calls `yield` to produce each output document,
85+
then call BSONMapper() against that function. For example:
7186

7287
from pymongo_hadoop import BSONMapper
7388
def mapper(documents):
@@ -77,7 +92,8 @@ For example:
7792
BSONMapper(mapper)
7893

7994
#####Reducer
80-
To implement a reducer, write a function which accepts two arguments: a key and an iterable sequence of documents matching that key. Compute your reduce output and pass it back to Hadoop with `return`. Then call `BSONReducer()` against this function. For example,
95+
To implement a reducer, write a function which accepts two arguments: a key and an iterable sequence of documents matching that key.
96+
Compute your reduce output and pass it back to Hadoop with `return`. Then call `BSONReducer()` against this function. For example,
8197

8298
from pymongo_hadoop import BSONReducer
8399

@@ -104,7 +120,8 @@ Install the nodejs mongo-hadoop lib with `npm install node_mongo_hadoop`.
104120

105121
#####Mapper
106122

107-
Write a function that accepts two arguments: the input document, and a callback function. Call the callback function with the output of your map function. Then pass that function as an argument to node_mongo_hadoop.MapBSONStream. For example:
123+
Write a function that accepts two arguments: the input document, and a callback function. Call the callback function with the output of
124+
your map function. Then pass that function as an argument to node_mongo_hadoop.MapBSONStream. For example:
108125

109126
function mapFunc(doc, callback){
110127
if(doc.headers && doc.headers.From && doc.headers.To){

0 commit comments

Comments
 (0)