You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: streaming/README.md
+34-17
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,9 @@ Streaming support + MongoDB **requires** your Hadoop distribution include the pa
7
7
*[HADOOP-5450 - Add support for application-specific typecodes to typed bytes](https://issues.apache.org/jira/browse/HADOOP-5450)
8
8
*[MAPREDUCE-764 - TypedBytesInput's readRaw() does not preserve custom type codes](https://issues.apache.org/jira/browse/MAPREDUCE-764)
9
9
10
-
For the mainline Apache Hadoop distribution, these patches were merged for the 0.21.0 release. We have verified as well that the [Cloudera](http://cloudera.com) distribution (while based on 0.20.x still) includes these patches in CDH3 Update 1; anecdotal evidence (which needs confirmation) indicates they may have been there since CDH2, and likely exist in CDH3 as well.
10
+
For the mainline Apache Hadoop distribution, these patches were merged for the 0.21.0 release. We have verified as well that the
11
+
[Cloudera](http://cloudera.com) distribution (while based on 0.20.x still) includes these patches in CDH3 Update 1; anecdotal evidence
12
+
(which needs confirmation) indicates they may have been there since CDH2, and likely exist in CDH3 as well.
11
13
12
14
Building Streaming
13
15
------------------
@@ -18,25 +20,35 @@ This will create a new “fat” jar in: `streaming/target/mongo-hadoop-streamin
18
20
19
21
This jar file is runnable with hadoop jar, and contains all of the dependencies necessary to run the job.
20
22
21
-
Each individual scripting language will have different requirements for working with MongoDB + Hadoop Streaming. Once you have the jar file built for mongo-hadoop-streaming, you will need to build and deploy the support libraries for your chosen language.
23
+
Each individual scripting language will have different requirements for working with MongoDB + Hadoop Streaming. Once you have the jar
24
+
file built for mongo-hadoop-streaming, you will need to build and deploy the support libraries for your chosen language.
22
25
23
-
It will also be necessary to ensure these libraries are available on each Hadoop node in your cluster, along with the mongo-hadoop-core driver as outlined in the main setup instructions. However, you do not need to distribute the mongo-hadoop-streaming jar anywhere.
26
+
It will also be necessary to ensure these libraries are available on each Hadoop node in your cluster, along with the mongo-hadoop-core
27
+
driver as outlined in the main setup instructions. However, you do not need to distribute the mongo-hadoop-streaming jar anywhere.
24
28
25
29
26
30
### Overview
27
31
28
-
For distributions of Hadoop which support streaming, you can also use MongoDB collections as the input or output for these jobs as well. Here is a description of arguments needed to run a Hadoop streaming job including MongoDB support.
32
+
For distributions of Hadoop which support streaming, you can also use MongoDB collections as the input or output for these jobs as well.
33
+
Here is a description of arguments needed to run a Hadoop streaming job including MongoDB support.
29
34
30
35
* Launch the job with `$HADOOP/bin/hadoop jar <location of streaming jar> …`
31
-
* Depending on which hadoop release you use, the jar needed for streaming may be located in a different directory, commonly it is located somewhere like `$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar`
32
-
* Provide dependencies for the job by either placing each jar file in one your hadoop's classpath directories, or explicitly list it on the command line using `-libjars <jar file>`. You will need to do this for the Java MongoDB driver as well as the Mongo-Hadoop core library.
33
-
* When using a mongoDB collection as input, add the arguments `-jobconf mongo.input.uri=<input mongo URI>` and `-inputformat com.mongodb.hadoop.mapred.MongoInputFormat`
34
-
* When using a mongoDB collection as output, add the arguments `-jobconf mongo.output.uri=<input mongo URI>` and `-outputformat com.mongodb.hadoop.mapred.MongoOutputFormat`
36
+
* Depending on which hadoop release you use, the jar needed for streaming may be located in a different directory, commonly it is located
37
+
somewhere like `$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar`
38
+
* Provide dependencies for the job by either placing each jar file in one your hadoop's classpath directories, or explicitly list it on the
39
+
command line using `-libjars <jar file>`. You will need to do this for the Java MongoDB driver as well as the Mongo-Hadoop core library.
40
+
* When using a mongoDB collection as input, add the arguments `-jobconf mongo.input.uri=<input mongo URI>`
41
+
and `-inputformat com.mongodb.hadoop.mapred.MongoInputFormat`
42
+
* When using a mongoDB collection as output, add the arguments `-jobconf mongo.output.uri=<input mongo URI>`
43
+
and `-outputformat com.mongodb.hadoop.mapred.MongoOutputFormat`
35
44
* When using BSON as input, use `-inputformat com.mongodb.hadoop.mapred.BSONFileInputFormat`.
36
-
* Specify locations for the `-input` and `-output` arguments. Even when using mongoDB for input or output, these are required; you can use temporary directories for these in such a case.
37
-
* Always add the arguments `-jobconf stream.io.identifier.resolver.class=com.mongodb.hadoop.streaming.io.MongoIdentifierResolver` and `-io mongodb` so that decoders/encoders needed by streaming can be resolved at run time.
45
+
* Specify locations for the `-input` and `-output` arguments. Even when using mongoDB for input or output, these are required; you can
46
+
use temporary directories for these in such a case.
47
+
* Always add the arguments `-jobconf stream.io.identifier.resolver.class=com.mongodb.hadoop.streaming.io.MongoIdentifierResolver`
48
+
and `-io mongodb` so that decoders/encoders needed by streaming can be resolved at run time.
38
49
* Specify a `-mapper <mapper script>` and `-reducer <reducer script>`.
39
-
* Any other arguments needed, by using `-jobconf`. for example `mongo.input.query` or other options for controlling splitting behavior or filtering.
50
+
* Any other arguments needed, by using `-jobconf`. for example `mongo.input.query` or other options for controlling splitting behavior or
51
+
filtering.
40
52
41
53
Here is a full example of a streaming command broken into separate lines for readability:
42
54
@@ -57,17 +69,20 @@ Here is a full example of a streaming command broken into separate lines for rea
57
69
58
70
Also, refer to the `TestStreaming` class in the test suite for more concrete examples.
59
71
60
-
**Important note**: if you need to use `print` or any other kind of text output when debugging streaming Map/Reduce scripts, be sure that you are writing the debug statements to `stderr` or some kind of log file. Using `stdin` or `stdout` for any purpose other than communicating with the Hadoop Streaming layer will interfere with the encoding and decoding of data.
72
+
**Important note**: if you need to use `print` or any other kind of text output when debugging streaming Map/Reduce scripts, be sure that
73
+
you are writing the debug statements to `stderr` or some kind of log file. Using `stdin` or `stdout` for any purpose other than
74
+
communicating with the Hadoop Streaming layer will interfere with the encoding and decoding of data.
61
75
62
76
### Python
63
77
64
78
#####Setup
65
79
66
-
To use Python for streaming first install the python package `pymongo_hadoop` using pip or easy_install (For best performance, ensure that you are using the C extensions for BSON).
80
+
To use Python for streaming first install the python package `pymongo_hadoop` using pip or easy_install (For best performance, ensure that
81
+
you are using the C extensions for BSON).
67
82
68
83
#####Mapper
69
-
To implement a mapper, write a function which accepts an iterable sequence of documents and calls `yield` to produce each output document, then call BSONMapper() against that function.
70
-
For example:
84
+
To implement a mapper, write a function which accepts an iterable sequence of documents and calls `yield` to produce each output document,
85
+
then call BSONMapper() against that function. For example:
71
86
72
87
from pymongo_hadoop import BSONMapper
73
88
def mapper(documents):
@@ -77,7 +92,8 @@ For example:
77
92
BSONMapper(mapper)
78
93
79
94
#####Reducer
80
-
To implement a reducer, write a function which accepts two arguments: a key and an iterable sequence of documents matching that key. Compute your reduce output and pass it back to Hadoop with `return`. Then call `BSONReducer()` against this function. For example,
95
+
To implement a reducer, write a function which accepts two arguments: a key and an iterable sequence of documents matching that key.
96
+
Compute your reduce output and pass it back to Hadoop with `return`. Then call `BSONReducer()` against this function. For example,
81
97
82
98
from pymongo_hadoop import BSONReducer
83
99
@@ -104,7 +120,8 @@ Install the nodejs mongo-hadoop lib with `npm install node_mongo_hadoop`.
104
120
105
121
#####Mapper
106
122
107
-
Write a function that accepts two arguments: the input document, and a callback function. Call the callback function with the output of your map function. Then pass that function as an argument to node_mongo_hadoop.MapBSONStream. For example:
123
+
Write a function that accepts two arguments: the input document, and a callback function. Call the callback function with the output of
124
+
your map function. Then pass that function as an argument to node_mongo_hadoop.MapBSONStream. For example:
0 commit comments