Skip to content

Commit 87bdd44

Browse files
author
Justin Lee
committed
update docs to indicate gradle as the build mechanism
remove sbt references resolves HADOOP-123
1 parent c4ca413 commit 87bdd44

File tree

10 files changed

+240
-169
lines changed

10 files changed

+240
-169
lines changed

BSON_README.md

+35-14
Original file line numberDiff line numberDiff line change
@@ -1,56 +1,77 @@
11
## Using Backup Files (.bson)
22

3-
Static .bson files (which is the format produced by the [mongodump](http://docs.mongodb.org/manual/reference/program/mongodump/) tool for backups) can also be used as input to Hadoop jobs, or written to as output files.
3+
Static .bson files (which is the format produced by the [mongodump](http://docs.mongodb.org/manual/reference/program/mongodump/) tool for
4+
backups) can also be used as input to Hadoop jobs, or written to as output files.
45

56
###Using .bson files for input
67

78
#####Setup
89

9-
To use a bson file as the input for a hadoop job, you must set `mongo.job.input.format` to `"com.mongodb.hadoop.BSONFileInputFormat"` or use `MongoConfigUtil.setInputFormat(com.mongodb.hadoop.BSONFileInputFormat.class)`.
10+
To use a bson file as the input for a hadoop job, you must set `mongo.job.input.format` to `"com.mongodb.hadoop.BSONFileInputFormat"` or
11+
use `MongoConfigUtil.setInputFormat(com.mongodb.hadoop.BSONFileInputFormat.class)`.
1012

1113
Then set `mapred.input.dir` to indicate the location of the .bson input file(s). The value for this property may be:
1214

1315
* the path to a single file,
1416
* the path to a directory (all files inside the directory will be treated as BSON input files),
15-
* located on the local file system (`file://...`), on Amazon S3 (`s3n://...`), on a Hadoop Filesystem (`hdfs://...`), or any other FS protocol your system supports,
17+
* located on the local file system (`file://...`), on Amazon S3 (`s3n://...`), on a Hadoop Filesystem (`hdfs://...`), or any other FS
18+
protocol your system supports,
1619
* a comma delimited sequence of these paths.
1720

1821
#####Code
1922

20-
BSON objects loaded from a .bson file do not necessarily have a _id field, so there is no key supplied to the `Mapper` field. Because of this, you should use `NullWritable` or simply `Object` as your input key for the map phase, and ignore the key variable in your code. For example:
23+
BSON objects loaded from a .bson file do not necessarily have a _id field, so there is no key supplied to the `Mapper` field. Because of
24+
this, you should use `NullWritable` or simply `Object` as your input key for the map phase, and ignore the key variable in your code.
25+
For example:
2126

2227
public void map(NullWritable key, BSONObject val, final Context context){
2328
// …
2429
}
2530

2631
#####Splitting .BSON files for parallelism
2732

28-
Because BSON contains headers and length information, a .bson file cannot be split at arbitrary offsets because it would create incomplete document fragements. Instead it must be split at the boundaries between documents. To facilitate this the mongo-hadoop adapter refers to a small metadata file which contains information about the offsets of documents within the file. This metadata file is stored in the same directory as the input file, with the same name but starting with a "." and ending with ".splits". If this metadata file already exists in when the job runs, the `.splits` file will be read and used to directly generate the list of splits. If the `.splits` file does not yet exist, it will be generated automatically so that it is available for subsequent runs. To disable saving of this file, set `bson.split.write_splits` to false, splits will still be calculated and used. To disable calculating of splits set `bson.split.read_splits` to false
33+
Because BSON contains headers and length information, a .bson file cannot be split at arbitrary offsets because it would create incomplete
34+
document fragements. Instead it must be split at the boundaries between documents. To facilitate this the mongo-hadoop adapter refers to a
35+
small metadata file which contains information about the offsets of documents within the file. This metadata file is stored in the same
36+
directory as the input file, with the same name but starting with a "." and ending with ".splits". If this metadata file already exists in
37+
when the job runs, the `.splits` file will be read and used to directly generate the list of splits. If the `.splits` file does not yet
38+
exist, it will be generated automatically so that it is available for subsequent runs. To disable saving of this file, set
39+
`bson.split.write_splits` to false, splits will still be calculated and used. To disable calculating of splits set
40+
`bson.split.read_splits` to false
2941

30-
The default split size is determined from the default block size on the input file's filesystem, or 64 megabytes if this is not available. You can set lower and upper bounds for the split size by setting values (in bytes) for `mapred.min.split.size` and `mapred.max.split.size`.
31-
The `.splits` file contains bson objects which list the start positions and lengths for portions of the file, not exceeding the split size, which can then be read directly into a `Mapper` task.
42+
The default split size is determined from the default block size on the input file's filesystem, or 64 megabytes if this is not available.
43+
You can set lower and upper bounds for the split size by setting values (in bytes) for `mapred.min.split.size` and `mapred.max.split.size`.
44+
The `.splits` file contains bson objects which list the start positions and lengths for portions of the file, not exceeding the split size,
45+
which can then be read directly into a `Mapper` task.
3246

33-
However, for optimal performance, it's faster to build this file locally before uploading to S3 or HDFS if possible. You can do this by running the script `tools/bson_splitter.py`. The default split size is 64 megabytes, but you can set any value you want for split size by changing the value for `SPLIT_SIZE` in the script source, and re-running it.
47+
However, for optimal performance, it's faster to build this file locally before uploading to S3 or HDFS if possible. You can do this by
48+
running the script `tools/bson_splitter.py`. The default split size is 64 megabytes, but you can set any value you want for split size by
49+
changing the value for `SPLIT_SIZE` in the script source, and re-running it.
3450

3551

3652
###Producing .bson files as output
3753

38-
By using `BSONFileOutputFormat` you can write the output data of a Hadoop job into a .bson file, which can then be fed into a subsequent job or loaded into a MongoDB instance using `mongorestore`.
54+
By using `BSONFileOutputFormat` you can write the output data of a Hadoop job into a .bson file, which can then be fed into a subsequent
55+
job or loaded into a MongoDB instance using `mongorestore`.
3956

4057
#####Setup
4158

42-
To write the output of a job to a .bson file, set `mongo.job.output.format` to `"com.mongodb.hadoop.BSONFileOutputFormat"` or use `MongoConfigUtil.setOutputFormat(com.mongodb.hadoop.BSONFileOutputFormat.class)`
59+
To write the output of a job to a .bson file, set `mongo.job.output.format` to `"com.mongodb.hadoop.BSONFileOutputFormat"` or use
60+
`MongoConfigUtil.setOutputFormat(com.mongodb.hadoop.BSONFileOutputFormat.class)`
4361

44-
Then set `mapred.output.file` to be the location where the output .bson file should be written. This may be a path on the local filesystem, HDFS, S3, etc.
45-
Only one output file will be written, regardless of the number of input files used.
62+
Then set `mapred.output.file` to be the location where the output .bson file should be written. This may be a path on the local filesystem,
63+
HDFS, S3, etc. Only one output file will be written, regardless of the number of input files used.
4664

4765
#####Writing splits during output
4866

49-
If you intend to feed this output .bson file into subsequent jobs, you can generate the `.splits` file on the fly as it is written, by setting `bson.output.build_splits` to `true`. This will save time over building the `.splits` file on demand at the beginning of another job. By default, this setting is set to `false` and no `.splits` files will be written during output.
67+
If you intend to feed this output .bson file into subsequent jobs, you can generate the `.splits` file on the fly as it is written, by
68+
setting `bson.output.build_splits` to `true`. This will save time over building the `.splits` file on demand at the beginning of another
69+
job. By default, this setting is set to `false` and no `.splits` files will be written during output.
5070

5171
####Settings for BSON input/output
5272

53-
* `bson.split.read_splits` - When set to `true`, will attempt to read + calculate split points for each BSON file in the input. When set to `false`, will create just *one* split for each input file, consisting of the entire length of the file. Defaults to `true`.
73+
* `bson.split.read_splits` - When set to `true`, will attempt to read + calculate split points for each BSON file in the input. When set
74+
to `false`, will create just *one* split for each input file, consisting of the entire length of the file. Defaults to `true`.
5475
* `mapred.min.split.size` - Set a lower bound on acceptable size for file splits (in bytes). Defaults to 1.
5576
* `mapred.max.split.size` - Set an upper bound on acceptable size for file splits (in bytes). Defaults to LONG_MAX_VALUE.
5677
* `bson.split.write_splits` - Automatically save any split information calculated for input files, by writing to corresponding `.splits` files. Defaults to `true`.

README.md

+34-39
Original file line numberDiff line numberDiff line change
@@ -20,66 +20,57 @@ See the [release](https://github.com/mongodb/mongo-hadoop/releases) page.
2020

2121
## Building
2222

23-
To build, first edit the value for `hadoopRelease in ThisBuild` in the build.sbt file to select the distribution of Hadoop that you want to build against. For example to build for CDH4:
23+
The mongo-hadoop connector currently supports the following versions of hadoop: '0.23, 1.0, 1.1, 2.2, 2.3, and CDH 4. The default build
24+
version will build against the last Apache Hadoop (currently 2.3). If you would like to build against a specific version of Hadoop you
25+
simply need to pass `-Phadoop_version=<your version>`.
2426

25-
hadoopRelease in ThisBuild := "cdh4"
27+
Then run `./gradlew jar` to build the jars. The jars will be placed in to `build/libs` for each module. e.g. for the core module,
28+
it will be generated in the `core/build/libs` directory.
2629

27-
or for Hadoop 1.0.x:
28-
29-
hadoopRelease in ThisBuild := "1.0"
30-
31-
To determine which value you need to set in this file, refer to the list of distributions below.
32-
Then run `./sbt package` to build the jars, which will be generated in the `core/target/` directory.
33-
34-
After successfully building, you must copy the jars to the lib directory on each node in your hadoop cluster. This is usually one of the following locations, depending on which Hadoop release you are using:
30+
After successfully building, you must copy the jars to the lib directory on each node in your hadoop cluster. This is usually one of the
31+
following locations, depending on which Hadoop release you are using:
3532

3633
* `$HADOOP_HOME/lib/`
3734
* `$HADOOP_HOME/share/hadoop/mapreduce/`
3835
* `$HADOOP_HOME/share/hadoop/lib/`
3936

40-
4137
## Supported Distributions of Hadoop
4238

43-
* ###Apache Hadoop 1.0
39+
###Apache Hadoop 1.0
4440
Does **not** support Hadoop Streaming.
4541

46-
Build using `"1.0"` or `"1.0.x"`
42+
Build using `-Phadoop_version=1.0`
4743

48-
* ###Apache Hadoop 1.1
44+
###Apache Hadoop 1.1
4945
Includes support for Hadoop Streaming.
5046

51-
Build using `"1.1"` or `"1.1.x"`
52-
53-
54-
* ###Apache Hadoop 0.20.*
55-
Does **not** support Hadoop Streaming
47+
Build using `-Phadoop_version=1.1`
5648

57-
Includes Pig 0.9.2.
58-
59-
Build using `"0.20"` or `"0.20.x"`
60-
61-
* ###Apache Hadoop 0.23
62-
Includes Pig 0.9.2.
6349

50+
###Apache Hadoop 0.23
6451
Includes support for Streaming
6552

66-
Build using `"0.23"` or `"0.23.x"`
53+
Build using `-Phadoop_version=0.23`
6754

68-
* ###Cloudera Distribution for Hadoop Release 4
55+
###Cloudera Distribution for Hadoop Release 4
6956

70-
This is the newest release from Cloudera which is based on Apache Hadoop 2.0. The newer MR2/YARN APIs are not yet supported, but MR1 is still fully compatible.
57+
This is the newest release from Cloudera which is based on Apache Hadoop 2.0. The newer MR2/YARN APIs are not yet supported, but MR1
58+
is still fully compatible.
7159

72-
Includes support for Streaming and Pig 0.11.1.
60+
Includes support for Streaming.
7361

74-
Build with `"cdh4"`
62+
Build with `-Phadoop_version=cdh4`
7563

7664

77-
* ###Apache Hadoop 2.2
78-
Includes Pig 0.9.2
65+
###Apache Hadoop 2.2
66+
Includes support for Streaming
7967

68+
Build using `-Phadoop_version=2.2`
69+
70+
###Apache Hadoop 2.3
8071
Includes support for Streaming
8172

82-
Build using `"2.2"` or `"2.2.x"`
73+
Build using `-Phadoop_version=2.3`
8374

8475
## Configuration
8576

@@ -99,11 +90,13 @@ After successfully building, you must copy the jars to the lib directory on each
9990

10091
## Usage with Amazon Elastic MapReduce
10192

102-
Amazon Elastic MapReduce is a managed Hadoop framework that allows you to submit jobs to a cluster of customizable size and configuration, without needing to deal with provisioning nodes and installing software.
93+
Amazon Elastic MapReduce is a managed Hadoop framework that allows you to submit jobs to a cluster of customizable size and configuration,
94+
without needing to deal with provisioning nodes and installing software.
10395

10496
Using EMR with the MongoDB Connector for Hadoop allows you to run MapReduce jobs against MongoDB backup files stored in S3.
10597

106-
Submitting jobs using the MongoDB Connector for Hadoop to EMR simply requires that the bootstrap actions fetch the dependencies (mongoDB java driver, mongo-hadoop-core libs, etc.) and place them into the hadoop distributions `lib` folders.
98+
Submitting jobs using the MongoDB Connector for Hadoop to EMR simply requires that the bootstrap actions fetch the dependencies (mongoDB
99+
java driver, mongo-hadoop-core libs, etc.) and place them into the hadoop distributions `lib` folders.
107100

108101
For a full example (running the enron example on Elastic MapReduce) please see [here](examples/elastic-mapreduce/README.md).
109102

@@ -115,12 +108,15 @@ For examples on using Pig with the MongoDB Connector for Hadoop, also refer to t
115108

116109
## Notes for Contributors
117110

118-
If your code introduces new features, please add tests that cover them if possible and make sure that the existing test suite still passes. If you're not sure how to write a test for a feature or have trouble with a test failure, please post on the google-groups with details and we will try to help.
111+
If your code introduces new features, please add tests that cover them if possible and make sure that `./gradlew check` still passes.
112+
If you're not sure how to write a test for a feature or have trouble with a test failure, please post on the google-groups with details
113+
and we will try to help.
119114

120115
### Maintainers
121-
Mike O'Brien (mikeo@10gen.com)
116+
Justin lee (justin.lee@mongodb.com)
122117

123118
### Contributors
119+
* Mike O'Brien ([email protected])
124120
* Brendan McAdams [email protected]
125121
* Eliot Horowitz [email protected]
126122
* Ryan Nitz [email protected]
@@ -141,5 +137,4 @@ Mike O'Brien ([email protected])
141137

142138
Issue tracking: https://jira.mongodb.org/browse/HADOOP/
143139

144-
Discussion: http://groups.google.com/group/mongodb-user/
145-
140+
Discussion: http://groups.google.com/group/mongodb-user/

build.gradle

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ apply plugin: 'java'
22
apply plugin: 'download-task'
33

44
ext.configDir = new File(rootDir, 'config')
5-
ext.hadoop_version = project.getProperties().get('hadoop_version', 'cdh4')
5+
ext.hadoop_version = project.getProperties().get('hadoop_version', '2.3')
66
ext.versionMap = [
77
'0.23': '0.23.10',
88
'1.0' : '1.0.4',

0 commit comments

Comments
 (0)