|
1 | 1 | ## Using Backup Files (.bson)
|
2 | 2 |
|
3 |
| -Static .bson files (which is the format produced by the [mongodump](http://docs.mongodb.org/manual/reference/program/mongodump/) tool for backups) can also be used as input to Hadoop jobs, or written to as output files. |
| 3 | +Static .bson files (which is the format produced by the [mongodump](http://docs.mongodb.org/manual/reference/program/mongodump/) tool for |
| 4 | +backups) can also be used as input to Hadoop jobs, or written to as output files. |
4 | 5 |
|
5 | 6 | ###Using .bson files for input
|
6 | 7 |
|
7 | 8 | #####Setup
|
8 | 9 |
|
9 |
| -To use a bson file as the input for a hadoop job, you must set `mongo.job.input.format` to `"com.mongodb.hadoop.BSONFileInputFormat"` or use `MongoConfigUtil.setInputFormat(com.mongodb.hadoop.BSONFileInputFormat.class)`. |
| 10 | +To use a bson file as the input for a hadoop job, you must set `mongo.job.input.format` to `"com.mongodb.hadoop.BSONFileInputFormat"` or |
| 11 | +use `MongoConfigUtil.setInputFormat(com.mongodb.hadoop.BSONFileInputFormat.class)`. |
10 | 12 |
|
11 | 13 | Then set `mapred.input.dir` to indicate the location of the .bson input file(s). The value for this property may be:
|
12 | 14 |
|
13 | 15 | * the path to a single file,
|
14 | 16 | * the path to a directory (all files inside the directory will be treated as BSON input files),
|
15 |
| -* located on the local file system (`file://...`), on Amazon S3 (`s3n://...`), on a Hadoop Filesystem (`hdfs://...`), or any other FS protocol your system supports, |
| 17 | +* located on the local file system (`file://...`), on Amazon S3 (`s3n://...`), on a Hadoop Filesystem (`hdfs://...`), or any other FS |
| 18 | +protocol your system supports, |
16 | 19 | * a comma delimited sequence of these paths.
|
17 | 20 |
|
18 | 21 | #####Code
|
19 | 22 |
|
20 |
| -BSON objects loaded from a .bson file do not necessarily have a _id field, so there is no key supplied to the `Mapper` field. Because of this, you should use `NullWritable` or simply `Object` as your input key for the map phase, and ignore the key variable in your code. For example: |
| 23 | +BSON objects loaded from a .bson file do not necessarily have a _id field, so there is no key supplied to the `Mapper` field. Because of |
| 24 | +this, you should use `NullWritable` or simply `Object` as your input key for the map phase, and ignore the key variable in your code. |
| 25 | +For example: |
21 | 26 |
|
22 | 27 | public void map(NullWritable key, BSONObject val, final Context context){
|
23 | 28 | // …
|
24 | 29 | }
|
25 | 30 |
|
26 | 31 | #####Splitting .BSON files for parallelism
|
27 | 32 |
|
28 |
| -Because BSON contains headers and length information, a .bson file cannot be split at arbitrary offsets because it would create incomplete document fragements. Instead it must be split at the boundaries between documents. To facilitate this the mongo-hadoop adapter refers to a small metadata file which contains information about the offsets of documents within the file. This metadata file is stored in the same directory as the input file, with the same name but starting with a "." and ending with ".splits". If this metadata file already exists in when the job runs, the `.splits` file will be read and used to directly generate the list of splits. If the `.splits` file does not yet exist, it will be generated automatically so that it is available for subsequent runs. To disable saving of this file, set `bson.split.write_splits` to false, splits will still be calculated and used. To disable calculating of splits set `bson.split.read_splits` to false |
| 33 | +Because BSON contains headers and length information, a .bson file cannot be split at arbitrary offsets because it would create incomplete |
| 34 | +document fragements. Instead it must be split at the boundaries between documents. To facilitate this the mongo-hadoop adapter refers to a |
| 35 | +small metadata file which contains information about the offsets of documents within the file. This metadata file is stored in the same |
| 36 | +directory as the input file, with the same name but starting with a "." and ending with ".splits". If this metadata file already exists in |
| 37 | +when the job runs, the `.splits` file will be read and used to directly generate the list of splits. If the `.splits` file does not yet |
| 38 | +exist, it will be generated automatically so that it is available for subsequent runs. To disable saving of this file, set |
| 39 | +`bson.split.write_splits` to false, splits will still be calculated and used. To disable calculating of splits set |
| 40 | +`bson.split.read_splits` to false |
29 | 41 |
|
30 |
| -The default split size is determined from the default block size on the input file's filesystem, or 64 megabytes if this is not available. You can set lower and upper bounds for the split size by setting values (in bytes) for `mapred.min.split.size` and `mapred.max.split.size`. |
31 |
| -The `.splits` file contains bson objects which list the start positions and lengths for portions of the file, not exceeding the split size, which can then be read directly into a `Mapper` task. |
| 42 | +The default split size is determined from the default block size on the input file's filesystem, or 64 megabytes if this is not available. |
| 43 | +You can set lower and upper bounds for the split size by setting values (in bytes) for `mapred.min.split.size` and `mapred.max.split.size`. |
| 44 | +The `.splits` file contains bson objects which list the start positions and lengths for portions of the file, not exceeding the split size, |
| 45 | +which can then be read directly into a `Mapper` task. |
32 | 46 |
|
33 |
| -However, for optimal performance, it's faster to build this file locally before uploading to S3 or HDFS if possible. You can do this by running the script `tools/bson_splitter.py`. The default split size is 64 megabytes, but you can set any value you want for split size by changing the value for `SPLIT_SIZE` in the script source, and re-running it. |
| 47 | +However, for optimal performance, it's faster to build this file locally before uploading to S3 or HDFS if possible. You can do this by |
| 48 | +running the script `tools/bson_splitter.py`. The default split size is 64 megabytes, but you can set any value you want for split size by |
| 49 | +changing the value for `SPLIT_SIZE` in the script source, and re-running it. |
34 | 50 |
|
35 | 51 |
|
36 | 52 | ###Producing .bson files as output
|
37 | 53 |
|
38 |
| -By using `BSONFileOutputFormat` you can write the output data of a Hadoop job into a .bson file, which can then be fed into a subsequent job or loaded into a MongoDB instance using `mongorestore`. |
| 54 | +By using `BSONFileOutputFormat` you can write the output data of a Hadoop job into a .bson file, which can then be fed into a subsequent |
| 55 | +job or loaded into a MongoDB instance using `mongorestore`. |
39 | 56 |
|
40 | 57 | #####Setup
|
41 | 58 |
|
42 |
| -To write the output of a job to a .bson file, set `mongo.job.output.format` to `"com.mongodb.hadoop.BSONFileOutputFormat"` or use `MongoConfigUtil.setOutputFormat(com.mongodb.hadoop.BSONFileOutputFormat.class)` |
| 59 | +To write the output of a job to a .bson file, set `mongo.job.output.format` to `"com.mongodb.hadoop.BSONFileOutputFormat"` or use |
| 60 | +`MongoConfigUtil.setOutputFormat(com.mongodb.hadoop.BSONFileOutputFormat.class)` |
43 | 61 |
|
44 |
| -Then set `mapred.output.file` to be the location where the output .bson file should be written. This may be a path on the local filesystem, HDFS, S3, etc. |
45 |
| -Only one output file will be written, regardless of the number of input files used. |
| 62 | +Then set `mapred.output.file` to be the location where the output .bson file should be written. This may be a path on the local filesystem, |
| 63 | +HDFS, S3, etc. Only one output file will be written, regardless of the number of input files used. |
46 | 64 |
|
47 | 65 | #####Writing splits during output
|
48 | 66 |
|
49 |
| -If you intend to feed this output .bson file into subsequent jobs, you can generate the `.splits` file on the fly as it is written, by setting `bson.output.build_splits` to `true`. This will save time over building the `.splits` file on demand at the beginning of another job. By default, this setting is set to `false` and no `.splits` files will be written during output. |
| 67 | +If you intend to feed this output .bson file into subsequent jobs, you can generate the `.splits` file on the fly as it is written, by |
| 68 | +setting `bson.output.build_splits` to `true`. This will save time over building the `.splits` file on demand at the beginning of another |
| 69 | +job. By default, this setting is set to `false` and no `.splits` files will be written during output. |
50 | 70 |
|
51 | 71 | ####Settings for BSON input/output
|
52 | 72 |
|
53 |
| -* `bson.split.read_splits` - When set to `true`, will attempt to read + calculate split points for each BSON file in the input. When set to `false`, will create just *one* split for each input file, consisting of the entire length of the file. Defaults to `true`. |
| 73 | +* `bson.split.read_splits` - When set to `true`, will attempt to read + calculate split points for each BSON file in the input. When set |
| 74 | +to `false`, will create just *one* split for each input file, consisting of the entire length of the file. Defaults to `true`. |
54 | 75 | * `mapred.min.split.size` - Set a lower bound on acceptable size for file splits (in bytes). Defaults to 1.
|
55 | 76 | * `mapred.max.split.size` - Set an upper bound on acceptable size for file splits (in bytes). Defaults to LONG_MAX_VALUE.
|
56 | 77 | * `bson.split.write_splits` - Automatically save any split information calculated for input files, by writing to corresponding `.splits` files. Defaults to `true`.
|
|
0 commit comments