$ plastron import --help
usage: plastron import [-h] [-m MODEL] [-l LIMIT] [-% PERCENTAGE]
[--validate-only] [--make-template FILENAME]
[--convert-from {ndnp}] [--convert-option NAME VALUE]
[--access URI|CURIE] [--member-of URI]
[--binaries-location LOCATION] [--container PATH]
[--job-id JOB_ID] [--resume]
[--extract-text-from MIME_TYPES]
[--group-by {rootname,none}]
[--publish]
[import_file]
Import data to the repository
positional arguments:
import_file name of the file to import from
options:
-h, --help show this help message and exit
-m MODEL, --model MODEL
data model to use
-l LIMIT, --limit LIMIT
limit the number of rows to read from the import file
-% PERCENTAGE, --percent PERCENTAGE
select an evenly spaced subset of items to import;
the size of this set will be as close as possible
to the specified percentage of the total items
--validate-only only validate, do not do the actual import
--make-template FILENAME
create a CSV template for the given model
--convert-from {ndnp}
use a pre-processor to transform another data
format into an import job
--convert-option NAME VALUE, -o NAME VALUE
set a parameter to used by the --convert-from
pre-processor; repeatable
--access URI|CURIE URI or CURIE of the access class to apply to new items
--member-of URI URI of the object that new items are PCDM members of
--binaries-location LOCATION
where to find binaries; either a path to a directory,
a "zip:<path to zipfile>" URI, an SFTP URI in the
form "sftp://<user>@<host>/<path to dir>", or a URI
in the form "zip+sftp://<user>@<host>/<path to zipfile>"
--container PATH parent container for new items; defaults to the
RELPATH in the repo configuration file
--job-id JOB_ID unique identifier for this job; defaults to
"import-{timestamp}"
--resume resume a job that has been started; requires
--job-id {id} to be present
--extract-text-from MIME_TYPES, -x MIME_TYPES
extract text from binaries of the given MIME types,
and add as annotations
--group-by {rootname,none}
method for grouping related files into file groups;
"rootname" (default) groups files by shared base name,
"none" treats each file as a separate group
--publish automatically publish all items in this import
The following keys are used in the COMMANDS/IMPORT section of the config file:
| Name | Purpose |
|---|---|
JOBS_DIR |
Base directory for storing job information. Defaults to jobs in the working directory |
SSH_PRIVATE_KEY |
Path to the private key to use when retrieving binaries over SFTP |
Every time the import command runs, it is in the context of a job. Plastron stores the configuration specified when running the import, and the source CSV file, as well as a log of successfully imported items, and logs of items that were dropped during a particular run.
After starting a job, you can use its job ID to resume it at a later time. When resuming a job, plastron will check the completed log for the job and skip any items recorded there.
The completed.log.csv has the following columns:
| Name | Purpose |
|---|---|
id |
Unique identifier for this item (within the context of this job) |
timestamp |
Date and time when this item was successfully imported |
title |
Title of the item |
uri |
URI of the item in the target repository |
You may specify a job ID on the command line using the --job-id argument. If
you do not provide one, Plastron will generate one using the current timestamp.
Items that cannot be imported during a run are categorized as either "invalid" or "failed".
Invalid items are items that fail metadata validation, and are recorded in the "dropped-invalid" log for that run, along with the reason for the failure.
Invalid items will likely require changes to the source CSV file, or some other action on the part of the user (such as adding missing files).
Failed items are items that could not be imported due to problems adding records to the repository, and are recorded in the "dropped-failed" log for that run, along with the reason for the failure.
Some failures may occur due to transient network issues. In those cases, resuming the import should allow those items to tbe added.
Both the "dropped-invalid" and "dropped-failed" item logs have the following columns:
| Name | Purpose |
|---|---|
id |
Unique identifier for this item (within the context of this job) |
timestamp |
Date and time when this item failed to import |
title |
Title of the item |
uri |
URI of the item in the target repository; this may be empty if the item is new |
reason |
Short description of the error leading to failure to import |
Start a new job:
plastron -c repo.yml import \
--model Item \
--binaries-location /path/to/binaries \
--member-of http://localhost:8080/rest/collections/foo \
--container /objects \
--job-id import-foo-1
metadata.csvPlastron will create the following structure in the JOBS_DIR:
{JOBS_DIR}
+- import-foo-1 # job ID
+- completed.log.csv # completed item log
+- config.yml # command-line options
+- source.csv # copy of metadata.csv
Resume that job later:
plastron -c repo.yml import \
--job-id import-foo-1 \
--resumeAny dropped items from a particular run will be recorded in:
{JOBS_DIR}/import-foo-1/dropped-failed-{run_timestamp}.csv{JOBS_DIR}/import-foo-1/dropped-invalid-{run_timestamp}.csv
You may use the -% or --percent option to import only a subset of the items
in the import metadata CSV. Repeated use of this option with the same job will
select new subsets of items that have not yet been imported.
For example, start a job that has 50 items total, but only load 10% at first:
plastron -c repo.yml import \
--model Item \
--binaries-location /path/to/binaries \
--member-of http://localhost:8080/rest/collections/foo \
--container /objects \
--job-id percentile-job \
--percent 10Plastron will only import 5 items (10% of 50), as evenly spaced within the set of uncompleted items as possible.
If you resume the job with the --percent 10 option again:
plastron -c repo.yml import \
--job-id percentile-job \
--resume \
--percent 10Plastron will import 5 more items, selected from the 45 items that were not imported during the first run of the job.
If you specify a percentage that would generate a subset larger than the number of remaining items, Plastron will import all the remaining items.
In order to handle importing from data sources other than its standard CSV
spreadsheet format, the import command provides the ability to use a
pre-processor to convert data from some other format to a standard import
CSV spreadsheet.
Currently available pre-processors:
To use a pre-processor, provide the --convert-from option to the import
command, along with the name of the pre-processor to use.
Some pre-processors take initialization parameters. These are provided
using the --convert-option or -o switches, followed by the parameter
name and value as two separate strings. For example, -o batch_file small.xml
sets the batch_file parameter to 'small.xml'.
Jobs initiated using a pre-processor do not need special handling when resuming. At that point, it is just a standard job with a standard import CSV spreadsheet.
- Name:
ndnp - Options:
dirBase directory of the NDNP file treebatch_fileName of the XML file that describes this NDNP package; relative thedirparameter
Example:
plastron --config plastron.yml import \
--convert-from ndnp \
-o dir student_newspapers/data \
-o batch_file small.xml \
--model Issue \
--member-of http://fcrepo-local:8080/fcrepo/rest/dc/2016/1 \
--container /dc/2016/1