-
Notifications
You must be signed in to change notification settings - Fork 33
Configuration
RecordManager configuration can be divided to two categories, the general RecordManager settings and data source settings. The default distribution contains sample configuration files in the conf directory. They need to be copied from datasources.ini.sample to datasources.ini and recordmanager.ini.sample to recordmanager.ini.
General settings are in recordmanager.ini.
This section contains general settings.
| Setting | Description |
|---|---|
| timezone | Local time zone used to convert date stamps to/from OAI-PMH providers. |
| abbreviations | Name of a file containing abbreviations. When removing trailing periods, any abbreviations are left intact. |
| full_title_prefixes | Name of a file containing title prefixes. If a title starts with a listed title prefix, it will not be shortened in title_keys (for deduplication). Add frequently found titles, such as "visual approach chart" to the list |
| articles | Name of a file containing articles that should be removed from the beginning of a title for sorting. |
| dedup_handler | Name of the class and .php file containing the methods for handling record deduplication. Default is DedupHandler, which can be subclassed for modifications and the subclass specified here. |
This section contains settings controlling OAI-PMH harvesting.
| Setting | Description |
|---|---|
| max_tries | Number of attempts to fetch data from the OAI-PMH provider. Default is 5. RecordManager will try a harvesting request at most max_tries times if it fails for any reason. |
| retry_wait | Delay between request attempts in seconds. Default is 30. |
This section specifies how to connect to the Mongo database.
| Setting | Description |
|---|---|
| url | Mongo connection string in format mongodb:///tmp/mongodb-27017.sock (preferred) or mongodb://username:password@server. In a typical default installation with Mongo residing on the same server, username and password are not needed, and mongodb:///tmp/mongodb-27017.sock can be used. Using unix sockets provide a significant performance advantage over TCP/IP. |
| database | Mongo database to be used |
| counts | Whether to fetch counts from the Mongo database when processing records. Defaults to false because fetching counts can be slow in a large database, but setting this to true gives more feedback during operations. |
| compress_records | Whether to compress record metadata when it is stored in MongoDB. Compression/decompression increases CPU usage slightly but is offset by reduced disk space and I/O demand. Compression is enabled by default. Turn off if you use TokuMX instead of MongoDB (TokuMX has built-in compression). |
| connect_timeout | Connection timeout in milliseconds. Default is 300 000 ms. |
| cursor_timeout | Cursor timeout in milliseconds. Might be needed if a cursor doesn't live long enough for the whole operation to complete. Default is 300 000 ms. |
This section contains settings used when running the direct Solr updates from RecordManager. These settings are not needed if updatesolr function is not used. Note that RecordManager uses the JSON update method which requires a fairly recent Solr version, and in some cases that the method be enabled separately. See http://wiki.apache.org/solr/UpdateJSON for more information.
| Setting | Description |
|---|---|
| update_url | The url used for the JSON update in Solr |
| max_commit_interval | Maximum number of record updates to send to Solr between commits. Note that Solr also has settings for automatic commit that may override this and cause more frequent commits. Committing changes means that the updated version of the search index is brought online, which requires some resources for warmup etc. Therefore it is recommended to keep the commit interval at a fairly high value. A commit is always done at the end of the Solr update process regardless of this setting, if there were changes and the --nocommit parameter was not used. |
| username | User name if basic http authentication is required to connect to the Solr index for update |
| password | Password if basic http authentication is required to connect to the Solr index for update |
| background_update | Number of background tasks to be used for making Solr http calls. Can improve indexing performance as batches of records can be created and sent to Solr in parallel. Disabled (0) by default. Requires the pcntl extension in PHP. |
| threaded_merged_record_update | Whether merged record update is run in parallel with individual record update. Default is false. Enabling this setting may speed up indexing as server resources are utilized by two processes instead of one (especially when Solr is running on a separate server). Note that this effectively doubles background_update value as long as the two processes run in parallel. Requires the pcntl extension in PHP. |
| max_update_tries | Maximum number of tries to send an update to Solr. Default is 15. Useful for keeping a RecordManager solrupdate task running when Solr is restarted. |
| update_retry_wait | Delay between Solr update request attempts in seconds. Default is 60. |
| merge_records | If true, a merged record is created for duplicate records. This merged record is indexed alongside normal records. The merged record is marked with field merged_boolean=true and the normal records belonging to it with merged_child_boolean=true. This allows the merged child records to be excluded from search results, and replacing the merged record in result list with the appropriate original record (requires that VuFind support this. Support is included since VuFind 2.3, but for VuFind 1.x see sys/Solr.php for our customization). |
| merged_fields | A comma-separated list of multivalued fields to be added to the merged records. Default contains normal VuFind multivalued fields. There is one special case, "author=author2": if two records to be merged have different value in author field, the other one is copied to author2 since author is a single-valued field. |
| single_fields | A comma-separated list of single-valued fields to be added to the merged records. Default contains normal VuFind single-valued fields apart from fullrecord. For single-valued fields only the first occurrence is taken. |
| suffixed_merged_fields | A comma-separated list of merged fields to which the data source id is appended. Default is empty. |
| ignore_in_comparison | A comma-separated list of fields that are ignored in comparesolr function (typically fields that are created with Solr's copyField command or where stored="false"). |
| format_in_allfields | Whether the format (e.g. "Book") should be added to allfields. Default is false. |
| unicode_normalization_form | Unicode normalization form to use. Valid values: NFC, NFD, NFKC and NFKD. See e.g. the Wikipedia entry for more information. |
These settings are specific to the OAI-PMH provider. It is not a mandatory part of RecordManager, but with it RecordManager can be used as an OAI-PMH aggregator. See Setting up the OAI-PMH Provider for more information on setting up the OAI-PMH provider.
| Setting | Description |
|---|---|
| repository_name | Name of the repository displayed in the Identify response |
| base_url | Base url of the provider (e.g. http://x.y.z/oai-pmh with the default configuration) |
| admin_email | Email address displayed in the Identify response |
| result_limit | Limit of results per single response (additional results are requested with a resumptionToken) |
| format_definitions | File that contains the descriptions of the available metadata formats |
| set_definitions | File that contains the set definitions (for selective harvesting) |
| transformation_to_[format] | XSL transformation to be used for outputting records in the given [format] in OAI-PMH provider |
These settings provide mappings between formats and the record classes used to process them. By default the class used is FormatRecord where Format is the record format with first letter capitalized. The section contains a list of key=value pairs, where key is the format and value is the class name (e.g. marc=MyOwnMarcRecord). An example of creating a custom record class that can override or add functionality to the original one can be found in classes/NdlEadRecord.php.
| Setting | Description |
|---|---|
| log_file | File where RecordManager writes its log |
| log_level | The level of information written to the log file. It is recommended to keep this at least at level 2, and level 3 is also safe for production use, but level 4 might cause the log file size to increase rapidly. See table below for log levels. |
| error_email | An optional email address, or a comma-separated list of email addresses, where a message is sent if any fatal errors are encountered |
| Level | Description |
|---|---|
| 4 | Debug, the most verbose level |
| 3 | Info, some extra information in addition to errors and warnings |
| 2 | Warning, only errors and warning messages |
| 1 | Error, only errors are logged |
| 0 | Fatal, only fatal errors that prevent continuing the current function are logged |
Data Source settings are further divided into two categories. The first category of settings is used for all data sources, and the second one is specific to OAI-PMH harvesting. All data source settings always belong to a section that identifies the data source. The section name is is used as the "source" parameter in the command line programs.
| Setting | Description |
|---|---|
| idPrefix | By default the section name in datasources.ini is used as an identifier prefix for the institution. idPrefix can be used to override this e.g. in case multiple OAI-PMH sets need to be harvested from the same data source (which requires multiple uniquely named sections in datasources.ini). |
| institution | The institution code mapped to the data source. Used e.g. to fill an organization field in the Solr index. |
| recordXPath | An xpath expression used when loading records from a file to identify a single record (e.g. //record) |
| oaiIDXPath | An xpath expression used when loading records from a file to find record's OAI ID, if it's present in the file (typically when importing a file containing an OAI-PMH listRecords response). Relative to recordXPath (e.g. ../../header/identifier). |
| format | Record format in RecordManager (e.g. dc, ead, lido or marc) |
| preTransformation | Optional transformation to be applied to files to be imported (just the name of the xsl file in transformations directory, e.g. to strip namespaces) |
| recordSplitter | Optional XSL transformation or PHP class used to split records in import or OAI-PMH harvest (just the name of the xsl file in transformations directory). See classes/EadSplitter.php for an example implementation of a PHP-based splitter or transformations/EadSplit.xsl for an example of XSL transformation. Specify only the .xsl or .php file name without path. |
| normalization | Optional XSL Transformation to be applied to each record. Points to a properties file in transformations directory (enter only the file name, no path). The properties file further defines the actual XSL transformation and any PHP-based helper functions or classes used in the transformation. |
| solrTransformation | XSL Transformation to be used when converting a record for import to Solr. Must be specified if the record driver does not provide a usable toSolrArray method. Points to a properties file in transformations directory. |
| dedup | Whether this data source needs deduplication (true/false, defaults to false) |
| keepMissingHierarchyMembers | Whether members of a hierarchical record not present in an imported or harvested records are kept and not deleted (true/false, defaults to false). Normally it is assumed that an imported hierarchical record contains all the child records, and those not present anymore need to be deleted, but if a record hierarchy is imported in multiple parts, this setting can be enabled to keep the previously imported parts intact. The downside is that another way to handle any deletions (e.g. OAI-PMH harvest with the [[reharvest |
| componentParts | How component parts, if any, are handled in the data source during load to Solr. See the table below for possible values. |
| indexMergedParts | Whether to index merged component parts also separately with hidden_component_boolean field set to true. Defaults to true. |
| {field}_mapping[,regexp] | A mapping file in mappings directory to be used to map values of {field} when updating Solr index. Useful for e.g. mapping multiple location codes to one. See below for an explanation of mapping files. |
| institutionInBuilding | How institution is converted to building field. See below for possible values. |
| extraFields[] | An array of static fields to add to each record when sending them to solr. Format is fieldname:value, e.g. extraFields[] = "building:mainLibrary" or extraFields[] = "sector_str_mv:library"
|
| driverParams[] | An array of driver-specific parameters that control driver behavior. Format is fieldname:value, e.g. driverParams[] = "holdingsInBuilding:true". See below for available driver parameters. |
| enrichments[] | An array of enrichment classes to use for the records, e.g. enrichments[] = "MarcOnkiLightEnrichment"
|
| Setting | Description |
|---|---|
| as_is | No special handling (default) |
| merge_all | Merge all component parts to their host records |
| merge_non_articles | Merge to host record unless article (including e-journal articles) |
| merge_non_earticles | Merge to host record unless e-journal article |
| Setting | Description |
|---|---|
| default | Use institution setting from datasources.ini |
| "none" | No mapping. Note that due to PHP ini file handling, the quotes are required. |
| driver | Use whatever the record driver provided in institution field |
| source | Use source id |
| institution/source | Use institution and source id separated with a slash |
| Setting | Description |
|---|---|
| splitTitles=true | Lido: Split titles at the end of the first sentence. Some heuristics are applied when searching for the end of the sentence. If a title is split, the full title is recorded in description field. |
| holdingsInInstitution | Marc: Include holdings locations (852b) in building field. |
There are further parameters specific to NDL record drivers, and they are documented below for completeness, but the NDL drivers generally include functionality not useful for others or not compatible with the standard VuFind index.
| Setting | Description |
|---|---|
| institutionInBuilding=true | NdlLido: Add institution information into building field. |
| collectionInBuilding=true | NdlLido: Add collection information into building field. |
| 003InLinkingID=true | NdlMarc: Whether links from component parts to the host records include 003 field. |
| projectIdIn960=true | NdlMarc: 960 field contains a project id |
| categoriesIn650=true | NdlMarc: Whether 650 field contains categories (typically MetaLib records) |
Normal mapping files are simple .ini-style files where on the left side of an equals sign is the original value and on the right side the resulting value. Mappings are case-sensitive, and if multiple values in a multivalued field map to same result, only one is kept. There is a simple example mapping file in the mappings directory.
There are a couple of special mapping strings that can be used to provide default values:
; A default value of xyz is used if none of the other strings match
##default = xyz
; A default for a singlevalued field where no original value exists
##empty = xyz
; A default for a multivalued field where no original value exists
##emptyarray = xyz
It is also possible to use mapping files with regular expressions by adding ,regexp after the mapping file name. With regexp files, the left-hand side is used as a regexp pattern and the right hand side as the replacement for strings that match the pattern. The expressions are tested one by one and the process ends when a match is found. Slashes must not be escaped in the pattern. In replacement $1 .. $9 can be used to denote a match in the pattern. An example:
; Remove a number from the beginning
\d+(.*) = "$1"
; Convert a string to hierarchical using the first character as the hierarchy separator (e.g. h12 becomes h/h12)
(.)(.*) = "$1/$1$2
| Setting | Description |
|---|---|
| url | OAI-PMH provider base URL |
| set | Identifier of a set to harvest (normally found in the setSpec tag of an OAI-PMH ListSets response). Omit this setting to harvest all records. |
| metadataPrefix | Format to harvest. The default is oai_dc. |
| idSearch[] | |
| idReplace[] | Can be used to manipulate record ID's with regular expression. |
| dateGranularity | dateGranularity is the granularity used by the server for representing dates. This may be "YYYY-MM-DDThh:mm:ssZ," "YYYY-MM-DD" or "auto" (to query the server for details). The default is "auto." |
| verbose | Can be set to true in order to log more detailed output while harvesting; this may be useful for troubleshooting purposes, but it defaults to false. |
| debugLog | Can be set to a file where all the OAI-PMH requests and responses are written. There is also a splitlog.php utility that can be used to split the responses from the debug log so that they can be reloaded with the import program. This is especially useful when testing record splitters. |
| oaipmhTransformation | An XSL transformation that is applied to OAI-PMH responses before they are processed (just the name of the xsl file in the transformations directory, e.g. to strip namespaces). |
Note that MetaLib IRD Harvest uses MetaLib X-Server. While easy to set up, it doesn't include categories in the records.
| Setting | Description |
|---|---|
| type | Only valid value is metalib. This tells RecordManager to harvest from MetaLib X-Server instead of OAI-PMH. |
| url | MetaLib X-Server address |
| xUser | User name for X-Server login |
| xPassword | Password for X-Server login |
| query | X-Server source_locate query used to identify records to be harvested (e.g. "WIN=INSTITUTE") |
See MetaLib documentation at EL Commons for more information on the X-Server call used and the syntax used in query (locate_command).
MetaLib CKB harvest is actually "fetch export files and import them". MetaLib export files are fetched according to their time stamps and processed in RecordManager.
| Setting | Description |
|---|---|
| type | Only valid value is metalib_export. This tells RecordManager to harvest MetaLib export files via HTTP. |
| url | HTTP address of the export directory on the MetaLib server. Remember to include trailing slash. |
| filePrefix | File name prefix used to distinquish the files to be processed from any other export files |
| fileSuffix | File name suffix used to distinquish the files to be processed from any other export files |
The MetaLib export harvest requires that a MetaLib export be scheduled to run on the MetaLib server and the results exposed via Apache. See [Harvesting MetaLib CKB Export](Harvesting MetaLib CKB Export) for information on how to set up the MetaLib side.
SFX KB harvest is actually "fetch export files and import them". SFX export files are fetched according to their time stamps and processed in RecordManager.
| Setting | Description |
|---|---|
| type | Only valid value is sfx. This tells RecordManager to harvest SFX exports via HTTP. |
| url | HTTP address of the export directory on the SFX server |
| filePrefix | File name prefix used to distinquish the files to be processed from any other export files |
The SFX harvest requires that an SFX export be scheduled to run on the SFX server and the results exposed via the proxy Apache on the SFX server. See [Harvesting SFX Objects](Harvesting SFX Objects) for information on how to set up the SFX side.