Skip to content
Ere Maijala edited this page Nov 14, 2016 · 62 revisions

Configuration

RecordManager configuration can be divided to two categories, the general RecordManager settings and data source settings. The default distribution contains sample configuration files in the conf directory. They need to be copied from datasources.ini.sample to datasources.ini and recordmanager.ini.sample to recordmanager.ini.

General Settings

General settings are in recordmanager.ini.

Site

This section contains general settings.

Setting Description
timezone Local time zone used to convert date stamps to/from OAI-PMH providers.
abbreviations Name of a file containing abbreviations. When removing trailing periods, any abbreviations are left intact.
full_title_prefixes Name of a file containing title prefixes. If a title starts with a listed title prefix, it will not be shortened in title_keys (for deduplication). Add frequently found titles, such as "visual approach chart" to the list
articles Name of a file containing articles that should be removed from the beginning of a title for sorting.
dedup_handler Name of the class and .php file containing the methods for handling record deduplication. Default is DedupHandler, which can be subclassed for modifications and the subclass specified here.

Harvesting

This section contains settings controlling OAI-PMH harvesting.

Setting Description
max_tries Number of attempts to fetch data from the OAI-PMH provider. Default is 5. RecordManager will try a harvesting request at most max_tries times if it fails for any reason.
retry_wait Delay between request attempts in seconds. Default is 30.

Mongo

This section specifies how to connect to the Mongo database.

Setting Description
url Mongo connection string in format mongodb:///tmp/mongodb-27017.sock (preferred) or mongodb://username:password@server. In a typical default installation with Mongo residing on the same server, username and password are not needed, and mongodb:///tmp/mongodb-27017.sock can be used. Using unix sockets provide a significant performance advantage over TCP/IP.
database Mongo database to be used
counts Whether to fetch counts from the Mongo database when processing records. Defaults to false because fetching counts can be slow in a large database, but setting this to true gives more feedback during operations.
compress_records Whether to compress record metadata when it is stored in MongoDB. Compression/decompression increases CPU usage slightly but is offset by reduced disk space and I/O demand. Compression is enabled by default. Turn off if you use TokuMX instead of MongoDB (TokuMX has built-in compression).
connect_timeout Connection timeout in milliseconds. Default is 300 000 ms.
cursor_timeout Cursor timeout in milliseconds. Might be needed if a cursor doesn't live long enough for the whole operation to complete. Default is 300 000 ms.

Solr

This section contains settings used when running the direct Solr updates from RecordManager. These settings are not needed if updatesolr function is not used. Note that RecordManager uses the JSON update method which requires a fairly recent Solr version, and in some cases that the method be enabled separately. See http://wiki.apache.org/solr/UpdateJSON for more information.

Setting Description
update_url The url used for the JSON update in Solr
max_commit_interval Maximum number of record updates to send to Solr between commits. Note that Solr also has settings for automatic commit that may override this and cause more frequent commits. Committing changes means that the updated version of the search index is brought online, which requires some resources for warmup etc. Therefore it is recommended to keep the commit interval at a fairly high value. A commit is always done at the end of the Solr update process regardless of this setting, if there were changes and the --nocommit parameter was not used.
username User name if basic http authentication is required to connect to the Solr index for update
password Password if basic http authentication is required to connect to the Solr index for update
background_update Number of background tasks to be used for making Solr http calls. Can improve indexing performance as batches of records can be created and sent to Solr in parallel. Disabled (0) by default. Requires the pcntl extension in PHP.
threaded_merged_record_update Whether merged record update is run in parallel with individual record update. Default is false. Enabling this setting may speed up indexing as server resources are utilized by two processes instead of one (especially when Solr is running on a separate server). Note that this effectively doubles background_update value as long as the two processes run in parallel. Requires the pcntl extension in PHP.
max_update_tries Maximum number of tries to send an update to Solr. Default is 15. Useful for keeping a RecordManager solrupdate task running when Solr is restarted.
update_retry_wait Delay between Solr update request attempts in seconds. Default is 60.
merge_records If true, a merged record is created for duplicate records. This merged record is indexed alongside normal records. The merged record is marked with field merged_boolean=true and the normal records belonging to it with merged_child_boolean=true. This allows the merged child records to be excluded from search results, and replacing the merged record in result list with the appropriate original record (requires that VuFind support this. Support is included since VuFind 2.3, but for VuFind 1.x see sys/Solr.php for our customization).
merged_fields A comma-separated list of multivalued fields to be added to the merged records. Default contains normal VuFind multivalued fields. There is one special case, "author=author2": if two records to be merged have different value in author field, the other one is copied to author2 since author is a single-valued field.
single_fields A comma-separated list of single-valued fields to be added to the merged records. Default contains normal VuFind single-valued fields apart from fullrecord. For single-valued fields only the first occurrence is taken.
suffixed_merged_fields A comma-separated list of merged fields to which the data source id is appended. Default is empty.
ignore_in_comparison A comma-separated list of fields that are ignored in comparesolr function (typically fields that are created with Solr's copyField command or where stored="false").
format_in_allfields Whether the format (e.g. "Book") should be added to allfields. Default is false.
unicode_normalization_form Unicode normalization form to use. Valid values: NFC, NFD, NFKC and NFKD. See e.g. the Wikipedia entry for more information.

OAI-PMH

These settings are specific to the OAI-PMH provider. It is not a mandatory part of RecordManager, but with it RecordManager can be used as an OAI-PMH aggregator. See Setting up the OAI-PMH Provider for more information on setting up the OAI-PMH provider.

Setting Description
repository_name Name of the repository displayed in the Identify response
base_url Base url of the provider (e.g. http://x.y.z/oai-pmh with the default configuration)
admin_email Email address displayed in the Identify response
result_limit Limit of results per single response (additional results are requested with a resumptionToken)
format_definitions File that contains the descriptions of the available metadata formats
set_definitions File that contains the set definitions (for selective harvesting)
transformation_to_[format] XSL transformation to be used for outputting records in the given [format] in OAI-PMH provider

Record Classes

These settings provide mappings between formats and the record classes used to process them. By default the class used is FormatRecord where Format is the record format with first letter capitalized. The section contains a list of key=value pairs, where key is the format and value is the class name (e.g. marc=MyOwnMarcRecord). An example of creating a custom record class that can override or add functionality to the original one can be found in classes/NdlEadRecord.php.

Log

Setting Description
log_file File where RecordManager writes its log
log_level The level of information written to the log file. It is recommended to keep this at least at level 2, and level 3 is also safe for production use, but level 4 might cause the log file size to increase rapidly. See table below for log levels.
error_email An optional email address, or a comma-separated list of email addresses, where a message is sent if any fatal errors are encountered

Log Levels

Level Description
4 Debug, the most verbose level
3 Info, some extra information in addition to errors and warnings
2 Warning, only errors and warning messages
1 Error, only errors are logged
0 Fatal, only fatal errors that prevent continuing the current function are logged

Data Source Settings

Data Source settings are further divided into two categories. The first category of settings is used for all data sources, and the second one is specific to OAI-PMH harvesting. All data source settings always belong to a section that identifies the data source. The section name is is used as the "source" parameter in the command line programs.

Common Settings

Setting Description
idPrefix By default the section name in datasources.ini is used as an identifier prefix for the institution. idPrefix can be used to override this e.g. in case multiple OAI-PMH sets need to be harvested from the same data source (which requires multiple uniquely named sections in datasources.ini).
institution The institution code mapped to the data source. Used e.g. to fill an organization field in the Solr index.
recordXPath An xpath expression used when loading records from a file to identify a single record (e.g. //record)
oaiIDXPath An xpath expression used when loading records from a file to find record's OAI ID, if it's present in the file (typically when importing a file containing an OAI-PMH listRecords response). Relative to recordXPath (e.g. ../../header/identifier).
format Record format in RecordManager (e.g. dc, ead, lido or marc)
preTransformation Optional transformation to be applied to files to be imported (just the name of the xsl file in transformations directory, e.g. to strip namespaces)
recordSplitter Optional XSL transformation or PHP class used to split records in import or OAI-PMH harvest (just the name of the xsl file in transformations directory). See classes/EadSplitter.php for an example implementation of a PHP-based splitter or transformations/EadSplit.xsl for an example of XSL transformation. Specify only the .xsl or .php file name without path.
normalization Optional XSL Transformation to be applied to each record. Points to a properties file in transformations directory (enter only the file name, no path). The properties file further defines the actual XSL transformation and any PHP-based helper functions or classes used in the transformation.
solrTransformation XSL Transformation to be used when converting a record for import to Solr. Must be specified if the record driver does not provide a usable toSolrArray method. Points to a properties file in transformations directory.
dedup Whether this data source needs deduplication (true/false, defaults to false)
keepMissingHierarchyMembers Whether members of a hierarchical record not present in an imported or harvested records are kept and not deleted (true/false, defaults to false). Normally it is assumed that an imported hierarchical record contains all the child records, and those not present anymore need to be deleted, but if a record hierarchy is imported in multiple parts, this setting can be enabled to keep the previously imported parts intact. The downside is that another way to handle any deletions (e.g. OAI-PMH harvest with the [[reharvest
componentParts How component parts, if any, are handled in the data source during load to Solr. See the table below for possible values.
indexMergedParts Whether to index merged component parts also separately with hidden_component_boolean field set to true. Defaults to true.
{field}_mapping[,regexp] A mapping file in mappings directory to be used to map values of {field} when updating Solr index. Useful for e.g. mapping multiple location codes to one. See below for an explanation of mapping files.
institutionInBuilding How institution is converted to building field. See below for possible values.
extraFields[] An array of static fields to add to each record when sending them to solr. Format is fieldname:value, e.g. extraFields[] = "building:mainLibrary" or extraFields[] = "sector_str_mv:library"
driverParams[] An array of driver-specific parameters that control driver behavior. Format is fieldname:value, e.g. driverParams[] = "holdingsInBuilding:true". See below for available driver parameters.
enrichments[] An array of enrichment classes to use for the records, e.g. enrichments[] = "MarcOnkiLightEnrichment"

Possible Settings for componentParts

Setting Description
as_is No special handling (default)
merge_all Merge all component parts to their host records
merge_non_articles Merge to host record unless article (including e-journal articles)
merge_non_earticles Merge to host record unless e-journal article

Possible Settings for institutionInBuilding

Setting Description
default Use institution setting from datasources.ini
"none" No mapping. Note that due to PHP ini file handling, the quotes are required.
driver Use whatever the record driver provided in institution field
source Use source id
institution/source Use institution and source id separated with a slash

Possible Settings for driverParams

Setting Description
splitTitles=true Lido: Split titles at the end of the first sentence. Some heuristics are applied when searching for the end of the sentence. If a title is split, the full title is recorded in description field.
holdingsInInstitution Marc: Include holdings locations (852b) in building field.

There are further parameters specific to NDL record drivers, and they are documented below for completeness, but the NDL drivers generally include functionality not useful for others or not compatible with the standard VuFind index.

Setting Description
institutionInBuilding=true NdlLido: Add institution information into building field.
collectionInBuilding=true NdlLido: Add collection information into building field.
003InLinkingID=true NdlMarc: Whether links from component parts to the host records include 003 field.
projectIdIn960=true NdlMarc: 960 field contains a project id
categoriesIn650=true NdlMarc: Whether 650 field contains categories (typically MetaLib records)

Mapping Files

Normal mapping files are simple .ini-style files where on the left side of an equals sign is the original value and on the right side the resulting value. Mappings are case-sensitive, and if multiple values in a multivalued field map to same result, only one is kept. There is a simple example mapping file in the mappings directory.

There are a couple of special mapping strings that can be used to provide default values:

; A default value of xyz is used if none of the other strings match
##default = xyz
; A default for a singlevalued field where no original value exists
##empty = xyz
; A default for a multivalued field where no original value exists
##emptyarray = xyz

It is also possible to use mapping files with regular expressions by adding ,regexp after the mapping file name. With regexp files, the left-hand side is used as a regexp pattern and the right hand side as the replacement for strings that match the pattern. The expressions are tested one by one and the process ends when a match is found. Slashes must not be escaped in the pattern. In replacement $1 .. $9 can be used to denote a match in the pattern. An example:

; Remove a number from the beginning
\d+(.*) = "$1"

; Convert a string to hierarchical using the first character as the hierarchy separator (e.g. h12 becomes h/h12)
(.)(.*) = "$1/$1$2

OAI-PMH Harvesting Specific Settings

Setting Description
url OAI-PMH provider base URL
set Identifier of a set to harvest (normally found in the setSpec tag of an OAI-PMH ListSets response). Omit this setting to harvest all records.
metadataPrefix Format to harvest. The default is oai_dc.
idSearch[]
idReplace[] Can be used to manipulate record ID's with regular expression.
dateGranularity dateGranularity is the granularity used by the server for representing dates. This may be "YYYY-MM-DDThh:mm:ssZ," "YYYY-MM-DD" or "auto" (to query the server for details). The default is "auto."
verbose Can be set to true in order to log more detailed output while harvesting; this may be useful for troubleshooting purposes, but it defaults to false.
debugLog Can be set to a file where all the OAI-PMH requests and responses are written. There is also a splitlog.php utility that can be used to split the responses from the debug log so that they can be reloaded with the import program. This is especially useful when testing record splitters.
oaipmhTransformation An XSL transformation that is applied to OAI-PMH responses before they are processed (just the name of the xsl file in the transformations directory, e.g. to strip namespaces).

MetaLib IRD Harvest Specific Settings

Note that MetaLib IRD Harvest uses MetaLib X-Server. While easy to set up, it doesn't include categories in the records.

Setting Description
type Only valid value is metalib. This tells RecordManager to harvest from MetaLib X-Server instead of OAI-PMH.
url MetaLib X-Server address
xUser User name for X-Server login
xPassword Password for X-Server login
query X-Server source_locate query used to identify records to be harvested (e.g. "WIN=INSTITUTE")

See MetaLib documentation at EL Commons for more information on the X-Server call used and the syntax used in query (locate_command).

MetaLib CKB Harvest Specific Settings

MetaLib CKB harvest is actually "fetch export files and import them". MetaLib export files are fetched according to their time stamps and processed in RecordManager.

Setting Description
type Only valid value is metalib_export. This tells RecordManager to harvest MetaLib export files via HTTP.
url HTTP address of the export directory on the MetaLib server. Remember to include trailing slash.
filePrefix File name prefix used to distinquish the files to be processed from any other export files
fileSuffix File name suffix used to distinquish the files to be processed from any other export files

The MetaLib export harvest requires that a MetaLib export be scheduled to run on the MetaLib server and the results exposed via Apache. See [Harvesting MetaLib CKB Export](Harvesting MetaLib CKB Export) for information on how to set up the MetaLib side.

SFX KB Harvest Specific Settings

SFX KB harvest is actually "fetch export files and import them". SFX export files are fetched according to their time stamps and processed in RecordManager.

Setting Description
type Only valid value is sfx. This tells RecordManager to harvest SFX exports via HTTP.
url HTTP address of the export directory on the SFX server
filePrefix File name prefix used to distinquish the files to be processed from any other export files

The SFX harvest requires that an SFX export be scheduled to run on the SFX server and the results exposed via the proxy Apache on the SFX server. See [Harvesting SFX Objects](Harvesting SFX Objects) for information on how to set up the SFX side.

Clone this wiki locally