solr-loader3

A Clojure tool to create solr indexes from relational data in databases. This tool can be used as an alternative to the Data Import Handler (DIH) provided in the solr distribution. The main advantage of using this tool instead of DIH is performance and simple configuration.

Usage

From the command line run

java -jar solr-loader3-standalone.jar --upload <config-file>

Configuration

The configuration is specified in an EDN file using a native Clojure format. Database connection parameters and entities for indexing and their SQL to extract the entities are specified in the configuration file. Please see the sample configuration file for details.

Requirements

Java 7 is required for this tool
Solr 4.x is required
Oracle 10 or above (support for other databases coming soon)

Design

The tool is designed to load large data sets efficiently. It does so by using a combination of multiple threads and asynchronous processing.

Design Principles

Do not read the full data set at any time from the database. Read only the minimum data required to process the document record
Use concurrency by processing data in threads and tasks that can be run in parallel
All blocking IO should be processed concurrently on a separate thread pool
All non-blocking IO and processing tasks should be processed concurrently in separate thread pools or asynchronous channels
Simplify configuration to make the tool easy to use

Java concurrency classes are used for blocking IO tasks and clojure core.async functions are used for non-blocking CPU tasks. Both io and cpu tasks are run concurrently fully utilizing the cores on the machine.

Java concurrency

A thread pool is used to execute blocking io tasks to limit resource usage instead of creating a new thread per task.
Semaphores are used to control the number of records that can be processed at any given time.

Clojure concurrency

Clojure processing mostly involves transforming SQL data records to Solr document update records. Since the processing is CPU bound, core.async channels are used to process data concurrently.
To upload data to solr, non-blocking io with httpkit and core.async channels are used.

Other notes

core.async thread macro is not used for blocking io as io threads need to be bounded. Once can write pure clojure functions to provide a thread pool for blocking io tasks but this functionality is already provided in the java.util.concurrency package.
The (go (>! ...)) macro is not used for synchronizing execution of blocking and non-blocking tasks since the go macro synchronization only works with other go blocks and blocking io tasks do not run as go blocks. Instead semaphores are used to synchronize execution of blocking and non-blocking tasks running concurrently.

License

Distributed under the Eclipse Public License either version 1.0 or any later version.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
doc		doc
src/solr_loader3		src/solr_loader3
test/solr_loader3		test/solr_loader3
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
project.clj		project.clj
sample-solr-config.edn		sample-solr-config.edn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

solr-loader3

Usage

Configuration

Requirements

Design

Design Principles

Java concurrency

Clojure concurrency

Other notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

solr-loader3

Usage

Configuration

Requirements

Design

Design Principles

Java concurrency

Clojure concurrency

Other notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages