Skip to content

Commit

Permalink
Merge branch 'master' of github.com:nddipiazza/tika-fork
Browse files Browse the repository at this point in the history
  • Loading branch information
nddipiazza committed Jun 1, 2019
2 parents c3c188f + 2da5283 commit e749246
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
# tika-fork

Utility that allows you to run Tika as a forked JVM to minimize memory issues.
Utility that allows you to run Tika as a pool of forked JVMs to minimize memory issues.

## Motiviation

It is a common issue when dealing with Tika to have parses that cause your entire JVM to crash due to out-of-memory conditions.
It is a common issue when dealing with Tika to have parses that cause your entire JVM to crash due to out-of-memory conditions. There are some parameters that are intended to prevent these issues but the issues can still happen from time to time as described in https://issues.apache.org/jira/browse/TIKA-2575

There are also problems where a Tika parse will not return in sufficient time due to GC hell or some other CPU intense process and will cause issues.

This program attempts to deal with these problems:

* Launches a pool of forked JVMs that are all limited by the amount of memory they can use.
* Uses sockets (not HTTP) to send a stream your document content to the Tika parser, and to receive back a stream of metadata and a stream of the parsed content.
* Uses sockets (not HTTP) to send a stream of your document content to the Tika parser, and to receive back a stream of metadata and a stream of the parsed content.
* Uses commons-pool to provide fine-grained control the pool of the forked Tika JVMs.
* Provides a very simple "abortAfterMs" parameter to the parse that will throw a TimeoutException if too much time is taken. This will result in the forked JVM to be aborted. This is useful in the situations where the JVM went into GC hell eating tons of CPU and never returning.
* Provides a "abortAfterMs" parameter to the parse method that will throw a TimeoutException if too much time is taken. This will result in the forked JVM to be aborted. This is useful in the situations where the JVM went into GC hell eating tons of CPU and never returning.

## Usage

See the [Tika Fork Process Unit Test](fork/src/test/java/org/apache/tika/fork/TikaProcessTest.java) for several detailed examples of how to use the program.
See the [Tika Fork Process Unit Test](tika-fork/src/test/java/org/apache/tika/fork/TikaProcessTest.java) for several detailed examples of how to use the program.

0 comments on commit e749246

Please sign in to comment.