Skip to content

Commit

Permalink
Merge pull request #17 from camelot-dev/rules-manager
Browse files Browse the repository at this point in the history
[MRG] Add rules manager
  • Loading branch information
vinayak-mehta authored Nov 12, 2018
2 parents 0eb63a4 + 75a60eb commit 9e26ea1
Show file tree
Hide file tree
Showing 12 changed files with 198 additions and 157 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ That's it! Now you can go to http://localhost:5000 and extract data tables from

- **Excalibur gives you complete control over your data**. All file storage and processing happens on your own local or remote machine.
- Excalibur can be configured with **MySQL and Celery** for parallel and distributed workloads. By default, sqlite and multiprocessing are used for sequential workloads.
- You can save table extraction [rules](https://excalibur-py.readthedocs.io/en/master/user/concepts.html#rule) as **presets** and apply them on different PDFs to extract tables with similar structures. (*in v0.3.0*)
- You can save table extraction [rules](https://excalibur-py.readthedocs.io/en/master/user/concepts.html#rule) as **presets** and apply them on different PDFs to extract tables with similar structures.
- You can extract tables from **multiple PDFs in one go** using an extraction rule by starting [jobs](https://excalibur-py.readthedocs.io/en/master/user/concepts.html#job). (*in v0.4.0*)

Excalibur uses [Camelot](https://camelot-py.readthedocs.io/) under the hood. You can check out its [comparison with other PDF table extraction libraries and tools](https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools).
Expand Down
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ Why Excalibur?

- **Excalibur gives you complete control over your data**. All file storage and processing happens on your own local or remote machine.
- Excalibur can be configured with **MySQL and Celery** for parallel and distributed workloads. By default, sqlite and multiprocessing are used for sequential workloads.
- You can save table extraction :ref:`rules <concepts>` as **presets** and apply them on different PDFs to extract tables with similar structures. (*in v0.3.0*)
- You can save table extraction :ref:`rules <concepts>` as **presets** and apply them on different PDFs to extract tables with similar structures.
- You can extract tables from **multiple PDFs in one go** using an extraction rule by starting :ref:`jobs <concepts>`. (*in v0.4.0*)

Excalibur uses `Camelot <https://camelot-py.readthedocs.io/>`_ under the hood. You can check out its `comparison with other PDF table extraction libraries and tools`_.
Expand Down
8 changes: 6 additions & 2 deletions docs/user/concepts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,14 +36,18 @@ You can check out Camelot's `read_pdf`_ documentation to see a list of all confi

Inside Excalibur, a rule can be specified by selecting a flavor and its corresponding options in the rule box on the workspace. (As shown on the right)

From *v0.2.0*, you will be able to give each rule a name and save them as a preset for use on different PDFs to extract tables with similar structures.
When you create an extraction rule and start an extraction job, the rule is saved as a preset can be used in the future for PDFs having the same table structure as the one you created the rule on. A saved rule can be loaded on the workspace by selecting it from the "Saved Rules" dropdown.

.. image:: ../_static/gifs/saved-rule.gif
:scale: 65%
:align: center

Job
---

When you create a rule and apply it on a PDF, a table extraction job is created.

From *v0.2.0*, you will be able to apply a rule on multiple PDFs at once.
From *v0.4.0*, you will be able to apply a rule on multiple PDFs at once.

----

Expand Down
151 changes: 80 additions & 71 deletions docs/user/howto.rst
Original file line number Diff line number Diff line change
@@ -1,71 +1,80 @@
.. _howto:

How-to Guides
=============

Excalibur's architecture is heavily inspired from Airflow, so you may get a feeling of déjà vu while reading this page of the documentation. `Airflow LICENSE`_.

.. _Airflow LICENSE: https://github.com/apache/incubator-airflow/blob/master/LICENSE

Setting Configuration Options
-----------------------------

The first time you run Excalibur, it will create a file called ``excalibur.cfg`` in your ``$EXCALIBUR_HOME`` directory (``~/excalibur`` by default). This file contains Excalibur’s configuration and you can edit it to change any of the settings.

For example, the metadata database connection string can be set in ``excalibur.cfg`` like this::

[core]
sql_alchemy_conn = my_conn_string

Using the MySQL Database Backend
--------------------------------

Excalibur uses SqlAlchemy to connect to a database backend. By default, stores all metadata in a sqlite database. To use MySQL, you need to first install MySQL and then create a database and a user.

Installing MySQL
^^^^^^^^^^^^^^^^

To use the MySQL database backend, you need to install Excalibur using::

$ pip install excalibur-py[mysql]

You can install MySQL using your system's package manager. For Ubuntu::

$ sudo apt update
$ sudo apt install mysql-server libmysqlclient-dev

And then set it up using::

$ mysql_secure_installation

Setup
^^^^^

Now you can create the a database and a user for Excalibur::

> CREATE DATABASE excalibur CHARACTER SET utf8 COLLATE utf8_unicode_ci;
> grant all on excalibur.* TO 'excalibur'@'%' IDENTIFIED BY '1234';

Finally, you need to change the ``sql_alchemy_conn`` in ``excalibur.cfg`` to::

[core]
sql_alchemy_conn = mysql://excalibur:1234@localhost:3306/excalibur

And initialize the metadata database using::

$ excalibur initdb

Scaling Out with Celery
-----------------------

``CeleryExecutor`` is one of the ways you can scale out the number of workers. For this to work, you need to setup a Celery backend (RabbitMQ, Redis, …) and change your excalibur.cfg to point the executor parameter to ``CeleryExecutor`` and provide the related Celery settings.

For more information about setting up a Celery broker, refer to the exhaustive `Celery documentation on the topic`_.

.. _Celery documentation on the topic: http://docs.celeryproject.org/en/latest/getting-started/brokers/index.html

To kick off a worker, you need to setup Excalibur and kick off the worker subcommand::

$ excalibur worker

Your worker should start picking up tasks as soon as they get fired in its direction.
.. _howto:

How-to Guides
=============

Excalibur's architecture is heavily inspired from Airflow, so you may get a feeling of déjà vu while reading this page of the documentation. `Airflow LICENSE`_.

.. _Airflow LICENSE: https://github.com/apache/incubator-airflow/blob/master/LICENSE

Setting Configuration Options
-----------------------------

The first time you run Excalibur, it will create a file called ``excalibur.cfg`` in your ``$EXCALIBUR_HOME`` directory (``~/excalibur`` by default). This file contains Excalibur’s configuration and you can edit it to change any of the settings.

For example, the metadata database connection string can be set in ``excalibur.cfg`` like this::

[core]
sql_alchemy_conn = my_conn_string

Resetting the Metadata Database
-------------------------------

.. warning:: The following command will wipe your Excalibur metadata database, removing all information about uploaded files, saved extraction rules and finished/in-progress jobs.

You can reset the metadata database using::

$ excalibur resetdb

Using the MySQL Database Backend
--------------------------------

Excalibur uses SqlAlchemy to connect to a database backend. By default, stores all metadata in a sqlite database. To use MySQL, you need to first install MySQL and then create a database and a user.

Installing MySQL
^^^^^^^^^^^^^^^^

To use the MySQL database backend, you need to install Excalibur using::

$ pip install excalibur-py[mysql]

You can install MySQL using your system's package manager. For Ubuntu::

$ sudo apt update
$ sudo apt install mysql-server libmysqlclient-dev

And then set it up using::

$ mysql_secure_installation

Setup
^^^^^

Now you can create the a database and a user for Excalibur::

> CREATE DATABASE excalibur CHARACTER SET utf8 COLLATE utf8_unicode_ci;
> grant all on excalibur.* TO 'excalibur'@'%' IDENTIFIED BY '1234';

Finally, you need to change the ``sql_alchemy_conn`` in ``excalibur.cfg`` to::

[core]
sql_alchemy_conn = mysql://excalibur:1234@localhost:3306/excalibur

And initialize the metadata database using::

$ excalibur initdb

Scaling Out with Celery
-----------------------

``CeleryExecutor`` is one of the ways you can scale out the number of workers. For this to work, you need to setup a Celery backend (RabbitMQ, Redis, …) and change your excalibur.cfg to point the executor parameter to ``CeleryExecutor`` and provide the related Celery settings.

For more information about setting up a Celery broker, refer to the exhaustive `Celery documentation on the topic`_.

.. _Celery documentation on the topic: http://docs.celeryproject.org/en/latest/getting-started/brokers/index.html

To kick off a worker, you need to setup Excalibur and kick off the worker subcommand::

$ excalibur worker

Your worker should start picking up tasks as soon as they get fired in its direction.
2 changes: 1 addition & 1 deletion docs/user/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ Optionally, you can also select a column separator by clicking on "Add Separator
:scale: 40%
:align: center

Finally, you can click on "Extract" to start a table extraction *job*.
Finally, you can click on "Extract" to start a table extraction *job*. This will save the extraction rule that you created above as a preset which you can use in the future on PDFs with similar table structures as the one you created the rule on.

.. note:: The Lattice flavor for tables with lines doesn't have a "Add Separator" button. It also doesn't need a table area (though you can specify it) since it reliably detects table boundaries and column separators on its own. In most cases, you won't need to tweak any of its configuration options.

Expand Down
2 changes: 1 addition & 1 deletion excalibur/__version__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-

VERSION = (0, 2, 1)
VERSION = (0, 3, 0)

__title__ = 'excalibur-py'
__description__ = 'A web interface for Camelot (PDF Table Extraction for Humans).'
Expand Down
9 changes: 9 additions & 0 deletions excalibur/www/app.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,21 @@
import json

from flask import Flask, Blueprint
from werkzeug.utils import find_modules, import_string

from .. import configuration as conf
from .views import views


def to_pretty_json(value):
value = json.loads(value)
return json.dumps(value, sort_keys=True,
indent=4, separators=(',', ': '))


def create_app(config=None):
app = Flask(__name__)
app.config.from_object(conf)
app.register_blueprint(views)
app.jinja_env.filters['pretty'] = to_pretty_json
return app
Loading

0 comments on commit 9e26ea1

Please sign in to comment.