-
Notifications
You must be signed in to change notification settings - Fork 2
NGSTACK-809 index file content #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…tract text from some file and add configuration for that service
…e able to modify java executable path and allowed mime types
… a new Definition 'apache_tika.client' based on the config params
…ites\ApacheTika\Client to extract text from file
5309943
to
0cea80a
Compare
public function getIndexDefinition(): array | ||
{ | ||
$innerDef = $this->innerField->getIndexDefinition(); | ||
$innerDef['file_text'] = new Search\FieldType\FullTextField(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since full-text fields are "proxy" fields that are not separately indexed (instead they are mapped to a common full-text field in the field mapper), we do not return them in this method. Check other implementations to see that's the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4a53f1a Removed $innerDef['file_text'] = new Search\FieldType\FullTextField();
and just returned the inner method.
{ | ||
$nodeDefinition | ||
->children() | ||
->arrayNode('apache_tika') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put this under new file_indexing
node. Also, introduce file_indexing.enabled
(default false
) option, and inject it into the SearchField
to control whether the feature is used or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"symfony/messenger": "^5.4", | ||
"symfony/proxy-manager-bridge": "^5.4" | ||
"symfony/proxy-manager-bridge": "^5.4", | ||
"vaites/php-apache-tika": "^1.4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make this optional if we check in the bundle Extension whether the package exists. The dependency should then be moved to suggest
section of composer.json
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d0e8115
Moved vaites/php-apache-tika
bundle under "suggest" section of composer.json
file. In the bundle Extension, I check if the CLIClient
and WebClient
classes from vaites/php-apache-tika
bundle exist and then load search/file_indexing.yaml
.
->end(); | ||
} | ||
|
||
private function addApacheTikaSection(ArrayNodeDefinition $nodeDefinition): void |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to addFileIndexingSection
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f9e5121
Renamed to addFileIndexingSection()
@AntePrkacin also add some documentation for the feature, and update |
…d 'enabled' and rename function names
…earchField and check if its value before extracting Text from binaryFile
…xtractor into separate yaml file
…of the composer.json file
796d58c Updated |
…BinaryFile/SearchField class
FileTextExtractor
- a service that contains logic for extracting text from some file (pdf for now) using Apache Tika jar file. I've put this file inlib/Core/Search/Common/PageIndexing/TextExtractor
directory (if it needs to be moved somewhere else, let me know)BinaryFile/SearchField
- overrides the usualIbexa\Core\FieldType\BinaryFile\SearchField
file by adding file_text info to also be indexablelib/Resources/config/search/common.yaml
- this is where I've put the configuration forFileTextExtractor
andSearchField
services. If this config needs to be somewhere else, let me knowAdded multiple configuration parameters under the apache_tika node: mode, path, host, port and allowed_mime_types. The mode param determines whether to use Apache Tika as a JAR file (
mode: cli
) or as a server that is already up and running (mode: server
). If mode is set to "cli", then the path param needs to be defined - this is the path to the Apache Tika JAR file (added validation for this). The params host and port are by default '127.0.0.1' and '9998'.