Skip to content

Conversation

AntePrkacin
Copy link

@AntePrkacin AntePrkacin commented Jul 9, 2025

  • FileTextExtractor - a service that contains logic for extracting text from some file (pdf for now) using Apache Tika jar file. I've put this file in lib/Core/Search/Common/PageIndexing/TextExtractor directory (if it needs to be moved somewhere else, let me know)
  • BinaryFile/SearchField - overrides the usual Ibexa\Core\FieldType\BinaryFile\SearchField file by adding file_text info to also be indexable
  • lib/Resources/config/search/common.yaml - this is where I've put the configuration for FileTextExtractor and SearchField services. If this config needs to be somewhere else, let me know

Added multiple configuration parameters under the apache_tika node: mode, path, host, port and allowed_mime_types. The mode param determines whether to use Apache Tika as a JAR file (mode: cli) or as a server that is already up and running (mode: server). If mode is set to "cli", then the path param needs to be defined - this is the path to the Apache Tika JAR file (added validation for this). The params host and port are by default '127.0.0.1' and '9998'.

@AntePrkacin AntePrkacin requested a review from pspanja July 9, 2025 13:54
@AntePrkacin AntePrkacin self-assigned this Jul 9, 2025
@pspanja pspanja force-pushed the NGSTACK-809-index-file-content branch from 5309943 to 0cea80a Compare August 11, 2025 09:16
public function getIndexDefinition(): array
{
$innerDef = $this->innerField->getIndexDefinition();
$innerDef['file_text'] = new Search\FieldType\FullTextField();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since full-text fields are "proxy" fields that are not separately indexed (instead they are mapped to a common full-text field in the field mapper), we do not return them in this method. Check other implementations to see that's the case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4a53f1a Removed $innerDef['file_text'] = new Search\FieldType\FullTextField(); and just returned the inner method.

{
$nodeDefinition
->children()
->arrayNode('apache_tika')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this under new file_indexing node. Also, introduce file_indexing.enabled (default false) option, and inject it into the SearchField to control whether the feature is used or not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f9e5121
I've put all this under new file_indexing node and added enabled node as its child (default: false).

ac0745c 57a884b
I've also injected file_indexing.enabled config param into SearchField. If it is set to false, then file indexing (extracting text from file) is not done.

"symfony/messenger": "^5.4",
"symfony/proxy-manager-bridge": "^5.4"
"symfony/proxy-manager-bridge": "^5.4",
"vaites/php-apache-tika": "^1.4"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make this optional if we check in the bundle Extension whether the package exists. The dependency should then be moved to suggest section of composer.json.

Copy link
Author

@AntePrkacin AntePrkacin Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d0e8115
Moved vaites/php-apache-tika bundle under "suggest" section of composer.json file. In the bundle Extension, I check if the CLIClient and WebClient classes from vaites/php-apache-tika bundle exist and then load search/file_indexing.yaml.

->end();
}

private function addApacheTikaSection(ArrayNodeDefinition $nodeDefinition): void
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to addFileIndexingSection.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f9e5121
Renamed to addFileIndexingSection()

@pspanja
Copy link
Member

pspanja commented Aug 11, 2025

@AntePrkacin also add some documentation for the feature, and update README.md.

@AntePrkacin
Copy link
Author

@AntePrkacin also add some documentation for the feature, and update README.md.

796d58c Updated README.md file with info and configs for file_indexing.
In my opinion, editing the docs/ folder for file_indexing feature would be redundant and repetitive. I think only one of them is enough, so I just edited README.md file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants