forked from wikimedia/mediawiki-extensions-CirrusSearch
-
Notifications
You must be signed in to change notification settings - Fork 0
Github mirror of MediaWiki extension CirrusSearch - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access for contributing
License
zoglun/mediawiki-extensions-CirrusSearch
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Temporary PHP7 elasticsearch fix for Mediawiki 1.26.2 Notice this is not recommend for production environment! Strongly suggest you wait till MW1.27 release and use new version extension. Stop using this fix when MW 1.27 released! What I did: Download 1.26 stable extension relese from https://www.mediawiki.org/wiki/Special:ExtensionDistributor/CirrusSearch & https://www.mediawiki.org/wiki/Special:ExtensionDistributor/Elastica Use this version Elastica and put then under /wiki/extensions/Elastica/Elastica https://github.com/ruflin/Elastica/commit/e078baafe853fa7db7b0381f5280f775af17b16c Modify /wiki/extensions/CirrusSearch/includes/Searcher.php to remove any Class that PHP7 does not support. Now this extension barely work with Mediawiki 1.26.2 & Elasticsearch1.7.1 now. 这是一个临时的PHP7兼容修改,1.27正式版插件出来后应当删除 MediaWiki extension: CirrusSearch --------------------------------- Installation ------------ Get Elasticsearch up and running somewhere. 1.3.2 and above are all fine though soon it'll be 1.6+. CirrusSearch still uses dynamic scripts. Its very close to not needing them any more - probably by August 2015 it won't. Anyway, for now you'll have to reenable dynamic scripting in Elasticsearch. Do that by adding the contents of the elasticsearch.yml file in this directory to the elasticsearch.yml file that comes with your installation. Place the CirrusSearch extension in your extensions directory. Make sure you have the curl php library installed (sudo apt-get install php5-curl in Debian.) You also need to install the Elastica MediaWiki extension. Add this to LocalSettings.php: require_once( "$IP/extensions/Elastica/Elastica.php" ); require_once( "$IP/extensions/CirrusSearch/CirrusSearch.php" ); $wgDisableSearchUpdate = true; Configure your search servers in LocalSettings.php if you aren't running Elasticsearch on localhost: $wgCirrusSearchServers = array( 'elasticsearch0', 'elasticsearch1', 'elasticsearch2', 'elasticsearch3' ); There are other $wgCirrusSearch variables that you might want to change from their defaults. Now run this script to generate your elasticsearch index: php $MW_INSTALL_PATH/extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php Now remove $wgDisableSearchUpdate = true from LocalSettings.php. Updates should start heading to Elasticsearch. Next bootstrap the search index by running: php $MW_INSTALL_PATH/extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipLinks --indexOnSkip php $MW_INSTALL_PATH/extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipParse Note that this can take some time. For large wikis read "Bootstrapping large wikis" below. Once that is complete add this to LocalSettings.php to funnel queries to ElasticSearch: $wgSearchType = 'CirrusSearch'; Bootstrapping large wikis ------------------------- Since most of the load involved in indexing is parsing the pages in php we provide a few options to split the process into multiple processes. Don't worry too much about the database during this process. It can generally handle more indexing processes then you are likely to be able to spawn. General strategy: 0. Make sure you have a good job queue setup. It'll be doing most of the work. In fact, Cirrus won't work well on large wikis without it. 1. Generate scripts to add all the pages without link counts to the index. 2. Execute them any way you like. 3. Generate scripts to count all the links. 4. Execute them any way you like. Step 1: In bash I do this: export PROCS=5 #or whatever number you want rm -rf cirrus_scripts mkdir cirrus_scripts mkdir cirrus_log pushd cirrus_scripts php extensions/CirrusSearch/maintenance/forceSearchIndex.php --queue --maxJobs 10000 --pauseForJobs 1000 \ --skipLinks --indexOnSkip --buildChunks 250000 | sed -e 's/$/ | tee -a cirrus_log\/'$wiki'.parse.log/' | split -n r/$PROCS for script in x*; do sort -R $script > $script.sh && rm $script; done popd Step 2: Just run all the scripts that step 1 made. Best to run them in screen or something and in the directory above cirrus_scripts. So like this: bash cirrus_scripts/xaa.sh Step 3: In bash I do this: pushd cirrus_scripts rm *.sh php extensions/CirrusSearch/maintenance/forceSearchIndex.php --queue --maxJobs 10000 --pauseForJobs 1000 \ --skipParse --buildChunks 250000 | sed -e 's/$/ | tee -a cirrus_log\/'$wiki'.parse.log/' | split -n r/$PROCS for script in x*; do sort -R $script > $script.sh && rm $script; done popd Step 4: Same as step 2 but for the new scripts. These scripts put more load on Elasticsearch so you might want to run them just one at a time if you don't have a huge Elasticsearch cluster or you want to make sure not to cause load spikes. If you don't have a good job queue you can try the above but lower the buildChunks parameter significantly and remove the --queue parameter. Handling elasticsearch outages ------------------------------ If for some reason in process updates to elasticsearch begin failing you can immediately set "$wgDisableSearchUpdate = true;" in your LocalSettings.php file to stop trying to update elasticsearch. Once you figure out what is wrong with elasticsearch you should turn those updates back on and then run the following: php ./maintenance/forceSearchIndex.php --from <whenever the outage started in ISO 8601 format> --deletes php ./maintenance/forceSearchIndex.php --from <whenever the outage started in ISO 8601 format> The first command picks up all the deletes that occurred during the outage and should complete quite quickly. The second command picks up all the updates that occurred during the outage and might take significantly longer. PoolCounter ----------- CirrusSearch can leverage the PoolCounter extension to limit the number of concurrent searches to elasticsearch. You can do this by installing the PoolCounter extension and then configuring it in LocalSettings.php like so: require_once( "$IP/extensions/PoolCounter/PoolCounterClient.php"); // Configuration for standard searches. $wgPoolCounterConf[ 'CirrusSearch-Search' ] = array( 'class' => 'PoolCounter_Client', 'timeout' => 30, 'workers' => 25, 'maxqueue' => 50, ); // Configuration for prefix searches. These are usually quite quick and // plentiful. $wgPoolCounterConf[ 'CirrusSearch-Prefix' ] = array( 'class' => 'PoolCounter_Client', 'timeout' => 10, 'workers' => 50, 'maxqueue' => 100, ); // Configuration for regex searches. These are slow and use lots of resources // so we only allow a few at a time. $wgPoolCounterConf[ 'CirrusSearch-Regex' ] = array( 'class' => 'PoolCounter_Client', 'timeout' => 30, 'workers' => 10, 'maxqueue' => 10, ); // Configuration for funky namespace lookups. These should be reasonably fast // and reasonably rare. $wgPoolCounterConf[ 'CirrusSearch-NamespaceLookup' ] = array( 'class' => 'PoolCounter_Client', 'timeout' => 10, 'workers' => 20, 'maxqueue' => 20, ), ); If you have a version of the pool counter with Ie282b8486c7bad451fbc5fb9a8274c6e01a728a7 you can also enable per user request limits: // Each user (or remote IP) can only do 5 concurrent queries. $wgPoolCounterConf[ 'CirrusSearch-PerUser' ] = array( 'class' => 'PoolCounter_Client', 'timeout' => 0, 'workers' => 5, 'maxqueue' => 5, ); Upgrading --------- When you upgrade there four possible cases for maintaining the index: 1. You must update the index configuration and reindex from source documents. 2. You must update the index configuration and reindex from already indexed documents. 3. You must update the index configuration but no reindex is required. 4. No changes are required. If you must do (1) you have two options: A. Blow away the search index and rebuild it from scratch. Marginally faster and uses less disk space on in elasticsearch but empties the index entirely and rebuilds it so search will be down for a while: php updateSearchIndexConfig.php --startOver php forceSearchIndex.php B. Build a copy of the index, reindex to it, and then force a full reindex from source documents. Uses more disk space but search should be up the entire time: php updateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier now php forceSearchIndex.php If you must do (2) really have only one option: A. Build of a copy of the index and reindex to it: php updateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier now php forceSearchIndex.php --from <time when you started updateSearchIndexConfig.php in YYYY-mm-ddTHH:mm:ssZ> --deletes php forceSearchIndex.php --from <time when you started updateSearchIndexConfig.php in YYYY-mm-ddTHH:mm:ssZ> or for the Bash inclined: TZ=UTC export REINDEX_START=$(date +%Y-%m-%dT%H:%m:%SZ) php updateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier now php forceSearchIndex.php --from $REINDEX_START --deletes php forceSearchIndex.php --from $REINDEX_START If you must do (3) you again only have one option: A. Same as (2.A) 4 is easy! The safest thing if you don't know what is required for your update is to execute (1.B). Production suggestions ---------------------- Elasticsearch All the general rules for making Elasticsearch production ready apply here. So you don't have to go round them up below is a list. Some of these steps are obvious, others will take some research. ** NOTE: this list was written for 0.90 so it may not work well for 1.0. It'll be revised when I have more experience with 1.0. --Nik 1. Have >= 3 nodes. 2. Configure Elasticsearch for memlock. 3. Change each node's elasticsearch.yml file in a few ways. 3a. Change node name to the real host name. 3b. Turn off autocreation and some other scary stuff by adding this (tested with 0.90.4): ################################### Actions ################################# ## Modulo some small changes to comments this section comes directly from the ## wonderful Elasticsearch mailing list, specifically Dan Everton. ## # Require explicit index creation. ES never autocreates the indexes the way we # like them. ## action.auto_create_index: false ## # Protect against accidental close/delete operations on all indices. You can # still close/delete individual indices. ## action.disable_close_all_indices: true action.disable_delete_all_indices: true ## # Disable ability to shutdown nodes via REST API. ## action.disable_shutdown: true Testing ------- See tests/browser/README Job Queue --------- Cirrus makes heavy use of the job queue. You can run it without any job queue customization but if you switch the job queue to Redis with checkDelay enabled then Cirrus's results will be more correct. The reason for this is that this configuration allows Cirrus to delay link counts until Elasticsearch has appropriately refreshed. This is an example of configuring it: $redisPassword = '<password goes here>'; $wgJobTypeConf['default'] = array( 'class' => 'JobQueueRedis', 'order' => 'fifo', 'redisServer' => 'localhost', 'checkDelay' => true, 'redisConfig' => array( 'password' => $redisPassword, ), ); $wgJobQueueAggregator = array( 'class' => 'JobQueueAggregatorRedis', 'redisServer' => 'localhost', 'redisConfig' => array( 'password' => $redisPassword, ), ); Note: some MediaWiki setups have trouble running the job queue. It can be finicky. The most sure fire way to get it to work is also the slowest. Add this to your LocalSettings.php: $wgRunJobsAsync = false; Development ----------- The fastest way to get started with CirrusSearch development is to use MediaWiki-Vagrant. 1. Follow steps here: https://www.mediawiki.org/wiki/MediaWiki-Vagrant#Quick_start 2. Now execute the following: vagrant enable-role cirrussearch vagrant provision This can take some time but it produces a clean development environment in a virtual machine that has everything required to run Cirrus. Hooks ----- CirrusSearch provides hooks that other extensions can make use of to extend the core schema and modify documents. There are currently two phases to building cirrus documents: the parse phase and the links phase. The parse phase then the links phase is run when the article's rendered text would change (actual article change and template change). Only the links phase is run when an article is newly links or unlinked. Note that this whole thing is a somewhat experimental feature at this point and the API hasn't really been settled. 'CirrusSearchAnalysisConfig': Allows to hook into the configuration for analysis &config - multi-dimensional configuration array for analysis of various languages and fields $builder - instance of MappingConfigBuilder, for easier use of utility methods to build fields 'CirrusSearchMappingConfig': Allows configuration of the mapping of fields &config - multi-dimensional configuration array that contains Elasticsearch document configuration. The 'page' index contains configuration for Elasticsearch documents representing pages. The 'namespace' index contains namespace configuration for Elasticsearch documents representing namespaces. Validating a new version of Elasticsearch ----------------------------------------- The simplest way to validate the CirrusSearch works with a new version of Elasticsearch is to use MediaWiki-Vagrant. Get it setup and make sure the tests are passing (see tests/browser/README) and then: 1. Connect to your vagrant instance with ```vagrant ssh``` 2. Get the deb distribution of the Elasticsearch you want to validate with ```wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.0.deb``` or something like that. 3. Install it with ```sudo dpkg -i elasticsearch-1.7.0.deb```. 4. Disconnect from your vagrant instance. You are done with it. 5. Make a change like the following to MediaWiki-Vagrant: ``` diff --git a/puppet/modules/elasticsearch/manifests/init.pp b/puppet/modules/elasticsearch/manifests/init.pp index 08bc4bd..a9fea2e 100644 --- a/puppet/modules/elasticsearch/manifests/init.pp +++ b/puppet/modules/elasticsearch/manifests/init.pp @@ -5,7 +5,7 @@ # class elasticsearch { package { 'elasticsearch': - ensure => '1.6.0', + ensure => '1.7.0', } require_package('openjdk-7-jre-headless') diff --git a/puppet/modules/role/manifests/cirrussearch.pp b/puppet/modules/role/manifests/cirrussearch.pp index 549104a..c81a22f 100644 --- a/puppet/modules/role/manifests/cirrussearch.pp +++ b/puppet/modules/role/manifests/cirrussearch.pp @@ -16,19 +16,19 @@ class role::cirrussearch { ## Analysis elasticsearch::plugin { 'icu': name => 'elasticsearch-analysis-icu', - version => '2.6.0', + version => '2.7.0', } elasticsearch::plugin { 'kuromoji': name => 'elasticsearch-analysis-kuromoji', - version => '2.6.0', + version => '2.7.0', } elasticsearch::plugin { 'stempel': name => 'elasticsearch-analysis-stempel', - version => '2.6.0', + version => '2.7.0', } elasticsearch::plugin { 'smartcn': name => 'elasticsearch-analysis-smartcn', - version => '2.6.0', + version => '2.7.0', } elasticsearch::plugin { 'hebrew': # Less stable then icu plugin @@ -39,13 +39,13 @@ class role::cirrussearch { elasticsearch::plugin { 'highlighter': group => 'org.wikimedia.search.highlighter', name => 'experimental-highlighter-elasticsearch-plugin', - version => '1.6.0', + version => '1.7.0', } ## Trigram accelerated regular expressions, "safer" query, and friends elasticsearch::plugin { 'extra': group => 'org.wikimedia.search', name => 'extra', - version => '1.6.0', + version => '1.7.0', } ``` 6. Run ```vagrant provision```. You should notice it installing new Elasticsearch plugins but not installing a new Elasticsearch - you already did that. It'll restart Elasticsearch to pick up the new plugins. 7. Run the integration test suite again. Licensing information --------------------- CirrusSearch makes use of the Elastica extension containing the Elastica library to connect to Elasticsearch <http://elastica.io/>. It is Apache licensed and you can read the license Elastica/LICENSE.txt.
About
Github mirror of MediaWiki extension CirrusSearch - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access for contributing
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published
Languages
- PHP 80.7%
- Gherkin 11.6%
- Ruby 7.0%
- Python 0.4%
- JavaScript 0.1%
- Makefile 0.1%
- Other 0.1%