Update App Search exporter notebook (#426)

kderusso · peter-strsr · web-flow · commit d30b6096785c · 2025-03-18T15:04:45.000-04:00
Co-authored-by: peter-strsr &lt;peter.strasser@elastic.co&gt;
diff --git a/notebooks/enterprise-search/app-search-engine-exporter.ipynb b/notebooks/enterprise-search/app-search-engine-exporter.ipynb
@@ -12,6 +12,8 @@
     "\n",
     "This notebook explains the steps of exporting an App Search engine together with its configurations in Elasticsearch. This is not meant to be an exhaustive example for all App Search features as those will vary based on your instance, but is meant to give a sense of how you can export, migrate, and enhance your application.\n",
     "\n",
+    "NOTE: This notebook is designed to work with Elasticsearch **8.18** or higher. If you are running this notebook against an older version of Elasticsearch, we note commands that will need to be modified.\n",
+    "\n",
     "We will look at:\n",
     "\n",
     "- how to export synonyms\n",
@@ -57,7 +59,7 @@
    "source": [
     "## Connect to Elasticsearch\n",
     "\n",
-    "ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?onboarding_token=search&utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial. \n",
+    "ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?onboarding_token=search&utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial. This notebook is designed to be run against an Elasticsearch deployment running on version 8.18 or higher.\n",
     "\n",
     "We'll use the **Cloud ID** to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment. \n",
     "\n",
@@ -66,7 +68,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -95,12 +97,12 @@
     "\n",
     "You can find your App Search endpoint and your search private key from the `Credentials` menu inside your App Search instance in Kibana.\n",
     "\n",
-    "Also note here, we define our `ENGINE_NAME`. For this examplem we are using the `national-parks-demo` sample engine that is available within App Search."
+    "Also note here, we define our `ENGINE_NAME`. For this example, we are using the `national-parks-demo` sample engine that is available within App Search."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -129,7 +131,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": null,
    "metadata": {
     "id": "kpV8K5jHvRK6"
    },
@@ -173,9 +175,9 @@
     "\n",
     "Next, we will export any curations that may be in our App Search engine.\n",
     "\n",
-    "To export App Search curations we will use Elasticsearch [query rules](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-using-query-rules.html).\n",
-    "At the moment of writing this notebook Elasticsearch query rules only allow for pinning results unlike App Search curations that also allow excluding results.\n",
-    "For this reason we will only export pinned results. The code below will create the necessary `query_rules` to achieve this. Note that there is a default soft limit of 100 curations for `query_rules` that can be configured up to a hard limit of 1,000."
+    "To export App Search curations we will use Elasticsearch [query rules](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-using-query-rules.html). The code below will create the necessary `query_rules` to achieve this. Note that there is a default soft limit of 100 curations for `query_rules` that can be configured up to a hard limit of 1,000.\n",
+    "\n",
+    "NOTE: This example outputs query rules requiring `exact` matches, which are case-sensitive. If you need typo tolerance, consider using `fuzzy`. If you need different case values consider adding multiple values to your criteria. "
    ]
   },
   {
@@ -187,24 +189,60 @@
     "query_rules = []\n",
     "\n",
     "for curation in app_search.list_curations(engine_name=ENGINE_NAME).body[\"results\"]:\n",
-    "    query_rules.append(\n",
-    "        {\n",
-    "            \"rule_id\": curation[\"id\"],\n",
-    "            \"type\": \"pinned\",\n",
-    "            \"criteria\": [\n",
-    "                {\n",
-    "                    \"type\": \"exact\",\n",
-    "                    \"metadata\": \"user_query\",\n",
-    "                    \"values\": curation[\"queries\"],\n",
-    "                }\n",
-    "            ],\n",
-    "            \"actions\": {\"ids\": curation[\"promoted\"]},\n",
-    "        }\n",
-    "    )\n",
+    "    if curation[\"promoted\"]:\n",
+    "        query_rules.append(\n",
+    "            {\n",
+    "                \"rule_id\": curation[\"id\"] + \"-pinned\",\n",
+    "                \"type\": \"pinned\",\n",
+    "                \"criteria\": [\n",
+    "                    {\n",
+    "                        \"type\": \"exact\",\n",
+    "                        \"metadata\": \"user_query\",\n",
+    "                        \"values\": curation[\"queries\"],\n",
+    "                    }\n",
+    "                ],\n",
+    "                \"actions\": {\"ids\": curation[\"promoted\"]},\n",
+    "            }\n",
+    "        )\n",
+    "    if curation[\"hidden\"]:\n",
+    "        query_rules.append(\n",
+    "            {\n",
+    "                \"rule_id\": curation[\"id\"] + \"-exclude\",\n",
+    "                \"type\": \"exclude\",\n",
+    "                \"criteria\": [\n",
+    "                    {\n",
+    "                        \"type\": \"exact\",\n",
+    "                        \"metadata\": \"user_query\",\n",
+    "                        \"values\": curation[\"queries\"],\n",
+    "                    }\n",
+    "                ],\n",
+    "                \"actions\": {\"ids\": curation[\"hidden\"]},\n",
+    "            }\n",
+    "        )\n",
     "\n",
     "elasticsearch.query_rules.put_ruleset(ruleset_id=ENGINE_NAME, rules=query_rules)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's take a quick look at the query rules we've migrated. We'll do this via the `GET _query_rules/ENGINE_NAME` endpoint. Note that curations with both pinned and hidden documents will be represented as two rules in the ruleset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\n",
+    "    json.dumps(\n",
+    "        elasticsearch.query_rules.get_ruleset(ruleset_id=ENGINE_NAME).body, indent=2\n",
+    "    )\n",
+    ")"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {
@@ -215,7 +253,15 @@
     "\n",
     "We recommend reindexing your App Search engine data into a new Elasticsearch index instead of reusing the existing one. This allows you to update the index mapping to take advantage of modern features like semantic search and the newly created Elasticsearch synonym set.\n",
     "\n",
-    "App Search has the following data types: `text`, `number`, `date` and `geolocation`. Each of these types is mapped to Elasticsearch field types.\n",
+    "App Search has the following data types:\n",
+    "\n",
+    "- `text`\n",
+    "- `number`\n",
+    "- `date`\n",
+    "- `geolocation`\n",
+    " \n",
+    "Each of these types is mapped to Elasticsearch field types.\n",
+    "\n",
     "We can take a closer look at how App Search field types are mapped to Elasticsearch fields, by using the [`GET mapping API`](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-get-mapping.html).\n",
     "For App Search engines, the associated Elasticsearch index name is `.ent-search-engine-documents-[ENGINE_NAME]`, e.g. `.ent-search-engine-documents-national-parks-demo` for the App Search sample engine `national-parks-demo`.\n",
     "One thing to notice is how App Search uses [multi-fields](https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html) in Elasticsearch that allow for quickly changing the field type in App Search without requiring reindexing by creating subfields for each type of supported field:\n",
@@ -578,38 +624,11 @@
    "source": [
     "# Add semantic text fields for semantic search (optional)\n",
     "\n",
-    "One of the advantages of exporting our index directly to Elasticsearch is that we can easily perform semantic search with ELSER. To do this, we'll need to add an inference endpoint using ELSER, and a `semantic_text` field to our index to use it.\n",
+    "One of the advantages of exporting our index directly to Elasticsearch is that we can easily perform semantic search with ELSER. To do this, we'll need to add a `semantic_text` field to our index to use it. We will set up a `semantic_text` field using our default ELSER endpoint.\n",
     "\n",
-    "Note that to use this feature, your cluster must have at least one ML node set up with enough resources allocated to it.\n",
+    "Note that to use this feature, your cluster must be running at least version 8.15.0 and have at least one ML node set up with enough resources allocated to it.\n",
     "\n",
-    "If you have not already, be sure that your ELSER v2 model is [setup and deployed](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html).\n",
-    "\n",
-    "Let's first start by creating our inference endpoint using the [Create inference API]](https://www.elastic.co/guide/en/elasticsearch/reference/current/put-inference-api.html)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# delete our inference endpoint if it is already created\n",
-    "if elasticsearch.inference.get(inference_id=\"elser_inference_endpoint\"):\n",
-    "    elasticsearch.inference.delete(inference_id=\"elser_inference_endpoint\")\n",
-    "\n",
-    "# and create our endpoint using the ELSER v2 model\n",
-    "elasticsearch.inference.put(\n",
-    "    inference_id=\"elser_inference_endpoint\",\n",
-    "    inference_config={\n",
-    "        \"service\": \"elasticsearch\",\n",
-    "        \"service_settings\": {\n",
-    "            \"model_id\": \".elser_model_2_linux-x86_64\",\n",
-    "            \"num_allocations\": 1,\n",
-    "            \"num_threads\": 1,\n",
-    "        },\n",
-    "    },\n",
-    "    task_type=\"sparse_embedding\",\n",
-    ")"
+    "If you do not have an ELSER endpoint running, it will be automatically downloaded, deployed and started for you when you use `semantic_text`. This means the first few commands may take a while as the model loads. For Elasticsearch versions below 8.17, you will need to create an inference endpoint and add it to the `semantic_text` mapping."
    ]
   },
   {
@@ -618,7 +637,7 @@
    "source": [
     "## Using semantic text fields for ingest and query\n",
     "\n",
-    "Next, we'll augment our text fields with `semantic_text` fields in our index. We'll do this by creating a `semtantic_text` field, and providing a `copy_to` directive from the original source field to copy the text into our semantic text fields.\n",
+    "First, we'll augment our text fields with `semantic_text` fields in our index. We'll do this by creating a `semtantic_text` field, and providing a `copy_to` directive from the original source field to copy the text into our semantic text fields.\n",
     "\n",
     "In the example below, we are using the `description` and `title` fields from our example index to add semantic search on those fields."
    ]
@@ -636,10 +655,7 @@
     "# add the semantic_text field to our mapping for each field defined\n",
     "for field_name in SEMANTIC_TEXT_FIELDS:\n",
     "    semantic_field_name = field_name + \"_semantic\"\n",
-    "    mapping[semantic_field_name] = {\n",
-    "        \"type\": \"semantic_text\",\n",
-    "        \"inference_id\": \"elser_inference_endpoint\",\n",
-    "    }\n",
+    "    mapping[semantic_field_name] = {\"type\": \"semantic_text\"}\n",
     "\n",
     "# and for our text fields, add a \"copy_to\" directive to copy the text to the semantic_text field\n",
     "for field_name in SEMANTIC_TEXT_FIELDS:\n",
@@ -778,7 +794,7 @@
     "\n",
     "For the results, we sort on our score descending as the primary sort, with the document id as the secondary.\n",
     "\n",
-    "We apply highlights to our results, request a return size of the top 10 hits, and for each hit, return the result fields."
+    "We apply highlights to returned text search descriptions, request a return size of the top 10 hits, and for each hit, return the result fields."
    ]
   },
   {
@@ -826,7 +842,7 @@
     "        \"order\": \"score\",\n",
     "        \"encoder\": \"html\",\n",
     "        \"require_field_match\": False,\n",
-    "        \"fields\": {},\n",
+    "        \"fields\": {\"description\": {\"pre_tags\": [\"<em>\"], \"post_tags\": [\"</em>\"]}},\n",
     "    },\n",
     "    \"size\": 10,\n",
     "    \"_source\": result_fields,\n",
@@ -849,7 +865,7 @@
    "outputs": [],
    "source": [
     "results = elasticsearch.search(\n",
-    "    index=SOURCE_INDEX,\n",
+    "    index=DEST_INDEX,\n",
     "    query=app_search_query_payload[\"query\"],\n",
     "    highlight=app_search_query_payload[\"highlight\"],\n",
     "    source=app_search_query_payload[\"_source\"],\n",
@@ -866,7 +882,9 @@
     "### How to do semantic search using ELSER with semantic text fields\n",
     "\n",
     "If you [enabled and reindexed your data with ELSER](#add-sparse_vector-fields-for-semantic-search-optional), we can now use this to do semantic search.\n",
-    "For each `semantic_text` field type, we can define a [semantic query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-semantic-query.html) to easily perform a semantic search on these fields.\n"
+    "For each `semantic_text` field type, we can define a [match query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html) to easily perform a semantic search on these fields.\n",
+    "\n",
+    "NOTE: For Elasticsearch versions prior to 8.18, a [semantic query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-semantic-query.html) should be used to perform a semantic search on these fields.\n"
    ]
   },
   {
@@ -881,14 +899,7 @@
     "\n",
     "for field_name in SEMANTIC_TEXT_FIELDS:\n",
     "    semantic_field_name = field_name + \"_semantic\"\n",
-    "    semantic_text_queries.append(\n",
-    "        {\n",
-    "            \"semantic\": {\n",
-    "                \"field\": semantic_field_name,\n",
-    "                \"query\": QUERY_STRING,\n",
-    "            }\n",
-    "        }\n",
-    "    )\n",
+    "    semantic_text_queries.append({\"match\": {semantic_field_name: QUERY_STRING}})\n",
     "\n",
     "semantic_query = {\"bool\": {\"should\": semantic_text_queries}}\n",
     "print(f\"Elasticsearch query:\\n{json.dumps(semantic_query, indent=2)}\\n\")"
@@ -926,7 +937,7 @@
     "          \"should\": [\n",
     "            // multi_match query with best_fields from App Search generated query\n",
     "            // multi_match query with cross_fields from App Search generated query\n",
-    "            // text_expansion queries for sparse_vector fields\n",
+    "            // match queries for semantic_text fields\n",
     "          ]\n",
     "        }\n",
     "      }  \n",
@@ -960,7 +971,7 @@
    "outputs": [],
    "source": [
     "results = elasticsearch.search(\n",
-    "    index=SOURCE_INDEX,\n",
+    "    index=DEST_INDEX,\n",
     "    query=payload[\"query\"],\n",
     "    highlight=payload[\"highlight\"],\n",
     "    source=payload[\"_source\"],\n",
@@ -969,7 +980,7 @@
     "    min_score=1,\n",
     ")\n",
     "\n",
-    "print(f\"Text expansion query results:\\n{json.dumps(results.body, indent=2)}\\n\")"
+    "print(f\"Semantic query results:\\n{json.dumps(results.body, indent=2)}\\n\")"
    ]
   }
  ],