Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index span.db.statement as text #6586

Open
matschaffer opened this issue Nov 11, 2021 · 24 comments
Open

Index span.db.statement as text #6586

matschaffer opened this issue Nov 11, 2021 · 24 comments

Comments

@matschaffer
Copy link

The db statement is really useful, but it would be even more amazingly useful if we could search on it.

So by adding this mapping:

Screen Shot 2021-11-11 at 13 59 51

We should be able to search for spans which have a certain db query structure.

Screen Shot 2021-11-11 at 14 05 04

One gotcha I seem to be hitting in this test is that I can't search for jvm or heap . Seems like maybe I'm catching an "ignore_above" but I didn't think they were supposed to kick in on text fields.

Either way, I imagine it'd be quite useful to do things like highlight traces that involved specific query structures (indices, tables, etc).

@axw
Copy link
Member

axw commented Nov 11, 2021

@matschaffer thanks for opening.

Originally this field was intentionally not indexed, to avoid incurring high storage cost. I would like to revisit this decision, ideally with some data to show the cost is worthwhile.

Would match_only_text be reasonable for your needs?

@matschaffer
Copy link
Author

For the purpose I had above, definitely.

Though I could imagine cases where ordering of word in the statement is important. For example if it were a sql query, you might be interested in if the table appeared before or after the join.

Just conjecture on that though.

@matschaffer
Copy link
Author

As for the cost, have we gotten hard numbers on with/without? Seems like maybe we could add the mapping to a large index and check before/after. I'm sure ESS internal clusters have lots of large indices we could borrow for such a purpose.

@axw
Copy link
Member

axw commented Nov 15, 2021

As for the cost, have we gotten hard numbers on with/without?

No, that's really the only thing holding us back at the moment.

@matschaffer
Copy link
Author

Cool. I was able to simply add the mapping to get the functionality. I didn't check on-disk size before/after, but I could. Is there a particular disk number you think would be most useful? Is the store size on https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html sufficient you think?

@axw
Copy link
Member

axw commented Nov 15, 2021

I think indexing and store are most interesting. It would also be helpful to know the proportion of docs that have a span.db.statement field.

@matschaffer
Copy link
Author

I did a little poking at this today. I think we can probably get the ratio with:

GET apm-7.15.1-span-000001/_search
{
  "size": 0,
  "aggs": {
    "wrapper": {
      "terms": {
        "field": "_index",
        "size": 10
      },
      "aggs": {
        "with_db_statement": {
          "filter": {
            "exists": {
              "field": "span.db.statement"
            }
          }
        },
        "total": {
          "filter": {
            "match": {
              "_index": "*"
            }
          }
        },
        "ratio": {
          "bucket_script": {
            "buckets_path": {
              "numerator": "with_db_statement>_count",
              "denominator": "total>_count"
            },
            "script": "params.numerator / params.denominator"
          }
        }
      }
    }
  }
}

And the store sizing info with:

GET apm-7.15.1-span-000001/_stats/indexing,store

@matschaffer
Copy link
Author

Here's one from us-east-1 staging metrics cluster

Before:

GET apm-7.15.0-span-000048/_stats/indexing,store
{
  "_shards" : {
    "total" : 6,
    "successful" : 6,
    "failed" : 0
  },
  "_all" : {
    "primaries" : {
      "store" : {
        "size_in_bytes" : 3665198005,
        "total_data_set_size_in_bytes" : 3665198005,
        "reserved_in_bytes" : 0
      },
      "indexing" : {
        "index_total" : 0,
        "index_time_in_millis" : 0,
        "index_current" : 0,
        "index_failed" : 0,
        "delete_total" : 0,
        "delete_time_in_millis" : 0,
        "delete_current" : 0,
        "noop_update_total" : 0,
        "is_throttled" : false,
        "throttle_time_in_millis" : 0
      }
    },
    "total" : {
      "store" : {
        "size_in_bytes" : 7330877145,
        "total_data_set_size_in_bytes" : 7330877145,
        "reserved_in_bytes" : 0
      },
      "indexing" : {
        "index_total" : 0,
        "index_time_in_millis" : 0,
        "index_current" : 0,
        "index_failed" : 0,
        "delete_total" : 0,
        "delete_time_in_millis" : 0,
        "delete_current" : 0,
        "noop_update_total" : 0,
        "is_throttled" : false,
        "throttle_time_in_millis" : 0
      }
    }
  },
  "indices" : {
    "apm-7.15.0-span-000048" : {
      "uuid" : "KMHvD2_mQ9OYg2Eli4cq9w",
      "primaries" : {
        "store" : {
          "size_in_bytes" : 3665198005,
          "total_data_set_size_in_bytes" : 3665198005,
          "reserved_in_bytes" : 0
        },
        "indexing" : {
          "index_total" : 0,
          "index_time_in_millis" : 0,
          "index_current" : 0,
          "index_failed" : 0,
          "delete_total" : 0,
          "delete_time_in_millis" : 0,
          "delete_current" : 0,
          "noop_update_total" : 0,
          "is_throttled" : false,
          "throttle_time_in_millis" : 0
        }
      },
      "total" : {
        "store" : {
          "size_in_bytes" : 7330877145,
          "total_data_set_size_in_bytes" : 7330877145,
          "reserved_in_bytes" : 0
        },
        "indexing" : {
          "index_total" : 0,
          "index_time_in_millis" : 0,
          "index_current" : 0,
          "index_failed" : 0,
          "delete_total" : 0,
          "delete_time_in_millis" : 0,
          "delete_current" : 0,
          "noop_update_total" : 0,
          "is_throttled" : false,
          "throttle_time_in_millis" : 0
        }
      }
    }
  }
}

Mapping:

PUT apm-7.15.0-span-000048/_mapping
{
  "properties": {
    "span": {
      "properties": {
        "db": {
          "properties": {
            "statement": {
              "type": "text"
            }
          }
        }
      }
    }
  }
}

After:

GET apm-7.15.0-span-000048/_stats/indexing,store
{
  "_shards" : {
    "total" : 6,
    "successful" : 6,
    "failed" : 0
  },
  "_all" : {
    "primaries" : {
      "store" : {
        "size_in_bytes" : 3665198005,
        "total_data_set_size_in_bytes" : 3665198005,
        "reserved_in_bytes" : 0
      },
      "indexing" : {
        "index_total" : 0,
        "index_time_in_millis" : 0,
        "index_current" : 0,
        "index_failed" : 0,
        "delete_total" : 0,
        "delete_time_in_millis" : 0,
        "delete_current" : 0,
        "noop_update_total" : 0,
        "is_throttled" : false,
        "throttle_time_in_millis" : 0
      }
    },
    "total" : {
      "store" : {
        "size_in_bytes" : 7330877145,
        "total_data_set_size_in_bytes" : 7330877145,
        "reserved_in_bytes" : 0
      },
      "indexing" : {
        "index_total" : 0,
        "index_time_in_millis" : 0,
        "index_current" : 0,
        "index_failed" : 0,
        "delete_total" : 0,
        "delete_time_in_millis" : 0,
        "delete_current" : 0,
        "noop_update_total" : 0,
        "is_throttled" : false,
        "throttle_time_in_millis" : 0
      }
    }
  },
  "indices" : {
    "apm-7.15.0-span-000048" : {
      "uuid" : "KMHvD2_mQ9OYg2Eli4cq9w",
      "primaries" : {
        "store" : {
          "size_in_bytes" : 3665198005,
          "total_data_set_size_in_bytes" : 3665198005,
          "reserved_in_bytes" : 0
        },
        "indexing" : {
          "index_total" : 0,
          "index_time_in_millis" : 0,
          "index_current" : 0,
          "index_failed" : 0,
          "delete_total" : 0,
          "delete_time_in_millis" : 0,
          "delete_current" : 0,
          "noop_update_total" : 0,
          "is_throttled" : false,
          "throttle_time_in_millis" : 0
        }
      },
      "total" : {
        "store" : {
          "size_in_bytes" : 7330877145,
          "total_data_set_size_in_bytes" : 7330877145,
          "reserved_in_bytes" : 0
        },
        "indexing" : {
          "index_total" : 0,
          "index_time_in_millis" : 0,
          "index_current" : 0,
          "index_failed" : 0,
          "delete_total" : 0,
          "delete_time_in_millis" : 0,
          "delete_current" : 0,
          "noop_update_total" : 0,
          "is_throttled" : false,
          "throttle_time_in_millis" : 0
        }
      }
    }
  }
}

Ratio:

GET apm-7.15.0-span-000048/_search

Request:

{
  "size": 0,
  "aggs": {
    "wrapper": {
      "terms": {
        "field": "_index",
        "size": 10
      },
      "aggs": {
        "with_db_statement": {
          "filter": {
            "exists": {
              "field": "span.db.statement"
            }
          }
        },
        "total": {
          "filter": {
            "match": {
              "_index": "*"
            }
          }
        },
        "ratio": {
          "bucket_script": {
            "buckets_path": {
              "numerator": "with_db_statement>_count",
              "denominator": "total>_count"
            },
            "script": "params.numerator / params.denominator"
          }
        }
      }
    }
  }
}

Response:

{
  "size": 0,
  "aggs": {
    "wrapper": {
      "terms": {
        "field": "_index",
        "size": 10
      },
      "aggs": {
        "with_db_statement": {
          "filter": {
            "exists": {
              "field": "span.db.statement"
            }
          }
        },
        "total": {
          "filter": {
            "match": {
              "_index": "*"
            }
          }
        },
        "ratio": {
          "bucket_script": {
            "buckets_path": {
              "numerator": "with_db_statement>_count",
              "denominator": "total>_count"
            },
            "script": "params.numerator / params.denominator"
          }
        }
      }
    }
  }
}

Sadly it looks like this index had no db statement spans 😆

I'll see if I can dig around for a sample apm span index with more db statements around.

@matschaffer
Copy link
Author

Wonder if it might make sense to use @dgieselaar 's apm synthtrace to just generate a bunch of random db statements as part of the span data. Not as nice as "real" but looks like our biggest "real" data source doesn't record db statements at all today.

@axw
Copy link
Member

axw commented Nov 17, 2021

Sadly it looks like this index had no db statement spans laughing

@matschaffer oh, bummer. Thanks for digging into it anyway.

Wonder if it might make sense to use @dgieselaar 's apm synthtrace to just generate a bunch of random db statements as part of the span data. Not as nice as "real" but looks like our biggest "real" data source doesn't record db statements at all today.

++

I'd like to use synthtrace (some kind of generator) to generate APM events, pass them through apm-server, and then use the resulting docs in Rally (#6115). Then we can measure the impact of mapping changes like, while keeping the results in sync with the changes in how apm-server structures docs.

@henrikno
Copy link

Btw. from 7.15 you can get the actual disk usage per field, which is really useful to test mapping changes. https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-disk-usage.html

@henrikno
Copy link

Also wanted to note, match_only_text does support phrase queries, it'll just be slower since it needs to look at _source to verify ordering.

@tobiasstadler
Copy link
Contributor

I wanted to note that span.db.statement may contain sensitive information.

@axw
Copy link
Member

axw commented Nov 18, 2021

@tobiasstadler we already store span.db.statement in the Elasticsearch document, it's just that we don't index it for fast searching. It's already possible to search the field using runtime fields by defining a runtime field in the search request, so I think the only difference would be the speed of the search. Is there something else about indexing it that increases security risk?

@tobiasstadler
Copy link
Contributor

No, I don't thin so. Sorry I didn't know/forgot that the value is stored already.

@axw
Copy link
Member

axw commented Nov 18, 2021

No worries! Thanks for raising your concern anyway.

@tobiasstadler
Copy link
Contributor

I did the following in our test cluster:

PUT /apm-7.15.0-span-000001/_mapping
{
  "properties": {
    "span": {
      "properties": {
        "db": {
          "properties": {
            "statement": {
              "type": "text"
            }
          }
        }
      }
    }
  }
}

and

 POST /apm-7.15.0-span-000001/_update_by_query

Now

POST /apm-7.15.0-span-000001/_disk_usage?run_expensive_tasks=true

results in

...

"span.db.statement" : {
        "total" : "49.9mb",
        "total_in_bytes" : 52365481,
        "inverted_index" : {
          "total" : "35mb",
          "total_in_bytes" : 36720077
        },
        "stored_fields" : "0b",
        "stored_fields_in_bytes" : 0,
        "doc_values" : "0b",
        "doc_values_in_bytes" : 0,
        "points" : "0b",
        "points_in_bytes" : 0,
        "norms" : "14.9mb",
        "norms_in_bytes" : 15645404,
        "term_vectors" : "0b",
        "term_vectors_in_bytes" : 0
      },
...

apm-7.15.0-span-000001 has 13610386 docs with span.db.statement.

I hope this helps.

@axw
Copy link
Member

axw commented Nov 18, 2021

Thanks @tobiasstadler! Those numbers suggest we have nothing to be concerned about, storage wise. It would be helpful to know what the indexing overhead is (I expect it's also fine).

@tobiasstadler
Copy link
Contributor

Is there a way to measure to overhead?

@axw
Copy link
Member

axw commented Nov 19, 2021

@tobiasstadler using the Elasticsearch _stats API like in #6586 (comment) (comparing before & after). However, I just saw that Mat's results in #6586 (comment) and the indexing stats are all zero, so I'm not exactly sure what steps are necessary here.

@tobiasstadler
Copy link
Contributor

Sorry, but I do not have any "before" data.

@axw
Copy link
Member

axw commented Nov 19, 2021

No worries, we'll run some experiments. Thanks all the same :)

@simitt simitt added this to the 8.3 milestone Apr 4, 2022
@simitt simitt removed this from the 8.3 milestone May 24, 2022
@axw
Copy link
Member

axw commented Nov 23, 2023

This will kinda be handled as a side-effect of #11528, and the general move to dynamic mapping. I say "kinda" because the dynamic mapping rules will map this field as a keyword field, and not text. Still, it will be searchable (albeit more slowly), and folks will still be able to override this with a custom component template.

@axw
Copy link
Member

axw commented Nov 23, 2023

See also #12098

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants