Skip to content

Conversation

@kaituo
Copy link
Contributor

@kaituo kaituo commented Nov 17, 2025

Description

This PR contains the content for publishing the blog on AD imputation customer success story

Issues Resolved

Closes #4005

Check List

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the BSD-3-Clause License.

This PR contains the content for publishing the blog on AD imputation customer success story

Signed-off-by: kaituo <[email protected]>
@github-actions
Copy link

Thank you for submitting a blog post!

The blog post review process is: Submit a PR -> (Optional) Peer review -> Doc review -> Editorial review -> Marketing review -> Published.

@github-actions
Copy link

Hi @kaituo,

It looks like you're adding a new blog post but don't have an issue mentioned. Please link this PR to an open issue using one of these keywords in the PR description:

  • Closes #issue-number
  • Fixes #issue-number
  • Resolves #issue-number

If an issue hasn't been created yet, please create one and then link it to this PR.

@kaituo
Copy link
Contributor Author

kaituo commented Nov 17, 2025

Hi @kaituo,

It looks like you're adding a new blog post but don't have an issue mentioned. Please link this PR to an open issue using one of these keywords in the PR description:

* Closes #issue-number

* Fixes #issue-number

* Resolves #issue-number

If an issue hasn't been created yet, please create one and then link it to this PR.

added Closes #issue-number

@kaituo kaituo changed the title AD imputation customer success story [Blog post] AD imputation customer success story Nov 18, 2025
@pajuric
Copy link

pajuric commented Nov 18, 2025

@kolchfa-aws - Adding you for tech review.

## Introduction

Anomaly detection in Amazon OpenSearch Service enables users to automatically identify unusual patterns and behaviors in their data streams. This powerful capability has become an essential tool for many organizations seeking to monitor system health, detect issues early, and maintain operational excellence.
However, through continuous customer feedback and real-world usage, we have identified areas where the Anomaly Detection plugin could be further improved, particularly in how it handles scenarios with missing or insufficient input data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"continuous customer feedback and real-world usage" seems a bit hand-wavy. This whole sentence could be more clearly focused on "identifying customer use cases which are not well handled" or some similar tone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

- **`PREVIOUS` (last known value):** This is the best if you want to effectively ignore missing data by carrying the last observation forward.
- **`ZERO` or `FIXED_VALUES`:** These methods are similar and should be used when you want missing data to be treated as a potential anomaly. By filling in a rare or out-of-range value (like zero or a specific constant), you make the imputed point stand out to the detector. This approach contrasts with `PREVIOUS`, which aims to make missing data blend in.

### Algorithm sketch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious who the audience of this post is. Up until here it read like a layman's interpretation and I understood it well, then suddenly we're jumping into math. Is a blog post the right place for these equations? Is there a way to publish a smaller technical-focused blog and link to it from a layman's blog?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The target audience is engineers/scientists—both those who just want to know how to use imputation and advanced users who want to understand the underlying implementation. I don't know an option to publish a more technically focused blog. What I can do is to move most of the math into an appendix and where the main flow stays friendly to entry-level users.


For all $t$, we have $0 \le f_t \le 1$ and $0 \le q_t \le 1$.

*Proof.* In the binary model, $0 \le n^{\mathrm{imp}}_t \le L$ by construction, hence $0\le f_t\le 1$ and $q_t=1-f_t\in[0,1]$. In the fractional model, the window sum $S_t$ is the sum of the last $L$ mass terms, each of which lies in $[0,1]$. Therefore, $0 \le S_t \le L$, which implies $f_t \in [0,1]$ and $q_t \in [0,1]$.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing my other comment, a mathematical "proof" really seems out of place in a non-technical-audience blog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

answered in previous comment.


Also, because $q_t$ is always in the range $[0, 1]$, the smoothed statistic $\mathrm{DQ}_t$ is also guaranteed to remain within $[0, 1]$. This follows directly from the standard exponential‑smoothing recurrence, where the new value is a convex combination—i.e., a weighted average with nonnegative weights that sum to 1, so it lies between its inputs—of the previous smoothed value $\mathrm{DQ}_{t-1}$ and the current observation $q_t$ (specifically $(1-\lambda)\mathrm{DQ}_{t-1}+\lambda q_t$ with $0 \lt \lambda \lt 1$), ensuring it never leaves the bounds defined by the input signal (see Wikipedia's article on ["Convex combination"](https://en.wikipedia.org/wiki/Convex_combination#:~:text=As%20a%20particular%20example%2C%20every,1)). During a sustained period of missing data, as $f_t$ trends up, $q_t$ trends down, and $\mathrm{DQ}_t$ follows suit, decreasing smoothly. Conversely, when real data returns and $f_t$ trends down, $q_t$ trends up, and $\mathrm{DQ}_t$ reliably recovers towards 1. This ensures the gating mechanism, which relies on these signals, is stable and responds to persistent changes in data quality rather than short-term noise.

## System architecture
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we get back into the practical nature of a scalable design, but i almost missed it skimming through math I didn't quite understand despite being a math major :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to comment on how to make the math more understandable:)


## Conclusion

Klarna’s experience underscored a simple but easily overlooked truth: in real-world monitoring, **“no data” is sometimes the most important data point of all**. By treating silent intervals as a first-class signal rather than a gap to ignore, we were able to close a blind spot where critical outages could otherwise slip by undetected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about commenting near the top when "no data" was first introduced, but it's really just too vague a term and I'm not sure it should be the key word you focus on. There are really multiple types of "no data"

  • data missing because you didn't collect it (not the data's problem, truly missing and we shouldn't impute anything)
  • data missing because that's the anomaly itself (the focus here, possibly could mention survivorship bias and the need to impute "something")
  • data present that's just the baseline (treated as "no anomaly" and provides useful data in a Bayesian context)

I think the attempt to squeeze too much out of "no data" doesn't address the complexities of data quality well enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to distinguish the first two cases you mentioned in the Solution part. Let me know what you think.

@kolchfa-aws kolchfa-aws added New blog New blog post Tech review The blog is under tech review labels Nov 18, 2025
@kolchfa-aws kolchfa-aws self-assigned this Nov 18, 2025
Copy link
Member

@dbwiddis dbwiddis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for addressing my observations!

@kaituo
Copy link
Contributor Author

kaituo commented Nov 19, 2025

@kolchfa-aws We are ready for Doc review.

kaituo and others added 2 commits November 19, 2025 12:56
Signed-off-by: Fanit Kolchina <[email protected]>
@kolchfa-aws kolchfa-aws added Editorial review The blog is under editorial review and removed Tech review The blog is under tech review labels Nov 20, 2025
@kolchfa-aws kolchfa-aws removed their assignment Nov 20, 2025
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Editorial review

@natebower natebower removed the Editorial review The blog is under editorial review label Nov 20, 2025
natebower
natebower previously approved these changes Nov 20, 2025
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kaituo! LGTM

@pajuric This should be ready to publish.

@natebower natebower added Done and ready to publish The blog is approved and ready to publish and removed New blog New blog post labels Nov 20, 2025
@natebower natebower assigned pajuric and unassigned natebower Nov 20, 2025
@kaituo
Copy link
Contributor Author

kaituo commented Nov 20, 2025

@pajuric Please hold off on publishing for now, as we’re waiting for approval from the customer’s leadership team.

Signed-off-by: Kaituo Li <[email protected]>
@kaituo
Copy link
Contributor Author

kaituo commented Nov 21, 2025

@natebower I updated photo link. Do you mind approving again?

Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pajuric
Copy link

pajuric commented Dec 9, 2025

Per @kaituo - blog is still holding for customer approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Done and ready to publish The blog is approved and ready to publish

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BLOG] A customer impact journey: When no data is still important data

5 participants