-
Notifications
You must be signed in to change notification settings - Fork 527
[Blog post] AD imputation customer success story #4006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This PR contains the content for publishing the blog on AD imputation customer success story Signed-off-by: kaituo <[email protected]>
|
Thank you for submitting a blog post! The blog post review process is: Submit a PR -> (Optional) Peer review -> Doc review -> Editorial review -> Marketing review -> Published. |
|
Hi @kaituo, It looks like you're adding a new blog post but don't have an issue mentioned. Please link this PR to an open issue using one of these keywords in the PR description:
If an issue hasn't been created yet, please create one and then link it to this PR. |
added Closes #issue-number |
Signed-off-by: kaituo <[email protected]>
|
@kolchfa-aws - Adding you for tech review. |
| ## Introduction | ||
|
|
||
| Anomaly detection in Amazon OpenSearch Service enables users to automatically identify unusual patterns and behaviors in their data streams. This powerful capability has become an essential tool for many organizations seeking to monitor system health, detect issues early, and maintain operational excellence. | ||
| However, through continuous customer feedback and real-world usage, we have identified areas where the Anomaly Detection plugin could be further improved, particularly in how it handles scenarios with missing or insufficient input data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"continuous customer feedback and real-world usage" seems a bit hand-wavy. This whole sentence could be more clearly focused on "identifying customer use cases which are not well handled" or some similar tone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| - **`PREVIOUS` (last known value):** This is the best if you want to effectively ignore missing data by carrying the last observation forward. | ||
| - **`ZERO` or `FIXED_VALUES`:** These methods are similar and should be used when you want missing data to be treated as a potential anomaly. By filling in a rare or out-of-range value (like zero or a specific constant), you make the imputed point stand out to the detector. This approach contrasts with `PREVIOUS`, which aims to make missing data blend in. | ||
|
|
||
| ### Algorithm sketch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious who the audience of this post is. Up until here it read like a layman's interpretation and I understood it well, then suddenly we're jumping into math. Is a blog post the right place for these equations? Is there a way to publish a smaller technical-focused blog and link to it from a layman's blog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The target audience is engineers/scientists—both those who just want to know how to use imputation and advanced users who want to understand the underlying implementation. I don't know an option to publish a more technically focused blog. What I can do is to move most of the math into an appendix and where the main flow stays friendly to entry-level users.
|
|
||
| For all $t$, we have $0 \le f_t \le 1$ and $0 \le q_t \le 1$. | ||
|
|
||
| *Proof.* In the binary model, $0 \le n^{\mathrm{imp}}_t \le L$ by construction, hence $0\le f_t\le 1$ and $q_t=1-f_t\in[0,1]$. In the fractional model, the window sum $S_t$ is the sum of the last $L$ mass terms, each of which lies in $[0,1]$. Therefore, $0 \le S_t \le L$, which implies $f_t \in [0,1]$ and $q_t \in [0,1]$. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing my other comment, a mathematical "proof" really seems out of place in a non-technical-audience blog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
answered in previous comment.
|
|
||
| Also, because $q_t$ is always in the range $[0, 1]$, the smoothed statistic $\mathrm{DQ}_t$ is also guaranteed to remain within $[0, 1]$. This follows directly from the standard exponential‑smoothing recurrence, where the new value is a convex combination—i.e., a weighted average with nonnegative weights that sum to 1, so it lies between its inputs—of the previous smoothed value $\mathrm{DQ}_{t-1}$ and the current observation $q_t$ (specifically $(1-\lambda)\mathrm{DQ}_{t-1}+\lambda q_t$ with $0 \lt \lambda \lt 1$), ensuring it never leaves the bounds defined by the input signal (see Wikipedia's article on ["Convex combination"](https://en.wikipedia.org/wiki/Convex_combination#:~:text=As%20a%20particular%20example%2C%20every,1)). During a sustained period of missing data, as $f_t$ trends up, $q_t$ trends down, and $\mathrm{DQ}_t$ follows suit, decreasing smoothly. Conversely, when real data returns and $f_t$ trends down, $q_t$ trends up, and $\mathrm{DQ}_t$ reliably recovers towards 1. This ensures the gating mechanism, which relies on these signals, is stable and responds to persistent changes in data quality rather than short-term noise. | ||
|
|
||
| ## System architecture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we get back into the practical nature of a scalable design, but i almost missed it skimming through math I didn't quite understand despite being a math major :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel free to comment on how to make the math more understandable:)
|
|
||
| ## Conclusion | ||
|
|
||
| Klarna’s experience underscored a simple but easily overlooked truth: in real-world monitoring, **“no data” is sometimes the most important data point of all**. By treating silent intervals as a first-class signal rather than a gap to ignore, we were able to close a blind spot where critical outages could otherwise slip by undetected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about commenting near the top when "no data" was first introduced, but it's really just too vague a term and I'm not sure it should be the key word you focus on. There are really multiple types of "no data"
- data missing because you didn't collect it (not the data's problem, truly missing and we shouldn't impute anything)
- data missing because that's the anomaly itself (the focus here, possibly could mention survivorship bias and the need to impute "something")
- data present that's just the baseline (treated as "no anomaly" and provides useful data in a Bayesian context)
I think the attempt to squeeze too much out of "no data" doesn't address the complexities of data quality well enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to distinguish the first two cases you mentioned in the Solution part. Let me know what you think.
Signed-off-by: kaituo <[email protected]>
dbwiddis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for addressing my observations!
|
@kolchfa-aws We are ready for Doc review. |
Signed-off-by: kaituo <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
natebower
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Editorial review
Signed-off-by: Nathan Bower <[email protected]>
natebower
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@pajuric Please hold off on publishing for now, as we’re waiting for approval from the customer’s leadership team. |
Signed-off-by: Kaituo Li <[email protected]>
|
@natebower I updated photo link. Do you mind approving again? |
natebower
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Per @kaituo - blog is still holding for customer approval. |
Description
This PR contains the content for publishing the blog on AD imputation customer success story
Issues Resolved
Closes #4005
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the BSD-3-Clause License.