-
Notifications
You must be signed in to change notification settings - Fork 470
feat(profiling): make Profile upload lock-free #15302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(profiling): make Profile upload lock-free #15302
Conversation
|
|
Bootstrap import analysisComparison of import times between this PR and base. SummaryThe average import time from this PR is: 256 ± 5 ms. The average import time from base is: 262 ± 5 ms. The import time difference between this PR and base is: -5.6 ± 0.2 ms. Import time breakdownThe following import paths have shrunk:
|
Performance SLOsComparing candidate kowalski/reduce-locking-on-profile-object (a201ec6) with baseline main (67a9db7) ❌ Test Failures (1 suite)❌ telemetryaddmetric - 29/30✅ 1-count-metric-1-timesTime: ✅ 3.279µs (SLO: <20.000µs 📉 -83.6%) vs baseline: 📈 +10.2% Memory: ✅ 34.524MB (SLO: <35.500MB -2.7%) vs baseline: +4.7% ✅ 1-count-metrics-100-timesTime: ✅ 200.919µs (SLO: <220.000µs -8.7%) vs baseline: ~same Memory: ✅ 34.505MB (SLO: <35.500MB -2.8%) vs baseline: +4.8% ✅ 1-distribution-metric-1-timesTime: ✅ 3.628µs (SLO: <20.000µs 📉 -81.9%) vs baseline: +9.2% Memory: ✅ 34.524MB (SLO: <35.500MB -2.7%) vs baseline: +4.9% ❌ 1-distribution-metrics-100-timesTime: ❌ 221.239µs (SLO: <220.000µs +0.6%) vs baseline: +0.8% Memory: ✅ 34.524MB (SLO: <35.500MB -2.7%) vs baseline: +4.9% ✅ 1-gauge-metric-1-timesTime: ✅ 2.171µs (SLO: <20.000µs 📉 -89.1%) vs baseline: -1.2% Memory: ✅ 34.505MB (SLO: <35.500MB -2.8%) vs baseline: +4.6% ✅ 1-gauge-metrics-100-timesTime: ✅ 137.239µs (SLO: <150.000µs -8.5%) vs baseline: +0.8% Memory: ✅ 34.485MB (SLO: <35.500MB -2.9%) vs baseline: +4.7% ✅ 1-rate-metric-1-timesTime: ✅ 3.098µs (SLO: <20.000µs 📉 -84.5%) vs baseline: -0.8% Memory: ✅ 34.465MB (SLO: <35.500MB -2.9%) vs baseline: +4.8% ✅ 1-rate-metrics-100-timesTime: ✅ 214.579µs (SLO: <250.000µs 📉 -14.2%) vs baseline: -0.6% Memory: ✅ 34.465MB (SLO: <35.500MB -2.9%) vs baseline: +4.4% ✅ 100-count-metrics-100-timesTime: ✅ 20.416ms (SLO: <22.000ms -7.2%) vs baseline: +1.1% Memory: ✅ 34.505MB (SLO: <35.500MB -2.8%) vs baseline: +4.9% ✅ 100-distribution-metrics-100-timesTime: ✅ 2.262ms (SLO: <2.300ms 🟡 -1.6%) vs baseline: -0.8% Memory: ✅ 34.505MB (SLO: <35.500MB -2.8%) vs baseline: +4.7% ✅ 100-gauge-metrics-100-timesTime: ✅ 1.414ms (SLO: <1.550ms -8.8%) vs baseline: +0.6% Memory: ✅ 34.524MB (SLO: <35.500MB -2.7%) vs baseline: +5.0% ✅ 100-rate-metrics-100-timesTime: ✅ 2.208ms (SLO: <2.550ms 📉 -13.4%) vs baseline: ~same Memory: ✅ 34.583MB (SLO: <35.500MB -2.6%) vs baseline: +4.9% ✅ flush-1-metricTime: ✅ 4.889µs (SLO: <20.000µs 📉 -75.6%) vs baseline: +6.3% Memory: ✅ 34.544MB (SLO: <35.500MB -2.7%) vs baseline: +4.8% ✅ flush-100-metricsTime: ✅ 176.180µs (SLO: <250.000µs 📉 -29.5%) vs baseline: +0.5% Memory: ✅ 34.564MB (SLO: <35.500MB -2.6%) vs baseline: +5.2% ✅ flush-1000-metricsTime: ✅ 2.122ms (SLO: <2.500ms 📉 -15.1%) vs baseline: ~same Memory: ✅ 35.370MB (SLO: <36.500MB -3.1%) vs baseline: +5.1% 🟡 Near SLO Breach (2 suites)🟡 flasksimple - 18/18✅ appsec-getTime: ✅ 4.606ms (SLO: <4.750ms -3.0%) vs baseline: +0.4% Memory: ✅ 63.903MB (SLO: <66.500MB -3.9%) vs baseline: +4.6% ✅ appsec-postTime: ✅ 6.634ms (SLO: <6.750ms 🟡 -1.7%) vs baseline: +0.3% Memory: ✅ 64.399MB (SLO: <66.500MB -3.2%) vs baseline: +5.0% ✅ appsec-telemetryTime: ✅ 4.593ms (SLO: <4.750ms -3.3%) vs baseline: -0.1% Memory: ✅ 64.067MB (SLO: <66.500MB -3.7%) vs baseline: +5.0% ✅ debuggerTime: ✅ 1.856ms (SLO: <2.000ms -7.2%) vs baseline: ~same Memory: ✅ 47.950MB (SLO: <49.500MB -3.1%) vs baseline: +4.9% ✅ iast-getTime: ✅ 1.856ms (SLO: <2.000ms -7.2%) vs baseline: -0.3% Memory: ✅ 44.507MB (SLO: <49.000MB -9.2%) vs baseline: +5.2% ✅ profilerTime: ✅ 1.929ms (SLO: <2.100ms -8.1%) vs baseline: +0.1% Memory: ✅ 48.378MB (SLO: <50.000MB -3.2%) vs baseline: +4.3% ✅ resource-renamingTime: ✅ 3.382ms (SLO: <3.650ms -7.3%) vs baseline: +0.2% Memory: ✅ 54.828MB (SLO: <56.000MB -2.1%) vs baseline: +5.9% ✅ tracerTime: ✅ 3.373ms (SLO: <3.650ms -7.6%) vs baseline: +0.3% Memory: ✅ 54.480MB (SLO: <56.500MB -3.6%) vs baseline: +5.3% ✅ tracer-nativeTime: ✅ 3.372ms (SLO: <3.650ms -7.6%) vs baseline: +0.4% Memory: ✅ 54.421MB (SLO: <60.000MB -9.3%) vs baseline: +5.0% 🟡 otelspan - 22/22✅ add-eventTime: ✅ 39.332ms (SLO: <47.150ms 📉 -16.6%) vs baseline: +1.8% Memory: ✅ 39.027MB (SLO: <47.000MB 📉 -17.0%) vs baseline: +5.2% ✅ add-metricsTime: ✅ 256.691ms (SLO: <344.800ms 📉 -25.6%) vs baseline: ~same Memory: ✅ 43.294MB (SLO: <47.500MB -8.9%) vs baseline: +4.9% ✅ add-tagsTime: ✅ 316.980ms (SLO: <321.000ms 🟡 -1.3%) vs baseline: +1.0% Memory: ✅ 43.313MB (SLO: <47.500MB -8.8%) vs baseline: +4.9% ✅ get-contextTime: ✅ 78.786ms (SLO: <92.350ms 📉 -14.7%) vs baseline: -0.1% Memory: ✅ 39.255MB (SLO: <46.500MB 📉 -15.6%) vs baseline: +4.9% ✅ is-recordingTime: ✅ 36.203ms (SLO: <44.500ms 📉 -18.6%) vs baseline: ~same Memory: ✅ 38.713MB (SLO: <47.500MB 📉 -18.5%) vs baseline: +4.3% ✅ record-exceptionTime: ✅ 56.961ms (SLO: <67.650ms 📉 -15.8%) vs baseline: -0.2% Memory: ✅ 39.391MB (SLO: <47.000MB 📉 -16.2%) vs baseline: +4.8% ✅ set-statusTime: ✅ 42.467ms (SLO: <50.400ms 📉 -15.7%) vs baseline: +0.6% Memory: ✅ 38.827MB (SLO: <47.000MB 📉 -17.4%) vs baseline: +4.8% ✅ startTime: ✅ 35.500ms (SLO: <43.450ms 📉 -18.3%) vs baseline: +1.0% Memory: ✅ 38.717MB (SLO: <47.000MB 📉 -17.6%) vs baseline: +4.6% ✅ start-finishTime: ✅ 82.876ms (SLO: <88.000ms -5.8%) vs baseline: +1.4% Memory: ✅ 36.648MB (SLO: <46.500MB 📉 -21.2%) vs baseline: +5.1% ✅ start-finish-telemetryTime: ✅ 83.130ms (SLO: <89.000ms -6.6%) vs baseline: ~same Memory: ✅ 36.530MB (SLO: <46.500MB 📉 -21.4%) vs baseline: +4.5% ✅ update-nameTime: ✅ 37.347ms (SLO: <45.150ms 📉 -17.3%) vs baseline: ~same Memory: ✅ 38.910MB (SLO: <47.000MB 📉 -17.2%) vs baseline: +4.8%
|
9adf42e to
4c297d4
Compare
4c297d4 to
1d245aa
Compare
0adb927 to
a201ec6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.

Description
https://datadoghq.atlassian.net/browse/PROF-13044
This PR reduces locking around the libdatadog Profile object to make uploading non-blocking for sampling. Previously, the Scheduler thread (which uploads every once in a while) would take the Lock on the Profile object, preventing any new Samples from being pushed to it until the upload finishes. Although we don't have any numbers on that, it is possible that this would trigger adaptive sampling or make us lose some data.
The new implementation doesn't make uploading completely lock-free, but already improves it by eliminating the need for locking during the upload. We still need to lock the
Profileobject when we encode it. It may be possible to optimise this further by having twoProfileobjects (one we would be writing to, and one that would be being uploaded) but this requires more complicated lock handling to actually be safe. (Reason is, having two Profiles is still a problem if the previous encoding/uploading hasn't finished when the next needs to be encoded/uploaded.)In the new implementation, the Profile Lock is acquired just before encoding is done, and released right after. Profiler Stats are also copied and reset while the Lock is held.
In order to make that work, I had to slightly adapt the interface of
Uploaderfor it to work anEncodedProfileinstead of aProfile, which allowed me to hoist the encoding outside of theUploader(to theUploaderBuilder), and thus only hold the Profile Lock duringUploaderBuilder::Build(as opposed to during the whole ofUploader::upload).ProfilerStatsare captured and reset bystd::swap-ping them with an emptyProfilerStatsobject, eliminating the previous "singleton-like"ProfilerStatswe had. (This works and is trivial becauseProfilerStatsis a C++ object; we cannot do the same for theProfileitself, as it is managed by the Rust library.)Uploader's interface has also slightly changed – it is now constructed with theProfilerStatsandEncodedProfiledirectly (not taken as argument forupload).