Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: compute tbs entry size before writing and flush to avoid limit error #12120

Merged
merged 26 commits into from
Jan 18, 2024

Conversation

kruskall
Copy link
Member

@kruskall kruskall commented Nov 29, 2023

Motivation/summary

Keep track of pending writes in readwrites and use an atomic counter for the overall size of pending writes in the backing storage.
Catch storage issues earlier and avoid losing events.
Estimate the size of an entry by hardcoding values from badger internal code.

Checklist

For functional changes, consider:

  • Is it observable through the addition of either logging or metrics?
  • Is its use being published in telemetry to enable product improvement?
  • Have system tests been added to avoid regression?

How to test these changes

Setup 1 APM server with TBS enabled and a low storage limit
Configure 1 APM agent to send a transaction containing spans in high throughput.
Traces may look incomplete with missing spans in the waterfall

Related issues

Closes #11857

Copy link
Contributor

mergify bot commented Nov 29, 2023

This pull request does not have a backport label. Could you fix it @kruskall? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-7.17 is the label to automatically backport to the 7.17 branch.
  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Nov 29, 2023
x-pack/apm-server/sampling/eventstorage/storage.go Outdated Show resolved Hide resolved
x-pack/apm-server/sampling/eventstorage/storage.go Outdated Show resolved Hide resolved
x-pack/apm-server/sampling/eventstorage/storage.go Outdated Show resolved Hide resolved
x-pack/apm-server/sampling/eventstorage/storage.go Outdated Show resolved Hide resolved
err := rw.txn.Commit()
rw.txn = rw.s.db.NewTransaction(true)
rw.pendingWrites = 0
rw.pendingSize = defaultPendingSize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like in ReadWriter.writeEntry, this should be coordinated across all ReadWriters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this a bit.

IIUC when flushing we're just committing the current (readwriter) transaction which is independent from other ReadWriters so we need to track the total underlying storage pendingSize (sum of all the readwriters pendingSizes) but also the per-readwriter pendingSize.

When a ReadWriter flushes we reset its current pendingSize and subtract it from the storage pendingSize.

txn: s.db.NewTransaction(true),
s: s,
txn: s.db.NewTransaction(true),
pendingSize: defaultPendingSize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of defaulting to 1024, should we perhaps use s.db.Size()? And rather than calculating pendingSize, maybe we could track the remaining capacity? i.e. StorageLimitInBytes - Size()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the default 1024. For tracking I'd prefer to use pendingSize but I have no strong opinion on it

@kruskall kruskall marked this pull request as ready for review December 5, 2023 21:19
@kruskall kruskall requested a review from a team as a code owner December 5, 2023 21:19
@kruskall kruskall requested a review from axw December 5, 2023 21:19
@simitt
Copy link
Contributor

simitt commented Jan 4, 2024

@kruskall I moved this to ms-8.13 - can you please pick this up again, the PR looks stale.

Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code correctness is becoming harder to reason. Maybe refactor it a bit?

@kruskall kruskall requested a review from carsonip January 11, 2024 12:30
Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've written a test for the error scenario:

func TestStorageLimitWritesPersisted(t *testing.T) {
	// Ensure that if WriteTraceEvent does not return an error,
	// the event is eventually written to storage.
	const storageLimitInBytes = int64(10000)
	tempdir := t.TempDir()
	opts := func() badger.Options {
		opts := badgerOptions()
		opts = opts.WithInMemory(false)
		opts = opts.WithDir(tempdir).WithValueDir(tempdir)
		return opts
	}

	// Open and close the database to create a non-empty value log file,
	// which will cause writes below to fail due to the storage limit being
	// exceeded. We would otherwise have to rely on Badger's one minute
	// timer to refresh the size.
	db := newBadgerDB(t, opts)
	db.Close()
	db = newBadgerDB(t, opts)

	store := eventstorage.New(db, eventstorage.ProtobufCodec{})

	rw1 := store.NewReadWriter()
	defer rw1.Close()

	rw2 := store.NewReadWriter()
	defer rw2.Close()

	write := func(rw *eventstorage.ReadWriter) error {
		traceID := uuid.Must(uuid.NewV4()).String()
		transactionID := uuid.Must(uuid.NewV4()).String()
		transaction := modelpb.APMEvent{Transaction: &modelpb.Transaction{Id: transactionID}}
		err := rw.WriteTraceEvent(traceID, transactionID, &transaction, eventstorage.WriterOpts{
			TTL:                 time.Minute,
			StorageLimitInBytes: storageLimitInBytes,
		})
		return err
	}

	var (
		rw1Written, rw2Written int
		wg                     sync.WaitGroup
	)
	wg.Add(2)
	go func() {
		for {
			err := write(rw1)
			if err != nil {
				assert.ErrorIs(t, err, eventstorage.ErrLimitReached)
				break
			}
			rw1Written++
		}
		wg.Done()
	}()

	go func() {
		for {
			err := write(rw2)
			if err != nil {
				assert.ErrorIs(t, err, eventstorage.ErrLimitReached)
				break
			}
			rw2Written++
		}
		wg.Done()
	}()
	wg.Wait()

	getEventCount := func() (count int) {
		opts := badger.DefaultIteratorOptions
		txn := db.NewTransaction(true)
		iter := txn.NewIterator(opts)
		defer iter.Close()
		for iter.Rewind(); iter.Valid(); iter.Next() {
			item := iter.Item()
			if item.IsDeletedOrExpired() {
				continue
			}
			switch item.UserMeta() {
			case 'e':
				count++
			default:
				// Unknown entry meta: ignore.
				continue
			}
		}
		return
	}

	assert.Equal(t, rw1Written+rw2Written, getEventCount())
}

this will fail before the PR and will pass with the fix:

=== RUN   TestStorageLimitWritesPersisted
    storage_test.go:405: 
        	Error Trace:	/home/carson/projects/apm-server/x-pack/apm-server/sampling/eventstorage/storage_test.go:405
        	Error:      	Not equal: 
        	            	expected: 5256598
        	            	actual  : 4925758
        	Test:       	TestStorageLimitWritesPersisted
--- FAIL: TestStorageLimitWritesPersisted (68.69s)

Expected :5256598
Actual   :4925758
=== RUN   TestStorageLimitWritesPersisted
--- PASS: TestStorageLimitWritesPersisted (0.04s)
PASS

@kruskall
Copy link
Member Author

@carsonip I updated your test to print the error returned by err := write(rw1) and err := write(rw2)

Before this PR:

--- FAIL: TestStorageLimitWritesPersisted (75.54s)
    storage_test.go:336: 
                Error Trace:    /apm-server/x-pack/apm-server/sampling/eventstorage/storage_test.go:336
                                                        /usr/lib/go/src/runtime/asm_amd64.s:1650
                Error:          Received unexpected error:
                                failed to flush pending writes: configured storage limit reached (current: 7816729157, limit: 10000)
                Test:           TestStorageLimitWritesPersisted
                Messages:       configured storage limit reached
    storage_test.go:323: 
                Error Trace:    /apm-server/x-pack/apm-server/sampling/eventstorage/storage_test.go:323
                                                        /usr/lib/go/src/runtime/asm_amd64.s:1650
                Error:          Received unexpected error:
                                failed to flush pending writes: configured storage limit reached (current: 7816729157, limit: 10000)
                Test:           TestStorageLimitWritesPersisted
                Messages:       configured storage limit reached
    storage_test.go:367: 
                Error Trace:    /apm-server/x-pack/apm-server/sampling/eventstorage/storage_test.go:367
                Error:          Not equal: 
                                expected: 28609598
                                actual  : 24648084
                Test:           TestStorageLimitWritesPersisted

After this PR:

--- FAIL: TestStorageLimitWritesPersisted (0.01s)
    storage_test.go:371: 
                Error Trace:    /apm-server/x-pack/apm-server/sampling/eventstorage/storage_test.go:371
                                                        /usr/lib/go/src/runtime/asm_amd64.s:1650
                Error:          Received unexpected error:
                                configured storage limit reached (current: 10063, limit: 10000)
                Test:           TestStorageLimitWritesPersisted
                Messages:       configured storage limit reached
    storage_test.go:358: 
                Error Trace:    /apm-server/x-pack/apm-server/sampling/eventstorage/storage_test.go:358
                                                        /usr/lib/go/src/runtime/asm_amd64.s:1650
                Error:          Received unexpected error:
                                configured storage limit reached (current: 10200, limit: 10000)
                Test:           TestStorageLimitWritesPersisted
                Messages:       configured storage limit reached
    storage_test.go:403: 
FAIL

Note: because we are counting the entrySize before writing it, the "true" current usage size is actually current-entrySize. I update the error message to show that and here's the output:

With true-current:

--- FAIL: TestStorageLimitWritesPersisted (0.01s)
    storage_test.go:358: 
                Error Trace:    /apm-server/x-pack/apm-server/sampling/eventstorage/storage_test.go:358
                                                        /usr/lib/go/src/runtime/asm_amd64.s:1650
                Error:          Received unexpected error:
                                configured storage limit reached (current: 10200, true-current: 10063, limit: 10000)
                Test:           TestStorageLimitWritesPersisted
                Messages:       configured storage limit reached
    storage_test.go:371: 
                Error Trace:    /apm-server/x-pack/apm-server/sampling/eventstorage/storage_test.go:371
                                                        /usr/lib/go/src/runtime/asm_amd64.s:1650
                Error:          Received unexpected error:
                                configured storage limit reached (current: 10063, true-current: 9926, limit: 10000)
                Test:           TestStorageLimitWritesPersisted
                Messages:       configured storage limit reached
    storage_test.go:403: 
FAIL

I've added log lines to show what is going on:

(rw2)current: 199 | true-current: 62, pending: 179, lsm: 0, vlog: 20, entry: 137
(rw2)current: 336 | true-current: 199, pending: 316, lsm: 0, vlog: 20, entry: 137
(rw2)current: 473 | true-current: 336, pending: 453, lsm: 0, vlog: 20, entry: 137
(rw2)current: 610 | true-current: 473, pending: 590, lsm: 0, vlog: 20, entry: 137
(rw2)current: 747 | true-current: 610, pending: 727, lsm: 0, vlog: 20, entry: 137
(rw2)current: 884 | true-current: 747, pending: 864, lsm: 0, vlog: 20, entry: 137
(rw2)current: 1021 | true-current: 884, pending: 1001, lsm: 0, vlog: 20, entry: 137
(rw2)current: 1158 | true-current: 1021, pending: 1138, lsm: 0, vlog: 20, entry: 137
(rw2)current: 1295 | true-current: 1158, pending: 1275, lsm: 0, vlog: 20, entry: 137
(rw2)current: 1432 | true-current: 1295, pending: 1412, lsm: 0, vlog: 20, entry: 137
(rw2)current: 1569 | true-current: 1432, pending: 1549, lsm: 0, vlog: 20, entry: 137
(rw2)current: 1706 | true-current: 1569, pending: 1686, lsm: 0, vlog: 20, entry: 137
(rw2)current: 1843 | true-current: 1706, pending: 1823, lsm: 0, vlog: 20, entry: 137
(rw1)current: 1980 | true-current: 1843, pending: 1960, lsm: 0, vlog: 20, entry: 137
(rw1)current: 2117 | true-current: 1980, pending: 2097, lsm: 0, vlog: 20, entry: 137
(rw1)current: 2254 | true-current: 2117, pending: 2234, lsm: 0, vlog: 20, entry: 137
(rw1)current: 2391 | true-current: 2254, pending: 2371, lsm: 0, vlog: 20, entry: 137
(rw1)current: 2665 | true-current: 2528, pending: 2645, lsm: 0, vlog: 20, entry: 137
(rw1)current: 2802 | true-current: 2665, pending: 2782, lsm: 0, vlog: 20, entry: 137
(rw2)current: 2528 | true-current: 2391, pending: 2508, lsm: 0, vlog: 20, entry: 137
(rw1)current: 2939 | true-current: 2802, pending: 2919, lsm: 0, vlog: 20, entry: 137
(rw1)current: 3076 | true-current: 2939, pending: 3056, lsm: 0, vlog: 20, entry: 137
(rw1)current: 3213 | true-current: 3076, pending: 3193, lsm: 0, vlog: 20, entry: 137
(rw1)current: 3350 | true-current: 3213, pending: 3330, lsm: 0, vlog: 20, entry: 137
(rw1)current: 3487 | true-current: 3350, pending: 3467, lsm: 0, vlog: 20, entry: 137
(rw1)current: 3624 | true-current: 3487, pending: 3604, lsm: 0, vlog: 20, entry: 137
(rw1)current: 3761 | true-current: 3624, pending: 3741, lsm: 0, vlog: 20, entry: 137
(rw1)current: 3898 | true-current: 3761, pending: 3878, lsm: 0, vlog: 20, entry: 137
(rw1)current: 4035 | true-current: 3898, pending: 4015, lsm: 0, vlog: 20, entry: 137
(rw1)current: 4172 | true-current: 4035, pending: 4152, lsm: 0, vlog: 20, entry: 137
(rw1)current: 4309 | true-current: 4172, pending: 4289, lsm: 0, vlog: 20, entry: 137
(rw1)current: 4583 | true-current: 4446, pending: 4563, lsm: 0, vlog: 20, entry: 137
(rw1)current: 4720 | true-current: 4583, pending: 4700, lsm: 0, vlog: 20, entry: 137
(rw1)current: 4857 | true-current: 4720, pending: 4837, lsm: 0, vlog: 20, entry: 137
(rw1)current: 4994 | true-current: 4857, pending: 4974, lsm: 0, vlog: 20, entry: 137
(rw1)current: 5131 | true-current: 4994, pending: 5111, lsm: 0, vlog: 20, entry: 137
(rw1)current: 5268 | true-current: 5131, pending: 5248, lsm: 0, vlog: 20, entry: 137
(rw1)current: 5405 | true-current: 5268, pending: 5385, lsm: 0, vlog: 20, entry: 137
(rw1)current: 5542 | true-current: 5405, pending: 5522, lsm: 0, vlog: 20, entry: 137
(rw1)current: 5679 | true-current: 5542, pending: 5659, lsm: 0, vlog: 20, entry: 137
(rw1)current: 5816 | true-current: 5679, pending: 5796, lsm: 0, vlog: 20, entry: 137
(rw1)current: 5953 | true-current: 5816, pending: 5933, lsm: 0, vlog: 20, entry: 137
(rw1)current: 6090 | true-current: 5953, pending: 6070, lsm: 0, vlog: 20, entry: 137
(rw2)current: 4446 | true-current: 4309, pending: 4426, lsm: 0, vlog: 20, entry: 137
(rw2)current: 6364 | true-current: 6227, pending: 6344, lsm: 0, vlog: 20, entry: 137
(rw1)current: 6227 | true-current: 6090, pending: 6207, lsm: 0, vlog: 20, entry: 137
(rw2)current: 6501 | true-current: 6364, pending: 6481, lsm: 0, vlog: 20, entry: 137
(rw2)current: 6638 | true-current: 6501, pending: 6618, lsm: 0, vlog: 20, entry: 137
(rw2)current: 6775 | true-current: 6638, pending: 6755, lsm: 0, vlog: 20, entry: 137
(rw2)current: 7049 | true-current: 6912, pending: 7029, lsm: 0, vlog: 20, entry: 137
(rw1)current: 6912 | true-current: 6775, pending: 6892, lsm: 0, vlog: 20, entry: 137
(rw2)current: 7186 | true-current: 7049, pending: 7166, lsm: 0, vlog: 20, entry: 137
(rw2)current: 7460 | true-current: 7323, pending: 7440, lsm: 0, vlog: 20, entry: 137
(rw2)current: 7597 | true-current: 7460, pending: 7577, lsm: 0, vlog: 20, entry: 137
(rw2)current: 7734 | true-current: 7597, pending: 7714, lsm: 0, vlog: 20, entry: 137
(rw2)current: 7871 | true-current: 7734, pending: 7851, lsm: 0, vlog: 20, entry: 137
(rw2)current: 8008 | true-current: 7871, pending: 7988, lsm: 0, vlog: 20, entry: 137
(rw2)current: 8145 | true-current: 8008, pending: 8125, lsm: 0, vlog: 20, entry: 137
(rw1)current: 7323 | true-current: 7186, pending: 7303, lsm: 0, vlog: 20, entry: 137
(rw2)current: 8282 | true-current: 8145, pending: 8262, lsm: 0, vlog: 20, entry: 137
(rw1)current: 8419 | true-current: 8282, pending: 8399, lsm: 0, vlog: 20, entry: 137
(rw2)current: 8556 | true-current: 8419, pending: 8536, lsm: 0, vlog: 20, entry: 137
(rw2)current: 8830 | true-current: 8693, pending: 8810, lsm: 0, vlog: 20, entry: 137
(rw2)current: 8967 | true-current: 8830, pending: 8947, lsm: 0, vlog: 20, entry: 137
(rw2)current: 9104 | true-current: 8967, pending: 9084, lsm: 0, vlog: 20, entry: 137
(rw2)current: 9241 | true-current: 9104, pending: 9221, lsm: 0, vlog: 20, entry: 137
(rw1)current: 8693 | true-current: 8556, pending: 8673, lsm: 0, vlog: 20, entry: 137
(rw1)current: 9515 | true-current: 9378, pending: 9495, lsm: 0, vlog: 20, entry: 137
(rw1)current: 9652 | true-current: 9515, pending: 9632, lsm: 0, vlog: 20, entry: 137
(rw1)current: 9789 | true-current: 9652, pending: 9769, lsm: 0, vlog: 20, entry: 137
(rw1)current: 9926 | true-current: 9789, pending: 9906, lsm: 0, vlog: 20, entry: 137
(rw2)current: 9378 | true-current: 9241, pending: 9358, lsm: 0, vlog: 20, entry: 137
(rw1)FLUSH: 10063 | true-current: 9926, pending: 10043, lsm: 0, vlog: 20, entry: 137
(rw2)FLUSH: 10200 | true-current: 10063, pending: 10180, lsm: 0, vlog: 20, entry: 137

rw2 has true-current size of 10063 which happens because the two readwriter can race when writing an entry but because the size is an atomic int and a readwriter can write one entry in parallel the size is off by one entry: 9926+137=10063.

I think this is an acceptable tradeoff because of two reasons:

  • we are already being careful when estimating size (see the first flush which flushes before writing the entry because it would overflow the storage)
  • the previous behaviour was way worse and actually incorrect: the buffered writes weren't counted, leading to a huge overuse of storage

@kruskall kruskall requested a review from carsonip January 17, 2024 13:51
@carsonip
Copy link
Member

I think this is an acceptable tradeoff because of two reasons:

we are already being careful when estimating size (see the first flush which flushes before writing the entry because it would overflow the storage)

The main reason is the badger size reconciliation happens per-minute. So that will end up be the greatest source of error (as in inaccuracy) in our storage limit check.

the previous behaviour was way worse and actually incorrect: the buffered writes weren't counted, leading to a huge overuse of storage

Agreed. We are definitely not making things worse here. A critical bug is being fixed. Initially I did not think about the overuse of storage but when I last tested it out last week, I realize it was dangerous.

Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. If you could update the PR description and respond to the comments, I think it should be good to go. @axw do you want to glance over it?

@simitt
Copy link
Contributor

simitt commented Jan 18, 2024

@kruskall one last nit from my side - please update the How to test this section, as it will be crucial to know for the tester, especially if we end up releasing this as patch release.

@kruskall kruskall added the backport-8.12 Automated backport with mergify label Jan 18, 2024
@kruskall kruskall enabled auto-merge (squash) January 18, 2024 13:12
@mergify mergify bot removed the backport-skip Skip notification from the automated backport with mergify label Jan 18, 2024
@kruskall kruskall merged commit c8b7108 into elastic:main Jan 18, 2024
10 of 11 checks passed
mergify bot pushed a commit that referenced this pull request Jan 18, 2024
…error (#12120)

* feat: compute tbs entry size before writing and flush to avoid limit error

* fix: move pendingSize to storage and make it atomic

* refactor: remove default pending size

* fix: track readwriters pending size and resolve race conditions

* lint: remove unused method

* fix: add base transaction size and add more comments

* test: fix storage limit test

* lint: remove unused ErrLimitReached

* fix: do not add entrySize twice

Co-authored-by: Vishal Raj <[email protected]>

* docs: add pendingSize comment

* lint: fix linter issues

* fix: respect storage limit

* fix: handle 0 storage limit (unlimited)

* fix: flush what we have before discarding the transaction

* fix: do not discard txn twice

* test: fix error message typo

* fix: update pendingSize after flush and refactor for clarity

* docs: update test comment

Co-authored-by: Carson Ip <[email protected]>

* docs: readd db.size comment

---------

Co-authored-by: Vishal Raj <[email protected]>
Co-authored-by: Carson Ip <[email protected]>
(cherry picked from commit c8b7108)
mergify bot added a commit that referenced this pull request Jan 18, 2024
…error (#12120) (#12452)

* feat: compute tbs entry size before writing and flush to avoid limit error

* fix: move pendingSize to storage and make it atomic

* refactor: remove default pending size

* fix: track readwriters pending size and resolve race conditions

* lint: remove unused method

* fix: add base transaction size and add more comments

* test: fix storage limit test

* lint: remove unused ErrLimitReached

* fix: do not add entrySize twice

Co-authored-by: Vishal Raj <[email protected]>

* docs: add pendingSize comment

* lint: fix linter issues

* fix: respect storage limit

* fix: handle 0 storage limit (unlimited)

* fix: flush what we have before discarding the transaction

* fix: do not discard txn twice

* test: fix error message typo

* fix: update pendingSize after flush and refactor for clarity

* docs: update test comment

Co-authored-by: Carson Ip <[email protected]>

* docs: readd db.size comment

---------

Co-authored-by: Vishal Raj <[email protected]>
Co-authored-by: Carson Ip <[email protected]>
(cherry picked from commit c8b7108)

Co-authored-by: kruskall <[email protected]>
@carsonip
Copy link
Member

carsonip commented Jan 18, 2024

@kruskall this is missing changelog. Sorry I missed it during the review.

@kruskall kruskall deleted the feat/discarded-txn branch January 19, 2024 21:30
@carsonip
Copy link
Member

carsonip commented Feb 1, 2024

Testing notes

✔️ test-plan-ok

Tested with sampling rate 0.1 with 1KB TBS size limit under 2 ESS versions: 8.12.0 (affected by bug) and 8.12.1 (with bugfix)

Code snippet

Go agent code

package main

import (
	"context"
	"fmt"
	"go.elastic.co/apm/module/apmhttp/v2"
	"go.elastic.co/apm/v2"
	"golang.org/x/net/context/ctxhttp"
	"io/ioutil"
	"net"
	"net/http"
	"time"
)

var tracingClient = apmhttp.WrapClient(http.DefaultClient)

func serverHandler(w http.ResponseWriter, req *http.Request) {
	// Propagate the transaction context contained in req.Context().
	span, _ := apm.StartSpan(req.Context(), "serverHandler", "spantype")
	defer span.End()
	sleep(req.Context())
	resp, err := ctxhttp.Get(req.Context(), tracingClient, "https://httpbin.org/status/200")
	if err != nil {
		apm.CaptureError(req.Context(), err).Send()
		http.Error(w, "failed to query backend", 500)
		return
	}
	ioutil.ReadAll(resp.Body)
	sleep(req.Context())
	return
}

func sleep(ctx context.Context) {
	span, _ := apm.StartSpan(ctx, "sleepspan", "sleep")
	defer span.End()
	time.Sleep(100 * time.Millisecond)
}

const CONCURRENCY = 10

func main() {
	listener, err := net.Listen("tcp", "127.0.0.1:0")
	if err != nil {
		panic(err)
	}
	for i := 0; i < CONCURRENCY; i++ {
		go func() {
			for {
				//select {
				//case <-time.Tick(3 * time.Second):
				//}
				if _, err := http.Get(fmt.Sprintf("http://%s/", listener.Addr())); err != nil {
					panic(err)
				}
			}
		}()
	}
	if err := http.Serve(listener, apmhttp.Wrap(http.HandlerFunc(serverHandler))); err != http.ErrServerClosed {
		panic(err)
	}
}

Expected outcome

Traces should be complete like this:

Screenshot 2024-02-01 at 12-50-42 GET _ - Transactions - ___1high_throughput_8_12_0_ESS - Services - APM - Observability - Elastic

8.12.0 affected by bug

Traces are sometimes incomplete under high throughput.

Screenshot 2024-02-01 at 12-54-25 GET _ - Transactions - ___1high_throughput_8_12_0_ESS - Services - APM - Observability - Elastic
Screenshot 2024-02-01 at 12-54-42 GET _ - Transactions - ___1high_throughput_8_12_0_ESS - Services - APM - Observability - Elastic

8.12.1 with TBS bug fix

Did not find any incomplete traces after scrolling through 50+ traces.

Screenshot 2024-02-01 at 12-59-39 GET _ - Transactions - ___high_throughput_8_12_1_ESS_with_TBS_fix - Services - APM - Observability - Elastic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential missing spans and non-root transactions when TBS storage limit is reached
5 participants