Skip to content

Commit b327ca9

Browse files
committed
Update RFC with more info and address feedback
1 parent a463510 commit b327ca9

File tree

1 file changed

+192
-43
lines changed

1 file changed

+192
-43
lines changed

text/0000-build-observability.md

+192-43
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,10 @@
1212
# Summary
1313
[summary]: #summary
1414

15-
This RFC proposes leveraging [OpenTelemetry](https://opentelemetry.io/) to
16-
grant platform operators and buildpack operators more insight into buildpack
15+
This RFC proposes leveraging [OpenTelemetry](https://opentelemetry.io/) to
16+
grant platform operators and buildpack operators more insight into buildpack
1717
performance and behavior. This RFC describes new opt-in functionality
18-
for both pack and the buildpack spec such that OpenTelemetry data may be
18+
for both pack and the buildpack spec such that OpenTelemetry data may be
1919
exported to the build file system.
2020

2121
# Definitions
@@ -29,17 +29,20 @@ exported to the build file system.
2929
# Motivation
3030
[motivation]: #motivation
3131

32-
Buildpack authors and platform operators desire insight into usage and
33-
performance of builds and buildpacks on their platform. Questions like
34-
"How long does each buildpack compile phase take?", "Which buildpacks
35-
commonly fail to compile?", "How often is a certain buildpack used?",
36-
"Which versions of Go are being installed?", and "How long does it take to
37-
download node_modules?" are important questions for authors and operators that
38-
are currently difficult to answer.
32+
Buildpack authors and platform operators desire insight into usage, error
33+
scenarios, and performance of builds and buildpacks on their platform. The
34+
following questions are all important for these folks, but difficult to answer:
35+
36+
- "Which buildpacks commonly fail to compile?"
37+
- "How often does a particular error scenario occur?"
38+
- "How long does each buildpack compile phase take?"
39+
- "How often is a certain buildpack used?"
40+
- "Which versions of Go are being installed?"
41+
- "How long does it take to download node_modules?"
3942

4043
Instrumenting lifecycle and buildpacks with opt-in OpenTelemetry tracing will
41-
allow platform operators to better understand performance and behavior of their
42-
builds and buildpacks and as a result, provide better service and build
44+
allow platform operators to better understand performance and behavior of their
45+
builds and buildpacks and as a result, provide better service and build
4346
experiences.
4447

4548
To protect privacy and prevent unnecessary collection of data, this
@@ -50,8 +53,8 @@ functionality should be optional and anonymous.
5053

5154
This RFC aims to provide a solution for two types of OpenTelemetry traces:
5255

53-
1) Lifecycle tracing: Buildpack-agnostic trace data like which buildpacks were
54-
available, which buildpacks were detected, how long the detect, build, or
56+
1) Lifecycle tracing: Buildpack-agnostic trace data like which buildpacks were
57+
available, which buildpacks were detected, how long the detect, build, or
5558
export phase took, and so on. This telemetry data may be exported by lifecycle.
5659
2) Buildpack tracing: Telemetry data specific to a buildpack like how long it
5760
took to download a language binary, which language version was selected, and so
@@ -61,20 +64,21 @@ Though the sources and contents of the telemetry data differ, both types may
6164
be emitted to the build file system in OpenTelemetry's [File Exporter
6265
Format](https://opentelemetry.io/docs/specs/otel/protocol/file-exporter/).
6366

67+
In this solution, each lifecycle phase would write a `.jsonl` file with
68+
tracing data for that phase. For example, `lifecycle detector --telemetry`
69+
would write to `/cnb/telemetry/lifecycle-detect.jsonl`. Additionally each
70+
buildpack may also write tracing data to it's own `.jsonl` files (at
71+
`/cnb/telemetry/{BUILDPACK_ID}.jsonl`).
6472

65-
For example, `lifecycle detector --telemetry` might save a file like this:
66-
67-
```json
68-
{"resourceSpans":[{"resource":{"attributes":[{"key":"lifecycle.version","value":{"stringValue":"0.17.1"}}]},"scopeSpans":[{"scope":{},"spans":[{"traceId":"","spanId":"","parentSpanId":"","name":"buildpack-detect","startTimeUnixNano":"1581452772000000321","endTimeUnixNano":"1581452773000000789","droppedAttributesCount":1,"events":[{"timeUnixNano":"1581452773000000123","name":"detect-pass"}],"attributes":[{"key":"buildpack-id","value":{"stringValue":"heroku/nodejs-engine"}}],"droppedAttributesCount":2,"droppedEventsCount":1}]}]}]}
69-
{ // additional spans... // }
70-
```
71-
72-
And a buildpack's compile phase might save a file like this:
73+
These `.jsonl` files may be read by platform operators for consumption,
74+
transformation, enrichment, and/or export to an OpenTelemetry backend. Given
75+
that builds may crash or fail at any point, these files must be written to
76+
often and regularly to prevent data loss.
7377

74-
```json
75-
{"resourceSpans":[{"resource":{"attributes":[{"key":"buildpack.version","value":{"stringValue":"1.0.0"}}]},"scopeSpans":[{"scope":{},"spans":[{"traceId":"","spanId":"","parentSpanId":"","name":"install-nodejs","startTimeUnixNano":"1581452772000001321","endTimeUnixNano":"1581452773000004789","droppedAttributesCount":1,"events":[{"timeUnixNano":"1581452773000002123","name":"restored-from-cache"}],"attributes":[{"key":"nodejs.version","value":{"stringValue":"20.0.0"}}]}]}]}]}
76-
{ // additional spans... // }
77-
```
78+
Platform operators will likely want to view or analyze this data. These
79+
telemetry files are in OTLP compatible format, so may be exported to one or
80+
more OpenTelemetry backends like Honeycomb, Prometheus, and [many
81+
others](https://opentelemetry.io/ecosystem/vendors/).
7882

7983

8084
# How it Works
@@ -84,39 +88,159 @@ And a buildpack's compile phase might save a file like this:
8488

8589
If `lifecycle` is provided the telemetry opt-in flag (such as `--telemetry`),
8690
`lifecycle` phases (such as `detect`, `build`, `export`) may emit an
87-
OpenTelemetry File Export with tracing data to a known location, such as
91+
OpenTelemetry File Export with tracing data to a known location, such as
8892
`/cnb/telemetry/lifecycle-detect.jsonl` with contents like this:
8993

9094
```json
91-
{"resourceSpans":[{"resource":{"attributes":[{"key":"lifecycle.version","value":{"stringValue":"0.17.1"}}]},"scopeSpans":[{"scope":{},"spans":[{"traceId":"","spanId":"","parentSpanId":"","name":"buildpack-detect","startTimeUnixNano":"1581452772000000321","endTimeUnixNano":"1581452773000000789","droppedAttributesCount":1,"events":[{"timeUnixNano":"1581452773000000123","name":"detect-pass"}],"attributes":[{"key":"buildpack-id","value":{"stringValue":"heroku/nodejs-engine"}}],"droppedAttributesCount":2,"droppedEventsCount":1}]}]}]}
92-
{ // additional spans... // }
95+
{
96+
"resourceSpans": [
97+
{
98+
"resource": {
99+
"attributes": [
100+
{
101+
"key": "lifecycle.version",
102+
"value": {
103+
"stringValue": "0.17.1"
104+
}
105+
}
106+
]
107+
},
108+
"scopeSpans": [
109+
{
110+
"scope": {},
111+
"spans": [
112+
{
113+
"traceId": "",
114+
"spanId": "",
115+
"parentSpanId": "",
116+
"name": "buildpack-detect",
117+
"startTimeUnixNano": "1581452772000000321",
118+
"endTimeUnixNano": "1581452773000000789",
119+
"droppedAttributesCount": 2,
120+
"events": [
121+
{
122+
"timeUnixNano": "1581452773000000123",
123+
"name": "detect-pass"
124+
}
125+
],
126+
"attributes": [
127+
{
128+
"key": "buildpack-id",
129+
"value": {
130+
"stringValue": "heroku/nodejs-engine"
131+
}
132+
}
133+
],
134+
"droppedEventsCount": 1
135+
}
136+
]
137+
}
138+
]
139+
}
140+
]
141+
}
93142
```
94143

95144

96145
### Buildpack telemetry files
97146

98147
During a buildpack's `detect` or `build` execution, a buildpack may emit
99148
an OpenTelemetry File Export with tracing data to `/cnb/telemetry/#{buildpack-id}.jsonl`
100-
with contents like this:
149+
with contents like this:
101150

102151
```json
103-
{"resourceSpans":[{"resource":{"attributes":[{"key":"lifecycle.version","value":{"stringValue":"0.17.1"}}]},"scopeSpans":[{"scope":{},"spans":[{"traceId":"","spanId":"","parentSpanId":"","name":"buildpack-detect","startTimeUnixNano":"1581452772000000321","endTimeUnixNano":"1581452773000000789","droppedAttributesCount":1,"events":[{"timeUnixNano":"1581452773000000123","name":"detect-pass"}],"attributes":[{"key":"buildpack-id","value":{"stringValue":"heroku/nodejs-engine"}}],"droppedAttributesCount":2,"droppedEventsCount":1}]}]}]}
104-
{ // additional spans... // }
152+
{
153+
"resourceSpans": [
154+
{
155+
"resource": {
156+
"attributes": [
157+
{
158+
"key": "lifecycle.version",
159+
"value": {
160+
"stringValue": "0.17.1"
161+
}
162+
}
163+
]
164+
},
165+
"scopeSpans": [
166+
{
167+
"scope": {},
168+
"spans": [
169+
{
170+
"traceId": "",
171+
"spanId": "",
172+
"parentSpanId": "",
173+
"name": "buildpack-detect",
174+
"startTimeUnixNano": "1581452772000000321",
175+
"endTimeUnixNano": "1581452773000000789",
176+
"droppedAttributesCount": 2,
177+
"events": [
178+
{
179+
"timeUnixNano": "1581452773000000123",
180+
"name": "detect-pass"
181+
}
182+
],
183+
"attributes": [
184+
{
185+
"key": "buildpack-id",
186+
"value": {
187+
"stringValue": "heroku/nodejs-engine"
188+
}
189+
}
190+
],
191+
"droppedEventsCount": 1
192+
}
193+
]
194+
}
195+
]
196+
}
197+
]
198+
}
105199
```
106200

107201
### Lifetime
108202

109-
The telemetry files may be written at any point during the build. They should
110-
exist as a part of the build file system for the duration of the build.
111-
Telemetry files will not be included in the final image.
203+
Telemetry files may be written at any point during the build, so that they
204+
are persisted in cases of failures to detect, failures to build, process
205+
terminations, or crashes. The `jsonl` format allows telemetry libraries to
206+
safely append additional json objects to the end of a telemetry file, so
207+
telemetry data can be flushed to the file frequently. Telemetry files should
208+
not be truncated or deleted so that telemetry processing by a platform can
209+
happen during or after a build. Telemetry files should not be included in the
210+
build result, as they are not relevant, and would likely negatively impact
211+
image size and reproduceability.
112212

113213
### Access
114214

115-
The telemetry files should remain readable so that they may be analyzed by
116-
the user and/or platform. However, they should be write protected in some way to prevent
117-
malicious buildpacks from injecting tracing data into other buildpack's
118-
telemetry file.
215+
The telemetry files should be readable so that they may be analyzed by
216+
the user and/or platform. However, they should be write protected
217+
to prevent malicious buildpacks from injecting tracing data into other
218+
buildpack or lifecycle telemetry files.
219+
220+
221+
### Consumption
222+
223+
This RFC leaves the consumption of telemetry files to the platform operator.
224+
Platform operators choosing to use these metrics need to read them either during
225+
or after the build. This can be done using existing OpenTelemetry libraries.
226+
Platform operators may choose to optionally enrich or modify the tracing data
227+
as they see fit (with data like `instance_id` or `build_id`). Platform
228+
operators will likely want to export this data to an OpenTelemetry backend for
229+
persistence and analysis, and again, this may be done with existing
230+
OpenTelemetry libraries.
119231

232+
### Viewing and Analyzing
233+
234+
Once the lifecycle and buildpack traces are exported to an OpenTelemetry
235+
backend, platform operators should be able to (depending on the features of the
236+
backend):
237+
238+
- View the complete trace for a build
239+
- View or query attributes attached to spans (e.g. `buildpack_id`,
240+
`nodejs_version`)
241+
- View or query span durations
242+
- View or query error types and/or messages
243+
- and more
120244

121245
# Migration
122246
[migration]: #migration
@@ -142,6 +266,13 @@ design:
142266
usernames, IP addresses, etc.), so the telemetry data emitted by `lifecycle`
143267
will also be free of user-identifiaible data.
144268

269+
### File Export Format Status
270+
271+
While the [File Exporter
272+
Format](https://opentelemetry.io/docs/specs/otel/protocol/file-exporter/) is
273+
an official format, and matches the OTLP format nearly exactly (and thus seems
274+
unlikely to change), it is listed as experimental status.
275+
145276
# Alternatives
146277
[alternatives]: #alternatives
147278

@@ -154,6 +285,28 @@ provide statistical information in aggregate. Since `lifecycle` and `pack`
154285
only run one build at a time, there is no way to aggregate information about
155286
multiple builds in `pack` or `lifecycle`.
156287

288+
### OTLP
289+
290+
The [OpenTelemetryProtocol](https://opentelemetry.io/docs/specs/otlp/) is a
291+
network delivery protocol for OpenTelemetry data. Instead of emitting files as
292+
this RFC describes, lifecycle and buildpacks could instead connect to an
293+
OpenTelemetry collector provided by the platform operator. This pattern is
294+
well supported and well known.
295+
296+
However, there are drawbacks:
297+
298+
- In local `pack build` scenarios, it's unlikely that users would have an
299+
OpenTelemetry collector running. This RFC solution does not require a
300+
collector.
301+
- lifecycle and buildpacks would need to know where the OpenTelemetry collector
302+
is and how to authenticate with it. Lifecycle and buildpacks that wish to
303+
emit telemetry may not want to deal with the mountain of configuration to
304+
support various collectors.
305+
- Platform operators may have complex network topology that may make supporting
306+
this feature challenging (e.g. a firewall between lifecycle and the collector
307+
may still be perceived as a lifecycle malfunction).
308+
309+
There is an [RFC for this alternative](https://github.com/buildpacks/rfcs/pull/300).
157310

158311
# Prior Art
159312
[prior-art]: #prior-art
@@ -177,10 +330,6 @@ Discuss prior art, both the good and bad.
177330
shouldn't be a part of the build result image.
178331

179332

180-
- What parts of the design do you expect to be resolved before this gets merged?
181-
- What parts of the design do you expect to be resolved through implementation of the feature?
182-
- What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC?
183-
184333
# Spec. Changes (OPTIONAL)
185334
[spec-changes]: #spec-changes
186335

0 commit comments

Comments
 (0)