-
Notifications
You must be signed in to change notification settings - Fork 35
/
Copy pathinfrastructure.tex
247 lines (211 loc) · 11.1 KB
/
infrastructure.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
\section{Profiling Infrastructure}
\begin{figure*}
\begin{center}
\includegraphics[scale=0.3]{eventbench.png}
\end{center}
\caption{Synthetic event benchmark}
\label{f:event-bench}
\end{figure*}
Our profiling framework comprises three parts:
\begin{itemize}
\item Support in GHC's runtime for tracing events to a log file at
runtime. The tracing is designed to be as lightweight as possible,
so as not to have any significant impact on the behaviour of the
program being measured.
\item A Haskell library \codef{ghc-events} that can read the trace file
generated by the runtime and build a Haskell data structure
representing the trace.
\item Multiple tools make use of the \codef{ghc-events} library to read and
analyze trace files.
\end{itemize}
Having a single trace-file format and a library that parses it means
that it is easy to write a new tool that works with GHC trace files:
just import the \codef{ghc-events} package and write code that uses the
Haskell data structures directly. We have already built several such
tools ourselves, some of which are merely proof-of-concept
experiments, but the \codef{ghc-events} library makes it almost trivial to
create new tools:
\begin{itemize}
\item A simple program that just prints out the (sorted) contents of
the trace file as text. Useful for checking that a trace file can
be parsed, and for examining the exact sequence of events.
\item The ThreadScope graphical viewer.
\item A tool that parses a trace file and generates a PDF format
timeline view, similar to the ThreadScope view.
\item A tool that generates input in the format expected by the
GtkWave circuit waveform viewer. This was used as an early
prototype for ThreadScope, since the timeline view that we want to
display has a lot in common with the waveform diagrams that gtkwave
displays and browses.
\end{itemize}
\subsection{Fast runtime tracing}
The runtime system generates trace files that log certain events and
the time at which they occurred. The events are typically those
related to thread activity; for example, ``HEC 0 started to run thread
3'', or ``thread 5 blocked on an MVar''. The kinds of events we can
log are limited only by the extra overhead incurred by the act of
logging them. Minimizing the overhead of event logging is something
we care about: the goal is to profile the actual runtime behaviour of
the program, so it is important that, as far as possible, we avoid
disturbing the behaviour that we are trying to profile.
In the GHC runtime, a pre-allocated event buffer is used by each HEC
to store generated events. By doing so, we avoid any dynamic memory
allocation overhead, and require no locks since the buffers are
HEC-local. Yet, this requires us to flush the buffer to the
filesystem once it becomes full, but since the buffer is a fixed size
we pay a near-constant penalty for each flush and a deterministic
delay on the GHC runtime.
The HEC-local buffers are flushed independently, which means that
events in the log file appear out-of-order and have to be sorted.
Sorting of the events is easily performed by the profiling tool after
reading in the log file using the \codef{ghc-events} library.
To measure the speed at which the GHC runtime can log events, we used
a C program (no Haskell code, just using the GHC runtime system as a
library) that simply generates 2,000,000 events, alternating between
``thread start'' and ``thread stop'' events. Our program generates a
34MB trace file and runs in 0.31 seconds elapsed time:
\begin{verbatim}
INIT time 0.00s ( 0.02s elapsed)
MUT time 0.22s ( 0.29s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.22s ( 0.31s elapsed)
\end{verbatim}
which gives a rough figure of 150ns for each event on average.
Looking at the ThreadScope view of this program
(Figure~\ref{f:event-bench}) we can clearly see where the buffer
flushes are happening, and that each one is about 5ms long.
An alternative approach is to use memory-mapped files, and write our
events directly into memory, leaving the actual file writing to the
OS. This would allow writing to be performed asynchronously, which
would hopefully reduce the impact of the buffer flush. According to
\codef{strace} on Linux, the above test program is spending 0.7s
writing buffers, so making this asynchronous would save us about 30ns
per event on average. However, on a 32-bit machine where we can't
afford to reserve a large amount of address space for the whole log
file, we would still have to occasionally flush and remap new portions
of the file. This alternative approach is something we plan to
explore in the future.
% how much impact does this have on runtimes?
To see how much impact event logging has on real execution times, we
took a parallel version of the canonical Fibonacci function,
\codef{parfib}, and compared the time elapsed with and without event
logging enabled for 50 executions of parfib on an Intel(R) Core(TM)2
Duo CPU T5250 1.50GHz, using both cores. The program generates about
2,000,000 events during the run, and generates a 40MB log file.
\begin{verbatim}
parfib eventlog
./Main 40 10 +RTS -N2 -l -RTS
Avg Time Elapsed Standard Deviation
20.582757s 0.789547s
parfib without eventlog
./Main 40 10 +RTS -N2 -RTS
Avg Time Elapsed Standard Deviation
17.447493s 1.352686s
\end{verbatim}
Considering the significant number of events generated in the traces
and the very detailed profiling information made available by these
traces, the overhead does not have an immense impact at approximately
10-25\% increase in elapsed time. In the case of parfib, the event
representing the creation of a new spark is dominant, comprising at
least 80\% of the the events generated. In fact, it is debatable
whether we should be logging the creation of a spark, since the cost
of logging this event is likely to be larger than the cost of creating
the spark itself - a spark creation is simply a write into a circular
buffer.
For parallel quicksort, far fewer sparks are created and most of the
computation is spent in garbage collection; thus, we can achieve an
almost unnoticeable overhead from event tracing. The parallel quicksort
example involved sorting a list of 100,000 randomly generated integers
and was performed in the same manner as parfib where we compare with
event logging and without, yet in this test we perform 100 executions
on an Intel(R) Core(TM) 2 Quad CPU 3.0Ghz.
\begin{verbatim}
parquicksort eventlog
./Main +RTS -N4 -l -RTS
Avg Time Elapsed Standard Deviation
14.201385s 2.954869
parquicksort without eventlog
./Main +RTS -N4 -RTS
Avg Time Elapsed Standard Deviation
15.187529s 3.385293s
\end{verbatim}
Since parallel quicksort spent the majority of the computation doing
useful work, particularly garbage collection of the created lists, a
trace file of only approximately 5MB and near 300,000 events was
created and the overhead of event tracing is not noticeable.
The crux of the event tracing is that even when a poorly performing
program utilizes event tracing, the overhead should still not be
devastating to the program's performance, but best of all on a program
with high utilization event tracing should barely affect the performance.
\subsection{An extensible file format}
We believe it is essential that the trace file format is both
backwards and forwards compatible, and architecture independent. In
particular, this means that:
\begin{itemize}
\item If you build a newer version of a tool, it will still work with
the trace files you already have, and trace files generated by
programs compiled with older versions of GHC.
\item If you upgrade your GHC and recompile your programs, the trace
files will still work with any profiling tools you already have.
\item Trace files do not have a shelf life. You can keep your trace
files around, safe in the knowledge that they will work with future
versions of profiling tools. Trace files can be archived, and
shared between machines.
\end{itemize}
Nevertheless, we don't expect the form of trace files to remain
completely static. In the future we will certainly want to add new
events, and add more information to existing events. We therefore
need an extensible file format. Informally, our trace files are
structured as follows:
\begin{itemize}
\item A list of \emph{event types}. An event-type is a
variable-length structure that describes one kind of event. The
event-type structure contains
\begin{itemize}
\item A unique number for this event type
\item A field describing the length in bytes of an instance of the
event, or zero for a variable-length event.
\item A variable-length string (preceded by its length) describing
this event (for example ``thread created'')
\item A variable-length field (preceded by its length) for future
expansion. We might in the future want to add more fields to
the event-type structure, and this field allows for that.
\end{itemize}
\item A list of \emph{events}. Each event begins with an event number
that corresponds to one of the event types defined earlier, and the
length of the event structure is given by the event type (or it has
variable length). The event also contains
\begin{itemize}
\item A nanosecond-resolution timestamp.
\item For a variable-length event, the length of the event.
\item Information specific to this event, for example which CPU it
occurred on. If the parser knows about this event, then it can
parse the rest of the event's information, otherwise it can skip
over this field because its length is known.
\end{itemize}
\end{itemize}
The unique numbers that identify events are shared knowledge between
GHC and the \codef{ghc-events} library. When creating a new event, a new
unique identifier is chosen; identifiers can never be re-used.
Even when parsing a trace file that contains new events, the parser
can still give a timestamp and a description of the unknown events.
The parser might encounter an event-type that it knows about, but the
event-type might contain new unknown fields. The parser can recognize
this situation and skip over the extra fields, because it knows the
length of the event from the event-type structure. Therefore when a
tool encounters a new log file it can continue to provide consistent
functionality.
Of course, there are scenarios in which it isn't possible to provide
this ideal graceful degradation. For example, we might construct a
tool that profiles a particular aspect of the behaviour of the
runtime, and in the future the runtime might be redesigned to behave
in a completely different way, with a new set of events. The old
events simply won't be generated any more, and the old tool won't be
able to display anything useful with the new trace files. Still, we
expect that our extensible trace file format will allow us to smooth
over the majority of forwards- and backwards-compatibility issues that
will arise between versions of the tools and GHC runtime. Moreover,
extensibility costs almost nothing, since the extra fields are all in
the event-types header, which has a fixed size for a given version of
GHC.