@@ -6,28 +6,36 @@ TC-BPF (on egress) for the packet capture logic.
66## Simple description
77Passive Ping (PPing) is a simple tool for passively measuring per-flow RTTs. It
88can be used on endhosts as well as any (BPF-capable Linux) device which can see
9- both directions of the traffic (ex router or middlebox). Currently it only works
10- for TCP traffic which uses the TCP timestamp option, but could be extended to
11- also work with for example TCP seq/ACK numbers, the QUIC spinbit and ICMP
12- echo-reply messages . See the [ TODO-list] ( ./TODO.md ) for more potential features
13- (which may or may not ever get implemented).
9+ both directions of the traffic (ex router or middlebox). Currently it works for
10+ TCP traffic which uses the TCP timestamp option and ICMP echo messages, but
11+ could be extended to also work with for example TCP seq/ACK numbers, the QUIC
12+ spinbit and DNS queries . See the [ TODO-list] ( ./TODO.md ) for more potential
13+ features (which may or may not ever get implemented).
1414
1515The fundamental logic of pping is to timestamp a pseudo-unique identifier for
16- outgoing packets, and then look for matches in the incoming packets. If a match
17- is found, the RTT is simply calculated as the time difference between the
18- current time and the stored timestamp.
16+ packets, and then look for matches in the reply packets. If a match is found,
17+ the RTT is simply calculated as the time difference between the current time and
18+ the stored timestamp.
1919
2020This tool, just as Kathie's original pping implementation, uses TCP timestamps
21- as identifiers. For outgoing packets, the TSval (which is a timestamp in and off
22- itself) is timestamped. Incoming packets are then parsed for the TSecr, which
23- are the echoed TSval values from the receiver. The TCP timestamps are not
24- necessarily unique for every packet (they have a limited update frequency,
25- appears to be 1000 Hz for modern Linux systems), so only the first instance of
26- an identifier is timestamped, and matched against the first incoming packet with
27- the identifier. The mechanism to ensure only the first packet is timestamped and
28- matched differs from the one in Kathie's pping, and is further described in
21+ as identifiers for TCP traffic. The TSval (which is a timestamp in and off
22+ itself) is used as an identifier and timestamped. Reply packets in the reverse
23+ flow are then parsed for the TSecr, which are the echoed TSval values from the
24+ receiver. The TCP timestamps are not necessarily unique for every packet (they
25+ have a limited update frequency, appears to be 1000 Hz for modern Linux
26+ systems), so only the first instance of an identifier is timestamped, and
27+ matched against the first incoming packet with a matching reply identifier. The
28+ mechanism to ensure only the first packet is timestamped and matched differs
29+ from the one in Kathie's pping, and is further described in
2930[ SAMPLING_DESIGN] ( ./SAMPLING_DESIGN.md ) .
3031
32+ For ICMP echo, it uses the echo identifier as port numbers, and echo sequence
33+ number as identifer to match against. Linux systems will typically use different
34+ echo identifers for different instances of ping, and thus each ping instance
35+ will be recongnized as a separate flow. Windows systems typically use a static
36+ echo identifer, and thus all instaces of ping originating from a particular
37+ Windows host and the same target host will be considered a single flow.
38+
3139## Output formats
3240pping currently supports 3 different formats, * standard* , * ppviz* and * json* . In
3341general, the output consists of two different types of events, flow-events which
@@ -41,12 +49,12 @@ single line per event.
4149
4250An example of the format is provided below:
4351``` shell
44- 16:00:46.142279766 10.11.1.1:5201+10.11.1.2:59528 opening due to SYN-ACK from src
45- 16:00:46.147705205 5.425439 ms 5.425439 ms 10.11.1.1:5201+10.11.1.2:59528
46- 16:00:47.148905125 5.261430 ms 5.261430 ms 10.11.1.1:5201+10.11.1.2:59528
47- 16:00:48.151666385 5.972284 ms 5.261430 ms 10.11.1.1:5201+10.11.1.2:59528
48- 16:00:49.152489316 6.017589 ms 5.261430 ms 10.11.1.1:5201+10.11.1.2:59528
49- 16:00:49.878508114 10.11.1.1:5201+10.11.1.2:59528 closing due to RST from dest
52+ 16:00:46.142279766 TCP 10.11.1.1:5201+10.11.1.2:59528 opening due to SYN-ACK from dest
53+ 16:00:46.147705205 5.425439 ms 5.425439 ms TCP 10.11.1.1:5201+10.11.1.2:59528
54+ 16:00:47.148905125 5.261430 ms 5.261430 ms TCP 10.11.1.1:5201+10.11.1.2:59528
55+ 16:00:48.151666385 5.972284 ms 5.261430 ms TCP 10.11.1.1:5201+10.11.1.2:59528
56+ 16:00:49.152489316 6.017589 ms 5.261430 ms TCP 10.11.1.1:5201+10.11.1.2:59528
57+ 16:00:49.878508114 TCP 10.11.1.1:5201+10.11.1.2:59528 closing due to RST from dest
5058```
5159
5260### ppviz format
@@ -89,7 +97,7 @@ An example of a (pretty-printed) flow-event is provided below:
8997 "protocol" : " TCP" ,
9098 "flow_event" : " opening" ,
9199 "reason" : " SYN-ACK" ,
92- "triggered_by" : " src "
100+ "triggered_by" : " dest "
93101}
94102```
95103
@@ -107,7 +115,8 @@ An example of a (pretty-printed) RTT-even is provided below:
107115 "sent_packets" : 9393 ,
108116 "sent_bytes" : 492457296 ,
109117 "rec_packets" : 5922 ,
110- "rec_bytes" : 37
118+ "rec_bytes" : 37 ,
119+ "match_on_egress" : false
111120}
112121```
113122
@@ -116,136 +125,36 @@ An example of a (pretty-printed) RTT-even is provided below:
116125
117126### Files:
118127- ** pping.c:** Userspace program that loads and attaches the BPF programs, pulls
119- the perf-buffer ` rtt_events ` to print out RTT messages and periodically cleans
128+ the perf-buffer ` events ` to print out RTT messages and periodically cleans
120129 up the hash-maps from old entries. Also passes user options to the BPF
121130 programs by setting a "global variable" (stored in the programs .rodata
122131 section).
123- - ** pping_kern.c:** Contains the BPF programs that are loaded on tc (egress) and
124- XDP (ingress), as well as several common functions, a global constant ` config `
125- (set from userspace) and map definitions. The tc program ` pping_egress() `
126- parses outgoing packets for identifiers. If an identifier is found and the
127- sampling strategy allows it, a timestamp for the packet is created in
128- ` packet_ts ` . The XDP program ` pping_ingress() ` parses incomming packets for an
129- identifier. If found, it looks up the ` packet_ts ` map for a match on the
130- reverse flow (to match source/dest on egress). If there is a match, it
131- calculates the RTT from the stored timestamp and deletes the entry. The
132- calculated RTT (together with the flow-tuple) is pushed to the perf-buffer
133- ` events ` . Both ` pping_egress() ` and ` pping_ingress ` can also push flow-events
134- to the ` events ` buffer.
132+ - ** pping_kern.c:** Contains the BPF programs that are loaded on egress (tc) and
133+ ingress (XDP or tc), as well as several common functions, a global constant
134+ ` config ` (set from userspace) and map definitions. Essentially the same pping
135+ program is loaded on both ingress and egress. All packets are parsed for both
136+ an identifier that can be used to create a timestamp entry ` packet_ts ` , and a
137+ reply identifier that can be used to match the packet with a previously
138+ timestamped one in the reverse flow. If a match is found, an RTT is calculated
139+ and an RTT-event is pushed to userspace through the perf-buffer ` events ` . For
140+ each packet with a valid identifier, the program also keeps track of and
141+ updates the state flow and reverse flow, stored in the ` flow_state ` map.
135142- ** pping.h:** Common header file included by ` pping.c ` and
136143 ` pping_kern.c ` . Contains some common structs used by both (are part of the
137144 maps).
138145
139146### BPF Maps:
140147- ** flow_state:** A hash-map storing some basic state for each flow, such as the
141148 last seen identifier for the flow and when the last timestamp entry for the
142- flow was created. Entries are created by ` pping_egress() ` , and can be updated
143- or deleted by both ` pping_egress() ` and ` pping_ingress() ` . Leftover entries
144- are eventually removed by ` pping.c ` .
149+ flow was created. Entries are created, updated and deleted by the BPF pping
150+ programs. Leftover entries are eventually removed by userspace (` pping.c ` ).
145151- ** packet_ts:** A hash-map storing a timestamp for a specific packet
146- identifier. Entries are created by ` pping_egress() ` and removed by
147- ` pping_ingress() ` if a match is found. Leftover entries are eventually removed
148- by ` pping.c ` .
152+ identifier. Entries are created by the BPF pping program if a valid identifier
153+ is found, and removed if a match is found. Leftover entries are eventually
154+ removed by userspace ( ` pping.c ` ) .
149155- ** events:** A perf-buffer used by the BPF programs to push flow or RTT events
150156 to ` pping.c ` , which continuously polls the map the prints them out.
151157
152- ### A note on concurrency
153- The program uses "global" (not ` PERCPU ` ) hash maps to keep state. As the BPF
154- programs need to see the global view to function properly, using ` PERCPU ` maps
155- is not an option. The program must be able to match against stored packet
156- timestamps regardless of the CPU the packets are processed on, and must also
157- have a global view of the flow state in order for the sampling to work
158- correctly.
159-
160- As the BPF programs may run concurrently on different CPU cores accessing these
161- global hash maps, this may result in some concurrency issues. In practice, I do
162- not believe these will occur particularly often, as I'm under the impression
163- that packets from the same flow will typically be processed by the some
164- CPU. Furthermore, most of the concurrency issues will not be that problematic
165- even if they do occur. For now, I've therefore left these concurrency issues
166- unattended, even if some of them could be avoided with atomic operations and/or
167- spinlocks, in order to keep things simple and not hurt performance.
168-
169- The (known) potential concurrency issues are:
170-
171- #### Tracking last seen identifier
172- The tc/egress program keeps track of the last seen outgoing identifier for each
173- flow, by storing it in the ` flow_state ` map. This is done to detect the first
174- packet with a new identifier. If multiple packets are processed concurrently,
175- several of them could potentially detect themselves as being first with the same
176- identifier (which only matters if they also pass rate-limit check as well),
177- alternatively if the concurrent packets have different identifiers there may be
178- a lost update (but for TCP timestamps, concurrent packets would typically be
179- expected to have the same timestamp).
180-
181- A possibly more severe issue is out-of-order packets. If a packet with an old
182- identifier arrives out of order, that identifier could be detected as a new
183- identifier. If for example the following flow of four packets with just two
184- different identifiers (id1 and id2) were to occur:
185-
186- id1 -> id2 -> id1 -> id2
187-
188- Then the tc/egress program would consider each of these packets to have new
189- identifiers and try to create a new timestamp for each of them if the sampling
190- strategy allows it. However even if the sampling strategy allows it, the
191- (incorrect) creation of timestamps for id1 and id2 the second time would only be
192- successful in case the first timestamps for id1 and id2 have already been
193- matched against (and thus deleted). Even if that is the case, they would only
194- result in reporting an incorrect RTT in case there are also new matches against
195- these identifiers.
196-
197- This issue could be avoided entirely by requiring that new-id > old-id instead
198- of simply checking that new-id != old-id, as TCP timestamps should monotonically
199- increase. That may however not be a suitable solution if/when we add support for
200- other types of identifiers.
201-
202- #### Rate-limiting new timestamps
203- In the tc/egress program packets to timestamp are sampled by using a per-flow
204- rate-limit, which is enforced by storing when the last timestamp was created in
205- the ` flow_state ` map. If multiple packets perform this check concurrently, it's
206- possible that multiple packets think they are allowed to create timestamps
207- before any of them are able to update the ` last_timestamp ` . When they update
208- ` last_timestamp ` it might also be slightly incorrect, however if they are
209- processed concurrently then they should also generate very similar timestamps.
210-
211- If the packets have different identifiers, (which would typically not be
212- expected for concurrent TCP timestamps), then this would allow some packets to
213- bypass the rate-limit. By bypassing the rate-limit, the flow would use up some
214- additional map space and report some additional RTT(s) more than expected
215- (however the reported RTTs should still be correct).
216-
217- If the packets have the same identifier, they must first have managed to bypass
218- the previous check for unique identifiers (see [ previous point] (#Tracking last
219- seen identifier)), and only one of them will be able to successfully store a
220- timestamp entry.
221-
222- #### Matching against stored timestamps
223- The XDP/ingress program could potentially match multiple concurrent packets with
224- the same identifier against a single timestamp entry in ` packet_ts ` , before any
225- of them manage to delete the timestamp entry. This would result in multiple RTTs
226- being reported for the same identifier, but if they are processed concurrently
227- these RTTs should be very similar, so would mainly result in over-reporting
228- rather than reporting incorrect RTTs.
229-
230- #### Updating flow statistics
231- Both the tc/egress and XDP/ingress programs will try to update some flow
232- statistics each time they successfully parse a packet with an
233- identifier. Specifically, they'll update the number of packets and bytes
234- sent/received. This is not done in an atomic fashion, so there could potentially
235- be some lost updates resulting an underestimate.
236-
237- Furthermore, whenever the XDP/ingress program calculates an RTT, it will check
238- if this is the lowest RTT seen so far for the flow. If multiple RTTs are
239- calculated concurrently, then several could pass this check concurrently and
240- there may be a lost update. It should only be possible for multiple RTTs to be
241- calculated concurrently in case either the [ timestamp rate-limit was
242- bypassed] (#Rate-limiting new timestamps) or [ multiple packets managed to match
243- against the same timestamp] (#Matching against stored timestamps).
244-
245- It's worth noting that with sampling the reported minimum-RTT is only an
246- estimate anyways (may never calculate RTT for packet with the true minimum
247- RTT). And even without sampling there is some inherent sampling due to TCP
248- timestamps only being updated at a limited rate (1000 Hz).
249158
250159## Similar projects
251160Passively measuring the RTT for TCP traffic is not a novel concept, and there
0 commit comments