Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(stdlib)!: parse_logfmt performs escaping (#777) #924

Merged
merged 3 commits into from
Jul 23, 2024

Conversation

drmason13
Copy link
Contributor

So that parse_logmt(encode_logfmt(X)) == X, parse_logfmt now processes the escape sequences currently added by encode_logfmt:

  • \n
  • \"
  • \\

This is a breaking change, since the output of the parse_logfmt function is now different for certain inputs (that contain the above escape sequences).

A possible alternative is to avoid escaping while encoding, but that is also a breaking change and makes less sense to me at least.

Another alternative is to simply do nothing. Leave round trip encoding-decoding of logfmt "broken", however this is surprising for users and I don't think it was originally intended to work this way.

I do think it's worth considering other parse/encode pairs and whether they can and should roundtrip, but I haven't yet done this.

As for the implementation, parsers now return a Cow<'a, str> instead of just a &'a str because they may need to remove escape characters from the input, and thus it is no longer possible to represent as a string slice.

Performance

Returning Cow as opposed to plain String avoids allocation in the "happy path" of no escapes being present, however there is always a performance cost as each str is checked upfront for the presence of any escape characters to determine whether to allocate a Cow::Owned (regardless of whether any of the escape characters are actually removed during processing).
See parse_key_value::escape_str for the implementation.

"invalid" escape sequences

I've not (yet?) added an error for trailing escape characters. At the moment they would be accepted and kept as-is, the same goes for other escape sequences such as \r, \t, \u{0123} and "invalid sequences" such as say \z. If they're not one of the escape sequences encoded by encode_logfmt then they are ignored.
See parse_key_value::escape_char for the implementation.

@drmason13 drmason13 force-pushed the parse-encode-inverse-fix branch from b6be400 to c7b2f2b Compare July 7, 2024 22:00
@jszwedko jszwedko requested a review from pront July 8, 2024 19:57
Copy link
Member

@bruceg bruceg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code makes sense to me, as does introducing a breaking change for it. The match in escape_char is a little awkward but I can't think of a cleaner way of doing it myself. I think the changelog filename needs to change, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a breaking change, the filename for the changelog fragment needs to be X.breaking.md.

@bruceg bruceg changed the title fix(stdlib): parse_logfmt performs escaping (#777) fix(stdlib)!: parse_logfmt performs escaping (#777) Jul 15, 2024
@drmason13
Copy link
Contributor Author

The code makes sense to me, as does introducing a breaking change for it. The match in escape_char is a little awkward but I can't think of a cleaner way of doing it myself. I think the changelog filename needs to change, though.

I've looked at it again, and don't want to swap out the match for some sort of and_then or chain.

Some(_) => c

This branch is the most confusing. Might be worth a comment to highlight that we are ignoring the escape sequence entirely in that case by returning c (the backslash), and that this is because we only concern ourselves with escapes that are added by encode_key_value

@jszwedko jszwedko requested a review from bruceg July 23, 2024 15:01
@bruceg
Copy link
Member

bruceg commented Jul 23, 2024

I've looked at it again, and don't want to swap out the match for some sort of and_then or chain.

No, that would definitely be a step backwards. I was more thinking of how to deduplicate the repeated _ = rest.next(); but I couldn't see a good way, so my comment was not a complaint.

@bruceg bruceg added this pull request to the merge queue Jul 23, 2024
Merged via the queue into vectordotdev:main with commit 8691870 Jul 23, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants