Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

directive %r containing \n #1

Closed
kyn3s opened this issue Aug 4, 2020 · 12 comments
Closed

directive %r containing \n #1

kyn3s opened this issue Aug 4, 2020 · 12 comments

Comments

@kyn3s
Copy link

kyn3s commented Aug 4, 2020

Hi,
Thanks for your parser, working great.
Please look at this access log entry where "%r" contains \n. I get an "InvalidEntryError : Could not match log entry..."
[23/Jul/2020:11:21:48 +0100] 66.240.192.138 TLSv1.2 ECDHE-RSA-AES256-GCM-SHA384 "\n" 226

LogParser('%t %h %{tls}x %{encr}x \"%r\" %b')

Is there smt wrong with my regexp or the way apachelogs handles \n ?

@jwodder
Copy link
Owner

jwodder commented Aug 4, 2020

This works fine for me:

from apachelogs import LogParser

line = '[23/Jul/2020:11:21:48 +0100] 66.240.192.138 TLSv1.2 ECDHE-RSA-AES256-GCM-SHA384 "\\n" 226'
parser = LogParser('%t %h %{tls}x %{encr}x \"%r\" %b')

entry = parser.parse(line)
print(vars(entry))

Note that I doubled the backslash before the n so that line would contain a literal backslash followed by an n instead of a newline character. How are you entering the line in your program? If the log entry contains a literal backslash + n, it should be entered in Python either as '\\n' or else using a raw string. On the other hand, if the log entry contains an actual newline character in place of the %r directive, that is not something I was aware could happen, and I would like to know your Apache version.

@kyn3s
Copy link
Author

kyn3s commented Aug 8, 2020

Thanks for your reply!
Not sure of Apache version because this is a Cisco product running Apache as an appliance. I can have my team check this if you need it.
However, lines were imported from an apache access.log file and contains literal backslash + n. This would require a preprocessing to double every backslash in each '\n' entry. I am surprised this cannot be handled as a possible standard case.
To give you some context, we started to check access.log following our systems were suddenly reachable from Internet (error with Internet firewall traffic matrix settings) and were scanned by some bots. There is a lot of entry in Apache access log containing backslash char. We decided to run some forensics.

@jwodder
Copy link
Owner

jwodder commented Aug 8, 2020

When you say "imported from an apache access.log file", is your Python code opening & reading from the file (which is what I would generally recommend & expect and which requires no special processing to deal with backslashes), or are you copying & pasting the contents of the file into the Python program? If the latter, making the log file entries into raw strings (If you have a whole big block of them, by enclosing the block in r""" ... """) should be sufficient to keep Python from interpreting backslash escapes.

@kyn3s
Copy link
Author

kyn3s commented Aug 11, 2020

I am simply opening/reading from a file. Indeed, i tried enclosing into smt like r""" ... """
Here is a code sample.

with open(output, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    header = ['Server', 'IP', 'Date', 'Volume']
    writer.writerow(header)

    parser = LogParser("%t %h %{tls}x %{encr}x \"%r\" %b")
    with open(input, newline='\n') as f:

        reader = csv.reader(f, delimiter=',',quotechar='"')
        for entry in parser.parse_lines(f):

            dtObj = entry.directives["%t"]
            dtObjStr = dtObj.strftime("%d/%m/%Y")
            dtYear = dtObj.strftime("%Y")

            if dtYear == "2019" or dtYear == "2020":
                if entry.directives["%b"] is None:
                    writer.writerow( (in_dir, entry.remote_host, dtObjStr, "-") )

And here are some lines of the log file input which contains \n:

[18/Jun/2020:15:30:20 +0000] 71.6.199.23 TLSv1 ECDHE-RSA-AES256-SHA "\n" 226
[02/Jul/2020:19:38:10 +0000] 185.234.217.8 TLSv1.2 ECDHE-RSA-AES256-GCM-SHA384 "t3 11.1.2\n" 226

@jwodder
Copy link
Owner

jwodder commented Aug 11, 2020

Try this: Wrap the for entry in parser.parse_lines(f): block in a try: ... except block, where the except part is:

except apachelogs.InvalidEntryError as e:
    print(repr(e.entry))

Then re-run your script and tell me what exactly it outputs.

@kyn3s
Copy link
Author

kyn3s commented Aug 12, 2020

Hi, I have tested again and here is what I can tell:

case A) using parser.parse_lines(f)
this works actually almost fine, the problem is that the \n is interpreted in the result : I would like entry.request_line not to interpret \n
result is as follows:
90.115.177.7,185.234.217.8,02/07/2020,226,"t3 11.1.2
"

case B) using parser.parse(...), here is the error I get
raise InvalidEntryError(entry, self.format) apachelogs.errors.InvalidEntryError: Could not match log entry '[02/Jul/2020:19:38:10 +0000] 185.234.217.8 TLSv1.2 ECDHE-RSA-AES256-GCM-SHA384 "t3 11.1.2\n" 226' against log format '%{isoutc}x %{srv}x %{daemon}x: %t %h %{tls}x %{encr}x "%r" %b'
removing the \n from the log entry makes it work

Case B) is not my use case so we can ignore it for the moment. Let's focus on case A)

@jwodder
Copy link
Owner

jwodder commented Aug 12, 2020

I would like entry.request_line not to interpret \n

That's not going to change. When Apache writes out \+n in an access log entry, it means that the respective field (%r/request_line in your case) contained an actual newline. apachelogs translates the escape sequence back to a newline so that the user gets the actual string value of the request line (just like it also translates \t and \xA0, among other sequences). The request the client sent to Apache had a newline on the end, so apachelogs includes that newline. (I'm not sure why Apache isn't stripping that newline before logging the request line, but that's beyond the scope of this library.) If you don't want a newline at the end of request_line, you can strip it yourself with entry.request_line.rstrip('\n').

Why exactly do you want the newline to be escaped? A literal newline in the middle of a field in a CSV file isn't a problem as long as the field is quoted. If the problem is that whatever's reading the CSV file can't handle such newlines, you may be better served by configuring the CSV dialect into something compatible with that program.

@kyn3s
Copy link
Author

kyn3s commented Aug 12, 2020

Thanks for the tip with the .rstrip('\n') but it will simply not write it (ie. \n) in the result file.

I want the result to be printed exactly as it was in the input file because I am doing some forensics. The output file will contain all requests sent from suspicious IPs to my servers (more than 20 access.log files actually). I can compute how many requests were sent, from which IPs, which date but I cannot easily say top 5 request that generated the most bytes out. These requests are sent by some bots scanning reachable IPs for vulnerabilities. Requests are crafted on purpose to manage to get some data out (eg. /etc/passwd).
If you are not going to change the way \n is interpreted, I guess I have to split the row my self to get original request and write it to the output file.

@kyn3s
Copy link
Author

kyn3s commented Aug 12, 2020

By the way, .rstrip() is not documented:

  • when you have nothing after \n, this newline is not printed in the output file
  • if you have something after \n in the GET request, it will be printed on an new line

@jwodder
Copy link
Owner

jwodder commented Aug 12, 2020

I don't understand why you want one representation of the request line and not another, but OK.

By the way, .rstrip() is not documented:

rstrip() is a str method, as entry.request_line is just a str. If you want to remove medial newlines as well, perhaps you want replace()?

@kyn3s
Copy link
Author

kyn3s commented Aug 12, 2020

Ah OK! Thought it was part of your library. Not enough used to python ^^

Well I just want the request line to be printed out as it was sent so that I can say for instance that request t3 11.1.2\n was sent 100 times and maybe this has caused an issue.

Thanks for your quick replies anytime.
I will also look into CSV dialect.

@kyn3s
Copy link
Author

kyn3s commented Aug 12, 2020

I don't understand why you want one representation of the request line and not another, but OK.

Probably the same reason why Apache project is storing the request this way into access.log

Cannot get better than this:

For security reasons, starting with Apache 2.0.46, non-printable and other special characters are escaped mostly by using \xhh sequences, where hh stands for the hexadecimal representation of the raw byte. Exceptions from this rule are “ and \ which are escaped by prepending a backslash, and all whitespace characters which are written in their C-style notation (\n, \t etc). In httpd 2.0 versions prior to 2.0.46, no escaping was performed on the strings from %…r, %…i and %…o, so great care was needed when dealing with raw log files, since clients could have inserted control characters into the log.

What about adding an option to enable/disable writing special characters as mentionned above ?

@jwodder jwodder closed this as not planned Won't fix, can't repro, duplicate, stale Sep 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants