User-2019917520 posted
I've been looking at parsing the logs produced by IIS and have noticed issues when trying to obtain the value of cs(User-Agent) using both the default logging Advanced Logging modules.
1) Issue with default logging - it's not possible to get the original value of the User Agent.
IIS replaces all spaces with a +. It's impossible to tell if a User Agent really had a + or a space in it. Some user agents, for instance that of Googlebot, contain a +, eg:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Becomes:
Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)
So simply replacing all +'s with spaces won't get the original user agent string back.
2) Issue with Advanced Logging - spaces in field values cause incorrect tokenisation.
It's logged like this:
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Which is actually 4 tokens, not 1:
"Mozilla/5.0
(compatible;
Googlebot/2.1;
+http://www.google.com/bot.html)"
So to parse this correctly, we've got to add some extra tokenisation logic, outside of the W3C standard, to workaround this.
It says in the spec:
Fields are separated by whitespace, the use of tab characters for this purpose is encouraged.
Surely had this been done, both these issues would not exist? I wanted to file a bugs with Microsoft about these, but this was the best place I could find to discuss.
Thoughts?