Common Log Format

Original link: https://blog.othree.net/log/2023/08/14/common-log-format/

Common Log File Format

This article can be regarded as an archaeological article on software. Recently, some adjustments have been made to the blog. One of the changes is to remove Google Analytics. Part of it is because UA can no longer be used and GA4 is used. In fact, I also I have wanted to remove it for a long time, so I will remove it this time. However, I am still interested in knowing the difference in the degree of interest in different articles, so I studied the previous tool for organizing website statistics based on HTTP server logs. In fact, one of the reasons why I have not been able to remove GA before is that I can’t find a good alternative tool. I have always been more impressed with AWStats , but I really can’t stand the interface, and the process of searching for other alternative tools is also not easy. It went well, until after this re-study, I found a keyword Common Log Format , which sounds very common, but it has actually become a proper term in the software engineering world.

The Wikipedia entry for Common Log Format states that this is a standard format for HTTP server logs, but I think it can only be regarded as a de facto standard (industry standard), because no organization has really defined it and published it as a standard, and it is now available on the Internet I found a format description from W3C, but it was actually the description file of the httpd software in the CERN era. Taking this opportunity, I wanted to figure out a few words related to log that I have been puzzled about. It is the common mentioned at the beginning, and then combined and extended . These keywords are often seen when I set up Apache HTTPD, and even other web servers are also useful, but I have never done it. It is very clear, and I also find the word used very strange, like combined is combining something.

As a result, the answers to these questions are almost all in the HTTPd software files developed by NCSA (National Center for Supercomputing Applications, US National Supercomputer Application Center). NCSA HTTPd is the first HTTP server to propose this log format, and NCSA HTTPd The log, in fact, there are three presets, namely:

TransferLog is actually the access log that everyone is talking about now, and the format is the so-called CLF, but in fact it was written in Common Log File (CLF) Format at the time, and the format of the recorded data is:

 host rfc931 authuser [DD/Mon/YYYY:hh:mm:ss] "request" ddd bbbb - host: Either the DNS name or the IP number of the remote client - rfc931: Any information returned by identd for this person, - otherwise. - authuser: If user sent a userid for authentication, the user name, - otherwise. - DD: Day - Mon: Month (calendar name) - YYYY: Year - hh: hour (24-hour format, the machine's timezone) - mm: minutes - ss: seconds - request: The first line of the HTTP request as sent by the client. - ddd: the status code returned by the server, - if not available. - bbbb: the total number of bytes sent, *not including the HTTP/1.0 header*, - if not available

Then the file also defines an extended version of the Extended CLF Format , which allows adding other data at the end of these logs. If LogOptions is set to Combined , the three logs will be combined together. Extended CLF Format is used, and referrer is added. And user-agent information, this is the origin of the format name Combined, and there is another confusing thing here, that is, W3C has a very old Working Draft of Extended Log File Format , the format defined in this file and CLF In fact, it doesn’t matter, so when you look at the file, if you have a more careful file, you will write whether it is W3C extended or NCSA.

Although I didn’t check the relationship carefully, the CERN version of HTTPd should have implemented the NCSA version of the log format later. The file is called Common Logfile Format , and the abbreviation is also CLF, but the words are a little different, of course. The format is the same, and in fact it also retains the old version of CERN HTTPd log, the format is:

 time remotehost request

The implementation is:

 fprintf(log, "%24.24s %s %s\n", asctime(gorl), HTClientHost, n_noth(HTReqLine));

Among them, %24.24s has been studied for a while before I understand that the first 24 is the shortest length. If the data is not long enough, a blank will be added, and then the latter is the accuracy. When encountering a string . it will become the longest length. , if it exceeds, it will not be output. asctime is a built-in function that can convert the time into a string. The format is:

 Www Mmm dd hh:mm:ss yyyy

The length is exactly 24 characters. As for the variable name gorl , it took me the most time to figure it out. It means: “GMT time or Local time”, but it is not a binary value like indicator, but the variable itself It is that time, and that time may be GMT time zone time or local time.

In this way, in fact, many small questions about details have been answered. For example, when looking at the log before, I often saw two - appearing in succession. In fact, it represents two consecutive fields with no value, one of which is almost useless now. Identification Protocol ( RFC1413 ) is also an ancient thing. I took a look and it seems that IRC is useful; and because there is no standard, the previous and current date formats are different. Now it is common to add time zones At that time, NCSA and later CERN did not have time zone information; in addition, the Apache HTTPD file example also mentioned RefererLog and AgentLog. After this proper term, I also followed the line to find more web log analysis tools . At present, I choose goaccess first.

Finally, let’s sort out these three keywords in the context of web logs:

  • Common format, usually refers to Common Log File (CLF) Format;
  • Extended, regardless of the W3C version, here refers to the NCSA Extended CLF Format. If the fields defined by the CLF Format are not enough and more information is needed, this format can be used. More information is added at the end of the log end;
  • Combined format, add referrer and user-agent web log, use NCSA version Extended CLF Format format.

In fact, NCSA HTTPd not only includes Common and Combined, but also ServerName, which of course also uses the Extended format, but the most common ones handed down are these two.

This article is transferred from: https://blog.othree.net/log/2023/08/14/common-log-format/
This site is only for collection, and the copyright belongs to the original author.