The case for Structured, Contextual Logs

By Morten Jensen

| 8 minutes read

Introduction

If correctly composed, logs can be an extremely useful resource to tap into in the following use cases:

  • Support end-users
  • Derive business metrics (how many users used our service yesterday, over the last 7 days and in the past month?)
  • Derive operational metrics (service uptime and failures)
  • Feed metrics to generate alerts during abnormal events or to trigger capacity increases and decreases based on service loads
  • Find and fix bugs

Business & Operational Dashboards can today be built to aggregate and chart metrics derived from logs in near-realtime. Dashboards can be tailored not only to IT but also to the wider organisation.

Alarms can be fired to alert people either by thresholds or by anomaly from the log-derived metrics.

Logs and metrics are enablers for feedback loops that enable organisations to react quickly.

In other words, proper logs and supporting services are a business-critical necessity in any organisation today. They need to be treated as such as they can give an organisation that is geared towards consuming actionable metrics a competitive advantage through quick response.

Structure & Correlation

To support the wide variety of log use cases, logs should meet certain criteria:

  • Log entries structured for easy searches with predicates (almost like SQL)
  • Log entries & messages formats that support easy filtering and aggregation
  • Sufficient log entry context to identify Who, What, Why, When & Where
  • Correlation - or linking - of entries that pertain to the same event or action

Some log entry fields will always be present. For instance consider a public-facing web service - HTTP request method & path should always be present in order to identify the call type and operation. Same goes for timestamp, message and a log-entry linking request id.

Other fields are added dynamically. For instance, a web service HTTP response status, which is only known towards the end of the processed request.

Traditional logs have in the past consisted of one-line text where fields were separated by spaces or tab, commas or other arbitrary delimiters like brackets etc. Such log formats all suffer from the challenge of effectively needing to write or configure a parser to ingest them into a store where they can be searched, filtered and aggregated. Furthermore, the formatting is often found to be inconsistent across different services and therefore service log correlation can be hard to achieve.

JSON has evolved to be an excellent format to emit logs in. It can be searched, filtered and events aggregated in a way that supports all of the above-mentioned use cases.

Example

Here is a simplified example, which shows 2 log entries for the same request to a hypothetical AWS Lambda Calendar web service:

# Request log start entry
{
    "thread": "main",
    "level": "INFO",
    "loggerName": "com.example.lambda.Handler",
    "message": "handleRequest: start",
    "contextMap": {
        "client-id": "calendar-android-app-v1.23",
        "user-id": "87654321-1234-5678-90ab-1234567890ab",
        "request-id": "12345678-7da1-49e3-8a7b-207a12a740b6",
        "request-method": "POST",
        "request-path": "/calendar/event",
        "start-free-memory": "52084216",
        "start-max-memory": "883884032",
        "start-time": "2019-09-23 06:50:16.461 +0000",
        "start-total-memory": "62849024",
        "viewer-country": "IE"
    }
}

# Request log end entry
{
    "thread": "main",
    "level": "INFO",
    "loggerName": "com.example.lambda.Handler",
    "message": "handleRequest: end",
    "contextMap": {
        "client-id": "calendar-android-app-v1.23",
        "user-id": "87654321-1234-5678-90ab-1234567890ab",
        "duration": "4645",
        "end-free-memory": "40031544",
        "end-max-memory": "883884032",
        "end-time": "2019-09-23 06:50:21.106 +0000",
        "end-total-memory": "62980096",
        "request-id": "12345678-7da1-49e3-8a7b-207a12a740b6",
        "request-method": "POST",
        "request-path": "/calendar/event",
        "response-status": "200",
        "start-free-memory": "52084216",
        "start-max-memory": "883884032",
        "start-time": "2019-09-23 06:50:16.461 +0000",
        "start-total-memory": "62849024",
        "viewer-country": "IE"
    }
}

From the two log entries we can derive any number of insights - for example:

  • The two entries were from the same request (same request-id UUID)
  • The request succeeded (response-status: 200) - the user successfully added a new calendar event
  • The user resides in Ireland (viewer-country = IE)
  • The request took 4645 milliseconds to complete (more than 4.5 seconds - perhaps itself a source of concern)
  • The user used version v1.23 of the Calendar Android app (client-id generally refers to the OAuth client id)
  • The user UUID can be used to identify the user for support request purposes
  • Processing of the request reduced free runtime memory by 12MB (end-free-memory - start-free-memory)

Furthermore, the log entries satisfy practically all of the established criteria:

  • It is possible to search for the log entries with e.g. AWS Cloudwatch Logs.
  • Using filters it’s possible to derive metrics from the entries (e.g. how many calendar events were created yesterday).
  • It’s also possible to derive uptime and error counts from the logs - simplistically, consider splitting the response-status field into HTTP 200 and non-200 responses and counting those events e.g. over 5 minutes.
  • For support purposes it’s possible to identify individual user actions
  • Linking all log entries pertaining to a single request or event is possible with the request-id

Tools

Achieving contextual and structured logs like this is possible and indeed reasonably simple to implement across programming languages like Java and Python. A couple of examples are:

Additional work to standardise log emission and formats across code bases and services through e.g. Decorator or Adapter patterns is definitely warranted.

Getting started

Code bases today can contain 100’s or 1000’s of different log calls. Many code bases use older log frameworks, which do not natively emit JSON - for instance Log4j 1.x.

Furthermore, many log calls may not emit particularly relevant logging information. Often, some log entries have been left in from a past debug session.

It can therefore be daunting to get started refactoring all of the logs.

We would make the following recommendations to get started:

  • Start simple; decide on a logging framework that supports JSON structured logging + context.

  • Carefully define and structure log levels such that each level is well-defined and actionable. An example:

    • ERROR: Server-side issue
    • WARN: Client-side issue
    • INFO: General log entry information to help understand the execution flow
    • DEBUG: Additional logging needed to support particular issues
  • Define relevant information to add to the context (request-id, start-time, end-time, request method/path etc.)

  • Implement a context per request or event; in many runtime environments a single thread processes a given request. In these runtimes use the Thread context to store information to support contextual information.

  • Implement a start and end log entry call at the top of the call stack (with all contextual information captured at each of the two points), which captures all request and response events.

  • Implement log filters that capture important business, operational and development events.

  • Use the log filters to aggregate and chart metrics from the events.

  • Build business and operational Dashboards from the metrics.

  • From learnings above, start adding standards to log entries - in particular for messages and how they should be structured

  • Refactor existing log calls over a number of sprints to use the new log format and log levels.

Personally Identifiable Information

Some considerations are warranted if storing personally identifiable information in logs. The considerations include:

  • It is helpful (and in many cases indeed necessary) to store user identifiers in logs in order to better support users. Ensure that this use is documented in the Privacy Policy
  • Try to avoid email addresses as user identifiers in logs - consider instead hashing the identifier or using a “surrogate key” like a UUID to uniquely identify each user in the logs.
  • Encrypt the logs at rest. With AWS Cloudwatch it is really simple to enable encryption per Log Group.
  • Per default log services like AWS Cloudwatch Logs store log entries indefinitely. This generally goes against privacy policies and should be avoided - it also comes at a considerable cost. Setting up a reasonable retention period (e.g. 1-3 months) is generally easy.
  • Ensure that there is a proper access policy to the logs both in terms of authentication and authorisation. For instance, consider requiring MFA for log access. MFA authorisation requirements can be enforced with AWS IAM policies.
  • Consider a policy on handling log data outside of the log service. For instance, while tools like AWS Logs for Humans can be helpful, consider that extracting logs to laptops etc. may not offer the same level of protection (like encryption and retention policies) as services like Cloudwatch Logs. For most operational & support purposes the AWS Cloudwatch Logs, Metrics, Dashboard & Insights Management Consoles would be sufficient.
  • Client IP addresses say a lot about users because they can identify a location very accurately. If - for instance - only the country is needed consider not logging client IP addresses (or hashing them) and instead use managed services like AWS API Gateway or AWS CloudFront to provide country information.
  • As per previous point, consider whether logging the full User-Agent header is necessary or whether managed services headers (e.g. AWS CloudFront/API Gateway Is-*-Viewer) are sufficient.
  • Unlike Facebook, do not under any circumstances store passwords and secrets in logs.
  • To further improve security, consider hashing user identifiers - and perhaps even things like client IP address. The hashed identifier can remain the same across log entries and requests (for searchability) but without the original user identifiers it is a lot harder to compromise user data through the logs.

Conclusion

Structured, contextual logs solve a number of business, operational and development challenges, which means that giving logs the attention they deserve will help organisations immensely.

There are several tools to handle storage, usage and the lifecycle of logs. In particular AWS provides a comprensive, Cloud-native set of services that satisfy most organisational log requirements.

Finally, in the past many complaints have been made suggesting that JSON is unreadable to humans. However, managed services like the AWS Cloudwatch Logs & Insights Consoles have gone a long way to make it easier to read the logs. While it takes some getting used to reading JSON logs it should therefore be less of a concern today.