cloudsoft.io

Logging

AMP uses the SLF4J logging facade, which allows use of many popular frameworks including logback, java.util.logging and log4j.

The convention for log levels is as follows:

  • ERROR and above: exceptional situations which indicate that something has unexpectedly failed or some other problem has occurred which the user is expected to attend to
  • WARN: exceptional situations which the user may which to know about but which do not necessarily indicate failure or require a response.
  • INFO: a synopsis of activity, but which should not generate large volumes of events nor overwhelm a human observer.
  • DEBUG: detail of activity which might merit closer inspection under certain circumstances.
  • TRACE and lower: detail of activity which is not normally of interest, but which might merit closer inspection under certain circumstances including sensitive information (e.g. secrets) that should not be exposed in higher lover levels. A configuration example for TRACE level is present in the log configuration file, but is commented because of security concerns.

Loggers follow the package.ClassName naming standard.

The default logging is to write INFO+ messages to amp.info.log, and DEBUG+ to amp.debug.log. Each is a rolling log file, where the past 10 files will be kept. INFO level, and above, messages will be logged to the karaf console. Use the log: commands in the karaf client, e.g. log:tail, to read these messages.

Using Logback through OSGi Pax Logging

In the OSGi based Cloudsoft AMP logging is configured from ops4j pax logging.

See: https://ops4j1.jira.com/wiki/display/paxlogging/Configuration

Standard Configuration

A org.ops4j.pax.logging.cfg file is included in the etc/ directly of the AMP distro; this is read by brooklyn at launch time. Changes to the logging configuration, such as new appenders or different log levels, can be made directly in this file.

Karaf logging is highly configurable. For example enable the sift appender to log to separate log files for each bundle as described here: Advanced configuration

Using the default configuration the log entries are reported in UTC time. If you want the logging to be reported using the server local time you can replace the log4j2.pattern removing the UTC flag and the Z suffix:

log4j2.pattern = %d{ISO8601}Z %X{task.id}-%X{entity.ids} %-5.5p %3X{bundle.id} %c{1.} [%.16t] %m%n

Advanced Configuration

The default logback.xml file references a collection of other log configuration files included in the AMP jars. It is necessary to understand the source structure in the logback-includes project.

For example, to change the debug log inclusions, create a folder brooklyn under conf and create a file logback-debug.xml based on the brooklyn/logback-debug.xml from that project.

A full explanation of logging in karaf is available here.

Karaf Log commands

Logging commands are available through the karaf console. These let you interact with the logs and dynamically change logging configuration in a running application.

Some useful log: commands are:

log:display mylogger -p “%d - %c - %m%n” - Show the log entries for a specific logger with a different pattern.

log:get/set - Show / set the currently configured log levels

log:tail - As display but will show continuously

log:exception-display - Display the last exception

Log File Backup

This sub-section is a work in progress; feedback from the community is extremely welcome.

The default rolling log files can be backed up periodically, e.g. using a CRON job.

Note however that the rolling log file naming scheme will rename the historic zipped log files such that amp.debug-1.log.zip is the most recent zipped log file. When the current amp.debug.log is to be zipped, the previous zip file will be renamed amp.debug-2.log.zip. This renaming of files can make RSYNC or backups tricky.

An option is to covert/move the file to a name that includes the last-modified timestamp. For example (on mac):

    LOG_FILE=amp.debug-1.log.zip
    TIMESTAMP=`stat -f '%Um' $LOG_FILE`
    mv $LOG_FILE /path/to/archive/amp.debug-$TIMESTAMP.log.zip

Logging Aggregators

Integration with systems like Logstash and Splunk is possible using standard log4j configuration. Log4j can be configured to write to syslog using the SyslogAppender which can then feed its logs to Logstash.

For More Information

The following resources may be useful when configuring logging:

AMP usage and audit trails

There are several different areas of requirements:

  • Retrieving usage information programmatically - i.e. the applications, and the resources used by each.
  • Configuring AMP logging (through log4j), and log shipping (e.g. with Logstash)
  • Audit trails, for post-hoc offline analysis
  • Integration with monitoring dashboards (e.g. Nagios)

Usage information

The AMP REST API provides access to usage information. It lists the applications (including those now terminated), showing the start/end time and state transitions for each. It also lists the machines used (including those now terminated), linking each machine back to an application id.

Documentation for the REST API can be found within AMP itself (in the web-console, go to the Script -> REST API tab, and then browse the API). The annotations on the Java interfaces also make it easy to browse the API: UsageApi.java.

An example of retrieving all applications is shown below. For each application, it shows start/end time for each phase (e.g. when it was starting, when it was running, and when it stopped).

curl http://localhost:8081/v1/usage/applications
[
  {
    "statistics": [
      {
        "status": "STARTING",
        "id": "htStRkN7",
        "applicationId": "htStRkN7",
        "start": "2014-10-09T11:00:13+0100",
        "end": "2014-10-09T11:00:15+0100",
        "duration": 2313,
        "metadata": {}
      },
      {
        "status": "RUNNING",
        "id": "htStRkN7",
        "applicationId": "htStRkN7",
        "start": "2014-10-09T11:00:15+0100",
        "end": "2014-10-09T11:00:22+0100",
        "duration": 6495,
        "metadata": {}
      }
    ],
    "links": {}
  },
  {
    "statistics": [
      {
        "status": "STARTING",
        "id": "Z3TTK4sM",
        "applicationId": "Z3TTK4sM",
        "start": "2014-10-09T10:59:55+0100",
        "end": "2014-10-09T10:59:55+0100",
        "duration": 33,
        "metadata": {}
      },
      {
        "status": "UNKNOWN",
        "id": "Z3TTK4sM",
        "applicationId": "Z3TTK4sM",
        "start": "2014-10-09T10:59:55+0100",
        "end": "2014-10-09T11:00:22+0100",
        "duration": 26634,
        "metadata": {}
      }
    ],
    "links": {}
  }
]

An example of retrieving all machines is shown below, with each machine giving the associated application id:

    curl http://localhost:8081/v1/usage/machines
    
[
  {
    "statistics": [
      {
        "status": "ACCEPTED",
        "id": "yqqA9Moy",
        "applicationId": "rhsLVvJs",
        "start": "2014-10-09T11:11:13+0100",
        "end": "2014-10-09T11:13:45+0100",
        "duration": 151750,
        "metadata": {
          "id": "yqqA9Moy",
          "displayName": "159.8.33.134",
          "provider": "softlayer",
          "account": "cloudsoft",
          "serverId": "6488686",
          "imageId": "4343926",
          "instanceTypeName": "br-ked-aled-rhsl-j8mq-fe",
          "instanceTypeId": "6488686",
          "ram": "1024",
          "cpus": "1",
          "osName": "ubuntu",
          "osArch": "x86_64",
          "64bit": "true"
        }
      }
    ],
    "links": {}
  }
]

The start/end time to be retrieved can also be constrained. It is also possible to retrieve information about a specific machine or specific location. An example is below:

    curl http://localhost:8081/v1/usage/applications/rhsLVvJs?start=2014-10-09T11:07:18+0100&end=2014-10-09T11:11:14+0100
{
  "statistics": [
    {
      "status": "STARTING",
      "id": "rhsLVvJs",
      "applicationId": "rhsLVvJs",
      "start": "2014-10-09T11:07:18+0100",
      "end": "2014-10-09T11:11:14+0100",
      "duration": 236354,
      "metadata": {}
    },
    {
      "status": "RUNNING",
      "id": "rhsLVvJs",
      "applicationId": "rhsLVvJs",
      "start": "2014-10-09T11:11:14+0100",
      "end": "2014-10-09T11:16:13+0100",
      "duration": 298443,
      "metadata": {}
    }
  ],
  "links": {}
}

Audit trails

Audit trails are essential for determining why an event happened, and who was responsible.

Persisting the AMP logs is one approach to storing the audit trail. Subsequent offline analysis can then be performed. For example, logstash (via syslog) could be used to collect all logs.

Integration with monitoring dashboards

AMP provides a web-console for monitoring the applications, and drilling into the current state and actions being performed. However, this is more a debug console. Most enterprises are keen to use their existing pane-of-glass for their operations staff. The AMP REST api provides access to all information shown in the web-console.

Integration with the monitoring dashboard could involve the dashboard making REST API calls into AMP to retrieve the required information.

Alternatively, AMP could push events to a given endpoint. This requires wiring up a listener for the desired events, and potentially bespoke code for pushing to the given endpoint (depending on the technology used).

Logbook

The logbook offers the possibility to query and view logs in the UI. By default, logs are stored in files as per configuration in etc/org.ops4j.pax.logging.cfg. The logbook can be configured against different log aggregation sources by setting the brooklyn.logbook.logStore parameter in brooklyn.cfg, and depending which backend is used, other parameters.

For example to use the local log files written by Cloudsoft AMP in a production environment, assuming /var/logs, you could configure:

        brooklyn.logbook.logStore=org.apache.brooklyn.util.core.logbook.file.FileLogStore
        brooklyn.logbook.fileLogStore.path=/var/logs/brooklyn/amp.debug.log

The default mode is to use the local log file in data/log/ relative to the launch directory.

The FileLogStore implementation is not compatible with multiline logs, only the first line will be print.

In production environments where log data is desired to be retained, Cloudsoft AMP supports Elasticsearch backends. This can be a dedicated ES environment for use by Cloudsoft AMP or a shared/managed ES facility that handles many logs, or – for lightweight usage – a simple local ES server running on the same instance as Cloudsoft AMP. As with any log storage requirement, the sizing, scaling, backup and maintenance of the logging environment requires careful attention. Elasticsearch includes numerous options to configure these, with one suggested configuration outlined in more detail below.

By default, only users with the root, powerUser, or an explicit logViewer entitlement are able to see log info through Cloudsoft AMP.

The Logbook UI widget can be found throughout the product:

  • on the About section, where all logs can be viewed
  • on the Inspector Entity Summary view
  • on the Inspector Activity view, where logs filtered by entity ID and activity ID respectively are shown
  • on the Debug Console, below the controls and output

Suggested Elasticsearch Setup

OpenSearch (OpenDistro for Elasticsearch) is an Apache-licensed open-source backend that works well with Cloudsoft AMP, with this configuration in brooklyn.cfg:

        brooklyn.logbook.logStore=org.apache.brooklyn.util.core.logbook.opensearch.OpenSearchLogStore
        brooklyn.logbook.openSearchLogStore.host=https://localhost:9200
        brooklyn.logbook.openSearchLogStore.index=brooklyn
        brooklyn.logbook.openSearchLogStore.user=admin
        brooklyn.logbook.openSearchLogStore.password=admin
        brooklyn.logbook.openSearchLogStore.verifySsl=false

Routing Logs to Elasticsearch

There are many solutions to routing log messages from Cloudsoft AMP to Elasticsearch, either plugging in to the log4j subsystem or routing the log files from disk. Fluentd, with the following configuration in td-agent.conf, is a good simple way to forward content added to the info and debug log files:

<source>
  @type tail
  @id input_tail_amp.info
  @log_level info
  <parse>
    @type multiline
    format_firstline /^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{3}Z/
    format1 /^(?<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{3}Z) (?<taskId>\S+)?-(?<entityIds>\S+)? (?<level>\w{4} |\w{5})\W{1,4}(?<bundleId>\d{1,3}) (?<class>(?:\S\.)*\S*) \[(?<threadName>\S+)\] (?<message>.*)/
    time_format %Y-%m-%dT%H:%M:%S,%L
  </parse>
  path /var/logs/brooklyn/amp.info.log
  pos_file /var/log/td-agent/amp.info.log.pos
  tag amp.info
</source>

<source>
  @type tail
  @id input_tail_amp.debug
  @log_level debug
  <parse>
    @type multiline
    format_firstline /^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{3}Z/
    format1 /^(?<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{3}Z) (?<taskId>\S+)?-(?<entityIds>\S+)? (?<level>\w{4} |\w{5})\W{1,4}(?<bundleId>\d{1,3}) (?<class>(?:\S\.)*\S*) \[(?<threadName>\S+)\] (?<message>.*)/
    time_format %Y-%m-%dT%H:%M:%S,%L
  </parse>
  path /var/logs/brooklyn/amp.debug.log
  pos_file /var/log/td-agent/amp.debug.log.pos
  tag amp.debug
</source>

<filter amp.debug>
  @type grep
  regexp1 level DEBUG
</filter>

<match brooklyn.*>
  @type elasticsearch
  hosts https://localhost:9200
  user admin
  password admin
  ssl_verify false
  logstash_format false
  index_name brooklyn
</match>

The filter block is needed for only picking up the debug log level from the debug source, as the info and upper levels are already present in the info file.

Sizing and Rotating Logs

Keeping log data obviously consumes disk storage, and serving the data requires memory. The log levels in Cloudsoft AMP can be configured on a fine-grained log-category basis, and different levels and categories can be routed to different indexes with different retention policies.

When designing your strategy for maintaining logs, some good rules of thumb are:

  • Allocate 4 GB RAM for a production ES instance plus 2 GB RAM for every TB of log data that is searchable
  • Consider a small number of tiers with different retention policies, e.g. INFO and selected DEBUG messages (logins) to one index, and all other DEBUG and lower messages sent to another index
  • Consider using rolling indices on a nightly or weekly basis, and an auto-close job and/or an auto-delete job to keep memory and disk usage at a steady state; for example the INFO and selected DEBUG messages might go to an index which rotates weekly and is deleted or moved to cold storage after two years, whereas DEBUG and lower messages might rotate daily and be deleted after a week
  • The amount of log data can vary depending how Cloudsoft AMP is used, so monitor usage to get a feel for what is maintainable, and put in place notifications if disk and/or memory usage become high
  • Review the logs and turn off unnecessary categories

Instructions and links to assist with this are below.

Index partitioning

It’s possible to configure Fluentd for sending the information to an index using an index name generated using datetime markers. This example will create and send the data to a new index every day:

<match brooklyn.*>
  @type elasticsearch
  hosts https://localhost:9200
  user admin
  password admin
  ssl_verify false

  include_timestamp true
  index_name ${tag}-%Y.%m.%d
  flush_interval 5s
  <buffer tag, time>
    timekey 60 # chunks per hours ("3600" also available)
    flush_interval 5s
  </buffer>
</match>

Cloudsoft AMP can be configured to use an index pattern for querying, eg:

    brooklyn.logbook.openSearchLogStore.index = brooklyn*

Index lifecycle management

Policies also allow handling the lifecycle of the indexes. For example, to delete debug indexes after a period of time based on the index naming pattern used in this page:

{
  "policy": {
    "description": "Delete workflow",
    "default_state": "new",
    "schema_version": 1,
    "states": [
      {
        "name": "new",
        "transitions": [
          {
            "state_name": "delete",
            "conditions": {
              "min_index_age": "60d"
            }
          }
        ]
      },
      {
        "name": "delete",
        "actions": [
          {
            "delete": {}
          }
        ]
      }
    ],
    "ism_template": {
        "index_patterns": ["amp.debug*"],
        "priority": 100
      }
  }
}

With these building blocks, and others linked below, you can configure the retention policy that suits your environment, balancing the trade-off between data availability and resource usage.