cloudsoft.io

Production Scalability

Scalability with Cloudsoft AMP falls into two categories:

  • Horizontal scalability refers to the ability to add additional AMP Servers, to scale out. See AMP Cluster for how to scale horizontally.
  • Vertical scalability refers to the amount of load one individual AMP server can handle.

This section deals with vertical scalability. The guiding principles to improve the maximum load are:

  • Reduce the kinds of work (e.g. offloading monitoring to an external dedicated sytem).
  • Reduce the level of work (e.g. increase poll periods)
  • Increase the size of the machine (in particular CPU and RAM) used for the AMP Server.

Monitoring

AMP does not install an agent on the machines it is managing. It commonly monitors application components through protocols such as SSH or HTTP for both health and performance and this can require regular polling. There are several ways to reduce the impact of this polling, these are detailed below.

External Monitoring

An external system can instead be used for monitoring application components. When deploying application components, AMP can install and configure a monitoring agent. AMP can then retrieve the appropriate metrics from a central monitoring server in a single call that gives metrics for all of the desired components.

Efficient Polling

If using AMP to monitor an application, consider avoiding the use of ssh for polling. Making ssh connections and executing commands is slow and CPU intensive compared to other monitoring techniques, such as use of a http-based management interface. See Health Check Sensors, and the configuration option on some entities of sshMonitoring.enabled.

One can disable non-essential polling. For example, the configuration below will, for some entities, limit the metrics retrieved to just health metrics:

metrics.usage.retrieve: false

One can decrease the polling frequency:

  • softwareProcess.serviceProcessIsRunningPollPeriod controls the poll period for ssh-based health checks (if not disabled)
  • If you are defining polling in YAML (e.g. using HttpRequestSensor or JmxAttributeSensor), the period configuration option controls how frequently it polls.

When using enrichers, consider including enricher.suppressDuplicates: true. Similarly for HttpRequestSensor, consider using suppressDuplicates: true. This ensures that the same sensor value is not published repeatedly, which can cause a chain of events when processing the value. This is particularly important for health metrics that repeatedly say the same thing. However, for performance and load metrics it can be necessary to repeat the value (for example, a duplicate “request count” implies that there have been no requests handled since the last event).

Machine Provisioning

In most cases, Cloudsoft AMP uses Apache jclouds for machine provisioning. There are several optimisations which can be made to the configuration of Apache jclouds through AMP, these are detailed below.

Key Pairs and Security Groups

There are a number of configuration options that reduce the amount of work provisioning a VM:

  • Use keyPair to define a pre-existing key-pair (e.g. in EC2). Depending on the cloud and whether it is Linux or Windows, you may also have to specify loginUser.privateKeyData for the private side of the key-pair. If keyPair is not supplied, jclouds will (depending on the cloud) auto-generate a new key-pair for each VM being provisioned.
  • Use a pre-existing security group, via the securityGroups configuration option. Also configure inboundPorts: []. if not explicitly empty, jclouds will auto-create a security group to open these ports.

VM Image and Size

When determining the VM image and size, this can either be declared explicitly or inferred from guidelines. To minimise the work required to provision machines, specify explicit values:

  • imageId: explicit image, rather than having to query and infer from osFamily: centos etc.
  • hardwareId: explicit hardware profile (e.g. m4.large), rather than having to query and infer from minCores etc.
  • region: this should be specified in the location, rather than inferred. It is commonly specified in the location type, such as aws-ec2:us-east-1.

VM Status Polling

When a VM is provisioned or destroyed, jclouds will poll for the VM entering the desired state. The poll interval is configurable:

  • jclouds.compute.poll-status.initial-period: initial polling interval in milliseconds when waiting for VM to be running, increasing exponentially to max-period (defaults to 50ms).
  • jclouds.compute.poll-status.max-period: maximum time between polls (defaults to 1000ms).

For additional jclouds configuration options, see the jclouds code including ComputeServiceConstants and ComputeServiceProperties.

Rate Limiting

Rate-limiting (by the cloud provider) is an important consideration if provisioning or deleting many VMs concurrently. For some clouds (e.g. AWS) there is no indication in the response for how long one must back off for. In jclouds, exponential backoff is supported for each individual request. The exponential backoff times are configurable: it is calculated as Math.min(2^failureCount * retriesDelayStart, retriesDelayStart * 10) + random(10%)

  • jclouds.retries-delay-start controls the number of milliseconds for the first backoff.
  • jclouds.max-retries controls the maximum number of attempts for a given command when rate limited.

In some clouds, rate limiting is handled differently. For example, in Azure the 429 http response code includes the time when the request can be retried.

Machine Provisioning Retries

Sometimes a machine is “dead on arrival” - for example, it immediately transitions from starting to failed, or it is never ssh’able. In such circumstances, one can retry. The relevant configuration options are:

  • machineCreateAttempts: maximum number of attempts to provision a desired VM.
  • destroyOnFailure: whether to delete the VM if provisioning fails.

The defaults for these options are fine for most use-cases.

Limiting Concurrent Requests

If provisioning or deleting many VMs concurrently, it can cause extreme rate limiting (where even the exponential backoffs fail).

Is is possible to constrain the maximum number of concurrent creation/deletions:

  • maxConcurrentMachineCreations: max number of concurrent VM provision requests within a single app.
  • maxConcurrentMachineDeletions: max number of concurrent VM deletion requests within a single app.

These configuration options apply within a single app (e.g. if set to 10 then deploying 10 large apps concurrently would be result in 100 concurrent VM creations). To limit concurrency across all apps within an AMP instance, it is possible to use machineCreationSemaphore and machineDeletionSemaphore. However, this requires some Java coding and would also not apply across AMP Worker instances when using AMP Cluster.

Limiting Machine Configuration Actions

If ssh’ing to the machine is enabled, a number of additional configuration actions will optionally be executed over ssh:

  • Create and configure a user, with appropriate RSA key, DSA key or password. This can be disabled to use the initial login user with dontCreateUser: true and user: .
  • lookupAwsHostname: whether to ssh to the VM to find its hostname (if false, just uses the ip).
  • openIptables: whether to open the “inbound ports” in iptables or firewalld
  • installDevUrandom: whether to use /dev/urandom instead of /dev/random (to work around a lack of entropy)

Even less work is required if there is no ssh or WinRM access required - see No SSH Access.

Example Usage

Some suggested values are shown below, but the best configuration will vary according to the use-case and environment. Note these configuration options can be declared on the location in the catalog, in-line in the location section of the application yaml, or for a specific entity in its provisioning.properties configuration:

location:
  aws-ec2:us-east-1:
    keyPair: my-keypair-name
    loginUser.privateKeyData: $brooklyn.external("creds", "aws-keypair-privateKeyData")
    imageId: us-east-1/ami-12345678
    hardwareId: m4.large
    securityGroups: my-security-group
    inboundPorts: []
    jclouds.retries-delay-start: 3s
    jclouds.max-retries: 10
    jclouds.compute.poll-status.initial-period: 2000
    jclouds.compute.poll-status.max-period: 10000
    maxConcurrentMachineCreations: 50
    maxConcurrentMachineDeletions: 50

 No SSH Access

One can use entities in AMP that do not require any SSH access. Example use-cases for this include:

  • Immutable infrastructure: the VM (or container) image is pre-configured with everything required. This works well with tools such as Docker, UForge and Packer. (TODO: links).
  • Use of an external configuration management system: the VM is pre-configured with an appropriate agent (e.g. for Puppet or Chef), and the software is installed by one of these systems.

Relevant location configuration options include:

  • pollForFirstReachableAddress: whether to poll for the VM to be reachable on ssh port 22 (or the WinRM port for windows); just checks network connectivity, rather than forming an ssh connection.
  • waitForSshable: whether to block provisioning until the VM is ssh’able (polling to execute a trivial ssh command).
  • waitForWinRmAvailable: same as waitForSshable, but for Windows machines.

Additional entity configuration options include:

  • onbox.base.dir.skipResolution: whether to infer (via ssh) the base directory.
  • sshMonitoring.enabled: whether to poll over ssh for a health-check (applies to some entities).

For more information, see Provisioning Machine Requirements.

Persistence

AMP persists its state to either an object store or the file system (e.g. an NFS mount). This is essential for service restart and for high availability (see Persistence.

Each time an entity’s state changes, it is re-persisted. This can lead to heavy load, particularly for a public cloud object store. Relevant configuration options include the following:

  • persistPeriod: maximum frequency for writing changes (defaults to 1 second).
  • persister.threadpool.maxSize: maximum number of concurrent threads writing the persisted state (defaults to 10).

Web Console Usage

When operating at large scale (e.g. 1000s of apps in an AMP Server), the web-console can become sluggish or unresponsive.

Use of the CLI and REST API is recommended to work around such problems.

AMP Server Setup

For further details of setting up a production AMP Server, see Requirements and Production Installation.