James's Ramblings

AWS: EC2

Created: September 16, 2020 (Updated: October 03, 2020)

Metadata and User Data

  • User data is a script that runs when an instance launches.
  • Both meta data and user data are accessible from within an instance.
  • Instance meta data is available at:
    https://169.254.169.254/latest/meta-data/
    
  • Instance user data is available at:
    https://169.254.169.254/latest/user-data
    

Change the EC2 Instance Type

  • EBS-backed instances only.
  • Stop the instance.
  • Right click the instance in the AWS Management Console.
  • Start the instance.
  • Data is retained.

Placement Groups

  • Spread. All instances in a single AZ. Low latency. Downside: capacity might not be available.

  • Cluster. Instances are on distinct hardware. Maximum of 7 instances per placement group per AZ.

  • Partition. Each partition is a logical group of instances. Each partition is on distinct hardware but members of the partition are on the same hardware. Maximum of 7 partition per AZ. EC2 can ascertain the instance they belong to using partition metadata.

Shutdown Behaviour (from within the OS)

  • Default: stopped. Can be set to terminate at instance creation or after the fact.
  • CLI Attribute: InstanceInitiatedShutdownBehavior
  • Termination Protection does not protect from in-OS shutdown signals.

Errors, Status Checks, and Monitoring

  • Status checks are performed every minute and return pass or fail.
  • Healthy instances have a status of “OK”.
  • Unhealthy instances have a status of “Impaired”.

  • InstanceLimitExceeded: reached the maximum number of vCPUs per region. The default limit is 32. The limit can an be raised by contacting AWS. Previously there was a 20 limit instance per region.

  • InsufficientInstanceCapacity: AWS has no capacity in the region.
  • StatusCheckFailed_System: a problem involving the underlying infrastructure. AWS need to fix it.
  • StatusCheckFailed_Instance: a problem that must be fixed by the instance administrator.

  • CloudWatch can monitor EC2 and can be configured to take action automatically if an EC2 status check fails.

If Instance Launch is Failing

  • EBS volume limits.
  • EBS snapshot corruption.
  • Volume is encrypted and there is no key access.
  • An instance-store-backed AMI is missing part of the image.

  • “State Transition Reason” and “State Transition Message” are two columns that can be enabled in the EC2 Management console to show the exact reason for failure.

Common SSH Issue Causes

  • POSIX permissions on the private key are too open (400 required).
  • Username is incorrect - varies per OS.
  • Security Group or Network ACL.
  • High CPU load.

Per Second Billing (Vs Per Minute)

EC2 usage are billed on one second increments, with a minimum of 60 seconds. Similarly, provisioned storage for EBS volumes will be billed per-second increments, with a 60 second minimum. Per-second billing is available for instances launched in:

  • On-Demand, Reserved and Spot forms
  • All regions and Availability Zones
  • Amazon Linux and Ubuntu

Launch Types

  • On-demand. Per second billing after the first minute for Linux.

  • Reserved (1y or 3y commitment).
    • Reserved Instances: standard 24/7. Cannot change the instance type. Up to 75% discount.
    • Convertible Reserved Instances: can change the instance type. Up to 54% discount.
    • Scheduled Reserved Instances.
    • Can pay nothing upfront, partially upfront, or all upfront.
  • Spot Instances. Up to 90% discount. You lose the instance when your configured maximum price is less then the market rate.
    • There is a grace period of 2 minutes.
    • Instances get stopped or terminated.
    • When instances from a persistent spot request need to be terminated, the spot request needs to be terminated first.
    • Cancelling a spot request does not terminate instances.
  • Spot Block.
    • Block a spot instance from termination for 1 to 6 hours.
    • Can still be terminated, although this is rare.
  • Dedicated Instances.
    • Hardware is only used by that AWS account.
    • Other instances from the same account may be on the same hardware.
    • No control over instance placement.
  • Dedicated Hosts.
    • Three year reservation.
    • Visibility into physical cores and CPU sockets.
    • Licensing and regulation requirements.

Spot Fleet

  • A set of Spot Instances and (optional) On-Demand Instances.
  • Define a target capacity with price constraints.
  • Need to define launch pools (OS, instance type, AZ).
  • Instances stop launching at max cost or target capacity.
  • Possible strategies to choose the launch pool:
    • lowestPrice: lowest price pool
    • diversified: spread across all pools
    • capacityOptimized: pool with the optimal capacity

Savings Plans

Get a discount of up to 72% for a commitment to spend a consistent amount for a 1 or 3 year period.

Use the AWS Cost Explorer for recommendations.

Instance Types

https://www.ec2instances.info

https://aws.amazon.com/ec2/instance-types/

  • General purpose: A, M, and T.
    • A1: ARM processors. Focussed on cost savings.
    • M6g, M5, M5a, and M4: general purpose.
    • M5n: higher bandwidth.
    • T4g, T3, T3a, and T2: burstable instances. Can burst for up to 1 minute. Usage below a baseline threshold triggers accumulation of CPU credits that allow bursting. Unlimited mode allows unlimited bursting but makes costs unpredictable. Unlimited mode is enabled by default.
  • CPU optimised: C.
    • C6g, C5, C5a, and C4: general purpose compute optimised.
    • C5n: compute optimised for HPC. Higher bandwidth.
  • Memory optimised:R, X, u, and z.
    • R6g, R5, R5a, and R4: general purpose memory optimised instances.
    • R5n: higher bandwidth.
    • X1e: for high-performance databases, in-memory databases, and other memory intensive applications.
    • X1: large-scale in-memory applications.
    • u: large in-memory databases on bare metal.
    • z1d: both high compute capacity and high memory footprint. NVMe instance storage as well.
  • Accelerated computing: P, Inf, G, and F.
    • P3 and P2: general purpose GPU instances.
    • Inf1: use the AWS Inferentia ASIC for ML applications.
    • G4: ML using a GPU and graphics applications.
    • G3: graphics applications.
    • F1: field programmable gate arrays (FPGAs).
  • Storage optimised: I, D, and H.
    • I3 and I3en: NVMe instance storage.
    • D2: high performance HDDs and consistent high performance at launch time.
    • H1 high performance HDDs.

Suffix Meanings

  • a: AMD processors.
  • g: Arm-based AWS Graviton2 processors.
  • n: higher bandwidth networking.

Amazon Machine Images (AMIs)

Components:

  • A template for the root volume of the instance.
  • Launch permissions to control AWS account access.
  • A block device mapping.

Additional information:

  • Volumes are either instance-store-backed or EBS-backed.
  • For EBS-backed AMIs, the volumes are created from snapshots stored on S3.
  • For instance-store-backed AMIs, the volumes are created from templates stored on S3.
  • Templates and snapshots are not visible in the S3 console.
  • AMIs are regional (because S3 is regional).
  • AMIs can be copied to other regions using the AWS console, command-line, or API.
  • By default AMIs are private.
  • Custom AMI generation is possible (and advisable in many situations).
  • As AMIs are stored in S3, standard S3 charges apply.

To make an AMI:

  • Make the desired changes to an EC2 instance.
  • Right click the EC2 instance in the EC2 console, go to Image > Create Image.

Sharing AMIs

  • Sharing an AMI does not affect the ownership of the AMI.
  • Copying an AMI makes the resultant AMI owned by the copying AWS account.

  • To copy an AMI, it’s necessary to have read permissions on the associated S3 bucket or EBS snapshot.

  • It’s not possible to copy an AMI with an associated billingProduct code.

IP Addresses

  • Public IPs are not persistent when stopping/starting an instance.
  • Private IPs are persistent.
  • Elastic IPs are persistent public IPs, attached to an ENI.
  • Elastic IPs can be moved between ENIs.
  • There is a charge for an Elastic IP if it is not being used on a running instance.
  • Public hostnames change to match the public IP addresses when an Elastic IP is attached.
  • Addresses resolve to the private IP address of an instance when within the same network.
  • There is a soft limit of 5 Elastic IPs per AWS account.
  • An Elastic IP can be remapped as a form of failover.

CloudWatch and EC2

EC2 Documentation: List of CloudWatch metrics

Default Metrics:

  • Basic Monitoring (free): 5-minute interval.
  • Detailed Monitoring (cost): 1-minute interval.
  • CPU utilisation (+ Credit Usage/Balance for T2/T3).
  • Network upload/download.
  • Disk IO (for instance-stored-back instances only).
  • EC2 Status Check.

Custom Metrics:

  • Basic Resolution: 1-minute.
  • High Resolution: customisable down to 1 second.
  • Requires IAM configuration.

Defaults:

  • For instances created via the console, basic monitoring is on by default.
  • For instances created via the CLI, detailed monitoring is on by default.

Configuring Custom Metrics

  • Configure a region inside the instance with aws configure.
  • Attach an IAM role that grants access to CloudWatch.
  • Make a script that uses aws cloudwatch put-metric-data to push metrics to CloudWatch.
  • Put the script in a crontab.

CloudWatch Agent

  • The CloudWatch Agent enables gathering of more metrics and also logs.
  • By default, instance logs are not pushed to CloudWatch.
  • This requires an IAM role to be configured to allow CloudWatch access.
  • Can collect metrics from applications or services using StatsD or collectd.
  • It’s also possible to install the CloudWatch Agent on on-prem servers.
  • The CloudWatch Agent can be installed manually, via the AWS System Manger, or by CloudFormation.
  • How to install the CloudWatch Agent on an EC2 instance.

Scheduled downtime

  • The AWS Personal Health dashboard allows you to view upcoming scheduled downtime impacting resources in the AWS account.

  • Stopping and starting an instance will typically resolve this by moving it to different hardware.

Auto Scaling groups

  • ASGs integrate with CloudWatch using metrics.

  • If a health check fails, CloudWatch reports the failure to the ASG, which then takes action. The ASG terminates the instance, and launches a replacement.

  • There are two types of health check related to ASGs: EC2 Status Checks and ELB Health Checks.

  • EC2 Status Checks are on by default and cannot be turned off.

  • ELB Health Checks are off by default but recommend to be on. When on, an ASG works in concert with an ELB. When off, there might be events where an ELB determines that an instance is not healthy and stops using it, however, the ASG has no trigger to terminate the unhealthy instance and launch another.

  • ASGs function with ELBs by attaching to target groups, which are logical groupings of instances. Target groups must be created first.

  • CloudWatch Auto Scaling group metrics collection is disabled by default. Enabling this feature, in the monitoring tab of an ASG, allows the tracking of metrics related to scaling.

  • The type of instance that is launched when scaling occurs is determined by a launch template or launch configuration. Every ASG must have one or the other.

  • Desired capacity, minimum capacity, and maximum capacity can be specified within the ASG configuration.

  • Optional scaling policies allows for dynamic resizing of ASGs via CloudWatch metrics.

  • Optional scale-in protection: ASGs will not terminate instances.

Launch configurations and launch templates

  • Launch Configuration: specify the AMI and instance type.
    • Cannot edit.
    • To change, make a new one and update the ASG.
  • Launch Templates: a newer alternative to Launch Configurations with several advantages.
    • Can have multiple versions.
    • Can edit.
    • Can for use dedicated hosts. Use both spot and on-demand.
    • Use multiple instance types.
    • Can configure advanced settings such as termination protection, shutdown behaviour and placement groups.

ASG scaling policies

  • Maintain (a fixed number of instances).
  • Manual (via the console or CLI).
  • Scheduled scaling.
  • Dynamic scaling. There are several sub-types (see below).
  • Predictive scaling via AWS Auto Scaling (using ML); this is not EC2 Auto Scaling but a layer on top.

Dynamic scaling policies

  • Target tracking: triggers scaling to keep a metric at, or close, to a specific target value.

  • Simple scaling policy: trigger based on a CloudWatch alarm. Must wait for scaling activity or instance replacement to complete, and the cooldown period to expire, before triggering again.

  • Step scaling policy: trigger based on a CloudWatch alarm. The size of the scaling depends on the size of the breach. Can respond to additional scaling needs during scaling activity and health check replacement events. Recommended over simple scaling policy for most situations.

Target tracking policies using CloudWatch custom metrics can be used to scale the number of instances based on the size of SQS queues. Scaling based on Amazon SQS.

Termination policies

  • Determine the criteria for when an instance is terminated.
  • Custom policies can be created.

Default termination policy

  1. Determine which AZ has the most instances.

  2. Determine which instance to terminate so as to align the remaining instances to the allocation strategy for the On-Demand or Spot.

  3. Determine whether any of the instances use the oldest launch template.

  4. Determine whether any of the instances use the oldest launch configuration.

  5. After applying all of the criteria in 2 through 4, if there are multiple unprotected instances to terminate, determine which instances are closest to the next billing hour.

Lifecycle Hooks

  • A lifecycle hook delays termination or start of an instance, to allow for actions to take place.
  • Pending:Wait (for new instances) and Terminating:Wait (for instances being terminated).
  • Actions are configurable and include:
    • An EventBridge (CloudWatch events) target to invoke a Lambda function.
    • A notification target (e.g. SNS).
    • Running a script.

RAID

  • RAID 0 and RAID 1 are ratified by AWS.
  • RAID is implemented at the instance OS-level rather than via AWS.
  • RAID 5 and RAID 6 are not recommended by AWS due to having a large IOPS cost.