Linux has a long and strong reputation for rarely needing a reboot – and it lives up to that reputation very well.
Recently I had to devise a solution for a case where it frequently needs a reboot, but you can’t easily take one.
AWS ASGs are notorious for being quick to terminate a rebooting linux instance because it deems them unhealthy. Making the health check long enough to accomodate the instance build and reboot will in many cases yield a health check that too long for daily production operations – which defeats the whole point of the health check.
Yet if you perform comprehensive OS patching during ASG provisioning of a new instance, you will eventually end up with a pending kernel patch due to the age of the AMI the ASG was commissioned with.
AWS Amazon Linux 1 is very stable and so new AMIs releases with updated patches can be 6 to 9 months or more apart – which increases the possibility of critical kernel vulnerability patches awaiting a reboot that will never happen.
Let’s look at a simple, effective solution to avoid this problem during ASG instance provisioning that can also be used to perform regularly patching of an autoscaling group of instances.
BTW – there is a lot of value to adding this pattern to your Windows instances as well – so you can read this article and the provided CloudFormation template with an eye to that as well!
Regularly Release and Re-Roll a Patched Custom AMI
Many AWS customers will resort to creating a patched AMI that they redeploy, and while this seems like a reasonable approach at first glance, it has higher complexity and a much higher long-term cost of ownership than what is proposed in this post because:
- The human procedures or automation build-out for creation of the AMIs must be invested in if it is to be done regularly and with configuration consistency.
- Over time, the accumulation of AMIs (which must be at least replicated to all regions where it is needed) adds up. If the patched AMIs aren’t at least being shared across accounts – the same accumulation happens on a per-account basis. The only offset is to have some sort of policy to delete old ones – it takes some work and expense to devise a policy, encourage people to move forward and to safely clean up old AMIs if you do not know what stacks might be depending on them. There are ways to solve all these questions, but they all take design effort, operationalization and additional AWS services to manage the lifecycle of the created AMIs.
In Place Patching Via ASG Removal and Reinsertion
Another pattern discussed on the AWS site involves putting instances into standby, suspending health checks and after rebooting, reverse the process to re-insert the rebooted instance.
Being able to simply use the Amazon Linux AMIs (or a static custom AMI) is much simpler and more cost effective in the long run.
In-Place Patching Using AWS Systems Manager
While AWS documentation provides a way to accomplish in-place patching, it does not inherently handle reboots required to finalize patching in ASGs that have tight health checks.
Windows Method Does Not Work for Linux
For Windows, we typically use cfn-init’s waitAfterCompletion, which is specifically designed to wait for a reboot and continue with the next command. However, as documented by Amazon, waitAfterCompletion only works for Windows.
ASG Lifecycle Hooks to the Rescue
I collaborated with my manager as I knew he had previously done an agile iteration to try use ASG Lifecycle hooks to implement Linux reboots during ASG scaling. I knew he thought that it would be reasonable and convenient approach to the problem. With a little of his help along the way, I was able to create and then improve a working model using an ASG Launching Lifecycle Hook.
I find it very helpful to enumerate architecture heuristics of a pattern as it helps with:
- keeping track of the architecture that emerged from the ‘design by building’ effort.
- my own recollection of the value of a pattern when examining past things I’ve done for a new solution.
- others quickly understanding the all the points of value of an offered solution – helping guide whether they want to invest in learning how it works.
- facilitating customization or refactoring of the code by distinguishing purpose designed elements versus incidental elements.
I specifically like the model of using Constraints, Requirements, Desirements, Applicability, Limitations and Alternatives as it helps indicate the optimization of the result without stating everything as a “requirement”. This model is also more open to emergent architecture elements that come from the build effort itself.
- Requirement: (Satisfied) Allow an ASG to provision instances that are fully patched.
- Constraint: Without resorting to creating AMIs solely for the purpose of patching.
- Requirement: (Satisfied) takes reboots needed for kernel patching and core shared library updates.
- Constraint: but only rebooting when absolutely necessary (by detecting a pending reboot).
- Desirement: (Satisfied) It would be nice if the same, simple solution could perform monthly patching.
- Desirement: (Satisfied) It would be nice if the solution could use metadata to self-document the last forced patching cycle.
- Desirement: (Satisfied) It would be nice if the solution could allow for the build of the entire software stack for before in-servicing an instance.
- Desirement: (Satisfied) It would be nice if the solution worked for multiple ASG update types (rolling and replacement).
- Desirement: (Satisfied) It would be nice if the code worked for non-ASG scenarios as well.
- Desirement: (Satisfied) While kernel patching reboots were the main impetus, it would be nice if the solution handled any scenario where a restart is required to finalize patching.
- Desirement: (Satisfied) Although designed around yum based distros, it would be nice if the framework could be reused with other distros’ package managers.
- Desirement: (NOT Satisfied) It would be nice if the solution could work with a fixed patch baseline to allow full DevOps environment promotion methods using a known, version pegged set of patches.
- Limitation: The patch level is dynamic and not a fixed baseline. When scaling occurs the newest instances will have patching up to date with their spin-up date. These newer patches will not have been tested with the application.
- Countermeasure: If you integrate automated QA testing with the provisioning of a new instance, you could catch problems with patching when they happen or by running a separate nightly build of the server tier againt the latest patches.
- Limitation: If you need to design for multiple or many reboots, you would have to do custom code to ensure userdata could pick up in the proper spot after each reboot.
- Countermeasure: This situation is exactly what cfn-init is for, if you have not previously used it, you can read up on how to implement it within the pattern in this post.
- Applicability: If you already release a per-ASG AMI for your own reasons (usually speed of scaling), then simply ensuring that AMI takes into account your desired patching frequency is a better solution. You could shorten your AMI release cycle to something like monthly so that satisfactory patching happens as part of the existing release process. This has the side benefit of version pegging your patching level and allowing it to be part of your development and automated QA and be ensured that production runs on a tested patch level.
- Alternative: If you have an existing long AMI release cycle (greater than 6 months), you could combine it with the dynamic patching solution offered here to keep the cycle long (to keep the cost and logistics of managing old AMIs to a minimum if that is a high priority).
- Alternative: Critical Vulnerability Response If you have an urgent enough patching scenario, you may wish to temporarily use this pattern to do dynamic patching when you do not normally support it.
- Alternative: Use for Windows as Well With windows long bootup times, implementing a launching lifecycle hook can help immensely with ASG stability – even if you already use cfn-init’s waitAfterCompletion for reboots, you can add the ASG lifecycle hook using this pattern.
- Limitation: This demo template relies on the default VPC security group being added to the instances and on it having default settings which allow internet access. If you have the default VPC security group nulled out (a great security practice!) or other networking configuration that limits internet access, you will need to update the template so that it has outbound internet access in your environment.
Minimal but Completely Working Template
The CloudFormation template is purposely minimal in order to more clearly demonstrate the concepts of the solution. At the same time it includes everything needed and works. The approach adheres to The Testable Reference Pattern Manifesto
Tight Health Check (For Demonstation Purposes)
HealthCheckGracePeriod is set to 3 seconds to demonstrate that the health check is not in play during the launching lifecycle hook which can be observed because the instance takes longer than 3 seconds to get ready, but is not terminated.
Tested With Both ASG Updatepolicy Settings
The parameter UpdateType defaults to “RollingThroughInstances” which sets the UpdatePolicy to use AutoScalingRollingUpdate, but it can be changed to “ReplaceEntireASG” to set the UpdatePolicy to use AutoScalingReplacingUpdate. Although not tested with Lambda based updates, they would be expected to work just fine with this template. You could even add a scheduled Lambda to patch monthly by updating CloudFormation with a new PatchRunDate.
Least Privilege IAM
The IAM Roles and least privilege permissions are included so that it is clear what permissions are needed and so that instances do not have more permissions than needed to interact with their own ASG. Two possible methods for limiting the permissions are provided. Using the ASG name in the Resource specification of the IAM is active. Using a condition on a tag is provided as a tested, but commented out alternative.
Maximizing ARN Flexibility for Template Reuse
The ASG arn in the IAM policy with the SID “ASGSelfAccessPolicy” demostrates maximizing the use of intrinsic AWS variables by using them for AWS Partition (use in Gov cloud or China without modification), AWS Account ID (use in any account) and AWS Region (use in any region without modification).
Works Without ASG
If the userdata code cannot retrieve it’s ASG tag it assumes that it is not in an ASG and all lifecycle hook actions are skipped. This allows the solution to be used in non-ASG scenarios.
Periodic Patching for the Entire ASG
The CloudFormation template also works for forcing fleet-wide patching updates. If you update the CloudFormation by updating the PatchRunDate the entire fleet will be replaced. The date is purposedly used to record an environment variable within Userdata so that the ASG Updatepolicy knows it should replace all instances.
Monitoring and Metrics
Two monitoring and metrics values are recorded as metadata. You can control what log file the is added to (or mute the log file) by altering the function “logit”. Generally you want this to be a log file that is collected by your log aggregation service (sumologic, loggly, etc). If you already collect /var/log/cloud-init-output.log, you can mute the log file write to /var/log/messages.
LAST_CF_PATCH_RUN
The CloudFormation parameter PatchRunDate
is:
- saved on the instance as the environment variable LAST_CF_PATCH_RUN in /etc/profile.d/lastpatchingdata.sh
- emited to /var/log/messages as “LAST_CF_PATCH_RUN: ”
- added as a tag to both the ASG and all Ec2 instances
This date simply indicates the initial setup of the ASG or the last fleetwide forced patch. It also serves to purposely change something in userdata so that the entire fleet is forced to be replaced when you run an update and change this date.
ACTUAL_PATCH_DATE
The date as of spin-up is:
- saved on the instance as the environment variable ACTUAL_PATCH_DATE in /etc/profile.d/lastpatchingdata.sh emited to /var/log/messages as “ACTUAL_PATCH_DATE: ”
Instances that spin up as a result of autoscaling will not have their patches limited to the date expressed in LAST_CF_PATCH_RUN, so ACTUAL_PATCH_DATE tracks the date they were actually patched.
Comparing these two dates can help you understand if you have developed a large variety of patching dates due to autoscaling and might want to roll the fleet to a standard date by updating the cloudformation with a new PatchRunDate
.
- In the CloudFormation template: the ASG is created with a launching lifecycle hook automatically configured. It is important that the implementation define the lifecycle hook integrated within the Autoscaling group definition (LifecycleHookSpecificationList), rather than as a separate resource or else some instances can be missed while the hook is being setup.
- Since health checks for a given instance do not commence until the life cycle hook is closed out, it provides the opportunity to interrupt the instance availability ability with a reboot.
- In the Userdata script: We attempt to retrieve the ASG name from the automatic CF tag ‘aws:autoscaling:groupName’ – if we can’t find it, either we are not in an ASG or we do not have proper IAM instance role permissions to read our own tags.
- If we find the tag, we list any lifecycle hooks in play for our instance to ensure that they are properly configured and that we have permissions to see hooks as well.
yum update -y
is run to run all updates.needs-restarting -r
(from yum-utils) is run to see if a restart is needed by any of the patching that was done, if so then:- ACTUAL_PATCH_DATE and LAST_CF_PATCH_RUN are emitted to logs and set in /etc/profile.d/lastpatchingdata.sh
- A patchingrebootwasdone flag file is set.
- The file /var/lib/cloud/instances/*/sem/config_scripts_user is removed so that userdata will process again on restart.
reboot
is run.sleep 30
is used to prevent further script processing while the reboot completes.
- Upon restart the flag file is used to skip patching/
- The lifecycle hook timeout is refreshed and Code Deploy is installed – mainly to demonstrate how the rest of your automation stack would be processed.
- A Code deploy install is done to emulate your software stack install.
- The ASG hook is sent a
Continue
call. cfn-signal --success
is called.
Kicking Off The Template
Use the AWS CloudFormation console to launch the template – to see how subsequent updates will work, pick 4 instances and set TroubleShootingMode to true.
Observing Lifecycle Hook in AWS Console
In the EC2 Console open the Autoscaling group, on the “Lifecycle Hook” tab observe the ‘instance-patching-reboot’ hook is configured.
Also, before the instances are in service you can see “Not yet in service” in the “Activity History” tab and “Pending:wait” in the “Lifecycle” column of the “Instances” tab for each instance. These will change to indicate the instances are in service as each instance completes setup procedures.
Observing On Instance Script Actions
All the actions of this template can be observed without logging into the instance by using the AWS console to view the system log for instances (Right Click Instance => Instance Settings => Get System Log) and scanning for the text “USERDATA_SCRIPT:”
The first message will contain “Processing userdata script on instance:”. All the messsages include timestamps so that you can observe things like how long a reboot took and the fact that if you don’t sleep the script, it keeps processing for a while after the reboot command.
Observing Logs on The Instance
However, if you need or want to logon to the instance for examination or troubleshooting, set the parameter TroubleShootingMode
to ’true’. This enables SSM IAM permissions and installs the SSM agent on the instances to allow AWS Session Manager to logon using SSH. The log lines that you see in the AWS System Console will be in the CloudFormation log at: varlogcloud-init-output.log
Observing Pseudo Web App
If you set SetupPseudoWebApp to true, the following is done: 1) A port 80 ingress is added to the default VPC security group, 2) Apache is installed, 3) an apache home page is created which publishes the patching and ASG details of the ASG that the instance is in. In order to see this data from a public frontned you must also deploy an ELB, you can find a premade ELB here: Observing Pseudo Web App If you set SetupPseudoWebApp to true, the following is done: 1) A port 80 ingress is added to the default VPC security group, 2) Apache is installed, 3) an apache home page is created which publishes the patching and ASG details of the ASG that the instance is in. In order to see this data from a public frontned you must also deploy an ELB, you can find a premade ELB here:
CloudFormationRebootRequiredPatchinginASG.yaml
Create Now in CloudFormation Console