Following up on the Tar-Pit of CSPM, I feel the need to offer something more constructive for CloudSecurity practitioners to do. Cloud Security Posture Monitoring is “here’s a spreadsheet of issues, go fix them”. There are other ways, but none of them are a panacea.
(Adding a link to Part III Ghost of CloudSec Yet to Come covering IAC Scanning, User Awareness, and a roadmap on building a CloudSec Program)
The Pit-falls of Prevention
Let’s look at the pitfalls of prevention. Ignorant executives engaging in inter-departmental posturing might think they can just say, “Deny the creation of anything that’s not encrypted” or “there will be no public buckets”, but these are foolish proclamations and are more security theater than risk-based thinking.
Risk-based thinking must balance security and operational risk while remembering that security practitioners are risk advisors to the business, not tin-pot dictators.
In the purest sense, prevention of cloud misconfigurations can come from two places: 1) users don’t make the API call in the first place, or 2) API calls that create misconfigurations are denied. I’m going to address the first one for the next blog post and just focus this part on denying the user’s ability to create a misconfiguration.
In AWS, which is where most of us deploy, there are pretty much two ways to deny the creation of a misconfiguration: SCPs and not granting the user the permission in the first place. Both are built on AWS IAM’s constructs of Actions, Resources, Effects, and Conditions. With AWS IAM, it is reasonably straightforward to deny the ability of a principal to perform an action. What is very difficult is to deny a principal the ability to create a security misconfiguration.
Let’s use Public AMIs and snapshots as examples. Unless you’re shipping software via an EC2 Image, there is no good reason to have an AMI in your environment that any AWS customer can launch. Nor should you ever offer random public access to a server’s hard drive, which is what a public Snapshot is. These are what AWS would call a Security Invariant - something where there should be no exception to the rule.
The AWS IAM Action to make a snapshot public is ec2:ModifySnapshotAttribute, while for an AMI, the Action is ec2:ModifyImageAttribute. However, these actions are the same actions needed to share snapshots across accounts for backup & recovery purposes. Sharing AMIs across accounts is how the concept of a Golden AMI works. Both of these are legitimate business purposes.
It’s easy to deny the action but harder to deny the security misconfiguration. I picked the AMI and Snapshot examples because these are very similar operations, using similar AWS constructs (AMIs are built on snapshots). However, the IAM Condition Keys available for these actions are different. The ec2:ModifySnapshotAttribute action has a condition key called “ec2:Add/group” while ec2:ModifyImageAttribute does not. I can write an IAM Policy to Deny ec2:ModifySnapshotAttribute where Condition “ec2:Add/group” contains the group name “all”.
{
"Sid": "PreventPublicSnapshot",
"Effect": "Deny",
"Action": ["ec2:ModifySnapshotAttribute"],
"Resource": ["*"],
"Condition": {
"StringEquals": {
"ec2:Add/group": ["all"]
}
}
}
That same “ec2:Add/group” key is not available on ec2:ModifyImageAttribute. As a result, I can craft an SCP for the Security Invariant “no public hard drives” but not one for “no public boot images”.
[ As an aside, I did testing with the condition ec2:Public. That only applies as an attribute of the AMI. Denying ModifyImageAttribute where ec2:Public is true allows me to make the AMI public, then denies me the ability to remove the public access. Not helpful at all. ]
So SCPs can be useful but in a limited fashion. You need to have actual security invariants. And AWS has to have the condition keys available to distinguish a legitimate action from a security misconfiguration.
The allure of auto-remediation
The other way to prevent security issues in an environment is to immediately fix them as soon as they happen. This is what tools like CloudCustodian do. They will detect that the user did X, they will then look to see if X created a misconfiguration, and if a misconfiguration exists, they will fix it.
The two critical aspects of auto-remediation of security misconfigurations are 1) speed in detection and remediation and 2) making sure there is a feedback loop to the carbon-based lifeform responsible for the misconfiguration.
Speed is critical to reduce both the security & operational risk. The longer a misconfiguration exists, the more time a threat actor can find it and exploit it. The longer a misconfiguration exists, the greater the chance that functionality depends on the misconfiguration. After all, most users don’t consciously say, “I’m going to misconfigure this resource so a hacker can use it”. They’re trying to get something to work, and the misconfiguration might be the difference between operational success and failure.
The feedback loop is just as critical. If a security automation is going to “fix” something a user intentionally or unintentionally did, that user needs to know it. Otherwise, if the misconfiguration is required for expected functionality, the user will keep making the misconfiguration and a ghost-in-the-cloud will keep fixing it.
Most importantly, deciding what issues can be auto-remediated is essential. If the bot cannot do the remediation with a simple algorithm, it’s not a good candidate for auto-remediation. Candidates for auto-remediation fall into two camps: high security-risk or low operational-risk.
High Security-Risk remediations are things like exposing RDP to the world. A misconfiguration like this can expose an enterprise to an existential ransomware attack. Replacing the CIDR of 0.0.0.0/0 with 0.0.0.0/32 is a straightforward change and, when done instantly after the misconfiguration is introduced, minimizes the risk to production.
Low Operational-Risk remediations are actions like auto-enabling default AES256 encryption on S3 to meet a compliance objective. Again, this is a straightforward change, and default AES256 encryption introduces minimal operational risk.
Poor choices for an auto-remediation are high operational-risk and low security-risk misconfigurations. Example: Terminating a RedShift database because it wasn’t encrypted. Cloud Encryption provides minimal security benefits, and the operational risk of allowing your bot to delete a data warehouse is massive.
If a bot is going to fix a problem, it needs to follow a modified version of Asimov’s three laws;
- A bot should not harm production, or at least, it must minimize the risk of harm to production.
- A bot must execute its orders (to secure the environment) except where such orders would conflict with the First Law.
- A bot must announce its own existence and actions whenever it acts on the first or second law.
Summary
Prevention via Service Control Policy must be limited to security invariants, but even then, prevention is limited by the granularity of IAM Actions and the availability of the necessary Condition keys.
Auto-remediation only works to fix high security and low operational risk misconfigurations and only where a simple and operationally safe action can be taken.
All other misconfigurations need to be processed by a human. Or the human needs not to perform them or ask them to be performed on the human’s behalf. That will be the next blog post on the ideal of IaC scanning and the ubiquity of user education.