This post is the final part of my Cloud Security Carol. Part one, my ghost of cloud security past, was about the tar-pit of CSPM and how it was impossible to evolve the program out of just chasing greener spreadsheets.
Part two, my ghosts of cloud security present, covered The pitfalls of prevention and the allure of auto-remediation, defining the limitations of these tools, and defining the Three Laws of a Security Remediation Robot.
Unlike in Dickens, this post is a cheerful ghost of cloud security yet to come. I’ll talk about where CloudSec really needs to focus - on the pipeline and ultimately on the cloud developer or engineer. Finally, I’ll close out with a one-year roadmap for how I’d build a third (fourth) program if I’m crazy enough to do this again at my next job.
The Ideal of IaC Scanning
IaC Scanning is a reasonably new addition to the Cloud Security space. The fundamental goal is to “shift left” and have the deployment pipeline flag any security issues to the developers before they are deployed and are flagged by a CSPM tool. Tools like Checkov, KICS, and tf-lint have been doing this for a while.
I call this the “ideal” of IaC scanning because catching these issues before they go to production is the ideal state. The issues are addressed immediately, not a week after they appear on a spreadsheet. The operational risk that an SCP will deny an action mid-deploy and leave things in a broken state is reduced. You won’t find yourself on an episode of Cloud Battlebots where Terraform is making a misconfiguration and CloudCustodian is fixing it.
For IaC Scanning to work, it needs to be a superset of the security invariants that are prevented by SCPs and immediately fixed by auto-remediation. It should also cover as many CSPM findings as possible, so the engineers aren’t surprised by things on their reports.
IaC scanning is the best place to flag low-risk high-effort-to-fix issues like RDS and EBS encryption at rest. Because no matter how much I dismiss cloud encryption, if your threat model includes auditors you need to get that stuff fixed.
The hardest part of IaC scanning is embedding it in all the pipelines in your organization. I’ve joked that if there are 15 ways to IaC and CI/CD, my company does 22 of them. Unless you’re in a very small organization, the security team doesn’t have the scaling power to make IaC Scanning work in every pipeline. You need high-level engineering management support to make that a requirement in every pipeline.
(about the only pipeline we don’t use is the Keystone XL)
The Ubiquity of User Education
IaC scanning is nothing more than in-the-moment user education. It’s an important way to keep the misconfigurations out of the environment, but it’s not enough. Developers, Engineers, cloud operators, and their management need to have a reasonably detailed understanding of cloud security risks. Cloud security practitioners know that The Cloud is Dark and Full of Terrors, but we need to spend more effort demonstrating how cloud misconfigurations can lead to actual business damage.
Outside of a handful of basic examples, one misconfiguration won’t lead to a breach. A good breach usually takes a combination of initial credentials access, overly permissive IAM Roles, and then enumeration of the environment for misconfigurations to lead to a substantial incident. It’s this “failure of imagination” that we in security understand and need to impart to our user communities.
It’s also important to articulate that risk doesn’t just come from “hackers”, nation-states, or advanced threat actions. Depending on your organization, more organizational risk can come from auditors than from the Kremlin. Insider threats are a thing, and they are the hardest to contemplate as we naturally trust our co-workers.
One thing I recommend is making sure you have a documented baseline for your cloud security. CSPM, IaC Scanning, and Auto-remediation tools ship with their own opinion of what should be fixed or prevented, which may not align with your organization’s risk level.
I used to believe that these baselines were a user-education tool. I’ve come to realize that’s a trap. If you add “user educational” items into a baseline, they become targets for auditors. You’ll fall into the compliance trap and avoid increasing the guidance because it would mess up the metrics.
Where to go from here
If I were to build a cloud security program from scratch a third time, I’d want to focus on the user side. That said, the initial formula would still probably be the same.
These 30, 60, 90, and 180-day plans are rough guides. Depending on the organization, just getting an audit role capability into all accounts can sometimes take five months. These time frames will vary based on your company’s size and general governance practices.
30-day plan
The 30-day plan is about ensuring the basics are in place.
- Is CloudTrail enabled and feeding the SEIM?
- Are there dedicated security & logging accounts?
- Does the security account have audit capability to the environment?
- Do you have GuardDuty enabled?
- Do you know all of the accounts you have and who is accountable for them? What is the multi-account strategy?
- Are cloud users leveraging federated identities or IAM Users?
- Is there a security baseline for cloud usage? What security policies, standards, and baselines do exist?
- Who are the core cloud constituencies? How do they see the current security team? Who will be allies, and who do you need to influence?
60-day plan
The 60-day plan would focus on situational awareness and introducing the cloud security program to the company’s broader cloud user community.
- Deploy a CSPM solution. Probably something free.
- Deploy IAM Access Analyzer
- Review the CSPM and access analyzer results for the most egregious issues and target them as part of the initial outreach. By egregious issues, I’m referring to 3389 open to the world on a windows box or unauthenticated elastic search clusters. Encryption, public buckets, etc., are not on this list.
- Deploy an auto-remediation tool like Cloud Custodian, but do it in notify-only mode. Hold the findings internally for a risk-based targeted outreach.
- If you’re ambitious, price out and consider using Macie to scan the public buckets for PII.
- Start creating your KRIs. Don’t share them with anyone just yet. Understand how your KRI automation reflects the risk to the business and adjust accordingly.
- Leverage the CSPM and auto-remediation tools findings to advise how to proceed.
For the last point - which teams or accounts have the most findings? Is there a pattern of opening security groups because there is no architectural pattern for secure on-prem or on-network connectivity? If so, fix that.
90-day plan
Begin the user education phase of things
- Write a security baseline or security best practices document for the development community. Be careful here. If this document becomes part of the formal GRC/Policy framework, you’ve fallen into the Tar-Pit of CSPM, and you’ve become part of the compliance police. The goal is a secure environment, so focus on developer education.
- Introduce a core set of SCPs to cover the security invariants that the security and cloud user community can agree on. Develop a mutual understanding of how exceptions might be handled.
- Work with teams to get the CloudCustodian notifications delivered to the teams themselves, rather than just the security team. Still, in notify-only mode, they now start to see in real-time the new issues being created. This avoids swamping them with eight years of un-encrypted snapshots and databases and demoralizing them with thousands of findings.
- Start building out the IaC scanning according to your company’s risk profile.
180-day plan
Now you’re ready to enable the full suite of preventative controls
- Ensure the developers are using IaC scanning. They know what is coming via the auto-remediation and Service Control Policies. They know their deployments will fail if certain things are not resolved.
- Enable the next level of SCPs to prevent things at the IAM phase. This will cause pipelines to break and throw weird errors at users doing ClickOps.
- Enable the auto-remediation capability in Cloud Custodian (first in dev accounts, then in prod). IaC scanning should prevent these policies from firing in automated accounts, so this is a way to catch errors via ClickOps or pipeline operators who ignore the warnings. Remember: actual remediation must be against high security-risk misconfigurations, where the remediation is low operational risk.
- Start monitoring your KRIs. Do you see a drop-off in the older CSPM findings? If yes, great, keep things moving along.
- If your KRIs indicate a certain class of risk is not going down, consider a targeted effort to resolve those.
- If you deploy a CSPM vulnerability management program with SLAs, ensure that high-effort-to-remediate findings aren’t just added to a spreadsheet. They must be part of a risk-informed architectural review.
One year and beyond
There are several evolutionary improvements to be made beyond the above plan to combat cloud security misconfigurations. One to consider is the quest for least-privilege. Can you leverage a CloudSplaining tool into the KRIs and the user workflows? You’ll get a heck of a lot of security improvement out of finding all the S3FullAccess policies that have been attached to application roles.
Because the attack surface of an EC2 Instance is so great, consider the older instances in the environment. Do they have IMDSv2 enabled? Are they being scanned by the traditional vulnerability management tooling? Are they orphaned and forgotten, and their only function is to help Jeff Bezos get a bigger boat?