The attack vectors of major breaches range from misconfigured systems, improper controls on third party systems, to unencrypted data files. For 2019, Norton wrote about 9 data breaches that resulted in 4 Billion records breached, while Risk Based Security calculates some 5,183 breaches from everyone else. Ransomware plagued 2019 with at least 966 reported organization falling victim to a ransomware incident (not the worst year for ransomeware btw). With these kinds of numbers, you would expect that a Small/Medium Business (SMB) and their small IT shop do not stand a chance against all the threat actors out there. Especially remembering that security is not a winnable game, but an eternal posture. But alas, there is hope; I argue that most of the woes are a failure in the fundamentals. For a new team or even if you are inheriting an existing team, going through and ensuring the fundamentals are solid should be the first step for any IT department. I outline, in order, my considerations for building an environment that will be easier to manage and secure just because of the fundamentals.
This is the first and most important step. Backups have been around as long as computers have been a thing, and have been solved in many different ways, with many different technologies. We are no longer in the old days with tape rotations, and physical transfer of backup media to offsite locations. This is a new century where even Iron Mountain has bought into cloud backups. Ransomware is a wholly simpler threat if you have proper backups, and a needed reference point for catastrophic failures. As a first step, just ensure your backups are automatically scheduled and completing without errors. Then, just as importantly, validate your backups regularly.
Of course backups become a lot easier as the team becomes more mature and automation allows you to automatically validate your backups, but more on that later.
Computers don't do anything they are not told to. There is always some input, some file, some command that made that computer do something, and multiple solutions have been built to give you insight into every command that is run on your system. The goal with your monitoring solution is to ensure that the data you need to solve problems are available to you. You cannot save everything without sacrificing storage space, or bandwidth (both network and CPU). As any kind of operations person (DevOps, SRE, IT Ops, etc), monitoring is where you will be spending the majority of your day checking, tweaking, and referencing for incident responses. The shear lack of focus monitoring gets is astonishing to me.
What gets measured, gets managed
~Not Peter Drucker
All malware has indicators and the only surefire way to catch all malware is through monitoring/logging. This does not mean you are going to catch all malware, but if it runs on the system, then you could be logging something (e.g., processes running, install procedure, process startup in system logs). Ransomware doesn't just randomly happen, it has to spend time to encrypt it's target. That means processes are running and CPUs are running hotter then normal to quickly encrypt everything as fast as possible; usually overnight. Experiencing a sudden spike in user traffic, you have network logs, system load metrics, and more to alert you. Common malware running on your machine; there are logs that show what was installed and when, when it is started and running, and how much resources it is taking up. User committing industrial espionage, there are logs you can turn on to catch that as well.
Now there are thousands of monitoring solutions out there, and all of their marketing will claim to be your silver bullet, (Spoilers: none of them are your silver bullet), and a good majority of them only exist because vendors craft the story that our survival relies on their special sauce of tools and a small battalion of analysts. But all those vendors rely on the same thing, monitoring. You can have all kinds of debates over antivirus, vs. endpoint protection, sandboxes, etc., but they all require the ability to monitor system performances and process log data. So it behooves you to understand what your risks are, defined your requirements before listening to marketing presentations and only then building or buying a solution that fits your requirements.
All this monitoring now becomes a new data source for you automation as well. This data source becomes your triggers to kick off automation like auto-scaling, or incident response. At this point you should be putting alarms in place to trigger who ever is on-call. If there is one thing in the past 10 years that has improved the advancement of IT, it is the advancement in monitoring solutions and their ability to send triggers to other systems which allows better visibility and faster response times to IT Operations. DevOps does not exist without better monitoring.
Now that you are monitoring, now you can start developing a baseline. Without a baseline you do not know what is abnormal. Each system installed clean, should look very, very similar to other nodes of the same system. But this presupposes that you can build the system the same way each time. This leads us directly into...
Infrastructure as Code
It is 2020 now, this is the future. Even as a SMB IT shop, if you are not building your solutions using Ansible, Chef, Salt, Packer, Terraform, Cloudformation, <IaC tool du jour>, then you are just making more work for yourself. Seriously, pick a tool, and redeploy your solutions using some Infrastructure-as-code or configuration management tools.
But in order to collaborate and deploy your Infrastructure-as-code you need someplace to store all this code in some sort of code repository, and that code repository will use git. Git has taken the world by storm. Whether you use GitHub, GitLab, Microsoft's Azure Devops (Team Foundation Server).
If you cannot replace any box (outside the storage layer) in your environment in 10 minutes or less, then you are struggling at the fundamentals. Whether you are deploying microservices or a whopping monolithic solution – 10 minutes is what it should take to get a new instance up and running using the same infrastructure (subnets, databases, cache and queuing services). We are not talking restoring from backup, but just a new instance that is clean and able to handle requests. To be clear, an ElasticSearch cluster takes forever to stand up, but adding a new node to that cluster, should take you less then 10 minutes. If you are having issues hitting that 10 minute mark, then look at your process. Are you using Ansible scripts to make changes after the fact? Pre-bake some AMI images or use Packer to pre-build VM images. Now your time is down to just spinning up an existing image, and applying any extra changes with Ansible afterwards. Now would we a great time to evaluate Docker to replace some (or all) of your images.
Why 10 minutes? Lower is better, but most services start losing users after 5 minutes of retrying. After 5 minutes, users will leave and try to remember to come back. If you are a retail website, you just lost that sale to some other retailer. If you can replace or supplement all your nodes in a website within 10 minutes, you might still be able to recover lost revenue. How ever long this process takes will determine your first Service Level Objective (SLO). If it takes you 8 minutes to stand a node back up to start handling web traffic again, you better not have an SLO less then 9 minutes (1 minute to detect a problem, and 8 to stand the box up). Your first SLO for this service will need to add in some padding to begin with, until you can start to understand the alerting, false alarms, and time and process to stand up new nodes. If you are using something like AWS Elastic Loadbalancers, you need to add time for draining and health checks of your solution. As the service and your team matures, this SLO starts getting reduced.
Now you have built a system that can be replicated exactly, and in short order. Congratulations, you have just earned your first milestone to building your own...
Any changes happening to a system must go through the pipeline. It is how modern operation teams manage change control, ensure processes are done (e.g., QA checks), and ultimately deploy the solution the same way, every time. Whether you call it CI, CD, the other CD, CI/CD, CI/CD/CD, or Mary, pipelines are the beginning of easy street for any IT Ops team.
Modern day change management
What ever process you want to build around your pipeline to meet your preferred pipeline definition (Continuous Integration, Continuous Development, Continuous Deployment), The first pipeline you should work on building is the pipeline to deploy into production. Pipelines are the modern day change management. Most pipeline solutions allow you to setup approvers (if you don't control that from your git branch permissions), and provide ability to setup tests and checkpoints. Here is the trick, if you can setup production, you can just as easily setup a test environment as well to provide validations that your changes are not going to break production (since you are building all your solutions as Packer/Ansible/Chef/Puppet scripts or pre-baked AMI/VM images). Pipelines allow you all the requirements of any official change management standard.
Modern day patch management
So all your changes are in code now, processed through a git repository, through your new change management process. Hopefully, you have built a test environment that you automatically deploy to, and you are now building QA tests into your pipeline.
With your new pipelines, you can start using cron jobs (or cron equivalent) to build new, weekly versions of your base images. Security releases are now getting built into weekly images regularly. Now you can start building regression tests into your pipeline, and when successful, you can push to production. New major software version, change the code in your git repo, and away you go. I once built pipelines that used Packer to weekly, grab the newest Ubuntu AMI image, applied my hardening scripts, and automatically added the newest security updates, and then published that to a private repo as my Ubuntu Gold Disk in the dev environment. Then, using repo update as a trigger to kick off other repos to build out the Jira AMI, and the Confluence AMI. Build out some Cypress.io regression tests to verify nothing in Jira or Confluence broke, and those images are now ready for production, newly patched, and ready for prime time.
Automate Backup Validations
Now that you have this pipeline down, and new images being built, you can now trigger off your automated backup schedule, and start validating your backups. How you validate varies on your applications, but even if you just verify the database, or if you actually apply it to the Testing environment, and use Cypress.io scripts to do regression tests, and somehow verify the last write dates of your data. You can even nerd out a bit depending on your security posture, and start doing statistic validations against a production API, read-only endpoints. What would that look like? Glad you asked.
A scenario for a Jira solution:
- Stand up Jira instance, and apply backups to Test instance (DB, file share, etc.)
- Grab the date range of backup data. Grab random sampling across the entire date range. (e.g., 10+ records < 1 week old; 10+ records < 1 month old; 10+ records < 6 months old; 10+ records > 6 months old;)
- Cycle through those 40+ records and validate against the Production Jira Issue API.
- Use your statistics acumen to calculate % probability that backup was successful.
- If % is less then 90%, throw error in chatroom, or email on-call technician, etc..
Engineering for Chaos
There is Chaos Engineering, and while that is a lofty goal, we are not talking about that at this time. At this time, you just need to be engineering for chaos.
You now have a pretty solid pipeline that provides you with exact builds for every deployment. Something breaks, you just have to roll back to the last version in the git repo, and redeploy the old working image. You are now able to start playing with Green/Blue or Canary deployment models. Your deployments break a lot less because you are constantly adding new regression tests to all your solutions as you run into new problems. You are adored by management and now you are starting to develop a following. Now we get into the soft squishy side of DevOps / SRE / ITOps.
By now you have had at least one outage, one rollback, some emergencies. The question I pose is, have you defined and trained on your official incident response procedures? You should be looking at ensuring that there is a predefined process in what classifies as an incident; what steps to take to start an incident; what documents and communication channels are used during the incident; who has what roles and what steps happen after an incident. Stuff breaks,
shit incidents happen. The time to learn the process is not when it is happening. Practice, train, have templates, and documentation before you need those skills. And always do a postmortem.
Incident response training gamified. Running successful tabletops can be a whole book in itself. But many trainings do not go far enough. Ensure that you cover not just system failures, or Holiday Sales scenarios, but tabletop out security incidents all the way out to the courtroom. Is the team trained for first response computer forensics? Do you have tools and techniques defined and ready? What LEOs are you going to call, and what information / evidence is needed for your insurance claims; for prosecuting the guilty party? How are you going to preserve the chain of custody in your environment?
Now you are ready. You can drop a system and watch everyone scramble to monitor and respond. You are starting to build auto-healing systems. You have crossed out of "fundamentals" and into actual Chaos Engineering. The future is yours to pave. Go forth and build!