In our last blog we looked at how to prepare for and navigate through a cyberattack. Next, I want to highlight the critical difference between cyber resiliency and recovering from a cyberattack versus more traditional disaster recovery. Too often I talk to customers who believe an existing disaster recovery policy will be sufficient, but let me spell out some important differences.
A disaster recovery plan is designed to recover from an inadvertent disaster – usually massive outages, natural disasters or catastrophic event like 9/11 – where the system has inadvertently been shut down or even destroyed. A cyberattack is quite different – bad actors have proactively gained access to bring down the system for the pleasure of disruption or more likely to ransom and/or steal critical data. Dealing with this entirely more malignant scenario is very different.
The Disaster Recovery Scenario
Traditional disaster recovery products are designed to provide fast recovery of servers, apps and data to getback up and running as quickly as possible with the minimal loss of data. Usually relying on backup, the data would be on average a day old, and it would take a long time to reload the operating system, applications and data just to start bringing the server online. And if you had to do this to all of the critical servers in a datacenter you would be down for days if not longer. The goal with disaster recovery is to measure that downtime in minutes and the loss of data to under an hour or better, depending on its criticality and the amount of money a company spends.
The disaster recovery site can be a co-location, cloud or secondary data center. You don’t worry about an isolated environment as disaster recovery is not about attack, but disaster. You can use a cloud, but you can also use another company-owned data center (often data centers are the backup sites for one another), or a co-location data center.
Features focus on the ability to have everything switch over quickly. As part of switching over quickly, you want all of the key infrastructure (servers and applications, networking, DNS, SSO) to work just as it was minutes before. Therefore, the network should look very similar, and at most you’re updating DNS entries. Passwords should continue to work and the servers and applications are expected to be in a known safe state.
The disaster recovery product can use async replication, CDP, or point-in-time images. There are multiple methods for creating disaster recovery images. In some cases there is continuous replication so that the disaster recovery server is almost always just a few transactions behind the production servers. With CDP (continuous data protection), it is similar to async replication, but does allow you to go back to previous points in time. There are also the snapshot-based point-in-time images. Each have their pros and cons, but for disaster recovery the user can make tradeoffs based on cost and other factors.
The Cyber Resilience Scenario
In contrast, it takes a combination of products, infrastructure, security and processes to create cyber resiliency. There are always attack vectors, and as products get better at protection and recovery, the bad actors evolve their methods.
Point-in-time images – These allow an organization to look back and compare points in time to see changes and find out what has been modified in the server image in order to obtain a clean and safe image to move forward.
Cloud-deployment
- The Cloud provides an initial isolation from the data center, making it much more difficult for bad actors to reach the saved images
- Airgaps in the cloud prevent test images from corrupting other images and provide for a secure vault.
- Role-based access to products that access and store data limits what any individual user (including admins) can do to change how storage is managed, delete critical tools or images, or change what an application or tool does. Having different roles or functions that can be performed means it takes multiple people with different credentials to make significant modifications that can compromise the environment.
- Immutable storage so that older versions of point-in-time images can’t be deleted.
The process
A full cyber recovery plan and play book – A written plan and the steps that need to be taken for each activity in the recovery process. Nothing should be done from memory and the process must be followed in order to recover.
Scanning of images for anomalies – Regular scanning of images that are in the cyber resiliency environment for changes or files that are suspect allows earlier detection of a pending attack. It also allows early knowledge of potential data theft. All of this is critical to being able to recover after an attack.
Remediation of compromised images – To recover a server to bring it online for production operations, each server needs to be evaluated and if infected needs to be cleaned before restarting production, otherwise the risk of reinfection is high.
Regular testing in an isolated environment – Regular testing has always been important to validate that the processes are working in order to bring servers up after an attack, and to evaluate servers as to their current state.
Deployment only when the images can be confirmed safe from malware – See remediation, two bullets above.
Add users back to applications based on security rules – After an attack it is likely that some users’ credentials have been compromised. If all users are added with their current credentials, the new environment is as vulnerable as before. As applications come online, users need to be added carefully and should have new passwords and other credential in order to gain access.
Next, we’ll look at the clear advantages to battling cyber attacks in a cloud environment.