By now, the cloud has become a well settled deployment option for business applications. Among it’s many benefits, it provides an excellent ground for basing the business intelligence systems as it provides scalability, business agility and lowers total cost of ownership. So, it is anomalous that although most applications have long moved to the cloud, data warehouses – the core of any business intelligence systems – are still mostly on premise.
Of course, that’s not entirely surprising given that businesses have invested heavily in an on-premise data warehouse technology, with strong concerns about protecting sensitive data. Migration of the data warehouse to the cloud must be not only seamless but must ensure that additional security measures are addressed. Let’s take a look.
Challenges in migration of the Data Warehouse to cloud
The challenge starts in identifying what part of the data should be in the cloud. Security and compliance issues are important concerns that can prohibit data movement to the cloud. Some data that needs to be secured on premise cannot be moved to the cloud as is. To overcome this challenge, organizations need to define a solution: either to encrypt the data in the cloud or build a hybrid data warehouse strategy.
The volume of the data that needs to be moved from the data centers to the cloud can be very high and can run into hundreds of terabytes. Ingesting such voluminous data would need time, bandwidth and cost. To solve this, at times, we need to physically move the data from data centers to the cloud provider.
Another challenge, as always, is cost. Cloud migration may not be cost effective if there are heavy investments done into the existing infrastructure. This can be mitigated by understanding that existing infrastructure will eventually become obsolete, demanding upgrades. Cloud migration could be planned with the infrastructure upgrade cycle so that the costs can be optimized.
Steps in migration of the Data Warehouses
At Persistent Systems, we successfully migrated multiple data warehouse systems from an on-premise implementation to the cloud. In this section, we summarize the steps that we recommend. The below image depicts the high-level approach to the warehouse migration.
Characterize the current data warehouse along multiple dimensions
To start, an in-depth analysis of the current data warehouse system is required. It should cover aspects such as:
- How to handle sensitive data. Some options are:
- Migrate the entire on-premise warehouse data to the cloud while encrypting the sensitive/secret data
- Opt for a hybrid solution where only non-sensitive data is moved to the cloud
- Total volume of the data to be moved considering the growth rate of the data, long-term persistence of the data on the cloud etc.
Analysis should cover the sensitive data aspects very carefully. What options are provided by the cloud vendor and what are the organizational needs for the confidentiality, need to be taken together. Regulatory standards must be analyzed. Some cloud service providers offer security and management programs that meet guidelines of regulatory agencies and standards bodies. Performance criteria to be met also need to be analyzed while coming up with the solution for the sensitive data aspects.
There could be multiple business units and business users who are using the data. Prioritizing and breaking down the entire data warehouse into business process subject areas and sequencing them for migration over a period of time can ease out the migration process.
Choose the right cloud vendor for your warehouse
There are multiple cloud providers e.g. Amazon, Microsoft Azure, Google Cloud in the market. Based on the outcome of the characterization of the existing warehouse, create a matrix of the functionalities needed vs. features available in respective cloud provider’s environment to make the final choice. Of course, the cost plays a very important role in making this decision.
Identify technology choices using proof of concept projects
The most important technology decision is the appropriate database technology for hosting the warehouse. To make this decision, identify the queries characterizing the typical workload as a basis for performance tests. The parameters that will drive this decision include:
- Database technologies available from the cloud provider
- Data types needed by the application
- Database features required
- Number of concurrent users
- Real-time data processing needs
- Performance needs etc..
The compute and the storage requirements will decide the infrastructure needed for hosting the data warehouse. At this point, we also need to decide the high-availability and load balancing needs for the deployment architecture. Tools to be used for onetime pull of the data versus incremental data migration / ETL also need to be decided.
Prepare a migration plan
Remember the old saying, those who fail to plan, plan to fail. Preparing a well thought out project plan is crucial for a successful migration. You simply must involve business users in the decision making and in the project plan, well in advance. They should be completely aware of the migration project plan and in fact participate in testing the new system. Often, there is a lot of resistance in moving away from the well-established systems and processes. Make sure you evangelize the new system – and importantly get the business users to evangelize – to raise user confidence level.
Infrastructure readiness
Most data warehouse projects on the cloud require a VPC to be set up. A VPC houses the entire production warehouse setup. It is important to create parallel sets of environments for development, QA and UAT to ensure smooth functioning of the systems. Set up security and access control on each of the environments. Lay out the deployment architecture and complete the plumbing.
Complete the migration
Migration involves initial movement of the historical data followed by the incremental pull of the data from the various source systems. A detailed specification needs to be in place that maps the source data elements to the target data elements. Test the migration framework using test data or sample data. Once such testing is complete, production data needs to be migrated wherein users can perform UAT, help debug their systemic issues and get the needed confidence. It is advisable to run the pilot data migration project in parallel with the legacy system so that it is easier to test and demonstrate the success of the migration. The extraction and migration needs to be setup in a repeatable way so that in case there are issues with the migration, it can be processed again.
Test the success of migration
Defining the success criteria for the migration and testing in accordance with them is of utmost importance to ensure long term success of the cloud data warehouse. For the initial load of the data migration, we typically test and compare one or several sample data sets and queries. In addition, we recommend comparing row counts of objects, min and max values as well as sample data row comparisons. If the migration includes data that poses significant business or compliance risk, 100% of the migrated data should be verified using an automated testing tool. If the data is encrypted, then the testing becomes more challenging, as you need to make sure encryption works properly and decryption generates the original data. You also need to make sure that incremental data loads work correctly as well.
Best Practices for data warehouse migration
While working on various data warehouse migration projects within Persistent Systems, we have established some best practices for migrating to the cloud. Here, we share a few of them.
- Focus on the end users of the system: involve them in the migration process right from the start. Get their help to evaluate the system so as to generate enough interest and excitement for them to embrace the new system.
- Run the legacy system and the migrated system in parallel during an agreed time period. Users would feel very confident about the entire migration.
- Do not overdo the encryption: make sure you understand the security needs, in particular, the level of encryption that is required. Encryption adds to the latency and hence affects the performance.
- Write generic components for the migration wherever possible. For instance, in cases where there are multiple source systems involved and one-to-one object migrations are needed, a generic extraction framework can be implemented. Once the data is in staging, specific transform and load scripts can be written to move the data to their respective data warehouse tables. With this approach, a one-time development effort is spent for the extraction framework and it can be reused for multiple source system migrations.
- Plan to use the functionalities like servers-less computing that will bring down the TCO.
- It is important to develop a comprehensive test plan to ensure a high-quality migration. Automate the tests wherever possible.
- Partition the data warehouse model into sections used by different business processes; these correspond to logical data marts within the warehouse. Identify those data marts that are non-critical to the business which can be pilots for migration. With the pilot warehouse migration many basic migration issues would be ironed out; this will lay a very good ground for migration of the business-critical data marts.
Conclusion
In the recent past there has been a lot of support for data warehouses in the cloud. As many vendors expand the breadth and depth of services for warehousing and analysis of the data, moving data warehouses to the cloud is yielding real business benefits. However, migration is not exempt of challenges. Make sure you put enough thought in the different aspects of data warehouse migration and come up with a well thought out plan to address those challenges.