I have been following various Data Lake solutions for the past several years and have seen its evolution from being an on-premise Hadoop-based stack to finally arriving at full PaaS services on cloud. In fact, I strongly believe that S3 + Glue + Athena + EMR is becoming more popular than Hadoop-based solutions.
One of the biggest selling points of the ‘Data Lake’ concept is its ability to consume data in a variety of different ways – like SQL, BI, Scripting, Machine Learning and more. But guess how most of the users are consuming it – its SQL. Be it Hive, Impala, Spark (SparkSQL) or Athena, or through BI tools, which again indirectly mean SQL!
So, while all the Data Lake technologies are happy to embrace SQL, modern Data Warehouses like Snowflake are embracing cooler things that Data Lake brought – like cheap storage, storage compute separation, pay as you go, unlimited scalability, semi-structured data support etc. That’s the reason why Snowflake, which is actually a Data Warehouse, is also becoming a popular Data Lake solution.
In this blog, I will explain the strengths and weaknesses of traditional Data Lake solutions and also proceed to compare the Snowflake-based solution with traditional stacks (Hadoop or AWS stack).Here is a quick analysis of Hadoop/AWS based Data Lake solutions:
Good at:
- Flexible Data Ingestion
- Low Cost Storage
- Multi-variate Schema
- Schema on Read
- Multiple Mechanisms of Consumption
- Batch mode analytics
- Scale and Scale-out
- Access Control
Becoming better at:
- Interactive Querying
- BI Tool Integration
- ETL/ELT tool integration
Still Novice:
- Metadata Tagging
- Schema Discovery
- Authorization and Security
- Auditing
- Lineage
- Pure Replication Use Cases
- Data with lot of updates/deletes
Complicated:
- Cluster Setup and Management
- Too many technologies
- Integration left to Developers
- Not ONE story yet
A few quick pointers around why Snowflake is becoming popular as a Data Lake solution
Here is how the two solutions score on various dimensions:
Data Lake |
Snowflake Data Lake + DWH |
Snowflake Advantage |
|
---|---|---|---|
Flexible Data Ingestion | ☆☆☆☆☆ | ☆☆☆☆☆ | Seamless ingestion from variety of sources and good support
for ETL tools |
Low Cost Storage | ☆☆☆☆☆ | ☆☆☆☆☆ | Storage Compute Separation – Pass through cost for cloud
storage |
Multi-variate Schema | ☆☆☆☆☆ | ☆☆☆☆ | First class support for JSON |
Schema on Read | ☆☆☆☆☆ | ☆☆☆☆ | Create Views for JSON Tables |
Multiple Mechanism of Consumption | ☆☆☆☆☆ | ☆☆☆☆☆ | Supports SQL, Spark, Python, R, Java |
Batch Mode Analytics | ☆☆☆☆☆ | ☆☆☆☆☆ | Transformation using series of SQLs |
Unlimited Scaling and Scale Out | ☆☆☆☆☆ | ☆☆☆☆☆ | Instant Scale out for storage and compute |
Access Control | ☆☆☆☆☆ | ☆☆☆☆☆ | Integrates with all popular authentication systems |
Interactive Querying | ☆☆☆ | ☆☆☆☆☆ | High performance SQL |
BI Tool Integration | ☆☆☆ | ☆☆☆☆☆ | Very good support (Tableau, Qlick, MSTR, Cognos…) |
ETL/ELT Tool Integration | ☆☆☆ | ☆☆☆☆☆ | Very good support (Talend, Informatica, Mattilion, ….) |
Metadata Tagging | ☆☆☆ | ☆☆☆☆ | Rich Metadata available through services layer. Can be integrated using glossary tools |
Schema Discovery | ☆☆☆ | ☆☆☆ | No in-built support but can be integrated with external tools |
Authorization and Security | ☆☆ | ☆☆☆☆☆ | Schema, Table, View, Column and Row level security |
Auditing | ☆☆☆ | ☆☆☆ | Pretty similar capabilities |
Lineage | ☆☆☆☆☆ | ☆☆☆ | Cloudera Navigator etc. tools have better lineage tracking |
Pure Replication Use Cases | ☆☆ | ☆☆☆☆☆ | Support updates and deletes. Very good for replication
use cases |
Data with lot of updates | ☆☆ | ☆☆☆☆☆ | Support updates and deletes and time travel |
Setup and Management | ☆☆☆ | ☆☆☆☆☆ | Full PAAS |
My final closing thoughts
1. Snowflake is a very good data lake solution, (it gives value over and above just being a Data Warehouse) if most of your data is structured or JSON. You can still reap the benefits of a lot of features promised by Data Lake solutions while still leveraging the advantages of what a scalable database can offer (e.g. query performance, role-based access control, full transaction semantics, update and delete support etc).
2. Of course, if you have a lot of image, text or similar unstructured data or the volume is going beyond petabytes, or schema-on-read is a must have feature, then Hadoop/AWS based lakes would score better than Snowflake as a data lake. In such cases, Snowflake’s usage would be limited to being a Data Warehouse.