I have been following various Data Lake solutions for the past several years and have seen its evolution from being an on-premise Hadoop-based stack to finally arriving at full PaaS services on cloud. In fact, I strongly believe that S3 + Glue + Athena + EMR is becoming more popular than Hadoop-based solutions.

One of the biggest selling points of the ‘Data Lake’ concept is its ability to consume data in a variety of different ways – like SQL, BI, Scripting, Machine Learning and more. But guess how most of the users are consuming it – its SQL. Be it Hive, Impala, Spark (SparkSQL) or Athena, or through BI tools, which again indirectly mean SQL!

So, while all the Data Lake technologies are happy to embrace SQL, modern Data Warehouses like Snowflake are embracing cooler things that Data Lake brought – like cheap storage, storage compute separation, pay as you go, unlimited scalability, semi-structured data support etc. That’s the reason why Snowflake, which is actually a Data Warehouse, is also becoming a popular Data Lake solution.

In this blog, I will explain the strengths and weaknesses of traditional Data Lake solutions and also proceed to compare the Snowflake-based solution with traditional stacks (Hadoop or AWS stack).Here is a quick analysis of Hadoop/AWS based Data Lake solutions:

Good at:

  • Flexible Data Ingestion
  • Low Cost Storage
  • Multi-variate Schema
  • Schema on Read
  • Multiple Mechanisms of Consumption
  • Batch mode analytics
  • Scale and Scale-out
  • Access Control

Becoming better at:

  • Interactive Querying
  • BI Tool Integration
  • ETL/ELT tool integration

Still Novice:

  • Metadata Tagging
  • Schema Discovery
  • Authorization and Security
  • Auditing
  • Lineage
  • Pure Replication Use Cases
  • Data with lot of updates/deletes

Complicated:

  • Cluster Setup and Management
  • Too many technologies
  • Integration left to Developers
  • Not ONE story yet

A few quick pointers around why Snowflake is becoming popular as a Data Lake solution

A few quick pointers around why Snowflake is becoming popular as a Data Lake solution

Here is how the two solutions score on various dimensions:

Data Lake

Snowflake Data Lake + DWH

Snowflake Advantage

Flexible Data Ingestion ☆☆☆☆☆ ☆☆☆☆☆ Seamless ingestion from variety of sources and good support

for ETL tools

Low Cost Storage ☆☆☆☆☆ ☆☆☆☆☆ Storage Compute Separation – Pass through cost for cloud

storage

Multi-variate Schema ☆☆☆☆☆ ☆☆☆☆ First class support for JSON
Schema on Read ☆☆☆☆☆ ☆☆☆☆ Create Views for JSON Tables
Multiple Mechanism of Consumption ☆☆☆☆☆ ☆☆☆☆☆ Supports SQL, Spark, Python, R, Java
Batch Mode Analytics ☆☆☆☆☆ ☆☆☆☆☆ Transformation using series of SQLs
Unlimited Scaling and Scale Out ☆☆☆☆☆ ☆☆☆☆☆ Instant Scale out for storage and compute
Access Control ☆☆☆☆☆ ☆☆☆☆☆ Integrates with all popular authentication systems
Interactive Querying ☆☆☆ ☆☆☆☆☆ High performance SQL
BI Tool Integration ☆☆☆ ☆☆☆☆☆ Very good support (Tableau, Qlick, MSTR, Cognos…)
ETL/ELT Tool Integration ☆☆☆ ☆☆☆☆☆ Very good support (Talend, Informatica, Mattilion, ….)
Metadata Tagging ☆☆☆ ☆☆☆☆ Rich Metadata available through services layer. Can be
integrated using glossary tools
Schema Discovery ☆☆☆ ☆☆☆ No in-built support but can be integrated with external tools
Authorization and Security ☆☆ ☆☆☆☆☆ Schema, Table, View, Column and Row level security
Auditing ☆☆☆ ☆☆☆ Pretty similar capabilities
Lineage ☆☆☆☆☆ ☆☆☆ Cloudera Navigator etc. tools have better lineage tracking
Pure Replication Use Cases ☆☆ ☆☆☆☆☆ Support updates and deletes. Very good for replication

use cases

Data with lot of updates ☆☆ ☆☆☆☆☆ Support updates and deletes and time travel
Setup and Management ☆☆☆ ☆☆☆☆☆ Full PAAS

My final closing thoughts
1. Snowflake is a very good data lake solution, (it gives value over and above just being a Data Warehouse) if most of your data is structured or JSON. You can still reap the benefits of a lot of features promised by Data Lake solutions while still leveraging the advantages of what a scalable database can offer (e.g. query performance, role-based access control, full transaction semantics, update and delete support etc).

2. Of course, if you have a lot of image, text or similar unstructured data or the volume is going beyond petabytes, or schema-on-read is a must have feature, then Hadoop/AWS based lakes would score better than Snowflake as a data lake. In such cases, Snowflake’s usage would be limited to being a Data Warehouse.