What is Snowpark in Snowflake and How Does it Work?

Snowflake revolutionized the data warehousing industry with the launch of its cloud offering in 2014. They introduced and successfully implemented the concept of storage and compute separation in the data warehousing world. Plus, they effectively leveraged cloud by providing elasticity in usage and costs. Snowflake has for the most part remained a very SQL heavy transformations system. However, that has undergone a change since they introduced their Snowpark offering last year.

Snowpark – The new data transformation ecosystem

Snowpark allows developers to write transformation and machine learning code in a spark-like fashion using python (or java) and run the code on Snowflake’s virtual warehouses i.e. “without” the need to maintain or to provision any separate additional compute infrastructure.

It is very similar to Spark with the fundamental difference of not requiring a separate compute infrastructure. That’s what Snowflake is banking on. It is Anaconda powered and runs within Snowflake Anaconda powered sandbox in its virtual warehouse.

See below for a 30,000 feet comparison between Spark and Snowpark.

Snowpark and Spark - High Level Architecture

Snowpark provides a client API to developers that can be used with Python, Scala, Java to perform transformations. It works on lazy execution principles like Spark. You can use Jupyter to write and debug Snowpark programs.

For Data Science workloads you can now do preprocessing, feature engineering, model training, and model execution all using the Snowpark API. The best part is that all execution happens within Snowflake virtual warehouses. No additional compute or configuration or maintenance is needed.

Snowpark seems to provide some clear-cut advantages over traditional transformation technologies like Spark, Hadoop, DBT. Listing some of the important ones below –

Zero cluster maintenance: Other than Snowflake, no additional service or infrastructure is needed. That means unlimited scalability and flexibility with near zero maintenance. No extra compute technology is needed.
Inherently secure: All processing happens within Snowflake. Data does not leave Snowflake.
Multi use case support: SQL based, or non sql based Data Pipelines, all supported using Snowpark. Support for unstructured, semi-structured and structured data by leveraging different Snowflake engine’s powerful UDF features.
Opensource package manager: Full python support and access to open-source packages and package manager via the Anaconda integration. No additional setup required.
Governance friendly: As everything happens within Snowflake, its easy to enforce governance policies from the single Snowpark+ Snowflake platform. No redundancy, no unwanted network access, no data movement outside Snowflake.
Multi language support: Other than SQL it supports the top programming languages viz Python, Java and Scala.
Free API with Snowflake subscription: No additional charge for using Snowpark.
Data scientist friendly: ETL developers, Data Scientists will now all be working on the same platform bringing homogeneity and predictability in the process. Devops becomes simple and worry free.

SQL Transformations Vs Snowpark – A perspective on when to use one over the other

With all this cool stuff going around the question to answer is – should enterprises completely migrate to Snowpark? The simplicity provided for standard SQL transformations (stream, tasks, stored procedures or DBT based transformation) is too lucrative to completely ignore. Data engineers should leverage both SQL transformations and Snowpark while building data pipelines.

Here are some thumb rules on when to use SQL transformations Vs Snowpark

Type of Transformations	SQL Transformation	Snowpark
Parsing	Standard formats like CSV, JSON, Parquet are easy to parse and process using standard SQL	Easy to parse EDI, IOT type formats.
Translation and Mapping	Source to target mapping is easy	Might get complex for simple mappings
Filtering, Aggregation, Summarization	Good support for filtering, aggregation, and summarization functions available.	Good support here too.
Enrichment and Imputation	Would be complex to achieve	Python over Snowpark scores much better in this area
Indexing and Ordering	SQL has very rich functionality and can be leveraged easily	Python is at par
Anonymization and Encryption	May not be available	Python third party libraries can be used
Modeling, typecasting, formatting, and renaming	Ease of use when converting to standardized data models (e.g., dimensional model, Data vault)	Might be complex using this.
Model execution/inference	The models need be implemented as UDFs to execute them through SQL. This might get complicated.	Python/Snowpark scores much better for model execution
Model training	SQL does not have any such capability	Snowpark shines here due to full python support

To summarize, Snowpark is a new technology, and its adoption by Snowflake customers and the technology community is still to reach maturity. Also, it needs to prove its mettle on a production scale. It has a steep way ahead to get its head over to incumbents like Databricks, Hadoop, and other such transformations technologies. However, it’s a new age, cool solution and for users who are on Snowflake, it is an interesting option to try out. Remember that it’s from Snowflake and they do have a powerful team behind their cloud platform. So go try it!

In the second part of this blog series, we will cover some important functionalities provided by the Snowpark API and see them in action.

Author

Yogesh Bhate

Principal Architect/Data Practitioner
Persistent Systems
Santa Clara, CA

yogesh_bhate@persistent.com

Author

Yogesh Bhate

Principal Architect/Data Practitioner Persistent Systems Santa Clara, CA

An introduction to Snowpark@Snowflake

Snowpark – The new data transformation ecosystem

SQL Transformations Vs Snowpark – A perspective on when to use one over the other

Author

Yogesh Bhate

Explore our Industry & Service Offerings

Related Content

Author

Yogesh Bhate

Related Content

Contact us