Google Dataproc Serverless and Dataproc on Google Compute Engine (GCE) are Google’s cloud-native Spark environments to help enterprises capitalize on the value of their data. Dataproc on GCE is a self-managed cluster service to run Spark jobs. Dataproc Serverless is the industry’s first autoscaling Spark offering that abstracts away the management of underlying clusters.
Although Dataproc Serverless offers incremental benefits such as developer productivity, faster innovation, and ease of doing the job, there are instances where Dataproc on GCE would be better suited.
We outline four factors that enterprises need to consider when choosing between Google Dataproc Serverless and Dataproc on GCE:
- Type of Workload: Since Dataproc on GCE allows users to set up under-the-hood clusters as per job requirements, it is preferred over Google Dataproc Serverless for instances such as Hadoop migration from another platform with already defined cluster settings and libraries. Dataproc on GCE will require minimal code rewrites. Google Dataproc Serverless is a more suitable option for running jobs that don’t require teams to manage underlying infrastructure to begin their data analytics tasks.
- Infrastructure Costs: Since Google Dataproc Serverless auto-scales compute resources as per the job requirement and charges user only for the duration of the job, it comes out to be more cost-effective compared to Dataproc on GCE, where users are charged for the entire duration the compute resources are up. Google Dataproc Serverless also ensures optimum compute resource utilization, compared to Dataproc on GCE, where teams may over- or under-provision clusters. Google Dataproc Serverless is cheaper than Dataproc on GCE from a resource management perspective.
- Control over Infrastructure: Spark jobs are typically run by data engineers and data scientists with different intents and purposes. Data engineers run core, high-service-level agreement workloads, requiring greater control of the underlying infrastructure. Whereas data scientists would want to access Spark directly without managing the underlying infrastructure. For the situations mentioned, Google Dataproc on GCE is more appropriate for data engineers, whereas data scientists would find Google Dataproc Serverless more advantageous.
- Supported Spark Versions: Google Dataproc Serverless supports Spark 3.2+, which is better performing, faster, and runs better for specific workloads. Also, moving from Spark version 2 to 3 can save up to 40% of the costs.
Conclusion
Both Spark offerings from Google cater to unique needs. Enterprises need to contextualize the choice between the two with higher-level business objectives such as cost, control over infrastructure management, and type of workloads.