Spark Dataset APIs: A Gentle Introduction

A gentle introduction to Apache Spark Dataset with a Databricks notebook to play around.


Mageswaran D

2 years ago | 2 min read

RDD was the initial backbone of Apache Spark that enabled distributed computing more accessible for masses with easy to use functional APIs.

Dataframe was later introduced to make the structured data computation more easy, however it didn’t have the type information for the data it was holding.

Dataset brings the best of both worlds with a mix of relational (DataFrame) and functional (RDD) transformations. This API is the most up to date and adds type-safety along with better error handling and far more readable unit tests.

Dataset clubs the features of RDD and DataFrame.

It provides:

  • The convenience of RDD.
  • Performance optimization of DataFrame.
  • Static type-safety of Scala.
  • Structured queries with encoders

The encoder is primary concept in serialization and deserialization (SerDes) framework in Spark SQL. Encoders translate between JVM objects and Spark’s internal binary format. Spark has built-in encoders which are very advanced. They generate bytecode to interact with off-heap data.

An encoder provides on-demand access to individual attributes without having to de-serialize an entire object. To make input-output time and space efficient, Spark SQL uses the SerDe framework. Since encoder knows the schema of record, it can achieve serialization and deserialization.

However, it comes with a tradeoff as map and filter functions perform poorer with this API. Frameless is a promising solution to tackle this limitation.


Optimized Query

Dataset in Spark provides Optimized query using Catalyst Query Optimizer and Tungsten. Catalyst Query Optimizer is an execution-agnostic framework. It represents and manipulates a data-flow graph. Data flow graph is a tree of expressions and relational operators. By optimizing the Spark job Tungsten improves the execution. Tungsten emphasizes the hardware architecture of the platform on which Apache Spark runs.

Analysis at compile time

Using Dataset we can check syntax and analysis at compile time. It is not possible using Dataframe, RDDs or regular SQL queries.

Less Memory Consumption

While caching, it creates a more optimal layout. Spark knows the structure of data in the dataset.

Lightning-fast Serialization with Encoders

Encoders are highly optimized and use run time code generation to build custom bytecode for serialization and deserialization. As a result, they can operate significantly faster than Java or Kryo serialization.

In addition to speed, the resulting serialized size of encoded data can also be significantly smaller (up to 2x), reducing the cost of network transfers. Furthermore, the serialized data is already in the Tungsten binary format, which means that many operations can be done in-place, without needing to materialize an object at all. Spark has built-in support for automatically generating encoders for primitive types (e.g. String, Integer, Long), Scala case classes, and Java Beans.

You can try online version with DataBricks Community version @

Dataset : StarWars.csv

“Anakin Skywalker”;188;84;”blue”;”blond”;”jedi”;”human”
“Padme Amidala”;165;45;”brown”;”brown”;”no_jedi”;”human”
“Luke Skywalker”;172;77;”blue”;”blond”;”jedi”;”human”
“Leia Skywalker”;150;49;”brown”;”brown”;”no_jedi”;”human”
“Qui-Gon Jinn”;193;89;”blue”;”brown”;”jedi”;”human”
“Obi-Wan Kenobi”;182;77;”bluegray”;”auburn”;”jedi”;”human”
“Han Solo”;180;80;”brown”;”brown”;”no_jedi”;”human”
“Sheev Palpatine”;173;75;”blue”;”red”;”no_jedi”;”human”
“Darth Maul”;175;80;”yellow”;”none”;”no_jedi”;”dathomirian”
“Lando Calrissian”;178;79;”brown”;”blank”;”no_jedi”;”human”
“Boba Fett”;183;78;”brown”;”black”;”no_jedi”;”human”
“Jango Fett”;183;79;”brown”;”black”;”no_jedi”;”human”


Created by

Mageswaran D







Related Articles