How to Create an Empty DataFrame in Spark?

A DataFrame in Spark is a distributed collection of data organized into named columns. It resembles a table in a relational database or a spreadsheet in a familiar tabular format. DataFrames provide a high-level API for manipulating structured and semi-structured data, making it easy to perform complex data operations efficiently.

Creating an Empty DataFrame in Spark 2.x:
In Spark 2.x, you can create an empty DataFrame using the createDataFrame() method provided by the SparkSession. Here’s the code snippet to create an empty DataFrame in Spark 2.x:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Create an empty DataFrame
empty_df = spark.createDataFrame([], schema)

The createDataFrame() method accepts two arguments: the data and the schema. In the case of an empty DataFrame, we pass an empty list [] as the data and provide the schema if necessary.

Example: Creating an Empty DataFrame in Spark 2.x:
Let’s consider an example where we create an empty DataFrame named “emptyData” with columns “name” and “age”:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Define the schema
schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

# Create an empty DataFrame
empty_data = spark.createDataFrame([], schema)

# Display the empty DataFrame
empty_data.show()

Creating an Empty DataFrame in Spark 3.x:
In Spark 3.x, the process of creating an empty DataFrame has been simplified. You can directly create an empty DataFrame without specifying the schema. Here’s the code snippet for creating an empty DataFrame in Spark 3.x:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Create an empty DataFrame
empty_df = spark.createDataFrame([])

Example: Creating an Empty DataFrame in Spark 3.x:
Let’s consider the same example as above to create an empty DataFrame named “emptyData” with columns “name” and “age” using Spark 3.x:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Create an empty DataFrame
empty_data = spark.createDataFrame([])

# Display the empty DataFrame
empty_data.show()

References:

  1. Apache Spark Documentation: https://spark.apache.org/docs/latest/
  2. PySpark API Documentation: https://spark.apache.org/docs/latest/api/python/index.html