With the use of high-level APIs, Apache Spark, a widely used distributed computing platform, allows users to handle data on big datasets. PySpark, Apache Spark’s Python interface, is one of the most frequently used Spark APIs. An orderly distributed collection of data with named columns is referred to as a DataFrame in PySpark.
Changing a dataframe’s column names when using PySpark is possible. There can be various reasons for changing column name, including renaming columns following a join operation, standardizing column names, or changing column names for readability. We will first create a sample dataset and then look at various methods to rename dataframe columns in PySpark in this blog post.
Table of Contents:
- Creating a Sample DataFrame
- Changing column names using the withColumnRenamed() method
- Changing multiple column names using the selectExpr() method
- Changing column names using the toDF() method
- Conclusion
1. Using toDf()
Let’s start by creating a sample DataFrame with three columns: Id, Name, and Salary.
from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType # Create a SparkSession spark = SparkSession.builder.appName("ChangeColumnNames").getOrCreate() # Define the schema for the DataFrame schema = StructType([ StructField("Id", IntegerType(), True), StructField("Name", StringType(), True), StructField("Salary", DoubleType(), True) ]) # Create a sample DataFrame data = [(1, "John Doe", 5000.00), (2, "Jane Smith", 6000.00), (3, "Bob Johnson", 5500.00), (4, "Alice Williams", 7000.00)] df = spark.createDataFrame(data, schema) # Show the DataFrame df.show()
Output from running the above code:
+---+--------------+------+
| Id| Name|Salary|
+---+--------------+------+
| 1| John Doe|5000.0|
| 2| Jane Smith|6000.0|
| 3| Bob Johnson|5500.0|
| 4|Alice Williams|7000.0|
+---+--------------+------+
As you can see, the dataframe has three columns – Id , Name
and Salary
.
Changing Column Names:
Now that we have our sample dataframe, let’s look at how to change the column names.
2. Using withColumnRenamed()
method:
The easiest way to change the column names in PySpark is to use the withColumnRenamed()
method. This method takes two arguments – the current column name and the new column name. Here’s an example of how to use it to change the name of the “Name” column to “Full Name”.
# Change the name of the "Name" column to "Full Name" df = df.withColumnRenamed("Name", "Full Name") # Show the DataFrame with the new column name df.show()
Output:
+---+--------------+------+
| Id| Full Name|Salary|
+---+--------------+------+
| 1| John Doe|5000.0|
| 2| Jane Smith|6000.0|
| 3| Bob Johnson|5500.0|
| 4|Alice Williams|7000.0|
+---+--------------+------+
3. Changing Multiple Column Names using select():
The select() method allows us to select columns and change their names. Here’s an example of how to use it to change the names of the “Id” and “Salary” columns.
# Change the names of the "Id" and "Salary" columns df = df.select(df["Id"].alias("Employee Id"), df["Salary"].alias("Annual Salary")) # Show the DataFrame with the new column names df.show()
Output:
+-----------+-------------+ |Employee Id|Annual Salary| +-----------+-------------+ | 1| 5000.0| |
4. Using toDf()
Th toDf() function returns a new DataFrame that with new specified column names.
Here is an example of changing names in PySpark using toDf() method:
from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, IntegerType, StringType # create SparkSession spark = SparkSession.builder.appName('Change Column Names').getOrCreate() # create a list of tuples data = [(1, "John Doe", 5000.00), (2, "Jane Smith", 6000.00), (3, "Bob Johnson", 5500.00), (4, "Alice Williams", 7000.00)] # create a PySpark DataFrame with schema schema = StructType([ StructField("Id", IntegerType()), StructField("Name", StringType()), StructField("Salary", IntegerType()) ]) df = spark.createDataFrame(data, schema) # convert DataFrame to RDD and then back to DataFrame with new column names new_df = df.rdd.toDF(["EmployeeId", "EmployeeName", "MonthlySalary"]) # show DataFrame with new column names new_df.show()
Output:
+---+--------------+------+ | EmployeeId|EmployeeName|MonthlySalary| +---+--------------+------+ | 1| John Doe|5000.0| | 2| Jane Smith|6000.0| | 3| Bob Johnson|5500.0| | 4|Alice Williams|7000.0| +---+--------------+------+
Another way to use toDf() is to create an order list of new column names and pass it into toDF function
from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, IntegerType, StringType # create SparkSession spark = SparkSession.builder.appName('Change Column Names').getOrCreate() # create a list of tuples data = [(1, "John Doe", 5000.00), (2, "Jane Smith", 6000.00), (3, "Bob Johnson", 5500.00), (4, "Alice Williams", 7000.00)] # create a PySpark DataFrame with schema schema = StructType([ StructField("Id", IntegerType()), StructField("Name", StringType()), StructField("Salary", IntegerType()) ]) df = spark.createDataFrame(data, schema) data_new = ["EmployeeId","EmployeeName", "MonthlySalary"] # convert DataFrame to new DataFrame with new column names new_df = df.toDF(*data_new) # show DataFrame with new column names new_df.show()
Output:
+---+--------------+------+ | EmployeeId|EmployeeName|MonthlySalary| +---+--------------+------+ | 1| John Doe|5000.0| | 2| Jane Smith|6000.0| | 3| Bob Johnson|5500.0| | 4|Alice Williams|7000.0| +---+--------------+------+
5. Conclusion
In this blog post, we discussed how to change column names in a PySpark DataFrame. We used the withColumnRenamed
, select and toDf method to rename columns in a PySpark DataFrame. Renaming columns in a DataFrame is a common operation that is required during data preprocessing.