In Spark or PySpark, when you do DataFrame show, it truncates column content that exceeds longer than 20 characters. Let’s check some solutions to show the full column content in a Spark DataFrame
Understanding the Issue: Truncated Column Content
By default, Spark DataFrames truncate the content of columns when displaying them. This truncation hampers our ability to gain complete insights from the data and may lead to misinterpretation. Fortunately, Spark provides several approaches to address this issue and display the full column content.
Adjusting Display Options
One way to tackle truncated column content is by modifying the display options of the Spark DataFrame. Spark provides two properties 1/ spark.sql.repl.eagerEval.maxNumRows and 2/ spark.sql.repl.eagerEval.truncate, you can control the maximum number of rows to display and disable truncation, respectively by tweaking those. Consider the following example:
spark.conf.set("spark.sql.repl.eagerEval.maxNumRows", 100)
spark.conf.set("spark.sql.repl.eagerEval.truncate", False)
Using the show() Function with Truncation Disabled
Another approach is to use the show() function with the truncate parameter set to False. This overrides the default truncation behavior and displays the full content of each column. Here’s an example:
df.show(truncate=False)
Expanding Column Width with the option() Function
To expand the column width of a Spark DataFrame, you can utilize the option() function with the “maxColumns” parameter. By increasing the value of “maxColumns”, you allow more characters to be displayed horizontally, ensuring the full visibility of column content. Consider the following code snippet:
spark.conf.set("spark.sql.repl.console.maxColumns", 100)
Writing the DataFrame to an Output Format
If you need to export the complete column content for further analysis or sharing, you can write the Spark DataFrame to an output format such as CSV, Parquet, or JSON. By saving the DataFrame in this manner, you preserve the complete content of each column. Here’s an example:
df.write.format("csv").mode("overwrite").save("output.csv")
Converting DataFrame to Pandas and Displaying
Alternatively, you can convert the Spark DataFrame to a Pandas DataFrame using the toPandas() method. Pandas provides flexible display options, allowing you to see the complete content of each column. However, keep in mind that this approach requires sufficient memory to hold the entire DataFrame. Here’s an example:
python
pandas_df = df.toPandas()
print(pandas_df)
Conclusion:
In this blog post, we explored several techniques to show the full column content in a Spark DataFrame. By adjusting display options, disabling truncation, expanding column width, writing to an output format, or converting to Pandas, you can overcome the limitations of truncated column content. These methods enable comprehensive data analysis and interpretation, ensuring you make the most of your Spark DataFrames.