PySpark col should be Column Error

While coding transformations as part of the Data Engineering process, it is a common practice to create new columns based on the existing columns in a data frame. One of the most common command used is withColumn

     
df = df.select("col1","col2","col3").withColumn("col4", "col1")

In the above code df is an existing dataframe from which we are selecting 3 columns (col1,col2,col3) but at the same time creating a new column col4 from an existing column col1.

On executing the command we get the following error :-
col should be Column

The Reason we get this error is because the command df.select returns a DataFrame, and in this case we are trying to add a new column. Using .withColumn requires the second argument to be a column

The updated code to resolve the error.

     
df = df.select("col1","col2","col3").withColumn("col4", df['col1'])

Comments

SteveSLFebruary 8, 2022 at 9:20 AM
Correct. The second parameter has to be an object of the type , col()
There are many variations for what you're presenting.
df.withColumn('col4', df.col1)
or
from pyspark.sql.functions import col
df.withColumn('col4', col('col1')
#import pyspark.sql.functions as f
#df.withColumn('col4', f.col('col1')

col() is a function that takes a string representing the label of a column of the immediate data frame.

I find it interesting that col() is a "function"

It isn't a type (import pyspark.sql.types).
DataFrame() is a type. Row() is a type. But col() isn't?
Akhil MahajanFebruary 8, 2022 at 4:11 PM
Absolutely, there are many variations of the code. Wanted to keep it simple.

Akhil Mahajan

Search This Blog

Agentic AI/ AI agents