- Get link
- X
- Other Apps
PySpark col should be Column Error
While coding transformations as part of the Data Engineering process, it is a common practice to create new columns based on the existing columns in a data frame. One of the most common command used is withColumn
df = df.select("col1","col2","col3").withColumn("col4", "col1")
In the above code df is an existing dataframe from which we are selecting 3 columns (col1,col2,col3) but at the same time creating a new column col4 from an existing column col1.
On executing the command we get the following error :-
col should be Column
col should be Column
The Reason we get this error is because the command df.select returns a DataFrame, and in this case we are trying to add a new column. Using .withColumn requires the second argument to be a column
The updated code to resolve the error.
df = df.select("col1","col2","col3").withColumn("col4", df['col1'])
Correct. The second parameter has to be an object of the type , col()
ReplyDeleteThere are many variations for what you're presenting.
df.withColumn('col4', df.col1)
or
from pyspark.sql.functions import col
df.withColumn('col4', col('col1')
#import pyspark.sql.functions as f
#df.withColumn('col4', f.col('col1')
col() is a function that takes a string representing the label of a column of the immediate data frame.
I find it interesting that col() is a "function"
It isn't a type (import pyspark.sql.types).
DataFrame() is a type. Row() is a type. But col() isn't?
Absolutely, there are many variations of the code. Wanted to keep it simple.
ReplyDelete