Skip to main content

Posts

Showing posts from February, 2022

GROUP BY ALL - Databricks

PySpark col should be Column Error

PySpark col should be Column Error While coding transformations as part of the Data Engineering process, it is a common practice to create new columns based on the existing columns in a data frame. One of the most common command used is withColumn df = df.select("col1","col2","col3").withColumn("col4", "col1") In the above code df is an existing dataframe from which we are selecting 3 columns ( col1,col2,col3 ) but at the same time creating a new column col4 from an existing column col1 . On executing the command we get the following error :- col should be Column The Reason we get this error is because the command df.select returns a DataFrame, and in this case we are trying to add a new column. Using .withColumn requires the second argument to be a column The updated code to resolve the error. df = df.select("col1","col2","col3").withColumn("col4", df['col1'