Skip to main content

GROUP BY ALL - Databricks

PySpark col should be Column Error

PySpark col should be Column Error


While coding transformations as part of the Data Engineering process, it is a common practice to create new columns based on the existing columns in a data frame. One of the most common command used is withColumn
     
df = df.select("col1","col2","col3").withColumn("col4", "col1")

In the above code df is an existing dataframe from which we are selecting 3 columns (col1,col2,col3) but at the same time creating a new column col4 from an existing column col1.

On executing the command we get the following error :-
col should be Column

The Reason we get this error is because the command df.select returns a DataFrame, and in this case we are trying to add a new column. Using .withColumn requires the second argument to be a column

The updated code to resolve the error.
     
df = df.select("col1","col2","col3").withColumn("col4", df['col1'])

Comments

  1. Correct. The second parameter has to be an object of the type , col()
    There are many variations for what you're presenting.
    df.withColumn('col4', df.col1)
    or
    from pyspark.sql.functions import col
    df.withColumn('col4', col('col1')
    #import pyspark.sql.functions as f
    #df.withColumn('col4', f.col('col1')

    col() is a function that takes a string representing the label of a column of the immediate data frame.

    I find it interesting that col() is a "function"

    It isn't a type (import pyspark.sql.types).
    DataFrame() is a type. Row() is a type. But col() isn't?

    ReplyDelete
  2. Absolutely, there are many variations of the code. Wanted to keep it simple.

    ReplyDelete

Post a Comment