none
Can I change the datatype of the Spark dataframe columns that is being loaded to SQL DataWare House as a table?

    Question

  • I am trying to read a Parquet file from Azure Data Lake using the following Pyspark code.

    df= sqlContext.read.format("parquet")
       .option("header", "true")
       .option("inferSchema", "true")
       .load("adl://xyz/abc.parquet")
    df = df['Id','IsDeleted']

    Now I would like to load this dataframe df as a table in sql dataware house using the following code

    df.write \
      .format("com.databricks.spark.sqldw") \
      .mode('overwrite') \
      .option("url", sqlDwUrlSmall) \
      .option("forward_spark_azure_storage_credentials", "true") \
      .option("dbtable", "test111") \
      .option("tempdir", tempDir) \
      .save()

    This creates a table dbo.test111 in the SQL Datawarehouse with datatypes:

    • Id(nvarchar(256),null)
    • IsDeleted(bit,null)

    But I need these columns with different datatypes say char(255), varchar(128) in SQL Datawarehouse. How do I do this while loading the dataframe into SQL Dataware house?

    Tuesday, January 15, 2019 6:11 PM

All replies

  • Hi,

    You can achieve the same in PySpark using cast method with DataType instance. After casting the column, you can write to the table in sql data warehouse.

    There's a similar thread where you can read about casting :

    https://stackoverflow.com/questions/32284620/how-to-change-a-dataframe-column-from-string-type-to-double-type-in-pyspark

    Let us know if this helps. Else, we can gladly continue to probe in further.


    MSDN

    Thursday, January 17, 2019 8:19 AM
    Moderator
  • Hello Chirag,

    If i use the casting in pyspark, then it is going to change the data type in the data frame into datatypes that are only supported by spark SQL (i.e. spark sql supported types) which doesn't have varchar,nvarchar etc. But I need the data types to be converted while copying this data frame to SQL DW.  When I copy this dataframe to SQL DW the data types in the dataframe are automatically converted into SQL DW default data types. I want to override this behaviour and mention my own data type instead of SQL DW default data types.

    When I used the code mentioned in the question. This is what I can see on my SQL DW.

    When I used the above code by adding .option("createTableColumnTypes") i.e. 

     .format("com.databricks.spark.sqldw") \
     .option("createTableColumnTypes", "Id varchar(64)") \

    it is taking the default column data types (just like in the case above) instead of Id varchar(64)


    However, I was able do change the datatype of the 'Id' column when I changed the format to "jdbc" i.e.   

        .format("jdbc") \
        .option("createTableColumnTypes", "Id varchar(64)") \

    Attached herewith is the output.

    But the issue here is that I need a column with nvarchar data type in SQL DW not varchar() which cannot be done. Looks like it supports only few data types when .format("jdbc") is used. When I tried to use nvarchar() I am getting this error ''\nDataType nvarchar is not supported.(line 1, pos 3)\n\n== SQL ==\nId nvarchar\n---^^^\n''


    Moreover when I used the code

    .format("jdbc") with out  .option("createTableColumnTypes", " ") it throws the error 'com.microsoft.sqlserver.jdbc.SQLServerException: The statement failed. Column 'Id' has a data type that cannot participate in a columnstore index.'

    Hope I am clear.

    Regards,

    Sri


    Thursday, January 17, 2019 6:06 PM
  • Hi Sri,

    Sorry for the delayed response. Have you tried to apply the cast method with DataType on the column ? That's also one way to do it. There are a couple of approaches discussed on this thread :

    https://stackoverflow.com/questions/29383107/how-to-change-column-types-in-spark-sqls-dataframe

    Have a look at it and let us know if it helps.


    MSDN

    Thursday, January 24, 2019 7:39 AM
    Moderator