none
Why is an empty file with the name of folder inside a Azure Blob storage container is created? RRS feed

  • Question

  • Hi All,

    I am running a Hive QL through HD Insight on demand cluster which does the following

    1) spool the data from a hive view

    2) Create a folder by name abcd inside a Blob storage container named XYZ

    3) Store the view data in a file inside the abcd folder

    However,when the hive QL is run,  there is an empty file with the name abcd   that is getting created outside the abcd folder

    Any idea why this is happening and how do we stop it from happening. Please suggest

    Thanks,

    Surya


    Thursday, October 18, 2018 11:04 AM

All replies

  • Hi Surya,

    Can you please provide copies of the client code that are connecting and accessing this storage location? We are confident that something in that client code is causing the folder anomaly. 

    Regards,

    Mike

    Tuesday, October 23, 2018 9:07 PM
    Moderator
  • Hi Mike,

    I'm also facing the exact problem when we try to write Parquet format data in Azure blob using Apache API org.apache.parquet.avro.AvroParquetWriter.

    Here is the sample code that we are using.

    org.apache.hadoop.fs.Path outputPathFileAzure = new org.apache.hadoop.fs.Path("wasbs://" + getAzureContainerName() + "@" + getAzureAccountName() + ".blob.core.windows.net" + "//" + currentFileName.toString());
    parquetStreamWriter = AvroParquetWriter.builder(outputPathFileAzure).withSchema(schema).withCompressionCodec(compression_codec_name).withConf(parquetConfiguration).withPageSize(pageSize).withRowGroupSize(pageSize).build();

    Tuesday, May 21, 2019 10:38 AM
  • Hi Sandeep,

    Your issue is slightly different. Please see: Can I append Avro serialized data to an existing Azure blob?

    I think you need Microsoft.Hadoop.Avro but not totally clear on your use case and the Stack Overflow post has some good information, including code samples.

    As for Surya's issue, I think the issue could be resolved by simply attempting to access the container via https:// by converting the wasb:// protocol to http. By going through this exercise you will likely sort out the variables so they reference the file correctly. I think this is the issue. 

    The follwoing: wasbs://2013@nytaxiblob.blob.core.windows.net/

    Could be expressed as: https://nytaxiblob.blob.core.windows.net/2013/ and used to test read access, etc., depending on specific permissions and container hierarchy. 

    I hope this helps.

    Tuesday, May 21, 2019 4:35 PM
    Moderator
  • We are getting the same problem when we try to save this file using the hadoop-azure.jar inside spark

    <style type="text/css">p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px 'Andale Mono'; color: #2fff12; background-color: #000000; background-color: rgba(0, 0, 0, 0.9)} span.s1 {font-variant-ligatures: no-common-ligatures; color: #c814c9} span.s2 {font-variant-ligatures: no-common-ligatures} </style>

     df.coalesce(1).write.option("header", "true").mode("overwrite").format("parquet").save("wasbs://test@myacccount.blob.core.windows.net/myfolder")

    and a empty file with myfolder is created along with myfolder/part-uuid-c00001.parquet

    Friday, September 27, 2019 11:14 PM
  • Hello Rui , 

    We were just checking to see if the issue is resolved , if not we will dig further on our side .


    Thanks Himanshu

    Wednesday, October 2, 2019 12:58 AM
  • We are also experiencing the same problem using Spark Datarfame api:

    df.repartition($"date").write.mode("overwrite").partitionBy("date")

    .csv("wasbs://containername@storagename.blob.core.windows.net/path/to/create")

    the problem is that is not possible to download the root folder because the os cannot have a file and a directory with the same name in the same file system location.

    As you can see we are also partitioning the output that result in one folder for each value of the partition column. This means that it's impossible to manually delete all the empty files.

    Wednesday, October 30, 2019 10:14 AM
  • sdecri, does your storage account have hierarchical namespace enabled? Also, when you set up your cluster, did you choose blob storage or Data Lake Gen2?
    Thursday, October 31, 2019 9:51 PM