locked
Azure Storage won't accept a file send by distCp RRS feed

  • Question

  • Hello, I'm uploading files from Cloudera HDFS by distCp to Azure Blob Storage. Several files are not being uploaded so distCp map task fails by the timeout as the result. I've been capturing traffic by tcpdump to see what was happening. DistCp sends file by parts, so when it starts to send content of those problem files it receives to responce from Azure Storage service. 

    It doesn't happen to every file. As a test I took several files from Microsoft Azure SDK and here are files that failed to be uploaded: 

    microsoft-windowsazure-storage-sdk-0.6.0-sources/com/microsoft/windowsazure/storage/OperationContext.java
    microsoft-windowsazure-storage-sdk-0.6.0-sources/com/microsoft/windowsazure/storage/blob/ContainerRequest.java

    Here is an example from the tcpdump http requests capturing:

    06:13:44.597446 IP quickstart.cloudera.40925 > blob.am5prdstr05a.store.core.windows.net.http: Flags [P.], seq 9996:10041, ack 7799, win 270, options [nop,nop,TS val 522135798 ecr 19573703], length 45
    06:13:44.639166 IP blob.am5prdstr05a.store.core.windows.net.http > quickstart.cloudera.40925: Flags [P.], seq 7799:8140, ack 10041, win 513, options [nop,nop,TS val 19573711 ecr 522135797], length 341
    06:13:44.696083 IP quickstart.cloudera.40925 > blob.am5prdstr05a.store.core.windows.net.http: Flags [P.], seq 10041:10714, ack 8140, win 279, options [nop,nop,TS val 522135896 ecr 19573711], length 673
    06:13:44.696348 IP quickstart.cloudera.40925 > blob.am5prdstr05a.store.core.windows.net.http: Flags [.], seq 10714:13450, ack 8140, win 279, options [nop,nop,TS val 522135896 ecr 19573711], length 2736
    06:13:44.696578 IP quickstart.cloudera.40925 > blob.am5prdstr05a.store.core.windows.net.http: Flags [.], seq 13450:16186, ack 8140, win 279, options [nop,nop,TS val 522135896 ecr 19573711], length 2736
    06:13:44.696694 IP quickstart.cloudera.40925 > blob.am5prdstr05a.store.core.windows.net.http: Flags [.], seq 16186:18922, ack 8140, win 279, options [nop,nop,TS val 522135897 ecr 19573711], length 2736
    06:13:44.696813 IP quickstart.cloudera.40925 > blob.am5prdstr05a.store.core.windows.net.http: Flags [.], seq 18922:21658, ack 8140, win 279, options [nop,nop,TS val 522135897 ecr 19573711], length 2736
    06:13:44.696898 IP quickstart.cloudera.40925 > blob.am5prdstr05a.store.core.windows.net.http: Flags [.], seq 21658:23026, ack 8140, win 279, options [nop,nop,TS val 522135897 ecr 19573711], length 1368
    06:13:44.733145 IP quickstart.cloudera.40925 > blob.am5prdstr05a.store.core.windows.net.http: Flags [P.], seq 23026:25762, ack 8140, win 279, options [nop,nop,TS val 522135933 ecr 19573720], length 2736
    06:13:44.733631 IP quickstart.cloudera.40925 > blob.am5prdstr05a.store.core.windows.net.http: Flags [.], seq 25762:28498, ack 8140, win 279, options [nop,nop,TS val 522135934 ecr 19573720], length 2736
    06:13:44.733757 IP quickstart.cloudera.40925 > blob.am5prdstr05a.store.core.windows.net.http: Flags [P.], seq 28498:28695, ack 8140, win 279, options [nop,nop,TS val 522135934 ecr 19573720], length 197
    06:13:45.008607 IP quickstart.cloudera.40925 > blob.am5prdstr05a.store.core.windows.net.http: Flags [P.], seq 28498:28695, ack 8140, win 279, options [nop,nop,TS val 522136209 ecr 19573724], length 197
    06:13:45.484960 IP quickstart.cloudera.40925 > blob.am5prdstr05a.store.core.windows.net.http: Flags [P.], seq 28498:28695, ack 8140, win 279, options [nop,nop,TS val 522136685 ecr 19573724], length 197
    06:13:46.436757 IP quickstart.cloudera.40925 > blob.am5prdstr05a.store.core.windows.net.http: Flags [P.], seq 28498:28695, ack 8140, win 279, options [nop,nop,TS val 522137637 ecr 19573724], length 197
    06:13:48.341386 IP quickstart.cloudera.40925 > blob.am5prdstr05a.store.core.windows.net.http: Flags [P.], seq 28498:28695, ack 8140, win 279, options [nop,nop,TS val 522139541 ecr 19573724], length 197

    This is a log of the moment, when context of the file part is being sent to Azure Storage. 

    Then the job will try another port, and then another port but no reply will be received. 

    At the same time in the container a folder I'm uploading to exists, distCp temp file exists but is zero size. Because Azure Storage fails to accept this content and won't give an answer.

    I can upload those files with Azure Storage Explorer, but uploading with distCp fails by timeout. File name and extention doesn't matter. The problem is with the content. I tried splitting a file into parts and sending it with distCp, it was successful.

    I see no reason for this issue. Are there any limits related to blob files content? 

    Wednesday, December 14, 2016 3:19 PM

All replies

  • Hi,

    Thank you for posting here!

    Please refer the blog post to know more details about the max size of blocks and limits.

    Also, check the article for Understanding Block Blobs, Append Blobs, and Page Blobs

    When you say, it is working for some files and not working for others, does those files are different or same type.

    You may want to check the article Backup Cloudera data to Azure Storage incase if you haven’t checked earlier to transfer data azure storage.

    Hope this helps.

     

    Regards,

    Ashok

    Thursday, December 15, 2016 3:46 AM
  • Hello, thank you for your answer. I'll check this out for sure. 

    You see these files are the same type. I was testing uploading to the storage with microsoft sdk files, so those were java files. But I'm not sure that the type matters because renaming for exmple of the ContainerRequest.java to test.txt gave me the same result. These files are 18 - 20Kb so they don't exceed the maximum limit and distCp passes them part by part. 

    Backing up Cloudera data to Azure Storage is exactly what I'm doing. The problem is - I'm not sure what kind of content I'll need to back up in the future and this issue with uploading blobs bother me. And I may not be able to just archive everything because distCp copies file with saving access rights and I'd like to be able to use this feature.

    Also I found another problem with Azure SDK, it was sending x-ms-header 2013-08-15 that is not accepted by the storage while documentation says this version is supported. 

    The problem with those "special" files is next: they can't be passed by distCp to the storage alone or together with other files. If i send a 100 of them, the mapper will get to such a problem file and will stuck. Then it will retry several times and the whole job fails. 20Kb doesn't exceed even a block of the blob limit so I see no reason why it would happen. 

    If i pass such a file alone by distCp it will stuck anyway. If i rename the file it will stuck. Azure Storage Service won't reject this file. It just doesn't reply. Also other files with the similar size will be uploaded. Those are block blobs and these files are not even large. 

    Let me show you another example:

    -rw-r--r-- 1 cloudera cloudera 19402 Dec 13 09:17 test0.java
    -rw-r--r-- 1 cloudera cloudera 19401 Dec 15 04:37 test1.java
    -rw-r--r-- 1 cloudera cloudera 19402 Dec 15 04:44 test3-length.java

    File test1 is a copy of ContainerRequest.java but the comment in the line 442 was removed but the * is left. File test0 is where i added one symbol to the line 442 after the *, the file test3 is a copy of the file1 but i added one symbol into the end of it.

    So it looks like that (from the line 442)

    test0: - won't pass

         * P
         */
        private ContainerRequest() {
            // No op
        }
    }

    test1: - passed

         * 
         */
        private ContainerRequest() {
            // No op
        }
    }

    test3: - passed

         * 
         */
        private ContainerRequest() {
            // No op
        }
    }
    P

    So the file test0 was stuck. Files 1 and 3 were uploaded to the storage. Files 0 and 3 have the same length, but the only difference is the letter "P" location. 

    Do you understand my problem? 

    I see no pattern here, one symbol while the length is the same can define whether the file will be uploaded or will be stuck. 

    Regards, Olena

    Thursday, December 15, 2016 12:58 PM
  • Hi,

    Thank you for additional details. We are checking on this query with the respective teams.

    Appreciate your time and patience in this matter.

     

    Regards,

    Ashok

    Saturday, December 17, 2016 7:23 AM
  • Hi, thank you. I'll be waiting for an update regarding this issue

    Regards, Olena

    Monday, December 19, 2016 8:53 AM
  • Hi,

    I see that the issue needs a further deeper dive technically. I recommend you to create a technical support ticket. Refer the link to create a support ticket: https://azure.microsoft.com/en-us/support/options/

    The ticket enables you to work closely with the support engineers and get a quick resolution for your issue.

    Apologies for the inconvenience caused and appreciate your patience in this matter.

     

    Regards,

    Ashok

    Wednesday, December 21, 2016 10:46 AM
  • Hi, 

    Thank you for your answer, I will take a look at the link you provided. However I have one more question.

    Can you please explain to me why Azure Blob Storage rejects HTTP header 'x-ms-version' = '2013-08-15' sent by Azure SDK 0.6.0? Is version 2013-08-15 unsupported? According to this https://azure.microsoft.com/en-us/blog/azure-storage-service-version-update-2016/ link it must be supported. But any request sent by Microsoft Azure SDK 0.6.0 returns an error that says that http header not in correct format. And in the body of the response it mentions exactly this header being rejected. Everything works fine with 2014-02-14. Why does this happen? 

    Regards, Olena

    Friday, December 23, 2016 2:01 PM
  • Regarding the storage service version, you must use at least version 2014-02-14 if you are using a blob storage account or a premium storage account. For a general purpose storage account, previous versions are supported.

    I recommend that you upgrade to the latest version of the Java storage client to get all of the latest features, bug fixes, and perf improvements, including support for blob storage and premium storage accounts.

    Thanks,
    Michael

    • Proposed as answer by vikranth s Saturday, December 31, 2016 12:17 PM
    Friday, December 30, 2016 5:10 PM
  • Unfortunately CDH 5.8.x+ comes with hadoop-azure 2.6.0 and microsoft azure storage sdk 0.6.0 that refers to an older version. Starting from 1.0.0 SDK has 2014-02-14 version but it comes with hadoop 2.7.0 which is not a part of cloudera yet. Isn't it inconvenient? To update this one module I'd have to update the whole CDH because there are dependencies, but even the newest CDH doesn't contain Azure SDK higher than 0.6.0. Do you understand what I mean? 

    CDH has a support of Microsoft Azure Storage that won't work with blobs because it contains an unsupported version. And you won't get a responce that says "this version is unsupported", you'll get some strange "one of HTTP headers is not in the correct format" error that explains nothing. And even if you make a dump you'll see nothing because Azure SDK will send HTTP HEAD request and an explaination for the error comes in the body of the responce. But you won't get the body as this is not a GET request.

    Regards, Olena

    Tuesday, January 3, 2017 1:25 PM
  • Hi Olena, if upgrading to a GA version is not an option then please use a general purpose storage account instead.

    Thanks,
    Michael

    Tuesday, January 3, 2017 3:06 PM
  • Our client wanted to use a blob storage. So it's not like I've had a choice.

    I fixed this issue by recompiling your SDK (thanks God it's an open source) with the next version and it worked. Only with 2014-02-14, later versions can't work with Azure SDK 0.6.0. No official documentation warned about these issues with the type of storage and versions. It means we can buy a storage but won't be able to use it which is not nice. Maybe it's not a "pretty" solution, but not supporting the version that is used in all the CDH available versions that support Azure is not pretty as well. 


    Tuesday, January 3, 2017 3:42 PM
  • I'm glad you were able to resolve. You can find blob storage account documentation here:
    https://docs.microsoft.com/en-us/azure/storage/storage-blob-storage-tiers#blob-storage-accounts

    Tuesday, January 3, 2017 7:50 PM
  • Dear Micheal, if you know that Azure Blob Storage can work only with version 2014-02-14 which means using Microsoft Azure SDK version 1.0.0+ shouldn't Microsoft inform about this while purchasing a blob storage? Shouldn't Cloudera's documentation say that while they include Microsoft Azure SDK into their CDH it means that a half of the storage services won't be available? It had nothing about storage account types we can or cannot use. I would understand if it was some third party client library that cannot work with your storage but this is your SDK. What are your policies regarding storage accounts? What do you do if your storage was purchased but a client is unable to use it? Isn't it easier to upgrade 0.6.0 version so it would send headers with 2014-02-14 instead of saying not to use your blob storage? Looking forward for your reply. Regards, Olena
    Tuesday, January 3, 2017 10:27 PM