none
Get a list of files and file sizes

    Question

  • I have a simple query that needs to run for validation after a Data Factory copy.  I would like to get a list of the files and their file length from one directory.

    This errors out on a particular file. Any suggestions would be greatly appreciated.

    DECLARE @in string = "/incoming/p20180316/full/{schema}.{filename}.text";
    DECLARE @out string = "/incoming//reports/20180316/DataTableList.text";
    @rowData =
        EXTRACT
            [datastring] string,
            schema string,
            filename string
        FROM @in
        USING Extractors.Text(silent: true, skipFirstNRows: 1, quoting: false, delimiter: '~');
    @outputRows =
        SELECT
            schema,
            filename,
            filename.Length AS FileLength 
        FROM @rowData;
    OUTPUT @outputRows
    TO @out
    USING Outputters.Text();   

    Friday, March 16, 2018 5:20 PM

All replies

  • Forgot to add the error. 

    {
      "errorSource": "User",
      "errorId": "VertexFailedFast",
      "errorFields": [
        {
          "fieldName": "SEVERITY",
          "fieldValue": "Error"
        },
        {
          "fieldName": "DESCRIPTION",
          "fieldValue": "Vertex failure triggered quick job abort. Vertex failed: SV244_Extract[0][96] with error: Vertex user code error."
        },
        {
          "fieldName": "MESSAGE",
          "fieldValue": "Vertex failed with a fail-fast error"
        },
        {
          "fieldName": "DETAILS",
          "fieldValue": "\nVertex SV244_Extract[0][96].v1 {4E490B8E-0606-4235-B288-638825648DC5} failed \n\nError:\nVertex user code error\n\nexitcode=CsExitCode_StillActive Errorsnippet=An error occurred while processing adl://zelisadls01.azuredatalakestore.net/incoming/payplus-sql10/ppsbackend/data/20180314/full/pps.MessageQueue.text\n\n"
        }
      ],
      "innerError": {
        "errorSource": "User",
        "errorId": "E_RUNTIME_USER_STREAM_SPLIT_UNEXPECTED_EOF",
        "errorFields": [
          {
            "fieldName": "SEVERITY",
            "fieldValue": "Error"
          },
          {
            "fieldName": "COMPONENT",
            "fieldValue": "RUNTIME"
          },
          {
            "fieldName": "DESCRIPTION",
            "fieldValue": "An EOF was reached at an invalid point for the given schema.\nByte count consumed: '1077936132'\nLine count consumed: '16000711'"
          },
          {
            "fieldName": "MESSAGE",
            "fieldValue": "Reading stream split but reached EOF unexpectedly."
          },
          {
            "fieldName": "DETAILS",
            "fieldValue": "Row Delimiter: 0x0\nColumn Delimiter: 0x7E\nHEX: 6F 76 69 64 65 72 50 61 79 6D 65 6E 74 45 78 70 6F 72 74 2E 4C 69 62 72 61 72 79 2E 50 72 6F 63 65 73 73 6F 72 20 4D 65 74 68 6F 64 20 50 72 6F 63 65 73 73 53 63 68 65 64 75 6C 65 ###\nTEXT: oviderPaymentExport.Library.Processor Method ProcessSchedule ###\n"
          },
          {
            "fieldName": "DIAGNOSTICCODE",
            "fieldValue": "195887167"
          },
          {
            "fieldName": "RESOLUTION",
            "fieldValue": "Make sure each row in the file is complete."
          }
        ],
        "innerError": null
      }
    }

    Friday, March 16, 2018 5:23 PM
  • Hi sharonbje2,

    Do you have requirement other than list files and get file length? Not sure if you have already known or not, ADF has an GetMetadata activity which is target to get metadata (e.g. sub file list, file size, lastModified) for a dataset. Please refer to https://docs.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity. Currently, it only support azure blob with properties: size, lastModified, structure. And will soon support more datasource (e.g. ADLS, SQL....) with more properties (e.g. childItems, ...). Once it is ready, maybe you don't have to write a complicated query to do the validation you want, just an ADF pipeline can solve all the things! 

    Hope you can share more information to help us improve our product to meet customer's requirement. Thanks!

    Monday, March 19, 2018 7:55 AM
  • Thanks for the response but I need it for ADLS files. 
    Monday, March 19, 2018 12:39 PM
  • Hi sharonbje2,

    If you want a programmatic way to do this, Data Lake Store provides a WebHDFS compliant API that can list several folder attributes: GETCONTENTSUMMARY. You can see more details here: https://docs.microsoft.com/en-us/rest/api/datalakestore/webhdfs-filesystem-apis.

    Hope this helps
    José

    Tuesday, March 27, 2018 1:19 AM