none
Slow execution of USQL

    Question

  • Hi There,

    I have created a simple script to score between two strings. Please find the USQL and BackEnd .net Code below

    CN_Matcher.usql:

       
    REFERENCE ASSEMBLY master.FuzzyString;
    
    @searchlog =
            EXTRACT ID int,
                    Input_CN string,
                    Output_CN string
            FROM "/CN_Matcher/Input/sample.txt"
            USING Extractors.Tsv();
    
    @CleansCheck =
        SELECT ID,Input_CN, Output_CN, CN_Validator.trial.cleanser(Input_CN) AS Input_CN_Cleansed,
               CN_Validator.trial.cleanser(Output_CN) AS Output_CN_Cleansed
        FROM @searchlog;
    
    @CheckData= SELECT ID,Input_CN, Output_CN, Input_CN_Cleansed, Output_CN_Cleansed,
                       CN_Validator.trial.Hamming(Input_CN_Cleansed, Output_CN_Cleansed) AS HammingScore,
                       CN_Validator.trial.LevinstienDistance(Input_CN_Cleansed, Output_CN_Cleansed) AS LevinstienDistance,
                       FuzzyString.ComparisonMetrics.JaroWinklerDistance(Input_CN_Cleansed, Output_CN_Cleansed) AS JaroWinklerDistance
                                           FROM @CleansCheck;
    
    OUTPUT @CheckData
        TO "/CN_Matcher/CN_Full_Run.txt"
        USING Outputters.Tsv();

    CN_Matcher.usql.cs:

    using Microsoft.Analytics.Interfaces;
    using Microsoft.Analytics.Types.Sql;
    using System;
    using System.Collections.Generic;
    using System.IO;
    using System.Linq;
    using System.Text;
    //using FuzzyString;
    
    namespace CN_Validator
    {
        public static class trial
        {
    
            public static string cleanser(string val)
            {
                List<string> wordsToRemove = "l.p. registered pc bldg pllc lp. l.c. div. national l p l.l.c international r. limited school azioni joint co-op corporation corp., (corp) inc., societa company llp liability l.l.l.p llc bancorporation manufacturing c dst (inc) jv ltd. llc. technology ltd., s.a. mfg rllp incorporated per venture l.l.p c. p.l.l.c l.p.. p. partnership corp co-operative s.p.a tech schl bancorp association lllp n r ltd inc. l.l.p. p.c. co district int intl assn. sa inc l.p co, co. division lc intl. lp professional corp. a l. l.l.c. building r.l.l.p co.,".Split(' ').ToList();
                return string.Join(" ", val.ToLower().Split(' ').Except(wordsToRemove));
            }
    
            public static int Hamming(string source, string target)
            {   
                int distance = 0;
                if (source.Length == target.Length)
                {
                    for (int i = 0; i < source.Length; i++)
                    {
                        if (!source[i].Equals(target[i]))
                        {
                            distance++;
                        }
                    }
                    return distance;
                }
                else { return 99999; }
            }
    
            public static int LevinstienDistance(string source, string target)
            {
                int n = source.Length;
                int m = target.Length;
    
                int[,] d = new int[n + 1, m + 1]; // matrix
                int cost; // cost
                // Step 1
                if (n == 0) return m;
                if (m == 0) return n;
    
                // Step 2
    
                for (int i = 0; i <= n; d[i, 0] = i++) ;
    
                for (int j = 0; j <= m; d[0, j] = j++) ;
    
                // Step 3
    
                for (int i = 1; i <= n; i++)
                {
                    for (int j = 1; j <= m; j++)
                    {
                        cost = (target.Substring(j - 1, 1) == source.Substring(i - 1, 1) ? 0 : 1);
                        d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                                  d[i - 1, j - 1] + cost);
                    }
                }
                return d[n, m];
    
            }
    
        }
    }
    


    I have ran a sample batch with 100 inputs and set parallelism as 1  and priority as 1000. The job completed in 1.6 minutes.

    I wanted to test the same job with 1000 inputs and set parallelism as 1 and priority as 1000 and as per my calculation since it took 1.6 minutes for 100 inputs I thought it would take around 20 minutes for 1000 inputs but it was running for more than 50 minutes and I did not see any progress.

    So I added another 100 input job and tested it ran same as previous time. So I thought of increasing the parallelism and increased it to 3 and ran again it did not complete even after 28 minutes and I did not see any progress this time also.

    JOB_ID=07c0850d-0770-4430-a288-5cddcfc26699

    The main issue is I am not able to see any progress or status.

    Please let me know if I am doing anything wrong.

    Is there anyway to use constructor in USQL?. Since if I am able to do that I will not need to do the same cleansing steps again and again.




    Thursday, November 10, 2016 12:55 PM

All replies

  • Hi,

    I'm not sure this is only related to your C# code, but have you tried to move wordsToRemove variable creation at trial class level, as a static member?

    This will prevent from instanciating it each time cleanser method is called (twice per input row).

    Have you tried profiling trial class outside of U-SQL project ?

    I hope this will help.


    Michel CARADEC

    Tuesday, November 22, 2016 2:00 PM