none
SQL Server Agent scheduled job taking abnormally long time to run but the same one finishing in time when run manually

    Question

  • I am having a quite strange problem- the scheduled job in SQL Server Agent is taking abnormally long time to run. However the same job when I right click on the job and select "Start job at step...", it finishes at desired time. Usually the job should be finished in about 15 to 17 minutes (it works on pretty big data set having about 19 joins and each table has more than 15 million rows. The source tables are all properly clustered indexed) but when it ran at the scheduled time it took whopping 13 hours !!! The job is scheduled to run every hour. Needless to say it didn't run in subsequent hours. After that abnormal timing it took 25 minutes (still about double time than normal) in the next scheduled hour and then again fall in the same situation on the next hour.

    I couldn't any problem at all. To give a try I deleted the job and also the SSIS package and then recreated the package with a different name and scheduled the job. First ran it manually and it finished within 15 minutes but on the scheduled hour it faced the same abnormal situation again. It started about 50 minutes ago and still running and most probably I will kill the process and run again manually. I will search web for solution and in the mean while any good answer is highly appreciated.

    Wednesday, April 17, 2019 6:29 PM

All replies

  • To be honest, I have no idea what is going on. Had you been running the same query from query window in SSMS, I could have offered some ideas. But to my knowing, there is no difference between if you start a job manually, or it runs by schedule.

    The only advice is that you try to troubleshoot to find out at which query it gets stuck. My beta_lockinfo can be a quick way to find out - since I assume it runs the same query for a long time. If you are on SQL 2016, you can enable trace flg 7412 globally and get more information from the query plan than without.

    http://www.sommarskog.se/sqlutil/beta_lockinfo.html


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Wednesday, April 17, 2019 9:41 PM
    Moderator
  • Hi Erland,

    I am running SQL Server 2014 Standard Edition. The queries run as expected from SSMS query window. Another strange thing is that the job ran and finished within expected time (11 to 13 minutes) in three different intervals (today at 16:38, 17:38 and 18:38) but the next one that started at 19:38 is again having the same problem. It is still running (at this moment 92 minutes passed and still not finished).

    I ran the command "select * from sys.sysprocesses where blocked<>0" but it didn't produce any output. Basically that command produces something only when a new job starts and within few seconds they are gone.

    I copied and run your stored procedure (I used the "One for SQL 2012 and SQL 2014. (And SQL 2016 lower than SP1 CU2.)") but it also didn't produce anything. The table guest.beta_lockinfo also was not created. I am now clueless! Thank you very much for your great intent to help me. I couldn't understand much of the stored procedure but I copied it and executed.

    As the last option I will try by restarting the machine itself at deep night.

    Thursday, April 18, 2019 1:19 AM
  • Hi,

    Can you provide the job's code including all steps (the script to create the job)

    SSMS Object explorer > Right click on the job > Script job as -> create to > new query editor windows


    signature   Ronen Ariely
     [Personal Site]    [Blog]    [Facebook]    [Linkedin]

    Thursday, April 18, 2019 3:41 AM
    Moderator
  • Here it is:

    USE [msdb]
    GO

    /****** Object:  Job [Consolidated_LTE_NOKIA_Raw_Data]    Script Date: 2019-04-17 23:52:55 ******/
    BEGIN TRANSACTION
    DECLARE @ReturnCode INT
    SELECT @ReturnCode = 0
    /****** Object:  JobCategory [[Uncategorized (Local)]]    Script Date: 2019-04-17 23:52:55 ******/
    IF NOT EXISTS (SELECT name FROM msdb.dbo.syscategories WHERE name=N'[Uncategorized (Local)]' AND category_class=1)
    BEGIN
    EXEC @ReturnCode = msdb.dbo.sp_add_category @class=N'JOB', @type=N'LOCAL', @name=N'[Uncategorized (Local)]'
    IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback

    END

    DECLARE @jobId BINARY(16)
    EXEC @ReturnCode =  msdb.dbo.sp_add_job @job_name=N'Consolidated_LTE_NOKIA_Raw_Data',
            @enabled=1,
            @notify_level_eventlog=0,
            @notify_level_email=0,
            @notify_level_netsend=0,
            @notify_level_page=0,
            @delete_level=0,
            @description=N'No description available.',
            @category_name=N'[Uncategorized (Local)]',
            @owner_login_name=N'NQSUPERIOR\Administrator', @job_id = @jobId OUTPUT
    IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
    /****** Object:  Step [Consolidated_LTE_NOKIA_Raw_Data]    Script Date: 2019-04-17 23:52:56 ******/
    EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N'Consolidated_LTE_NOKIA_Raw_Data',
            @step_id=1,
            @cmdexec_success_code=0,
            @on_success_action=1,
            @on_success_step_id=0,
            @on_fail_action=2,
            @on_fail_step_id=0,
            @retry_attempts=0,
            @retry_interval=0,
            @os_run_priority=0, @subsystem=N'SSIS',
            @command=N'/FILE "\"E:\SQLSERVER_INTEGRATION_SERVICE_PACKAGES\Consolidated_LTE_NOKIA_Raw_Data\Consolidated_LTE_NOKIA_Raw_Data\Consolidated_LTE_NOKIA_Raw_Data.dtsx\"" /CHECKPOINTING OFF /REPORTING E',
            @database_name=N'master',
            @flags=0
    IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
    EXEC @ReturnCode = msdb.dbo.sp_update_job @job_id = @jobId, @start_step_id = 1
    IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
    EXEC @ReturnCode = msdb.dbo.sp_add_jobschedule @job_id=@jobId, @name=N'Consolidated_LTE_NOKIA_Raw_Data',
            @enabled=1,
            @freq_type=4,
            @freq_interval=1,
            @freq_subday_type=8,
            @freq_subday_interval=1,
            @freq_relative_interval=0,
            @freq_recurrence_factor=0,
            @active_start_date=20190417,
            @active_end_date=99991231,
            @active_start_time=3800,
            @active_end_time=235959,
            @schedule_uid=N'bab3b6be-9b02-45aa-b493-24fb5952ae15'
    IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
    EXEC @ReturnCode = msdb.dbo.sp_add_jobserver @job_id = @jobId, @server_name = N'(local)'
    IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
    COMMIT TRANSACTION
    GOTO EndSave
    QuitWithRollback:
        IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION
    EndSave:

    GO


    Thursday, April 18, 2019 3:53 AM
  • It could be that every time you run the package, the plans have fallen out of the cache and are compiled anew, and the query wobbles between a good plan and a bad plan. The reason could of course be due to differences in the input.

    beta_lockinfo should definitely return some data when there are queries running. If you see nothing while the job is running, check that you are connected to the right server. If you are, and you see nothing and repeatedly, this would indicate that the problem is inside SSIS.

    The table guest.beta_lockinfo is only created when you run with the @archivemode parameter set. And when you use it, there is no output.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Thursday, April 18, 2019 9:49 AM
    Moderator
  • share contain in SSIS packages..

    https://www.mssqltips.com/sqlservertutorial/9070/sql-server-integration-services-ssis-performance-tuning-techniques/

    https://www.mssqltips.com/sqlservertip/3217/improve-ssis-data-flow-buffer-performance/


    https://social.technet.microsoft.com/wiki/contents/articles/37872.sql-server-installation-on-centos-linux.aspx


    • Edited by Av111 Thursday, April 18, 2019 5:59 PM
    Thursday, April 18, 2019 5:57 PM
  • Hi Erland,

    Yes, you are right. Now I am getting lots of entries when I run the stored procedure during the run of the job from SQL Server Agent. I couldn't figure out much from that output. Only from the query plan I could see one table is using 57% cost for scanning clustered index. Other tables are using much less. However this didn't give much clue as whenever I run the job by right clicking on it and then"Start job at step" it finishes within 12 to 13 minutes. But whenever it is started by the scheduler it simply stalls there.

    Let me give you some more insight about the query. There are total 21 tables to be joined for the final result. I did not join all of them in one go rather I created 5 different temp tables. The first three each have 5 tables joined (all are left joined with the reference table) and other two temp tables 3 and 4 tables joined together. Then the 5 temp tables are joined to populate data in the final table. The query is stalling at the very first temp table and the reference table for the first group is showing 57% cost for clustered index scanning.

    But I am unable to understand why it would face problem only when it is automatically initiated by SQL Server Agent. The exact same job works fine when run manually from the server agent!

    I restarted the physical machine and also tried by restarting the SQL Service from SQL Server Configuration Manager but nothing improved the situation. I also tried by deleting the job and the SSIS package and then recreated with a different name and scheduled but still no change in behavior. I have never seen anything like that.

    Friday, April 19, 2019 4:21 AM
  • No, I don't understand why starting a job manually would be different from it running on a schedule. We will have to try to research this. Here is a query:

    WITH CTE AS (
       SELECT a.attrlist, qs.*,           substring(est.text, qs.statement_start_offset / 2,
                                  IIF(qs.statement_end_offset > qs.statement_start_offset,
                                     (qs.statement_end_offset - qs.statement_start_offset + 1) / 2, len(est.text))) AS query_text
       FROM   sys.dm_exec_query_stats qs
       CROSS  APPLY sys.dm_exec_sql_text(qs.sql_handle) est
       CROSS  APPLY (SELECT epa.attribute + '=' + convert(nvarchar(127), epa.value) + '   '
                     FROM   sys.dm_exec_plan_attributes(qs.plan_handle) epa
                     WHERE  epa.is_cache_key = 1
                     ORDER  BY epa.attribute
                     FOR    XML PATH('')) AS a(attrlist)
       WHERE  a.attrlist LIKE '%dbid=' + ltrim(str(db_id())) + ' %'
         AND  est.text NOT LIKE '%sys.dm_exec_plan_attributes%'
    )
    SELECT execution_count, plan_generation_num, last_execution_time, attrlist, query_text
    FROM   CTE
    WHERE  query_text LIKE '%LAG(EndingonHand%'

    Replace the "LAG(EndingonHand" with a piece of the text of your difficult query that is distinctive from all other queries. Be sure to keep the percent characters.

    It is likely that you will get multiple lines since you have run the query from SSMS. Try to use last_execution_time to find out which line(s) that relates to the Agent executions.

    Some other remarks: that 57% is just an estimate. Estimates are often wrong, not the least when you get bad performance, so don't pay too much attention to that number.

    You mention that you have a number of temp tables for intermediate results. This can often help with big queries, since temp table has statistics. This can help the optimizer to compute the next step more accurately. But this strategy is not always the right step since there is an overhead for materialising result. And you may break up the query in the wrong place. In any case, when you do this, it is very important that you define primary keys for your temp tables, and then I mean real primary keys based on the data in the tables. There is no point with an IDENTITY column you that you slap on, but you don't use. These indexes helps the optimizer a lot.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Friday, April 19, 2019 9:03 AM
    Moderator
  • The fact that it intermittently takes longer makes me think, as Erland pointed out, there's a change of plans. Have you monitored the wait stats when it runs longer? Monitor the resource utilization and see what wait types show up against the job session, compare the execution plans from good vs bad runs. Unfortunately, you're on a version where the advanced methods to force good plans, etc are not available so you'd have to go the old school approach. 

    Please remember to click "Mark as Answer" if my response answered your question or click "Vote as helpful" if it helped you in any way.

    Friday, April 19, 2019 1:04 PM
    Answerer
  • Hi Erland,

    First let me give you today's behavior for the job.

    Started by Job agent at 2019-04-19 01:38 (Toronto, Canada local time) and finished in 18 minutes 53 seconds

    Started by Job agent at 2019-04-19 02:38, didn't finish and I killed the job after 6 hr 40 min 50 sec.

    Started manually from Job agent at 2019-04-19 09:18:57 and it finished in 20 minutes (it look longer as it had to catch up with those 7 missing hours)

    Started by Job agent at 2019-04-19 10:38 and finished in 14 min 07 sec (good!)

    Started by Job agent at 2019-04-19 11:38 and finished in 12 min 18 sec (good!)

    Started by Job agent at 2019-04-19 12:38 and still running (current time 2019-04-19 13:38) for one hour!

    At this time I executed the script (CTE) provided by you. I put one specific lines that is only there in the problematic job but I didn't get any output. Then for testing I discarded the WHERE condition and I got 34 lines of output. None of those except 1 line (select top 10 * from sys.dm_exec_query_stats) was known to me. Also I found that the latest  last_execution_time was 2019-04-19 09:42:33.153 although the job ran successfully twice after that time and also one problematic one that is still going on.

    Some comments on resource usage:

    At this moment the CPU utilization is 100% !!! and memory usage is 67%. Only is some moments the CPU utilization is coming down to 70 to 80%. Almost always during the problem it is running on 100%.

    This is Intel Xeon CPU E5-2640 v2 @ 2.60 GHz (16 cores, 32 logical processors) takes only about 2% to 20% CPU and under 1.2 GHZ when run manually
    Memory used is about 280 GB out of 512 GB (55%)

    Is there any other thing that I may provide you? This is extremely weird problem.

    Now I will kill the job again to save others as CPU utilization  is 100%.

    Friday, April 19, 2019 5:46 PM
  • Hi,

    This report gives different information from the beginning. According to this information the issue is not related to "any scheduled execution" while it execute fast when you run the job manually. According to this info, it seems like sometimes the execution "gets stuck" while in other cases it works well and finish fast.

    According to this information I would focus on understanding what the SSIS project do and what can make it work slow in some cases. Is it working on the same amount of data in each execution? is the data include values which led it to "get stuck", and so on

    and in the same time, as Erland already worked with you, I would search for locks, waits, and issue related to the execution while it is seems like not working. You need to monitor what happen when it "gets stuck".

    * I had a specific idea related to the user that execute the steps (if you use "execute as", but according to the code you provided when I asked it, this is not relevant. I do not have a specific idea at this time, but only know that this is monitoring time ;-) 

    ** By the way, you mentioned that you use SQL Server 2014 and I remember a bug in 2014 which was related to slow execution of jobs, that was reported and there was a discussion in the CONNECT website but the website is not online anymore and I do not remember what was it exactly. I tried Google but found nothing. 


    signature   Ronen Ariely
     [Personal Site]    [Blog]    [Facebook]    [Linkedin]


    Friday, April 19, 2019 6:06 PM
    Moderator
  • I would say that with the results of today's execution, we can conclude that the purpotred difference between starting the job manually or by schedule is a red herring. There is something else that causes the job to run with different execution plans at different occasions. There can be a couple of reasons for this. So let's abandon the attempts with the query I posted.

    Instead, I would like to see the problematic query.

    It would also help to see the query plans for the fast and slow executions. You should at least be able to capture the slow with beta_lockinfo. The faster may be difficult, depending on how fast it is. What we want to see is the XML, not the graphics. You can post the XML to the forum, but upload it somewhere and post the link. You can also upload them to www.brentozar/pastetheplan.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Friday, April 19, 2019 9:26 PM
    Moderator
  • Hi Erland,

    Things now changed a bit and in couple of cases the manual execution also faced the same problem! Also I observed whenever there is problem the CPU load reaches 100%. I wanted to see for some more time if I get similar behavior for both automatic and manual execution and today I found that is happening. So I believe we can discard the primary suspect of SQL Server Agent itself. In anyway in some cases automatic execution also was working fine.

    As per your instruction I have uploaded the XML query plan when it was stuck for over 3 hours in manual execution. But it is not allowing me to send the url, so I am only giving the id of the plan that I pasted. I am sure you can get the actual post by adding the part in the double quote after the website address that you shared with me.  "id=ryob4UY9E"

    I am not sure if it is due to hardware of the server or not. The same script was running without any issue for over 2 years but the problem started when I run shrink file on the raw database and then rebuilt the indexes.

    Thank a lot for your time and effort in at attempt to find the root cause for this strange behavior.
    Sunday, April 21, 2019 2:45 AM
  • Hi Erland,

    Things now changed a bit and in couple of cases the manual execution also faced the same problem! Also I observed whenever there is problem the CPU load reaches 100%. I wanted to see for some more time if I get similar behavior for both automatic and manual execution and today I found that is happening. So I believe we can discard the primary suspect of SQL Server Agent itself. In anyway in some cases automatic execution also was working fine.

    As per your instruction I have uploaded the XML query plan when it was stuck for over 3 hours in manual execution. But it is not allowing me to send the url, so I am only giving the id of the plan that I pasted. I am sure you can get the actual post by adding the part in the double quote after the website address that you shared with me.  "id=ryob4UY9E"

    I am not sure if it is due to hardware of the server or not. The same script was running without any issue for over 2 years but the problem started when I run shrink file on the raw database and then rebuilt the indexes.

    Thank a lot for your time and effort in at attempt to find the root cause for this strange behavior.

    Hi,

    >> So I believe we can discard the primary suspect of SQL Server Agent itself.

    I already told you that from your previous report it is pretty clear the issue is not related to Automatic execution vs manual execution ;-)

    And as I mentioned it is time for monitoring and find the real source of the issue

    * it will help to get the execution plan for the fast execution as well. It will make it simpler and faster to compare

    I have a meeting in several hours that I need to prepare so I will try to remember to check the thread after it...

    In the meantime for these who want to get the link to the execution plan, here it is:

    https://www.brentozar.com/pastetheplan/?id=ryob4UY9E

    Good luck


    signature   Ronen Ariely
     [Personal Site]    [Blog]    [Facebook]    [Linkedin]



    Sunday, April 21, 2019 3:58 AM
    Moderator
  • Things now changed a bit and in couple of cases the manual execution also faced the same problem!

    Yes, that was bound to happen sooner or later.

    Also I observed whenever there is problem the CPU load reaches 100%.

    That's your query running!

    As per your instruction I have uploaded the XML query plan when it was stuck for over 3 hours in manual execution. But it is not allowing me to send the url, so I am only giving the id of the plan that I pasted. I am sure you can get the actual post by adding the part in the double quote after the website address that you shared with me. *"id=ryob4UY9E"*

    You should be able to paste a URL as plain text, I assume. Anyway, to be granted access to post images and URL you need to have your account "verified". You do this here: https://social.msdn.microsoft.com/Forums/sqlserver/en-US/dc4002e4-e3de-4b1e-9a97-3702387886cc/verify-account-42?forum=reportabug

    I am not sure if it is due to hardware of the server or not. The same script was running without any issue for over 2 years but the problem started when I run shrink file on the raw database and then rebuilt the indexes.

    Data is not static over time, but the profile etc changes. When you rebuilt the indexes, statistics were updated with fullscan. This should be to the better. However, in terms of estimated cost, the query may be balancing between two plans, one good and one bad. (We have seen this before.)

    Your query is very long, and in the plan it was truncated. Any chance that you could post just the part from the FROM clause and down?

    By the way, you said that you had broken up the join in several with intermediate temp tables, and the problematic query is the first one. So I had expected an INSERT-SELECT statement, but this appears to be a plain SELECT?


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Sunday, April 21, 2019 9:49 AM
    Moderator
  • Yes, the query is very long although not very complicated. It has 905 columns and per hour about 70000 rows are getting populated in the table. In some cases it may be way more if it didn't run in some previous intervals or the source tables get lot of new data that were missed previously.

    To make things easier I created another table keeping only one column from each source table and all the objects. Then I ran the query from SSMS to catch the query plan XML using your stored procedure beta_lockinfo. Unfortunately it is catching only the part that is executing at that moment. So the query plan that I uploaded at the https://www.brentozar.com/pastetheplan/?id=rkjCeG5cN

    has only #Gr3 (only one temp table of many). I was wondering if there is any way that I can give you the full SQL query file? Or I need to run the beta_lockinfo frequently during the execution of the script and catch as many query plan as possible and then upload all the query plans? Please advise me.

    This specific new table with the shortened query took 5 min 50 second to populate over 1 million row in the table. It is understandable as this new table has only about 3 million rows of data where as the production table has more than 25 million row and that table has 905 columns.

    Sunday, April 21, 2019 4:15 PM
  • If you are running a script in SSMS, the easiest way is to select Include Actual Execution Plan from the context menu. If you are running a script from Agent, there some possible means, but they may not be good on live servers with other processes running.
    In any case, I find it to sufficient to work with the query and plan you posted now, so let's stay with that. Just collect the actual plan from SSMS and upload it to pastetheplan. I could have use for the actual values.

    I have some questions about your query.
    First of all, in all these left joins, how many rows do you expect to get back from each left join for a single row in A?

    Next, in the query plan there there is a table Lookup_NOKIA_LNCEL_Name_Mapping that occurs no less than five times. Can you explain how it gets into the picture?

    Would it be possible for you to post the full CREATE TABLE + CREATE INDEX scripts for the simplified tables (+ NameMapping) you used in this query? Don't post them here, but upload them to Dropbox, OneDrive or whatever and post a link. (So you can remove the file once this has been sorted out.)


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Sunday, April 21, 2019 10:03 PM
    Moderator
  • Hi Erland,

    I have uploaded 10 different files in Dropbox and have given access to your email address esquel@sommarskog.se

    If you haven't already received link via email you may use the following one to access the folder.

    https://www.dropbox.com/home/SQLSERVER_Problem

    I have provided files for both the simplified table and also the actual production table. However a few moments ago I changed the production script a little bit and brought down the number of intermediate temp tables from 5 to 4. The applicable Production files are based on this new modified method.

    As for name mapping and number of columns that is returned from each source table as describer below. Here #Gr1 to #Gr4 are actual temp tables that are created and the number after the colon (:) are the number of fields returned from each table. In the final table there will be 905 columns.

    #Gr1:
    LTE_NOKIA.dbo.V_LTE_Cell_Load as A: 229
    LTE_NOKIA.dbo.V_LTE_UE_State as B:30
    LTE_NOKIA.dbo.V_LTE_Radio_Bearer as C:4
    LTE_NOKIA.dbo.V_LTE_EPS_Bearer as D:136
    LTE_NOKIA.dbo.V_Lookup_NOKIA_LNCEL_Name_Mapping as E:5

    #Gr2:
    LTE_NOKIA.dbo.V_LTE_Cell_Avail as A:9
    LTE_NOKIA.dbo.V_LTE_Cell_Throughput as E:92
    LTE_NOKIA.dbo.V_LTE_Cell_Resource as F:46
    LTE_NOKIA.dbo.V_LTE_Pwr_and_Qual_UL as N:151
    LTE_NOKIA.dbo.V_LTE_Pwr_and_Qual_DL as P:34

    #Gr3:
    LTE_NOKIA.dbo.V_LTE_UE_Quantity as A:40
    LTE_NOKIA.dbo.V_LTE_Intra_eNB_HO as G:8
    LTE_NOKIA.dbo.V_LTE_Inter_eNB_HO as H:14
    LTE_NOKIA.dbo.V_LTE_Handover as I:15
    LTE_NOKIA.dbo.V_LTE_Inter_Sys_HO as J:13

    #Gr4:
    LTE_NOKIA.dbo.V_LTE_RRC as A:13
    LTE_NOKIA.dbo.V_LTE_MAC as B:31
    LTE_NOKIA.dbo.V_LTE_QoS as C:41
    LTE_NOKIA.dbo.V_LTE_HO_RLF_trigger as D:2
    LTE_NOKIA.dbo.V_LTE_Neighb_Cell_HO as E:9

    The behavior still remaining the same. Usually in one interval it will run just fine and in the next one it will get stuck. Then I kill the job and in the following interval it again runs fine. This has been extremely frustrating experience but I hope an expert like you will be able to give me some light.

    Monday, April 22, 2019 3:40 AM
  • The normal procedure is that you make the files publically available, so that anyone can join in. I can understand that you don't want to expose your intellectual property to everyone, but still that is the way it is.

    The information was in any case of limited to use for me. There was only the definition of a single table, and it is not any of those that figures in the query you posted the plan for. I would like to see the definions for the tables in the query that creates #Gr3. And I would like to understand how that name-mapping table comes into the equation. Oh, I see it now! In the query text you have V_LTE_UE_Quantity, whereas the plan has the names without V. So these are views! Oh well, then we need the view definitons as well.

    Also, there was a question that you did not answer, so I will repeat it: In all these left joins, how many rows do you expect to get back from each left join for a single row in A? You gave a list of number of columns, but that does not interest me. The question is about the number of rows.

    I should also say that it is unlikely that we will be able help you with this fully - it is simply too big for a forum post. So I will focus on the query that produces #Gr3 to see if can make some improvements that you then can apply to the rest.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Monday, April 22, 2019 12:53 PM
    Moderator
  • Now anybody should be able to see the files from https://www.dropbox.com/sh/884xq1hbe0gj1d4/AADbCODPvCNXVNiizYMMFJOIa?dl=0

    Sorry that I misread the requirement! In a normal operation the expected number of rows in about 70000. In some cases it might be significantly more if previous hours were missing. The script can go automatically 3 days back from the current date to check for the missing data in the desired table and will bring them and populate in the desired table. I have run the script to recover two full day data (more than 3.3 million rows) and it finished within 40 minutes. Each of the source table (in the actual query I am using view as the original table has column ID only and not the names. And it is very difficult to work with ID in such a big query. Also some enhancement is done on the original table through the view) has same number of records coming in every 15 minutes. I have uploaded three files for the table LTE_UE_Quantity and and corresponding view.

    CREATE INDEX for LTE_UE_Quantity.txt

    CREATE TABLE for LTE_UE_Quantity.txt

    View Definition for V_LTE_UE_Quantity.txt

    Is there anything that I am still missing?

    Monday, April 22, 2019 1:30 PM
  • I am sad to say so, but I will get off at this point. This is simply far too big to deal with in a forum thread. At least for me.

    I will make some closing remarks, though.

    What I asked "In all these left joins, how many rows do you expect to get back from each left join for a single row in A?" You never answered this, but I should explain why I asked this question. When you have a query like:

    SELECT ..
    FROM   A LEFT JOIN B ON A.col = B.col
    LEFT JOIN C ON A.col = C.col
    LEFT JOIN D ON A.col = D.col
    LEFT JOIN E ON A.col = E.col

    (A.col can here be the same column or different columns in A.)

    This query will produce at least as many rows as there are in A. Say for simplicity's sake that there is only one row in A. Say now that the join to B matches 7 rows in B. This means that the query will return 7 rows. And this is alright - as long as the joins to C, D and E matches at most one row each! But what if the join to C matches 5 rows? Now you get 35 rows, and all rows from B are combined with all rows in C. Maybe there are cases where this is what you want, but it is a hidden cartesian join and in many cases in indicates that something is wrong in the thought process. And to add insult to injury, it is usually bad for performance too.

    I'm not saying this is the situation in your case. After all, this has been running for a couple of years, so I would expect that it is known to return correct results. Yet I mention this, because I have seen people with low experience of SQL to do this mistake more than once.

    The feeling I get when I look at this beast is that this needs a redesign from the ground up. Maybe even  as far transforming many of these columns in to rows in a subtable. But I don't know about what this is about and what is used for, so I am not going to suggest that. But I would definitely suggest that you rewiew indexing. Had the views been plain tables, I would say that all tables should have clustered index on the join columns and all in the same order. With the views, this more difficult, not the the least when the join column TOR_DateTime is defined as

    ,convert(Datetime,dateadd(ss,-(duration+0*60*60),cast(datetime AS Datetimeoffset))) AS TOR_dateTime

    That ruins all chances for the index to be used.

    Yes, I know you said that this has been running for two years without problems and then things went down the hill lately. But I am afraid that the situation is that you have simply been lucky. I am not surprised that you get bad performance when I look the glimpses of code and definitions you have shared. You have been sitting on an accident waiting to happen.

    I think you will need to find someone with good experience of SQL Server tuning to help you, either within your organisation, or get your manager to approve getting an external consultant. It's a quite a bit of work to get this going well. And, as I said, this is too big for at least what I am prepared to work with in a forum thread.

    I should add that you been on SQL 2016 or later, you could have used Query Store to force the good plans. In theory, you can to plan freezing with SQL 2014, but it is more difficult to set up.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Monday, April 22, 2019 8:08 PM
    Moderator
  • Hi Erland,

    I understand that this is a very big query. But I would say that I did answer to your question about how many rows was expected for A, may be I didn't write for the first table of the left join table. For each 15 minute interval (that is the granularity of the data and in each table that is expressed by "duration" column) there are about 17500 rows of data. And it is same for all tables. There is no chance of having any cartesian product. It is more than enough to use only datetime, duration, LNBTS and LNCELL in the join condition to have the correct result. The actual script of the CREATE TABLE that I shared for the table has some more objects MCC, MNC, PLMN and MRBTS but they never change for any cell and they simply can be discarded. That is why they are not there in the final desired table as well. They are there in the table as they came from the processing of the source files in the transformation stage and I put all of them in the clustered index.

    Do you suggest to discard the 4 columns that are not used in joins from the clustered index?

    I have only one more question to you - is your suggestion not to use any conversion in the view if that column to be used in join? If that is the case, I will modify the view to have it the same datetime from the table and do the conversion only in the last phase of populating data into the desired table.

    There might be a question why such a big table was designed on the first place. It was to get reports much faster by avoiding many joins on the fly. Also it was to retain for longer the most important columns for much longer instead of retaining too many very big tables for too long.

    A few days ago I ran Solar Wind's DPA for performance tuning on this server and it didn't show any anomaly. That was before the problem started though.

    Anyway thank you very much for your time and effort.

    Monday, April 22, 2019 11:10 PM
  • Hi Erland,

    I just want to give you a big THANK YOU. Your comment of

    "convert(Datetime,dateadd(ss,-(duration+0*60*60),cast(datetime AS Datetimeoffset))) AS TOR_dateTime

    That ruins all chances for the index to be used."

    was good enough hint for me to redesign the query. Now I am querying directly from the tables and and for the last 7 intervals it is working fine. And there has been great improvement on the execution time as well. It is now finishing in 2 to 4 minutes. Previously it was taking 12 to 15 minutes.

    Tuesday, April 23, 2019 5:14 PM
  • I just want to give you a big THANK YOU. Your comment of

    "convert(Datetime,dateadd(ss,-(duration+0*60*60),cast(datetime AS Datetimeoffset))) AS TOR_dateTime

    That ruins all chances for the index to be used."

    was good enough hint for me to redesign the query. Now I am querying directly from the tables and and for the last 7 intervals it is working fine. And there has been great improvement on the execution time as well. It is now finishing in 2 to 4 minutes. Previously it was taking 12 to 15 minutes.

    That's amazingly great news!

    As much that I think that this needs a major review, I think you have more pressing things on your hands, so move on with them.

    It was great that it was able to remove the conversion of the datetime value to a later stage in the process. That was difficult for me to see.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Tuesday, April 23, 2019 10:02 PM
    Moderator