none
Deadlock not being detected [SQL 2017 CU12]

    Question

  • I have two merge statements that are deadlocked, but the deadlock isn't being detected so they're both stuck.  I understand the cause of the deadlock and a fix is in progress in our code, but I'm not sure how/where to report this; the deadlock not being detected sounds like a flaw in SQL Server.

    You can see in the below that 38 and 104 are blocking each other.  38 is a service broker activation procedure and 104 is a procedure called from an SSIS package.

    Microsoft SQL Server 2017 (RTM-CU12) (KB4464082) - 14.0.3045.24 (X64)   Oct 18 2018 23:11:05   Copyright (C) 2017 Microsoft Corporation  Developer Edition (64-bit) on Windows Server 2012 R2 Datacenter 6.3 <X64> (Build 9600: ) (Hypervisor)


    Thursday, December 6, 2018 3:19 PM

All replies

  • Means  you expected see some error below

    Server: Msg 1205, Level 13, State 50, Line 1
    Transaction (Process ID 38) was deadlocked on lock resources with
    another process and has been chosen as the deadlock victim. Rerun the
    transaction.


    Best Regards,Uri Dimant SQL Server MVP, http://sqlblog.com/blogs/uri_dimant/

    MS SQL optimization: MS SQL Development and Optimization
    MS SQL Consulting: Large scale of database and data cleansing
    Remote DBA Services: Improves MS SQL Database Performance
    SQL Server Integration Services: Business Intelligence

    Thursday, December 6, 2018 3:23 PM
    Answerer
  • Kinky!

    ---

    Looks like the old "Connect" has been replaced by this:

    https://feedback.azure.com/forums/908035-sql-server

    Maybe they'll be more responsive to reports than Connect ever was?

    I don't suppose you have a repro script they could use to replicate the problem on a clean database?

    Josh

    Thursday, December 6, 2018 5:31 PM
  • Yeah, I would expect a deadlock event and that one of the connections would be killed with that error.
    Thursday, December 6, 2018 9:33 PM
  • I don't have a repro script, but it's on a dev server that I'm not using so it is still stuck.

    It would be laborious to make a repro, and this deadlock has occurred before with the expected behavior (error + one of the transactions rolled back) so even if I repro the deadlock it wouldn't necessarily help.

    Basically I figured in the best case someone would see this and I'd be able to pull something useful out of the current server state before I kill it

    Thursday, December 6, 2018 9:36 PM
  • You could use my beta_lockinfo to get a view over the situation and upload the result somewhere and we can take a look at the result.

    Although, I think I know what is going on. It is clear from your output that both processes are running parallel plans. What I can't see is any of the threads are in EXECSYNC waits, that is they are waiting for other threads. The problem with these waits is that they are not recorded like other waits, including CXPACKET. That is, you cannot see that one thread is waiting for another in sys.dm_os_waiting_tasks. As a consequence, the deadlock is not detected.

    I investigated such a case a few years back, and reported this as a bug, which you can find at https://feedback.azure.com/forums/908035-sql-server/suggestions/32901304-undetected-deadlock-because-of-execsync-waits. I have not heard anything, so I guess it is still open.

    You may of course be seeing something else, but if you find that you have EXECSYNC waits on your server, I am quite sure that you are seeing this issue.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Thursday, December 6, 2018 10:22 PM
  • Thanks Erland, that is interesting.  I ran the procedure and can send you the results if you'd like.  But I don't see any EXECSYNC, just CXPACKET, CXCONSUMER, and LCK_M_U.  Maybe CXCONSUMER has the same issue.  I've attached what I assume are the relevant parts of the output, this way I don't have to work on redacting, but let me know if there's more that is needed from the output.

    (https://social.msdn.microsoft.com/Forums/getfile/1368257)

    Thursday, December 6, 2018 10:53 PM
  • I'm noticing that the 104/22 wait time is constantly low (rerunning the lock proc confirms), I wonder if that is yielding (spinlock?) within the blocked process threshold and thus defeating the deadlock detection.
    Thursday, December 6, 2018 10:57 PM
  • I'm noticing that the 104/22 wait time is constantly low (rerunning the lock proc confirms), I wonder if that is yielding (spinlock?) within the blocked process threshold and thus defeating the deadlock detection.

    That's interesting - so are you 100% certain that it is deadlocked and not just super-slow for some weird reason?

    I mean, that wouldn't be good either, but it might be a different issue.

    How many rows is it processing, how fast (slow) does it run when it does not deadlock?

    Josh

    Friday, December 7, 2018 3:19 AM
  • Well, generally the situation is this.  Consider a price table ("P") with a foreign key to the objects for which there are prices ("O")

    Process 38 is receiving results from service broker, which first updates info about the objects (writes to "O") and then MERGEs the prices (into "P") with HOLDLOCK.  This is done inside a transaction.

    Process 104 is doing a MERGE into "P" (also with HOLDLOCK) of prices from another source from SSIS, and in so doing, "O" is also being read by SQL to validate the foreign key.  This is the deadlock I've seen: 38 has completed updating "O" (has an update lock on "O") and is trying to MERGE into "P" and blocking on that lock; meanwhile 104 is merging into (and thus has the lock on) "P", and is waiting on the lock on "O".

    The data being merged is broken up into batches and they're usually reasonably small; usually, each such batch takes a minute or two at the most to merge.  Although "P" itself is gigantic, it is indexed such that the merges are quick.  This current situation, however, has been stuck for about 24 hours.

    That said, based on the output I gave above, the deadlock I just described might not exactly be the case here.    But as per the original post's image (https://social.msdn.microsoft.com/Forums/getfile/1368162), second result set, you can see that each of the two has a lock and is waiting for the opposite lock.  (I just picked two of the locks to prove it is a deadlock, but I can grab more locking detail if it helps)



    Friday, December 7, 2018 3:46 AM
  • OK, but I was all excited about your idea that it was some kind of spin-lock that only looked like a deadlock.

    24 hours sounds pretty solid.

    I've seen weird behaviors where big parallel plans get hung behind a single task that manages to spit out a row every few seconds - though not recently, and I don't think I ever really diagnosed the cause, just kicked stuff around until it stopped happening.

    Of course as soon as you start using HOLDLOCK you get range locks between keys and all sorts of additional deadlocks, but if your transactions are big enough to last even a minute or two, that's probably too many rowlocks anyway so you're going to end up with lock escalations.

    Sounds like you have it pretty well diagnosed, you can try smaller transactions, or better sort (select) the data so you don't get contending access to a range of values at a given time, or just run the updates sequentially, might run faster that way than fighting each other anyhow.

    Josh

    Friday, December 7, 2018 8:16 AM
  • I mean, it's still the case that the deadlock isn't getting detected, so something is up.  I think the lock output seems to conclusively indicate that it is a deadlock and not just "slow". But that one "thread" waiting on CXPACKET keeps having a small wait time which is also interesting (if not in fact the cause of the non-detection).   Again, the issue isn't the deadlock itself, the issue is that the deadlock detection isn't kicking in to kill one of the connections.





    Friday, December 7, 2018 4:44 PM
  • I mean, it's still the case that the deadlock isn't getting detected, so something is up.  I think the lock output seems to conclusively indicate that it is a deadlock and not just "slow". But that one "thread" waiting on CXPACKET keeps having a small wait time which is also interesting (if not in fact the cause of the non-detection).   Again, the issue isn't the deadlock itself, the issue is that the deadlock detection isn't kicking in to kill one of the connections.

    Right.

    Do you have any block alerts?  Those should fire on the other threads, anyway.

    Josh

    Friday, December 7, 2018 8:30 PM
  • No it's not something we're interested in on that server
    Friday, December 7, 2018 9:55 PM
  • This is definitely different from the issue I had in mind. For one thing, beta_lockinfo does see the deadlock. You see those DD in the blkby columns - that is the deadlock right there. In the EXECSYNC case, you don't see this. (Because beta_lockinfo relies on dm.sys_os_waiting_tasks and blocking information is missing.)

    So in this case, SQL Server rails to roll back. On the top of my head, I can think of two reasons:

    1. It tries, but as you suggests it gets stuck in a spinlock.
    2. The thread that is supposed to check for deadlocks every five seconds, has gone into nirvana.

    Have you check the SQL Server errorlog for interesting messages?

    To investigate the spinlock theory, you could sample sys.dm_os_splinlocks to see if something is constantly growing.

    To test the other theory - well you could restart SQL Server and try it again. If still gets stuck - well, interesting.

    I should add that I have never seen (or at least not paid attention to) this CXCONSUMER wait.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Friday, December 7, 2018 10:43 PM
  • Thanks, Erland.  Well, from sys.dm_os_spinlock_stats, LOCK_HASH is huge (results below). I think that's coherent with my theory, which is that it is backing off (yielding) before the deadlock thread can detect it as a deadlock, instead of blocking indefinitely.

    I created a simple deadlock of my own just now and confirmed that the deadlock thread is still alive and detecting in the general case.

    CXCONSUMER is an attempt in SQL 2016 SP2+ to segregate CXPACKET into "stuff you can do something about, and stuff you can't" (https://blogs.msdn.microsoft.com/sql_server_team/making-parallelism-waits-actionable/)

    name           collisions                                                      spins  spins_per_collision sleep_time backoffs
    LOCK_HASH 643387184 6247580627081    9710.452 228955 190928188
    X_PACKET_LIST 24544031 26563161353    1082.266 64365 3781069
    FCB_REPLICA_SYNC 216335 12241601926    56586.32 600 56098
    SOS_SUSPEND_QUEUE 3015818 5531886117    1834.29 2180 632734
    CONNECTION_MANAGER 18832103 5302481031    281.566 2012 270058


    Friday, December 7, 2018 10:57 PM
  • >Do you have any block alerts?

    No it's not something we're interested in on that server

    LOL - try it, you might like it!

    Have you looked at the total times on your lock waits?  Of course you may need to clear the totals or run differentials to get good numbers.

    I've written my own little block probes to sample and report on even relatively short blocks, and it often turns up interesting things.

    Josh


    • Edited by JRStern Saturday, December 8, 2018 1:26 AM
    Saturday, December 8, 2018 1:26 AM
  • The deadlock are the waits you see in sys.dm_os_waiting_tasks (as presented by beta_lockinfo), and they are not spinlock waits. But the spinlock appears to prevent the deadlock from being resolved.

    As for CXCONSUMER, I very timely learnt about the today while at SQL Saturday in Prague.

    I would try rebooting SQL Server, then set up for the deadlock again and see if the situation reappears. In such case I would consider distilling a repro that can be included in a bug report.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Saturday, December 8, 2018 10:43 PM