locked
ASA SQL Connection Fails, causing ASA to stall for many hours. RRS feed

  • Question

  • Hey,

    one of my ASA jobs has been behaving strangely, was wondering if anyone has any insight into it.

    It seems that it hits some form of database related error when trying to write events, then stops processing events altogether for several hours. At some point the system reboots ASA and it reprocesses all the missed events in a massive spike.

    https://social.msdn.microsoft.com/Forums/getfile/731123

    Please note the blue and orange lines flatlining before a massive spike.

    During the affected time, it does not report any incoming our outgoing events at all, but judging by the count of outgoing messages vs. incoming messages during the spike, it did receive events that occured during the flatline.

    The first major outage started at 10/5/2015 6:36:12 PM PST and correlates to these errors:


    Correlation ID:
    2aa76a9f-7a08-487f-b206-528e32ea23d3

    Error:
    WriteToServer requires an open and available Connection. The connection's current state is closed.


    Message:
    SQL output adapter error.

    Message Time:
    2015-10-06 01:36:12Z

    Microsoft.Resources/EventNameV2:
    streamingNode0.sql-event-order-skus-sql-event-order-skus#0$0

    Microsoft.Resources/Operation:
    Send Events

    Type:
    SqlOutputAdapterFailure

    ---

    Correlation ID: 2aa76a9f-7a08-487f-b206-528e32ea23d3 Error: Output adapter has encountered an error after processing [490] batches. Exception message at level [1], exception number [0], parent exception number [0]: WriteToServer requires an open and available Connection. The connection's current state is closed. Message: Output is currently experiencing transient errors. Retrying... Message Time: 2015-10-06 01:36:12Z Microsoft.Resources/EventNameV2: streamingNode0.sql-event-order-skus-sql-event-order-skus#0$0 Microsoft.Resources/Operation: Send Events Type: OutputProcessorTransientError

    ---

    Correlation ID:
    2aa76a9f-7a08-487f-b206-528e32ea23d3
    
    Error:
    
    
    Message:
    Azure SQL DB x - Write 147 events timed out after 300 seconds. Retrying...
    
    Message Time:
    2015-10-06 01:36:12Z
    
    Microsoft.Resources/EventNameV2:
    streamingNode0.sql-event-order-skus-sql-event-order-skus#0$0
    
    Microsoft.Resources/Operation:
    Send Events
    
    Type:
    SqlOutputAdapterFailure
    

    ----

    Correlation ID: 2aa76a9f-7a08-487f-b206-528e32ea23d3 Error: WriteToServer requires an open and available Connection. The connection's current state is closed. Message: Output adapter has encountered an error after processing [490] batches. Message Time: 2015-10-06 01:36:12Z Microsoft.Resources/EventNameV2: streamingNode0.sql-event-order-skus-sql-event-order-skus#0$0 Microsoft.Resources/Operation: Send Events Type: OutputAdapterError

    ----

    And the last error is

                        
    Correlation ID:
    2aa76a9f-7a08-487f-b206-528e32ea23d3
    
    Error:
    
    
    Message:
    First Occurred: 10/06/2015 01:36:12 | Resource Name: SQL-Event-Order-SKUs | Message: Encountered error trying to write 147 event(s): WriteToServer requires an open and available Connection. The connection's current state is closed.
    
    Message Time:
    2015-10-06 01:36:14Z
    
    Microsoft.Resources/EventNameV2:
    controller
    
    Microsoft.Resources/Operation:
    Diagnostic: Diagnostic Error
    
    Type:
    DiagnosticMessage
    

    Only after the service was (automatically) restarted twice did it start to work again.

    https://social.msdn.microsoft.com/Forums/getfile/731126

    Right now it's suffering another outage. Situation seems to be a repeat from before. Same error messages and pattern.

    Correlation ID:
    2aa76a9f-7a08-487f-b206-528e32ea23d3
    
    Error:
    WriteToServer requires an open and available Connection. The connection's current state is closed.
    
    
    Message:
    SQL output adapter error.
    
    Message Time:
    2015-10-08 09:20:43Z
    
    Microsoft.Resources/EventNameV2:
    streamingNode0.sql-event-order-products-sql-event-order-products#0$0
    
    Microsoft.Resources/Operation:
    Send Events
    
    Type:
    SqlOutputAdapterFailure
    

    Any idea what's going on, and how to make it perform reliably?

    Thank you,

    Felix

    Thursday, October 8, 2015 10:05 PM

Answers

All replies

  • Hi Felix,

    I have no answer, but a couple of questions:) :

    • In your post you wrote that "one" of your ASA jobs had been behaving strangely; at the time of the outage, did you have more than one job running? If so, were the other jobs affected as well?
    • Can you see anything on the SQL side, i.e. were SQL down at that time?
    • In your post you say the processing was down for several hours, however when looking at the error messages, (if I understand you correctly) the first error message was at 2015-10-06 01:36:12Z and last error message: 2015-10-06 01:36:14Z. Or am I reading you incorrectly?

    Thanks!

    Niels


    http://www.nielsberglund.com | @nielsberglund

    Friday, October 9, 2015 3:36 AM
  • Hey Niels,

    • I have 4 ASA jobs running on that Event Hub, all connecting to the same SQL Server. The other 3 jobs are not affected.
    • The SQL side seems fine. The 3 other jobs had no trouble writing data, and we have a system in place that automatically reads out new records which also has not reported any issues.
    • All 4 of the errors I posted seem to be what initiates the outage. The first outage was resolved at 2015-10-07 08:05:43Z. The second outage which started at 2015-10-08 09:20:43Z is still ongoing. Please see https://social.msdn.microsoft.com/Forums/getfile/731568 . Somehow I can not get my account validated to post pictures, so all I can do is post plaintext links... sorry about that.

    Thank you for your help!

    Felix

    Friday, October 9, 2015 7:10 PM
  • Hi Felix,

    Do you see any other errors beside the SQL timeout and connection closed being reported before these two? Without anything else I can only guess that there might be some throttling issues happening on SQL Azure but that depends on how much data you're trying to push and how often. See more information here:

    https://azure.microsoft.com/en-us/documentation/articles/sql-database-resource-limits/

    Can you tell me in what Azure data center/region is this job running so I can take a look at service logs?

    Thanks


    This posting is provided "AS IS" with no warranties, and confers no rights

    Friday, October 9, 2015 8:54 PM
  • Hey Silviu,

    the database is a Azure SQL Data Warehouse @ 100 DWU.

    I'm looking at my ASA jobs, and some seem to have errors, but only the job in question has completely stalled so far. Still, I'd of course feel better if the jobs didn't have errors to begin with.

    The Azure Data Center Region is US-West.

    The errors I get are either SendEvent Status: Failed or Diagnostic: Diagnostic Error Status: Failed.

    I will paste a sample of the errors I've seen:

    Correlation ID:
    a6e9897e-ba06-49cc-90e3-412c58f5c2aa
    
    1)
    
    Message:
    Azure SQL DB x, [db].[table] - Write 683 events timed out after 300 seconds. Retrying...
    
    2)
    
    Error:
    Cannot access destination table 'table'.
    Exception message at level [1], exception number [0], parent exception number [0]:
    Cannot open database 'db' on server 'server' requested by the login. Access to the database is only allowed using a security-enabled connection string.
    Changed database context to 'database'.
    A severe error occurred on the current command.  The results, if any, should be discarded.
    
    
    Message:
    Output adapter has encountered an error after processing [1074] batches.
    
    3) 
    
    Error:
    Output adapter has encountered an error after processing [1074] batches.
    Exception message at level [1], exception number [0], parent exception number [0]:
    Cannot access destination table 'table'.
    Exception message at level [2], exception number [0], parent exception number [0]:
    Cannot open database 'db' on server 'server' requested by the login. Access to the database is only allowed using a security-enabled connection string.
    Changed database context to 'db'.
    A severe error occurred on the current command.  The results, if any, should be discarded.
    
    
    Message:
    Output is currently experiencing transient errors. Retrying...
    
    4)
    Diagnostic: Diagnostic Error
    First Occurred: 10/09/2015 21:00:05 | Resource Name: SQL-Event-ProductSnapshot | Message: Encountered error trying to write 238 event(s): Cannot access destination table 'Event_ProductSnapshot'.
    
    5)
    Correlation ID:
    d538645f-b8f5-44d6-81a6-77391548b301
    
    Error:
    Cannot access destination table 'table'.
    Exception message at level [1], exception number [0], parent exception number [0]:
    The instance of SQL Server you attempted to connect to does not support CTAIP.
    A severe error occurred on the current command.  The results, if any, should be discarded.
    
    
    Message:
    SQL output adapter error.
    
    6)
    Output adapter has encountered an error after processing [155] batches.
    Exception message at level [1], exception number [0], parent exception number [0]:
    Cannot access destination table 'table'.
    Exception message at level [2], exception number [0], parent exception number [0]:
    The instance of SQL Server you attempted to connect to does not support CTAIP.
    A severe error occurred on the current command.  The results, if any, should be discarded.
    
    7)
    Error:
    Cannot access destination table 'table'.
    Exception message at level [1], exception number [0], parent exception number [0]:
    Cannot open database 'db' on server 'table' requested by the login. Access to the database is only allowed using a security-enabled connection string.
    Changed database context to 'db'.
    A severe error occurred on the current command.  The results, if any, should be discarded.
    
    
    Message:
    SQL output adapter error.


    I also investigated things on the Azure SQL Data Warehouse side of things.

    It seems more and more like something broke on the SQL DW side of things, and ASA can not deal with it.

    I ran the diagnostics query from the DW documentation and found the following:

    ASA sent a request to save events to DW, the Request_SubmitTime is 10/8/2015 9:15:38 AM and the Request_Status is Running.

    If I am reading the diagnostics right, it seems that DW internally stalled on a concurrency lock.

    The request to save events requested a UserConcurrencyResourceType lock on the row to insert, which was acquired right away, but it somehow failed to continue after acquiring the lock.

    Please see: https://social.msdn.microsoft.com/Forums/getfile/731599

    And it seems it never cancelled or aborted, so it remains running and is holding up all additional activity due to the lock it's acquired.

    Should I post about this in the SQL Data Warehouse forums?

    And would you recommend that I try to end the stalled query, or should I leave it for now so that the SQL DW Team can take a look at it and perhaps figure out what caused this permanent stall?

    Finally, would you be able to validate my account so I can post images? 

    Thank you,

    Felix

    Edit:

    In the SQL DW Documentation I found more specific steps and queries to debug issues like this.

    The Query stalled on creating a temporary table.

    The request_id is QID1156791, step_index is 1, operation_type is OnOperation, the location_type is Compute and the command is 

    CREATE TABLE [tempdb].[dbo].[15b05bdeab4d46d898c75dd1563ad4c8] ([Id] INT NOT NULL, [Name] NVARCHAR(200) COLLATE SQL_Latin1_General_CP1_CI_AS, [Code] NVARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS, [Price] MONEY, [Quantity] INT, [Type] NVARCHAR(250) COLLATE SQL_Latin1_General_CP1_CI_AS, [Recommended] INT, [EventId] CHAR(36) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL, [EventProcessedUtcTime] DATETIME2(7) NOT NULL ) WITH(DATA_COMPRESSION=PAGE);

    The documentation then suggests a query to "Find the execution progress of a SQL Step"

    -- Find the distribution run times for a SQL step.
    -- Replace request_id and step_index with values from Step 1 and 3.
    
    SELECT * FROM sys.dm_pdw_sql_requests
    WHERE request_id = 'QID1156791' AND step_index = 1;

    However, this query returns NO RESULTS.

    So that's where the trail ends. I hope that's useful.

    • Edited by FAKastner Friday, October 9, 2015 10:28 PM Added more info
    Friday, October 9, 2015 10:12 PM
  • Hi Felix,

    As you said, this seems like an issue on the SQL Data Warehouse side. Are you able to create a support ticket using the instructions here: https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-get-started-create-support-ticket/?

    If not, please tell us your server name, database name and timestamp of this issue and we investigate the issue.

    Thanks!

    • Marked as answer by FAKastner Tuesday, October 13, 2015 7:44 PM
    Tuesday, October 13, 2015 5:32 PM
  • Hey,

    it seems the error resolved itself over the weekend, as wait / lock seems to have disappeared from the SQL Data Warehouse and the Stream Analytics job has resumed sending events.

    I'll make sure to make a ticket with SQL DW directly should this happen again.

    Thank you very much for the help!

    Felix

    Tuesday, October 13, 2015 7:44 PM
  • Thanks for the update Felix!
    Thursday, October 15, 2015 6:02 PM