none
SQL server 2016 peer to peer transactional replication error: The replication agent has not logged a progress message in 10 minutes

    Question

  • Hi all,

    I am running into this error on the SQL server 2016 developer edition. 

    Set up: 4 nodes in a rectangular peer to peer replication topology, each node acting as a publisher as well as a distributor.

    Error: on one of the nodes it is reporting:The replication agent has not logged a progress message in 10 minutes. This might indicate an unresponsive agent or high system activity. Verify that records are being replicated to the destination and that connections to the Subscriber, Publisher, and Distributor are still active.

    I followed the link below trying to resolve this issue but I believe I can rule out the possibility of "The agent cannot log in to one of the computers in the topology". However, according to the link another possible cause is that the agent is busy, in this case the agent is distribution agent since it is the one that is not responding. I have tried to restart the distribution agent and also added verbose logging. What I am seeing in the log is that distribution agent started initializing but never seems to finish as it is stuck at "agent waiting for Peer-to-Peer commands to apply". 

    Since the node has two subscribers and only one is encountering this issue, I also added the logging on the healthy subscription and log leading up to the agent initializing are identical but the problematic one is just stuck at "agent waiting for peer-to-peer commands to apply". On the replication monitor I do see undistributed commands staying constant which means commands are not being applied. Then what is the distribution agent busy doing then? Any idea on how I can debug this further as I am completely lost and this peer to peer topology technology just seems way too unreliable. 

    https://docs.microsoft.com/en-us/sql/relational-databases/replication/mssql-eng020554?view=sql-server-2016

    Thank you for your help.

    Wednesday, May 8, 2019 3:00 PM

All replies

  • Hi JasperMSSQL,

    From your description, it seems the issue is due to the distribution agent is waiting for commands to apply. So we may suspect there are large transactions on distribution agent or there is performance issue when applying the commands in subscriber.

    Regarding to how to check large transaction:

    How to identify if there are large transactions pending in the distribution agent:
    ========
    Step 1, Get the current transaction(Xact_seqno) which already been replicated to subscriber side. 

    use Subscription_DB
    declare @publisher sysname
    declare @publicationDB sysname
    declare @publication sysname
    set @publisher='Publisher_Server'
    set @publicationDB='Publisher_DB'
    set @publication='Publication_Name'
    
    select transaction_timestamp, *
    From MSreplication_subscriptions
    where publisher=@publisher and publisher_db=@publicationDB and publication=@publication
    --0x00003E9A000085D70006000000000000 This is the transaction that has been replicated to sub already

    Step 2,  Replace the Xact_seqno we get from the above query and go to the distribution database. 

    use distribution 
    select top 100 xx.Xact_seqno, count(*) as Xcount 
    from dbo.msrepl_commands xx with (nolock)
    where xx.publisher_database_id = 4 and  xx.Xact_seqno > 0x00003E9A000085D70006000000000000
    group by xx.Xact_seqno 
    --having count(*)> 100000
    order by xx.xact_seqno

    Notes:
    How to get publisher_database_id. 

    Use Distribution
    select id,* From distribution..MSpublisher_databases where publisher_db = 'Publisher_DB'

    Regarding to how to capture the performance issue 
    How to identify if there is any performance issue when applying the commands in subscriber side:
    =======
    1.Get the session ID (to subscriber) for the distribution agent. 
    2.Capture the profiler trace on subscriber side. 
    Event: RPC completed and SQL:BatchCompleted
                    Filter: SPID = <52>
    3.Capture for few minutes is enough. 

    Step 3, if you have  found out the large transaction,  you can use following commands to show the detailed commands
    sp_browsereplcmds '0x00003E9A000075720007000000000000','0x00003E9A000075720007000000000000'

    Hope this could help you.

    Best regards,
    Cathy Ji


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to  MSDN Support, feel free to contact MSDNFSF@microsoft.com

    Friday, May 10, 2019 11:53 AM
  • Hi Cathy, 

    Thanks for your reply as I will employ your suggestion to debug if the issue arise again. We were able to reproduce the problem. 

    For step 2 where you suggested

    "Capture the profiler trace on subscriber side. 
    Event: RPC completed and SQL:BatchCompleted
                    Filter: SPID = <52>

    "

    I assume the SPID is the session ID found in step 1?

    Also we tear down the node that has its distribution agent stuck at initializing by removing its published database and distribution database and restore it from another node. This seems to "resolve" the issue as distribution database is a fresh one with no pending transactions in them. 

    I am curious on for a peer to peer topology like this, what is the proper way of recovering a node that has got blown away completely? Right now we tried to restore its database from another node, and after rejoining it back to the peer to peer replication topology was when the described issue occurred.

    Friday, May 10, 2019 3:54 PM
  • Hi Cathy,

    We have reproduced the problem again. Same issue when one of the node is rejoined back to the topology. The distribution agent stuck at initializing. This time I took your suggestion and this is what we have:

    Step 1, Get the current transaction(Xact_seqno) which already been replicated to subscriber side

    ran this on the subscriber side and return value is 0x000000000000000000000000000000..

    Step 2,  Replace the Xact_seqno we get from the above query and go to the distribution database

    ran this on the distribution db and got 100 entries (there are more I am pretty sure) and each of them has Xcount above 30.

    As of analyzing performance issue:

    The session id obtained from distribution to subscriber is 126. On subscriber side we ran profile tracer filtering on SPID 126 but did not see anything.

    Please advise.

    Thank you



    Tuesday, May 14, 2019 2:22 AM
  • Do you see any locks on any of the Instances/DBs (Publisher, Distributor, subscriber)?

    How did you obtain spid ID on subscriber? (also check distribution db/instnace as distribution agent connects to distributor and it may stuck there).

    Friday, May 17, 2019 6:46 AM