locked
Mass Scheduled Snapshot Jobs Failure RRS feed

  • Question

  • We are using merge replication and have set up snapshot jobs to regularly re-create user's snapshots in preperation for re-intialisations etc.  The jobs are all sheduled to execute at the same time, with the

    max_concurrent_dynamic_snapshots set to 10 so that they should simply queue and and work through the list. 

     

    This seems to work with less than 80 or so snapshot jobs.  All of the jobs are executed, with 10 starting their snapshot build while the rest report that they are waiting as there are too many concurrent jobs.  Once one of the first 10 finishes, the next starts until it has worked through all of the jobs.

     

    However, when we have more than 100 jobs defined, the scheduler atempts to start them all and immediatly errors on large numbers of them - ie they don't sit waiting for their turn, they report as an error immediatly in the job activity monitor with a non-descript message: "The replication agent encountered a failure.  See the previous job step history message or Replication Monitor for more information.  The step failed".  Neither the previous job step or replication monitor show any further information. 

     

    When we manually run the failed jobs indivudally they work fine without error.  It appears to only happen when large numbers are queued simultaneously, relying on the max concurrent snapshot setting to manage load, and only when there are more than 100 (approx. - not sure of the exact threshold, suspect it varies between environments).

     

    Has anyone else encountered this?  Is it something we've setup incorrectly or is there a bug in the replication agent?

     

    Monday, March 10, 2008 10:34 PM

Answers

All replies

  • Can you run this on the distributor?

    select max_worker_threads from msdb.dbo.syssubsystems where subsystem='snapshot'

    Tuesday, March 11, 2008 12:50 AM
  •  

    You can configure a greater values for work theads by registry.

    See the link below to learn all you need know to solve your problem.

    http://support.microsoft.com/kb/306457/en-us

    Tuesday, March 11, 2008 1:28 AM
  • Thanks Gopal Ashok,

     

    I ran the sugessted query on the distributor (which in this case is the same as the publisher) and it reported 800.  On our test system where we also reproduced this issues it is set to 200. 

     

    Should I try increasing this to higher number, if so how do I do it and what is a safe range to try?

    Tuesday, March 11, 2008 2:30 AM
  • Emanuel Peixoto, thanks for the post about the registry entries.  I've read the linked article but it doesn't sound quite the same, in that we're not seeing the jobs queued, instead they are erroring immediatly.  Have you seen this behaviour?

     

    Tuesday, March 11, 2008 2:34 AM
  • Yes, we had been verified that many distributions agents runnig have a bad impact to rum snapshot jobs.In our case, higher values for work threads didn't solve, our solution was split the replications jobs between others named instances. To do that we had to configure new distributors in new instances and than reconfigure some subscriptions to use them.

    Tuesday, March 11, 2008 3:33 AM
  •  Emanuel Peixoto wrote:

    Yes, we had been verified that many distributions agents runnig have a bad impact to run snapshot jobs.In our case, higher values for work threads didn't solve, our solution was split the replications jobs between others named instances. To do that we had to configure new distributors in new instances and than reconfigure some subscriptions to use them.

    Tuesday, March 11, 2008 7:06 AM
  • 800 is more than enough. Can you verify if you are running into the desktop heap limitation. You can check the event log for the following errror

     

    "Event Type: Information
    Event Source: Application Popup
    Event Category: None
    Event ID: 26
    Description:
    Application popup: snapshot.exe - Application Error : The application
    failed
    to initialize properly (0xc0000142). Click on OK to terminate the
    application.

    Tuesday, March 11, 2008 3:24 PM
  • Yes we do see those events in the event log, it looks like one for each failure.

     

    So if it is the desktop heap limitation, what does that mean and how do we work around it?

     

    Thanks again.

    Tuesday, March 11, 2008 9:25 PM
  • You can do one of three things, I recommend the first one:

    1. Stagger the starting of the jobs so they all don't start simultaneously

    2. Increase desktop heap size - http://msdn2.microsoft.com/en-us/library/ms152544.aspx

    3. Upgrade to 64bit OS (i'm not 100% sure of this, but i heard this is a workaround)

    Wednesday, March 12, 2008 7:51 PM
  •  

    Thanks Greg Y and gopal ashok for the assitance and suggestions.  I've tried increasing the heap size and that appears to resolve the issue, although I agree with Greg that staggering the jobs is probably a better solution.   

     

    Do you know of any side effects to increasing the heap size?  On my test server it was 512k and I simple increased that to 1024, so it's not like its a large amount of memory.

    Thursday, March 13, 2008 2:33 AM