I was doing some test on our sql cluster, and I've noticed a problem which causes the cluster log to report eventid 1069 and 1205.
My cluster configuration is as follow
3 HP DL380G7 (cluster-01, cluster-02, cluster-03)
Windows 2008R2 x64 on each of these server
I created a failover cluster of SQL Server
instance 1: IST01, preferred owner node cluster-01. failover on node cluster-02, cluster-03
instance 2: IST02, preferred owner node cluster-02, failover on node cluster-03, cluster-01
instance 3: IST03, preferred owner node cluster-03, failover on node cluster-01, cluster-02
Each server has 72GB memory, and each SQL server instance has set the maximum memory limit to 24GB (so, in the worst case, I can have all three instances on a single node, 24+24+24=72GB)
Every server uses iSCSI lun on our existing SAN.
As I said, I was doing some test, so I tried to move the IST01 from node 1 to node 2 to simulate a failover. Everything ok
I different solution (IST03 from node 3 to node 1, IST01 from node 1 to node 2 etc etc)
The problem arises when I try to move instance IST03 from node cluster-03 to cluster-02. I get 2 events in the cluster event log (eventid 1069 and 1025), the instance goes down for a couple of seconds and then it resumes on node cluster-03
EventID 1069 reports: "Cluster resource 'SQL Server (IST03)' in clustered service or application 'SQL Server (IST03)' failed."
EventID 1205 reports: "The Cluster service failed to bring clustered service or application 'SQL Server (IST03)' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application."
I tried to search online, but I still haven't found a good explanation about these errors
My cluster pass with 100% success the cluster validation process, every update (both windows and sql server) are installed.
Every server has the same software installed (double checked bios & driver revision of every peripheral)
Anyone has a good explanation of these 2 EventID and possibily an idea where I can start to look to solve this problem?
Thanks for any help
I tried, from command line:
cluster log /g
On node cluster-01 I don't have problems. It writes the cluster.log file
On node cluster-02 I get an error creating the cluster.log
C:\Users\sitfly>cluster log /g
Generating the cluster log(s) ...
The cluster log could not be created on node 'cluster-sql-02'...
System error 5 has occurred (0x00000005).
Access is denied.
The cluster log has been successfully generated on node 'cluster-sql-01'...
The cluster log has been successfully generated on node 'cluster-sql-03'...
System error 5 has occurred (0x00000005).
Access is denied.
I'm connected to remote desktop using Domain Admin account, and every service of sql (SQL Server, SQL Browser etc etc) runs under the same domain admin account
I looked the cluster.log created from cluster-01 and I just found these error/warning (repeated 24 times every 5 minutes from this morning at 9AM, but I did the failover test yesterday, so I don't think they are related to my test)
00001bc8.000014d4::2013/04/05-09:53:04.197 ERR [RHS] s_RhsRpcCreateResType: ERROR_NOT_READY(21)' because of 'Startup routine for ResType MSMQTriggers returned 21.'
0000109c.00001748::2013/04/05-09:53:04.197 WARN [RCM] Failed to load restype 'MSMQTriggers': error 21.
Those are not related and can be ignored. If you're not using MSMQ you can remove it: http://blogs.msdn.com/b/clustering/archive/2013/04/05/10408075.aspx
Looks like you're having some permissions issues on cluster-sql-02.
Thanks for your suggestione Sean
I will remove MSMQ during next "maintenance window"
About permission, I've noticed that, on the problematic node, even if I'm using Domain Admin account, I have to "run as administrator" command prompt to be able to execute cluster log /g
Oh the other two nodes, I don't have to do this...
In the meantime, I've noticed that I was missing one user with "run as service" ability (which is the user that one of our application uses). I already added it, waiting for the next time our developer can move the application to this node
Any other "suggestion" I could check while I'm waiting for next test?
Rebooted after changing UAC (I checked and I don't have the problem when running cluster log /g command)
I tried to move the Instance IST03 to node 02
I've noticed that, during the move process
- the resource "server name" goes offline and immediatly online on node 02
- all three resources undes "disk drives" go offline and online on node 02
- the resource under "file share resoures" goes offline and online on node 02
- the resource SQL Server (IST03) under "other resource" goes offline and, after a second or two, a failed message appears and all the resources go back offline and online on the old node (node 03)
In the log (Recent cluster event) I don't see anything apart from the message I posted in the beginnig (EventID 1069)
I checked again the configuration of three nodes, and I don't notice any difference (now that I have fixed the UAC "problem")
Any other suggestion?
Thanks for your time
I finally found the problem
On the node 02, the third instance of SQL Server was installaed with the wrong name, instead of IST03, it was installed with name ITS03 and after that renamed to IST03 (at least this is what I found digging into the Registry: I found both keys for IST03 and ITS03, but none of the registry hive was complete)
I will remove the node 02 from the cluster, remove the instance IST03 completely, reinstall it with the correct name, and add the node to the cluster again, hoping this will fix the problem (I already tried fixing registry keys manually, but as I suspected it didn't work)
Will let you know if it works
Thanks again for your help
PS: the event log didn't show anything else apart from what I posted in the original post, and it wasn't very helpful...