SQL 2012 AlwaysOn cluster IP not moving after failover, causing database to be read-only
-
Thursday, February 07, 2013 12:47 AM
SQL Server Cluster Name: SQLDAG01
SQL Server Cluster IP: 10.0.0.50
Cluster Listener IP: 10.0.0.60Node 1 Name: SQL01
Node 1 IP: 10.0.0.51Node 2 Name: SQL02
Node 2 IP: 10.0.0.52Everything is fine when SQL01 is the primary. When failing over to SQL02, everything looks fine in the dashboard but for some reason the cluster IP, 10.0.0.50, is stuck on node 1. The databases are configured to provide secondary read access. When executing a query on SQLDAG01, I get an error that the database is in read-only mode. Connectivity tests verify that SQLDAG01, 10.0.0.50, connects to SQL01 even though SQL02 is now the primary.
I've been Googling this for the better part of the day with no luck. Any suggestions? Is there a Powershell command force the cluster IP to move to the active node or something? Also I'm performing the failover as recommended, from Management Studio connected to the secondary node.
- Edited by tandrews851 Thursday, February 07, 2013 12:48 AM
- Edited by tandrews851 Thursday, February 07, 2013 12:54 AM
All Replies
-
Thursday, February 07, 2013 3:50 PMModeratorDo you see anything obvious from the Cluster.log when the failover is happening?
Satya SKJ, Moderator - SQL Server MVP [Knowledge Sharing Network - www.sqlserver-qa.net]
Author of SQL Server 2008 R2 Administration cookbook.
Follow me @sqlmaster. -
Thursday, February 07, 2013 8:33 PMNo, this is a 2008 server so it doesn't use cluster.log. I generated one manually and there's nothing there besides informational messages. The Cluster Events doesn't mention anything other than the failover was initiated manually. There is no mention of the cluster IP or any other resource failing to move over.
-
Friday, February 08, 2013 11:35 PM
There is no feature called AlwaysOn - that's a marketing term which covers FCIs and AGs. Is an FCI part of your AG configuration? If it is, the Listener will always be where the FCI is and will only move if the FCI fails to another node.
And what do you mean by Cluster IP - again is it an FCI, or just the WSFC?
Allan Hirt Blog: http://www.sqlha.com/blog Coming in 2013: Mission Critical SQL Server 2012 - the followup to Pro SQL Server 2008 Failover Clustering
-
Saturday, February 09, 2013 1:04 AM
No I believe it's just a regular DAG, how can I tell for sure? The secondary node says "Synchronized" which would seem to indicate log shipping rather than shared storage. I'll have to login and double check though. Even with FCI, if you manually fail it over to the secondary, wouldn't the listener move the the secondary as well? That's how MSCS has always worked for me in the past. Whichever node owns the service/application holds the cluster name & IP. That just seems like basic cluster functionality.
This is my first experience with SQL 2012 "AlwaysOn" but I am pretty familiar with Cluster Services. It just doesn't seem to be working the way it should. I'm waiting to get a test environment provisioned so I can setup a non-production copy from scratch (I didn't configure this environment) so I can see what the differences are between one that works properly and this one.
-
Saturday, February 09, 2013 1:42 AM
You're all over the map in terms of terminology. An availability group (AG, not DAG - that's an Exchange thing) does not need to have FCIs. AlwaysOn is not a feature. An AG can be configured between standalone instances on servers which are WSFC (not MSCS - that is old and incorrect) nodes. If that is the case, you do have the ability to move the Listener via FCM or command line. The primary replica will now be wherever you moved the Listener and where it was will now be a secondary replica. This is not the recommended or preferred way to do things. If you do have an FCI, that can't happen. The Listener is always wherever the FCI is running. If it is separated, you're in a bad state.
The WSFC name and IP have nothing to do with this.
Allan Hirt Blog: http://www.sqlha.com/blog Coming in 2013: Mission Critical SQL Server 2012 - the followup to Pro SQL Server 2008 Failover Clustering
- Edited by Allan HirtMVP Saturday, February 09, 2013 1:48 AM
-
Sunday, February 10, 2013 12:44 AM
Hi there,
FCI is automatic. AG is not automatic. Fail over must be manual in 2012. It may change in future versions.
Please check books online.
thanks
kumar
-
Monday, February 11, 2013 6:35 AM
Thats not exactly true Kumar. AGs can be set automatic as well.
@Tandrew: Can you attach us a screenshot of the cluster group that the AG is in? (Screenshot from Failover Cluster Manager)
-
Tuesday, February 12, 2013 8:33 PMI'm just waiting to hear back from the client on whether they want to keep troubleshooting this. I was brought in as a Sharepoint consultant to see why it was read-only and some other issues (all related to this), turns out it was on the 2nd node after a power outage and that's when I discovered this issue, and verified it by failing over manually through SQL Mgmt Studio multiple times, always with the same result. They're highly secure so I can't just login whenever I want, I have to request access. If they get back to me I'll post more information but they're up and running just fine on node 1 right now and may not want to mess with it for a while. I was also advised to stop the cluster service on node 1 and see if that forces the cluster IP to move, rather than initiating it cleanly.
-
Wednesday, February 13, 2013 12:33 PMTO me that says they don't know what they are doing. The WSFC IP/name does not have to be where the AG or any other cluster resource is unless you want or need it to be there. That's the whole point. In a geographically dispersed scenario there are some advantages to having the core cluster group where your resources are, but in a normal scenario? No.
Allan Hirt Blog: http://www.sqlha.com/blog Coming in 2013: Mission Critical SQL Server 2012 - the followup to Pro SQL Server 2008 Failover Clustering
-
Wednesday, February 13, 2013 6:00 PM
I'm sorry but that doesn't make any sense to me. Like I said this is my first experience with SQL 2012 AlwaysOn clustering, but I have configured clustering many times in the past on other platforms (SQL 2008, Exchange 2007) and on those clusters, the active node has a NIC bound to the cluster IP, it's one of the cluster resources, and when a failover occurs the secondary node receives the cluster IP, it is physically bound to the NIC, and that node responds to requests on that IP on the network layer. Has something changed with 2012? I thought maybe it used a method similar to Exchange where it updates the SRV records in DNS and clients are repointed to the secondary's IP, eliminating the need for a shared IP address but that does not seem to be the case.
You say "The WSFC IP/name does not have to be where the AG or any other cluster resource is" but how does the node receive and reply to requests to a IP address that's not physically bound to its NIC? Is there some sort of proxy that I don't know about that handles requests and forwards them to the active node? If so, it's not working. Also I've read the documentation and I didn't see any mention of how this works on a network level.
Like I said, I'm used to the active node having the cluster IP bound to its NIC, please explain how SQL 2012 acts differently than this. Thanks
-
Wednesday, February 13, 2013 9:52 PM
With a WSFC, it gets its own IP address. The underlying nodes have IP addresses. When you install a SQL Server FCI, it gets its own IP address. Let's take the FCI case with a 2-node cluster. The WSFC cluster group is on Node A as well as the FCI. A disk fails on Node A which is currently hosting the FCI. The cluster group is not affected. The FCI fails over to Node B. The cluster group stays on Node A. If the node itself failed, the cluster group and the FCI would move to Node B.
Again, you are using wrong terminology so if you put things in the right context maybe it wold mak. With availability groups (AlwaysOn is NOT a feature), there is no shared storage requirement. So you can have standalone instances of SQL Server installed on what are WSFC nodes. So far, so good. The listener component of an AG is technically optional, but let's assume you configure it. Let's also assume 3 nodes (A, B, C). The cluster group is on A and the primary/listener is attached to the instance on A. Problem happens - server is fine, but causes a blip with the listener and it fails to B. The cluster group can still be on A if the failure didn't affect it.
This isn't an FCI - it's a whole new paradigm you need to wrap your head around.
Allan Hirt Blog: http://www.sqlha.com/blog Coming in 2013: Mission Critical SQL Server 2012 - the followup to Pro SQL Server 2008 Failover Clustering
- Proposed As Answer by PrinceLuciferMVP Wednesday, February 13, 2013 9:55 PM
-
Wednesday, February 13, 2013 9:58 PM
I think there is a lot of confusion around the correct Terms... And unless we find to a Point where we all talk about the same things we will end up giving each other confusing answers. But as far as I understood both Points now Allan summed it up quite correctly:
1) The Windows Cluster itself has an IP. SQL Clients should NOT connect there.
2) A SQL Server Failover Cluster Instance (AlwaysOn FCI) will have an IP. THIS is the IP you Need to connect to
3) A SQL Server AlwaysOn Availability Group CAN have an IP (which is called a "Listener" in SQL), which should be the one you are connecting to when using Availability Groups.
- Marked As Answer by tandrews851 Sunday, March 10, 2013 7:48 PM
-
Thursday, February 14, 2013 6:03 AM
Bingo. This is why it's so hard to have conversations. Even MS uses the terminology wrong at times which doesn't help.
And let's not get started on quorum with AGs if you've only got two nodes :) Another topic for another day. How that is configured AND what OS you're using (say W2K8 R2 vs W2012) may mean different things.
Allan Hirt Blog: http://www.sqlha.com/blog Coming in 2013: Mission Critical SQL Server 2012 - the followup to Pro SQL Server 2008 Failover Clustering
-
Sunday, March 10, 2013 8:10 PM
This was the answer, it had been setup to use the cluster name instead of the application name. Whoever installed Sharepoint connected it to SBTSQLDAG01 instead of SHAREPOINT01. Once we changed Sharepoint to connect to SHAREPOINT01, the failover worked as expected. We did have a secondary issue with the ARP cache and had to install the hotfix from http://support.microsoft.com/kb/2582281 to resolve it. One of the Sharepoint app servers was failing to ping the SQL node after a failover, the ARP entry was stuck pointing to the previous node. This article actually helped a lot resolving that: http://blog.serverfault.com/2011/05/11/windows-2008-and-broken-arp/
One thing I did notice is that the SQL failover wizard does not move cluster groups "Available Storage" and "Cluster Group", I had to move those through the command line after using the wizard. I'm going to provide the client with a Powershell script that moves all cluster groups when they need to do a manual failover. This also happens to be why the Sharepoint issue started, "Cluster Group" is what responds to the cluster name SBTSQLDAG01. Moving that group over to the node that has the active SQL cluster group also made it work properly, but using the application name is the correct method.
Thanks everyone for all your help. Although the nitpicking about terminology really didn't help, that was a pointless argument and we really could have done without it. Yeah I know 2008 called is "Failover Cluster Manager" and MSCS is the "2003 term" but really, they're basically the same thing and we don't really need to derail the conversation because of it. Also, If you look at the screenshot below you can clearly see "AlwaysOn High Availability" in SQL Management Studio. That's what it's called in SQL, that's where you do all the work. Trying to tell me it's "not a feature" is wrong, pointless, and asinine, and doesn't get us anywhere.
Sorry it took so long to get back, I was off the project for a couple weeks while they were resolving some SAN issues that caused the failover to happen in the first place.

