Hello again… I came across something weird recently while trying to understand some strange behaviors we seen on our RDS instances in AWS. Here’s some background.
We have 18 RDS instances in AWS (various sizes) spread across 3 environments: test, stage and production. All 18 of them are using multi-az and SQL 2017 Standard Edition. With this setup, multi-az is using database mirroring (DBM) under the covers to implement the required degree of high-availability. For this, all databases must be under the FULL recovery model. We had 2 other cases opened with AWS support for various issues regarding DBM on these instances where we noticed some strange behaviors we couldn’t explain and as a side-effects of these strange behaviors, the DBM experienced quite a few role changes per day, in some cases also causing some short timeouts.
In theory, these role changes should not be long enough to cause any noticeable harm, it was just me trying to be aware of such incidents and be able to correlate those with other possible complaints from users.
So, I wanted to know when these role changes were happening and I started looking for ways to accomplish this.
Option #1 – SQL Server Error log
My first and obvious choice was to check the SQL Server’s error log. While in theory this should do the trick just nicely, here’s the catch with this. There’s a bug in the RDS engine which causes the SQL Error to be cycled whenever they feel like it. There’s no predictable pattern to this and there’s no way to turn it off and implement your own cycling job (as I am used to). In some cases, I’ve seen the error log cycled up to 12 times in a 24 hrs span, sometimes as often as few times per minute. So, imagine with only the last 30 error logs available, how reliable this method was. So this was a no-go. At least for me…
Option #2 – Use DBM system views
My second option was to use DBM system views and poll the current server name every 15 minutes. Then compare the returned value with what was the last known principal. If a change would be detected, some notification mechanism should be designed (because, you see, RDS does not allow Database Mail either) and raise a yellow flag somehow.
Polling the principal name should do the trick because AWS was claiming that in our setup, they offer what they call a “single-box-experience”. In other words, if DBM decides to fail over any of the databases due to some resource pressure, the RDS automation would kick in and fail over *all* databases to the secondary node, effectively promoting it to principal.
Because of all of the above (DBM, single-box-experience, etc), AWS stated that in our case, we should have 1 principal and 1 secondary.
The script I used to poll the name of the principal server is this:
;WITH dbm AS
( SELECT
@@SERVERNAME as [server_name],
DB_NAME([database_id]) as [database_name],
[mirroring_state_desc],
[mirroring_role_desc],
[mirroring_safety_level_desc],
[mirroring_partner_instance],
[mirroring_partner_name]
FROM sys.database_mirroring
WHERE [mirroring_role] IS NOT NULL
)
SELECT
DISTINCT GETUTCDATE() AS [UTCDate],
[server_name],
[mirroring_partner_instance] AS [partner_name],
[mirroring_role_desc]
FROM dbm
But, as it can be seen from the screenshots below, for 2 of the instances, the script reported 2 secondaries (instance and server names have been hashed, of course… I want to keep my job, you know… 🙂 )
I opened a case with AWS and they were able to confirm that all our replicas were actually on one secondary alone. I also posted a question on stackoverflow, but got no answers. Here’s the post.
Some prior actions that may or may not have contributed to this:
- immediately after the instances were online, we started deploying some empty databases, mainly for monitoring purposes (data stores for monitoring tools)
- once these were deployed, we started moving and consolidating some ERP databases on these instances
After all the above actions completed, we noticed those role changes and sometimes some timeouts. After 2 support cases with AWS, we learned that:
- whenever you want to deploy new databases on a similar setup, make sure you deploy it / them with FULL recovery model
- if you create more than a certain number of databases in SIMPLE recovery in a short time frame (for example, deploying empty DBs via scripts, as I did), the RDS automation (or something inside the RDS engine) doesn’t cope really well with all this and may decide that it’s safer to just rebuild all the DBM for all the DBs. And if you have tens of DBs, like we do, that’s gonna leave a mark
- this happens even when you deploy empty DBs but you already have DBs that are actively used – happened in both scenarios (empty instance and instance already in use)
- RDS engine cycles the bejeezus out of the error log – just look for the usual message “the error log has been reinitialized…” on an RDS instance.
And of course, AWS says that it may a bug in SQL Server engine and they have nothing to do with it. While I tend to agree with them up to a certain point, I am also fairly certain that all those role changes, if happening too often, may also contribute to the state of confusion SQL was eventually in with regards the number of secondaries.
The offered solution was to restart the instances, which I didn’t want to do, just in case I open a case with MS. And funny.. erm… weird enough, on one of the instances that initially the DBM system views reported 2 secondaries, now seems to report just one.
I am curious whether anybody else encountered this and have some more information / details to share.
Thanks for reading.