Who Left Those Old Components in My New Database???

April 7, 2017, 4:15 pm

≫ Next: T-SQL Tuesday #89 – The O/S It is A-Changing

≪ Previous: Querying SQL and Windows Version Info with T-SQL

Recently a client requested some new transactional replication publications on a relatively new pre-production server. The client has plenty of transactional and snapshot replication in their environment, including two other publications on this instance already, so this was just adding more to the existing pile.

Piece of cake, right?

https://s-media-cache-ak0.pinimg.com/564x/f2/de/6d/f2de6d3610642b866edcf76f7f86129a.jpg

I went through the new Publication Wizard (Shoot me - it is often the easiest way to handle a new "normal" publication) but this time I was greeted with a pair of errors:

Synonym 'dbo.MSreplication_subscriptions' refers to an invalid object.(Microsoft SQL Server, Error: 5313)

Object Reference not set to an instance of an object, (ConfigureDistWizard)

Um...OK...

My first instinct was that something was broken with the tools on the existing server - I RDP'ed to a different client server, and received the same errors when trying to create a publication for this server.

Hmm...

I have had experiences before where the REPL wizards just don't work (such as creating subscriptions to a higher version publication) so I said, "FINE, I'll just use the code!"

...but the sp_addpublication query threw me the same problem.

https://cdn.meme.am/cache/instances/folder61/250x250/71896061.jpg

At this point I figured that something must have been wrong with the replication components themselves - the server was new and not *really* production yet so maybe something had been fouled at setup.

***NOTE*** - I did *not* take the one step I should have at this point - since there were existing publications on a different user database, I should have paused and considered what was wrong with *this particular database* rather than with the instance at large - it would have saved us roughly 24 hours.

I advised the client I would script out the two existing publications and then completely disable and re-enable publishing and distribution on the instance and try again.

No good...

At this point the client suggested blowing away the VM and setting it up again - I could take backups of the existing user databases and restore them after they recreated the VM from scratch. I told them I could continue to investigate but they decided to go ahead with the rebuild.

Fast forward to the next day...

http://fandompodcast.com/wp-content/uploads/2016/01/tune-in-tomorrow.jpg

Even after the server rebuild, I found that the problem with the Publication Wizard still existed! More than a little research found that it was due to the particular database still having some old replication components in it – older than the current version and therefore not cleaned up when I tried running sp_removedbreplication across the user databases to try to restart replication setup from scratch. Since the problem was in one of the user databases it still existed after the rebuild since we just restored the old databases onto the new server!

I found the source of the problem (the old components) via the methodology described here http://willtonic.blogspot.com/2012/03/sql-2005-sp4-and-msreplicationsubscript.htmland then mitigated it by renaming the two offending objects:

USE <dbname>

EXEC sp_rename 'MSreplication_subscriptions', 'MSreplication_subscriptions_old'

EXEC sp_rename 'MSsubscription_agents', 'MSsubscription_agents_old'

This resolved the issue and allowed the New Publication Wizard to move forward.

In the post-mortem, we found that the impacted database had originally been provided by a vendor and that it almost certainly had been migrated from an older version of SQL Server, hence the old components.

I want to draw attention to the fact that the blog post I referenced above doesn't have anything to do with errors in the replication New Publication wizard. It discusses a "bug" in SQL Server 2005 the author encountered while trying to apply a 2005 service pack. I found this post simply by spreading my search criteria from my specific error of "Synonym 'dbo.MSreplication_subscriptions' refers to an invalid object" down to simply "dbo.MSreplication_subscriptions error"

When troubleshooting (and searching for information online in general), never discount something simply because it doesn't match your specific situation. You may find as in this case that there is some shared underlying problem that caused a certain set of symptoms for you but had manifest as a different yet similar set of symptoms for someone else. If you are looking for a specific error or message and not finding anything, cast your net a little wider by reducing your criteria - always start with your specific message, but if you don't get anything useful try again, and instead of searching for "Really Long Error Message about this Object and Some Other Stuff" look instead for "<object> <other stuff> error/warning/whatever"

As always, proceed with caution, but this is true even when you *do* find a blog post/newsgroup post/Connect item/etc. that *does* exactly match your situation - just because renaming an index fixed Error 123456 for someone else doesn't necessarily mean it will for you - always consider the potential impact of any course of action *before* deploying it anywhere, even to Dev/Test.

https://imgflip.com/i/1mvtnu

Hope this helps!

↧

T-SQL Tuesday #89 – The O/S It is A-Changing

April 11, 2017, 6:52 am

≫ Next: It's Just Another 9002...another error like all the rest...

≪ Previous: Who Left Those Old Components in My New Database???

It's T-SQL Tuesday time again - the monthly blog party was started by Adam Machanic (blog/@AdamMachanic) and each month someone different selects a new topic. This month's cycle is hosted by Koen Verbeeck (blog/@Ko_Ver) and his chosen topic is "The Times They Are A-Changing" considering changing technologies and how we see them impacting our careers in the future. (His idea came from a great article/webcast by Kendra Little (blog/@Kendra_Little) titled "Will the Cloud Eat My DBA Job?" - check it out too!)

I remember my first contact with this topic came a long time ago (relatively) from a very reliable source having a little April Fool's Day fun:

SQL Server on Linux
By Steve Jones, 2015/12/25 (first published: 2005/04/01)

A previous April Fools joke, but one of my favorites. Enjoy it today on this holiday - Steve.

I can't believe it. My sources at Micrsoft put out the word late yesterday that there is a project underway to port SQL Server to Linux!!

The exclusive word I got through some close, anonymous sources is that Microsoft realizes that they cannot stamp out Linus. Unlike OS/2 and the expensive, high end Unices, Linux is here to stay. Because of the decentralized work on it, the religous-like fever of its followers, and the high performance that it offers, the big boys, maybe just one big boy, at Microsoft have given in to the fact that it will forever be nipping at the heals of Windows.

And they know that all the work being done on clients for Exchange means that many sites that might want to switch the desktop, may still keep the server on Exchange and have a rich client front end that takes the place of Outlook. But don't rule against Outlook making a run at the Linux platform.

SQL Server, however, is a true platform that doesn't need a client to run against. With SQL Server 2005 and it's CLR integration, the platform actually expands into the application server space and can support many web development systems on a single server, within two highly integrated applications: SQL Server and IIS.
And with MySQL nipping away at many smaller installations that might have switched to SQL Server before, it's time to do something. So a top secret effort has been porting the CLR using Mono, along with the core relational engine to the Linux platform. The storage engine requires a complete rewrite as the physical storage model of Linux is radially different from that of Windows.

Integration Services, the next evolution of DTS, has already been ported with some impressive numbers of throughput. Given that many different sources need a high quality ETL engine, and a SQL Server license to handle this is still cheaper than IBM's Information Integration product and most other tools, there's hope that Integration Services will provide the foothold into Linux.

No work on pricing, but from hallway conversation it's looking like there will be another tier of licensing that will be less expensive than the Workgroup edition without CALs. There is still supposed to be a small, free version without Integration Services that competes against many small LAMP installations.

There isn't a stable complete product yet, heck, we don't even have the Windows version, but our best guess is that a SQL Server 2005b for Linux will release sometime in mid to late 2006 and the two platforms will then synch development over the next SQL Server cycle.

Moving onto a free platform is a difficuly and challenging task. Especially when you are coming from the proprietary software world. But Microsoft did it once before by moving into the free world of the Internet and taking on Netscape. I'm betting they will do it again.

And I'm sure...
that SQL Server...
is the best...;

April Fools!!!

Over ten years ago Steve Jones (blog/@way0utwest) of SQLServerCentral thought he was kidding but was actually looking forward into the future in which we are about to live.

The topic came up again on April Fool's Day a few years later...

How to run SQL Server on Linux
Posted on April 1, 2011 by retracement

UPDATE 2011-04-10: PLEASE NOTE THIS WAS AN APRILS FOOL JOKE 🙂

Well I have finally cracked it, after years of trying (and a little help from the R2 release) to get SQL Server to install and run happily on the Linux Platform. I could hardly believe it when it all came together, but I’m very pleased that it has. What is even more exciting is that the performance of the SQL Engine is going through the roof.

I am not entirely sure why this is, but I am assuming it is partly related to the capabilities of the EXT4 filesystem and the latest Linux Kernel improvements.

So here’s how I did it :-
Install WINE

Install the .NET 3.5 framework into WINE. This is very important, otherwise the SQL Server 2008R2 installer will fail.

Change a couple of WINE settings to allow for 64bit M.U.G extensions.

Install the Application Release 1 for Linux Free Object Orientated Libraries by sudo apt-get install aP-R1l\ f-0Ol
Ensure that you run setup.exe from SQL Server R2 Enterprise Edition – please note that SQL Server 2008 Release 1.5 will not work, and I additionally had problems with the Standard Edition of R2 (but not entirely sure if there is a restriction here with SQL Standard Edition Licensing on Linux).

SQL running happily in WINE on Linux Mint 10 (x64)

I think that the EXT4 filesystem is key to the success of your installation, since when I attempted this deployment using the EXT2 and EXT3 filesystems, SQL Server appeared to have issues installing.

I hope to provide more instructions and performance feedback to you all over the coming months. Enjoy!

Microsoft Certified Master Mark Broadbent (blog/@retracement) gave it a different spin - instead of a spoofed Microsoft announcement he set it up as "look what I did!" and didn't confirm the April Fool until several days later.

Microsoft got in on the act in March of 2016 (notable *not* on April Fool's Day) with a blog post headed with this logo:

Suddenly, it was all real.

When I read the announcement and subsequent blog posts and saw the follow-up webcasts, I thought one thing to myself over and over.

WHAT DO I DO KNOW?

I am an infrastructure DBA and have been for over seventeen years. One of my strengths has always been my ability to look beyond SQL Server and dig into the operating system and its components, such as Windows Clustering.

I don't know anything about Linux (other than recognizing this guy):

https://en.wikipedia.org/wiki/Tux

I am moderately ashamed to admit I did what many people do in this situation, and what I know many DBA's are still doing on this topic...

http://www.theecologist.org/siteimage/scale/800/600/388011.png

I quietly ignored it and went about my life and job, putting off the problem until later.

Time passed and Microsoft released a "public preview"/CTP of what they began calling "SQL Server vNext" for Linux, and it became more real. Then they released another, and another - as of this writing the current CTP is version 1.4 (download it here).

I recently realized I hadn't progressed past my original query:

WHAT DO I DO KNOW?

I have determined that there are three parts to the answer of this question:

(1) I need to continue to play with the CTP's and be ready to embrace the as-yet-unannounced RTM version. Everything I have seen so far has greatly impressed me with how similar the look and feel truly is to "normal" SQL Server on Windows. Obviously there isn't a GUI-style Management Studio for non-GUI Linux, but connecting to a Linux-hosted SQL Server instance from a Windows-client Management Studio is basically seamless. From a Linux host we can use SQLCMD and it looks and feels just like Windows SQLCMD.

(2) I need to commit myself to learning more about Linux, at least at the base levels. I have written numerous times before about the never-ending quest for knowledge we are all on, and this is just the next topic to pursue in a long line. I was happy to see that there are many classes on Pluralsight related to Linux and Linux administration. This subject has been something that was down there on my priority list below Powershell and a few other things, but it needs to move up and will do so.

(Side note - if you don't have a Pluralsight subscription, I think you should - it is an awesome resource full of recorded classes on an unbelievable number of topics, including many classes on SQL Server. You can get a ten-day free trial membership here.)

(3) I need to find more excuses to work with colleagues and potentially with clients on this new technology. None of us exist in a vacuum, and everyone has a different spin on a given topic. I recently attended a Microsoft workshop on this topic and it was very interesting to see the different insights from not only the variety of MS speakers but also the other workshop attendees. Interacting with peers and clients will help me learn even more quickly by providing their insights and by requiring me to research and test to answer their questions.

As with all technology changes, this definitely *will* change our jobs, and like many other things in the SQL Server world, the answer to how it will affect any given individual is:

https://3.bp.blogspot.com/-TEjSrM9xdHA/Vr6kx1jKSeI/AAAAAAABBkI/B_dxYc8AX7Y/s1600/itdepends.jpg

The impact to you will be determined by your ability to adapt and learn - if you are willing to embrace change (a dirty phrase to many DBA's) and learn new things, this will be an exciting time - not only will you need to learn new things, you will find yourself interacting with a new class of I.T. Professionals you may have never spoken to before - the Linux Administrator.

https://catmacros.files.wordpress.com/2009/08/whoa_thats_epic_cat.jpg

Hope this helps!

↧

It's Just Another 9002...another error like all the rest...

May 5, 2017, 10:52 am

≫ Next: 9003 is a Scary Number

≪ Previous: T-SQL Tuesday #89 – The O/S It is A-Changing

11:30pm on a Saturday night, and the pager went off…

Error: 5901, Severity: 16, State: 1.

One or more recovery units belonging to database 'i3_dialer' failed to generate a checkpoint. This is typically caused by lack of system resources such as disk or memory, or in some cases due to database corruption. Examine previous entries in the error log for more detailed information on this failure.

That phrase – “due to database corruption” – always makes my stomach turn just a little…

https://img.memesuper.com/9b7a67d76f5de79439a92cc9ad703dda_im-coming-elizabeth-its-the-im-coming-elizabeth-meme_500-384.jpeg

Thankfully, this turned out to *not* be corruption. <Whew!>

There was a related message:

Could not write a checkpoint record in database DB_1 because the log is out of space. Contact the database administrator to truncate the log or allocate more space to the database log files.

…and then the real problem…

Error: 9002, Severity: 17, State: 9.

The transaction log for database DB_1 is full due to 'AVAILABILITY_REPLICA'.

Upon cursory examination, DB_1 was suspended (not moving data) in Sales_AG, with the AG being primary on node ServerA. The other databases in the AG were all transferring data without apparent issue.

I went to check the drive hosting the LDF log file and almost broke out laughing at midnight – it was a mount point that was 367MB.

That’s right, megabytes.

http://static.ddmcdn.com/gif/tallest-fishing-tales0.jpg

The DB_1 database was just over 1GB, with a 350MB LDF file.

Being an Availability Group, there wasn’t shared storage, so I went to the B node to look and see if something was sitting on the drive on that side – some extra file that was taking up space and preventing the log file from auto-growing, but I didn’t find one.

The regular Database Index Maintenance instigated this situation. The largest object (table) in DB_1 is only 69MB, but the transaction LOG backup job only runs every fifteen minutes. In a database this small it is quite possible that the index rebuild will cycle through all of the tables in the database between LOG backups, which would drive the LDF fill out towards the used space of the MDF data file, in this case 1GB.

In this specific situation, the LDF log file grew by 10% repeatedly until it hit 350MB. At that point, there wasn’t room for another 10%/35MB so it couldn’t grow any more.

On a regular (non-availability group) database this would manifest as the 9002 errors for log full messages with a log_reuse_wait_desc of LOG_BACKUP – the most likely fix would be to take LOG backups, shrink the LDF log file if needed, and be done. That was not the case here.

(Note that the log_reuse_wait_desc here wasn’t LOG_BACKUP, it was AVAILABILITY_REPLICA. This shows that something was breaking the situation from the AG side before the lack of LOG_BACKUP could even be raised as a problem, but I didn’t notice that at the time.)

I added a second LDF log file (not a great answer but often a useful triage to get things moving) and that allowed DB_1 to function, but it did not allow the availability group data movement to resume. To try to repair the situation, I removed DB_1 from the Sales_AG availability group. (Note that this does not prevent applications from accessing the database, but rather it would not allow DB_1 to fail over to the B node if a failover occurred – but in the current state where data movement was suspended and non-resumable, a failover wouldn’t work anyway.)

I ran the LOG backup job manually multiple times and then tried to re-add the DB_1 database to the availability group – that action gave me a telling message that there wasn’t room on the B node to restore the LDF log file:

System.Data.SqlClient.SqlError: There is insufficient free space on disk volume 'X:\LOG' to create the database. The database requires 368050176 additional free bytes, while only 356515840 bytes are available.

http://s2.quickmeme.com/img/a7/a765c5be1bb9380025b17660b4782d2b5dc9fa3a9d50ede174a680733de1d383.jpg

What I found surprised me even more than finding a 367MB mount point in the first place on ServerA – the mount point on ServerB was only 343 MB!

This case with the availability group made the situation especially nasty – by growing to 350MB the LDF log file was now past the size of the 343MB mount point on the B node. This automatically suspended data movement for the DB_1 database (not the whole availability group – just the single database) while it was waiting for sufficient space to grow the LDF on the secondary node.

I have dealt with lots of clustering/mirroring/AG’s over the years, and I have never tripped over this particular issue before. With traditional failover clustering, you have shared storage, so this isn’t an issue; with a “no shared storage” design, this is a real risk for both mirroring and AG’s, and in this case it was exaggerated by the tiny size of the mount.

I was able to shrink the LDF log file of DB_1 to 310MB, which fit inside the 343MB mount point on the B node. At that point I was able to re-add DB_1 to Sales_AG, and after it was successfully back in the availability group the database showed in the monitor as successfully moving data from A>>B.

I removed the secondary LDF log file that I had added as it is not recommended to have multiple log files as a regular operating state. #TheMoreYouKnow

As a final step I shrunk the primary LDF log file a little further (shrinking is not great but is sometimes necessary) and then modified the FILEGROWTH for the file to grow in 10MB increments (rather than %). I then manually capped the MAXSIZE of the file at 335MB so that if the file did fill again, it wouldn’t break the availability group in this same way. The database would still throw 9002’s, but they would be for LOG_BACKUP, not AVAILABILITY_REPLICA.

I warned the client that they really needed to expand the mount point a little (storage is cheap, right?) to cover the needs of the index maintenance and to match the sizes. I also noted that if the mount point were expanded, the manual cap would need to be lifted before the DB_1 LDF log file will take advantage of any extra space.

The cautionary tale here is about the mismatched storage – as unlikely as it may seems to have 343MB and 365MB mount points, the exact same thing could happen on a system with a 343GB and 367GB mounts with a file trying to grow past the smaller of the two sizes.

Configurations on AG partners don’t have to be exactly the same, but they should be darned close…and for storage – just make them the same, and save yourself this particular headache.

http://img01.deviantart.net/668d/i/2004/260/e/c/headache_by_cat_lovers.jpg

Hope this helps!

↧

9003 is a Scary Number

June 21, 2017, 12:08 pm

≫ Next: Toolbox - Where Did All My Space Go?

≪ Previous: It's Just Another 9002...another error like all the rest...

My last post talked about 9002 errors and how they *might* be corruption.

Recently I moved up to the next consecutive error number, a 9003 error:

Log Name: Application
Source: MSSQLSERVER
Date: 6/15/2017 4:49:59 AM
Event ID: 9003
Task Category: Server
Level: Error
Keywords: Classic
User: N/A
Computer: Server1.domain.net
Description:
The log scan number (139943:2168:0) passed to log scan in database 'DBStuff' is not valid.

This error may indicate data corruption or that the log file (.ldf) does not match the data file (.mdf).

If this error occurred during replication, re-create the publication. Otherwise, restore from backup if the problem results in a failure during startup.

http://www.scaryforkids.com/pics/scary-music.jpg

Like the 9002, this is a seemingly scary error with the word “corruption” in it, but when I checked the server it appeared the database was online and I saw connections to DBStuff once I signed on.

The server was throwing a series of the errors constantly in the SQL Error Log (and Windows Application Log) for more than an hour by the time it was escalated to me – constantly every minute or less!

The DBStuff database was a relatively small database (13GB) so I went to run CHECKDB against it manually from a query window to look for corruption, but I couldn’t even get that far:

Msg 5901, Level 16, State 1, Line 1
One or more recovery units belonging to database 'DBStuff' failed to generate a checkpoint. This is typically caused by lack of system resources such as disk or memory, or in some cases due to database corruption. Examine previous entries in the error log for more detailed information on this failure.

Msg 1823, Level 16, State 2, Line 1
A database snapshot cannot be created because it failed to start.

Msg 1823, Level 16, State 8, Line 1
A database snapshot cannot be created because it failed to start.

Msg 7928, Level 16, State 1, Line 1
The database snapshot for online checks could not be created. Either the reason is given in a previous error or one of the underlying volumes does not support sparse files or alternate streams. Attempting to get exclusive access to run checks offline.

Msg 5030, Level 16, State 12, Line 1
The database could not be exclusively locked to perform the operation.

Msg 7926, Level 16, State 1, Line 1
Check statement aborted. The database could not be checked as a database snapshot could not be created and the database or table could not be locked. See Books Online for details of when this behavior is expected and what workarounds exist. Also see previous errors for more details.

Msg 9003, Level 20, State 9, Line 1
The log scan number (139943:2168:0) passed to log scan in database 'DBStuff' is not valid.

This error may indicate data corruption or that the log file (.ldf) does not match the data file (.mdf).

If this error occurred during replication, re-create the publication. Otherwise, restore from backup if the problem results in a failure during startup.

Note that the last error in the series is again the 9003 error shown in the initial Log entry above.

Could there be corruption?

http://s2.quickmeme.com/img/a6/a646c45633cfab95a2aaf727480883aba7de652148fb3601db61853a62d3b3da.jpg

At that point I went Googling and found this from Nathan Courtine at dbi Services:

https://blog.dbi-services.com/sql-server-database-failed-to-generate-a-checkpoint/

At first I dismissed the idea because it is talking about replication and I didn’t see any replication on Server1, but when I went to look at the log_reuse_wait_desc as the article mentions, sure enough I found this:

SELECT name, log_reuse_wait_desc

FROM sys.databases

WHERE name = N'DBStuff'

name	log_reuse_wait_desc
DBStuff	REPLICATION

As described in the article I tried running sp_removedbreplication against DBStuff since there is no apparent replication (publications or subscriptions) but as the author notes this did not resolve the issue in my case.

The fix described in the article involves setting up a new dummy table in the database in question and creating a publication for that dummy table, which will supposedly goose the system and reset the log_reuse_wait_desc from REPLICATION to some “normal” type such as NOTHING or CHECKPOINT. After this you can remove the replication and the dummy table and be done.

The catch for us was that the replication components weren’t installed on the Server1 instance during its original install!

Microsoft SQL Server Management Studio is unable to access replication components because replication is not installed on this instance of SQL Server. For information about installing replication, see the topic Installing Replication in SQL Server Books Online.

Replication components are not installed on this server. Run SQL Server Setup again and select the option to install replication. (Microsoft SQL Server, Error: 21028)

Sigh...

After mounting media and installing the Replication components the rest was relatively easy - add a dummy table, create a publication, then drop the publication and dummy table - just as described in the referenced post above.

Eventually we determined that the DBStuff database was restored from another server where it had been involved in a replication topology, which was the source of the offending bits.

Hope this helps!

↧

Toolbox - Where Did All My Space Go?

July 3, 2017, 11:08 am

≫ Next: T-SQL Tuesday #92 - Trust But Verify

≪ Previous: 9003 is a Scary Number

This is the first in a new series of blogs I am going to create talking about useful tools (mostly scripts) that I use frequently in my day-to-day life as a production DBA. I work as a Work-From-Home DBA for a Remote DBA company, but 90+% of my job functions are the same as any other company DBA.

Many of these scripts come directly from blogs and articles created and shared by other members of the SQL Server community; some of them I have slightly modified and some I have lifted directly from those articles. I will always give attribution back to the original source and note when I have made modifications.

How do I find these scripts, you may ask?

Google is still the top tool in my toolbox - it never ceases to amaze me what you can find by Googling an error number, or the snippet of a message, or simply an object name.

(Disclaimer - I use Google from familiarity but the occasions I have used Bing haven't shown much different results.)

--

The first script I want to share comes from MSSQLTips.com and was created by frequent MSSQLTip author Ken Simmons (Blog/@KenSimmons) - it shows the free space in each of the database files on the instance and default comes back sorted by free space descending:

--

/*
https://www.mssqltips.com/sqlservertip/1510/script-to-determine-free-space-to-support-shrinking-sql-server-database-files/
*/
USE MASTER
GO

CREATE TABLE #TMPFIXEDDRIVES (
DRIVE CHAR(1),
MBFREE INT)

INSERT INTO #TMPFIXEDDRIVES
EXEC xp_FIXEDDRIVES

CREATE TABLE #TMPSPACEUSED (
DBNAME VARCHAR(500),
FILENME VARCHAR(500),
SPACEUSED FLOAT)

INSERT INTO #TMPSPACEUSED
EXEC( 'sp_msforeachdb''use [?]; Select ''''?'''' DBName, Name FileNme,
fileproperty(Name,''''SpaceUsed'''') SpaceUsed from sysfiles''')

SELECT
C.DRIVE,
A.NAME AS DATABASENAME,
B.NAME AS FILENAME,
CASE B.TYPE
WHEN 0 THEN 'DATA'
ELSE TYPE_DESC
END AS FILETYPE,
CASE
WHEN (B.SIZE * 8 / 1024.0) > 1000
THEN CAST(CAST(((B.SIZE * 8 / 1024) / 1024.0) AS DECIMAL(18,2)) AS VARCHAR(20)) + ' GB'
ELSE CAST(CAST((B.SIZE * 8 / 1024.0) AS DECIMAL(18,2)) AS VARCHAR(20)) + ' MB'
END AS FILESIZE,
CASE
WHEN (B.SIZE * 8 / 1024.0) - (D.SPACEUSED / 128.0) > 1000
THEN CAST(CAST((((B.SIZE * 8 / 1024.0) - (D.SPACEUSED / 128.0)) / 1024.0) AS DECIMAL(18,2)) AS VARCHAR(20)) + ' GB'
ELSE CAST(CAST((((B.SIZE * 8 / 1024.0) - (D.SPACEUSED / 128.0))) AS DECIMAL(18,2)) AS VARCHAR(20)) + ' MB'
END AS SPACEFREE,
B.PHYSICAL_NAME
FROM SYS.DATABASES A
JOIN SYS.MASTER_FILES B
ON A.DATABASE_ID = B.DATABASE_ID
JOIN #TMPFIXEDDRIVES C
ON LEFT(B.PHYSICAL_NAME,1) = C.DRIVE
JOIN #TMPSPACEUSED D
ON A.NAME = D.DBNAME
AND B.NAME = D.FILENME
WHERE a.database_id>4
--and DRIVE = 'F'
ORDER BY (B.SIZE * 8 / 1024.0) - (D.SPACEUSED / 128.0) DESC

DROP TABLE #TMPFIXEDDRIVES

DROP TABLE #TMPSPACEUSED

The result set displays like this:

--

I use this all of the time to troubleshoot space issues - not only to find out what is taking up space but also to see if there are files with excessive empty space that can be examined for cleanup.

http://s2.quickmeme.com/img/16/167911b7d66e2736356e5680d23bc9d5111120af5aada10bdb714c70874cc3b1.jpg

Do not willy-nilly shrink files - it causes all kinds of problems with file fragmentation and potentially with performance - but there are times where you just don't have any choice, and this is an easy way to find candidates.

--

Hope this helps!

↧

T-SQL Tuesday #92 - Trust But Verify

July 11, 2017, 12:36 pm

≫ Next: Toolbox - Which Tables are Using All of My Space?

≪ Previous: Toolbox - Where Did All My Space Go?

It's T-SQL Tuesday time again, and this month the host is Raul Gonzalez (blog/@SQLDoubleG). His chosen topic is Lessons Learned the Hard Way. (Everybody should have a story on this one, right?)

When considering this topic the thing that spoke most to me was a broad generalization (and no, it isn't #ItDepends)

https://cdn.meme.am/cache/instances/folder589/500x/64701589/grumpy-cat-5-trust-but-verify.jpg

Yes I am old enough (Full Disclosure) to know many people associate the phrase "Trust But Verify" with a certain Republican President of the United States, but it really is an important part of what we do as DBA's.

When a peer sends you code to review, do you just trust it's OK?

https://cdn.meme.am/cache/instances/folder809/66278809.jpg

When a client says they're backing up their databases, do you trust them?

https://cdn.meme.am/cache/instances/folder686/500x/64677686/willy-wonka-trust-but-verify.jpg

When *SQL SERVER ITSELF* says it's backing up your databases, do you trust it?

https://imgflip.com/i/1sc4nl

(Catching a theme yet?)

At the end of the day, as the DBA we are the Default Blame Acceptors, which means whatever happens, we are Guilty Until Proven Innocent - and even then we're guilty half the time.

This isn't about paranoia, it's about being thorough and doing your job - making sure the systems stay up and the customers stay happy.

Verify your backup jobs are running, and then run test restores to make sure the backup files are good.
Verify your SQL Server services are running (when they're supposed to be)
Verify your databases are online (again, when they're supposed to be)
Verify your logins and users have the appropriate security access (#SysadminIsNotAnAppropriateDefault)
Read through your code and that of your peers - don't just trust the syntax checker!
When you find code online in a blog or a Microsoft forum, read through it and check it - don't just blindly execute it because it was written by a Microsoft employee or an MVP - they're all human too! (This almost bit me once on an old forum post where thankfully I did read through the code before I pushed Execute - the post was two years old with hundreds of reads so it would be fine, right? There was a pretty simple mistake in the WHERE clause and none of the hundreds of readers had seen it or been polite enough to point it out to the author!)
Run periodic checks to make sure all of your Perfmon traces (you are running Perfmon traces, right?), Windows Tasks, SQL Agent Jobs, XEvents sessions, etc. are installed properly and configured with the current settings - just because the instance has a job named "Online Status Check" on it doesn't mean the version is anything like your current code!

...and so many, many, many more.

My belief has always been that people are generally good - contrary to popular belief among some of my peers, developers/managers/QA staff/etc. are not inherently bad people - but they are people. Sometimes they make mistakes (we all do). Sometimes, they don't have a full grasp of our DBA-speak (just like I don't understand a lot of Developer-Speak - I definitely don't speak C#) - and that makes it our responsibility to make sure things are covered as a DBA - we are the subject matter expert in this area, so we need to make sure it's covered.

As I said above, this isn't about paranoia - you do need to entrust other team members with shared responsibilities and division of labor, because as Yoda says:

http://blog.idonethis.com/wp-content/uploads/2016/04/yoda.jpg

...but when the buck stops with you (as it so often does) and it is within your power to check something - do it!

Hope this helps!

↧

Toolbox - Which Tables are Using All of My Space?

July 21, 2017, 8:53 am

≫ Next: Sudden Differential Backup Failures after an Availability Group Failover

≪ Previous: T-SQL Tuesday #92 - Trust But Verify

This is the next in a new series of blogs I am going to create talking about useful tools (mostly scripts) that I use frequently in my day-to-day life as a production DBA. I work as a Work-From-Home DBA for a Remote DBA company, but 90+% of my job functions are the same as any other company DBA.

Many of these scripts come directly from blogs and articles created and shared by other members of the SQL Server community; some of them I have slightly modified and some I have lifted directly from those articles. I will always give attribution back to the original source and note when I have made modifications.

--

In the previous post in this series "Toolbox - Where Did All My Space Go?" I shared a script for finding which database files consumed the most space and which of those files had free space in them. The next step after finding out which databases are using the space is determining which tables in those databases are occupying that space to consider purge/archive opportunities. In many cases you will find a single large table (often called a "mother table") taking up most of the space in your database.

https://www.newslinq.com/wp-content/uploads/2014/05/table.png

(That's a *big* table!)

I found a script created by a developer from Switzerland in a question/answer on StackOverflow.com and modified it slightly to return the specifics I wanted. Among other things I added the InstanceName and DatabaseName because in my job I frequently create documentation or reports for external clients who don't necessarily know the result set came from a particular instance and a particular database:

--

/*
Object Sizes
Modified from http://stackoverflow.com/questions/15896564/get-table-and-index-storage-size-in-sql-server
*/
SELECT TOP 50
@@SERVERNAME as InstanceName
, DB_NAME() as DatabaseName
, s.NAME AS SchemaName
, t.NAME AS TableName
, SUM(p.rows) AS RowCounts
--, SUM(a.total_pages) * 8 AS TotalSpaceKB
, SUM(a.total_pages) * 8/1024.0 AS TotalSpaceMB
, SUM(a.total_pages) * 8/1024.0/1024.0 AS TotalSpaceGB
, SUM(a.used_pages) * 8/1024.0 AS UsedSpaceMB
, (SUM(a.total_pages) - SUM(a.used_pages)) * 8/1024.0 AS UnusedSpaceMB
FROM sys.tables t
INNER JOIN sys.schemas s
ON s.schema_id = t.schema_id
INNER JOIN sys.indexes i
ON t.OBJECT_ID = i.object_id
INNER JOIN sys.partitions p
ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a
ON p.partition_id = a.container_id
WHERE t.NAME NOT LIKE 'dt%' -- filter out system tables for diagramming
AND t.is_ms_shipped = 0
AND i.OBJECT_ID > 255
GROUP BY t.Name
, s.Name
ORDER BY TotalSpaceMB DESC

--

The results look like this:

--

InstanceName	DatabaseName	SchemaName	TableName	RowCounts	TotalSpaceMB	TotalSpaceGB	UsedSpaceMB	UnusedSpaceMB
Instance01	Database15	SI_USER	TRANS_DATA	17730150	16263.72	15.88	16234.14	29.58
Instance01	Database15	SI_USER	WORKFLOW_CONTEXT	50785680	7623.27	7.44	7622.52	0.75
Instance01	Database15	PTM	EI_HISTORY_MSG	22704543	3701.59	3.61	3701.19	0.40
Instance01	Database15	SI_USER	WORKFLOW_LINKAGE	72643908	2657.21	2.59	2657.06	0.15
Instance01	Database15	SI_USER	CORRELATION_SET	13762284	2542.87	2.48	2542.59	0.27
Instance01	Database15	SI_USER	DOCUMENT	9833616	1445.55	1.41	1445.32	0.23
Instance01	Database15	PTM	EI_HISTORY_TRANDTL	30673680	1257.23	1.23	1256.99	0.24
Instance01	Database15	SI_USER	ACT_SESSION_GUID	18728114	1246.77	1.22	1246.70	0.07
Instance01	Database15	SI_USER	DATA_FLOW_GUID	13635838	908.08	0.89	907.98	0.10
Instance01	Database15	PTM	EI_HISTORY_TRAN	18560608	699.97	0.68	699.73	0.24
Instance01	Database15	SI_USER	DATA_TABLE	174048	460.91	0.45	460.45	0.46
Instance01	Database15	SI_USER	DOCUMENT_EXTENSION	1630855	363.54	0.36	363.15	0.39
Instance01	Database15	SI_USER	ACT_NON_XFER	1579422	284.48	0.28	284.13	0.35
Instance01	Database15	SI_USER	ACT_XFER	804270	217.98	0.21	217.61	0.38
Instance01	Database15	SI_USER	ACT_SESSION	1008875	209.04	0.20	208.66	0.38
Instance01	Database15	SI_USER	ARCHIVE_INFO	4203976	113.34	0.11	113.10	0.24
Instance01	Database15	SI_USER	WF_INST_S	1061373	101.52	0.10	101.37	0.16
Instance01	Database15	SI_USER	ACT_AUTHORIZE	298908	70.27	0.07	70.11	0.16
Instance01	Database15	SI_USER	DATA_FLOW	420930	56.59	0.06	56.35	0.23
Instance01	Database15	SI_USER	ACT_AUTHENTICATE	269200	45.80	0.04	45.31	0.48
Instance01	Database15	SI_USER	EDIINTDOC	182672	43.83	0.04	43.69	0.14
Instance01	Database15	SI_USER	MSGMDNCORRELATION	74656	27.86	0.03	26.42	1.44
Instance01	Database15	SI_USER	EDI_DOCUMENT_STATE	57498	22.19	0.02	18.21	3.98
Instance01	Database15	SI_USER	ENVELOPE_PARMS	134691	19.50	0.02	19.34	0.16
Instance01	Database15	SI_USER	SAP_TID	81598	14.13	0.01	14.00	0.13
Instance01	Database15	PTM	EI_PARTNER_REPORT	74617	13.39	0.01	13.38	0.02
Instance01	Database15	SI_USER	EDI_ELEMENT_CODES	89583	10.63	0.01	10.55	0.08
Instance01	Database15	SI_USER	EDI_COMPLIANCE_RPT	37500	10.23	0.01	9.52	0.70
Instance01	Database15	SI_USER	DOCUMENT_LIFESPAN	29454	10.10	0.01	8.44	1.66
Instance01	Database15	SI_USER	ACTIVITY_INFO	43269	6.00	0.01	5.70	0.30
Instance01	Database15	SI_USER	YFS_USER_ACT_AUDIT	14025	4.90	0.00	4.18	0.72
Instance01	Database15	SI_USER	BPMV_LS_WRK2	19110	4.20	0.00	2.70	1.50
Instance01	Database15	SI_USER	CODELIST_XREF_ITEM	14332	3.50	0.00	2.76	0.74
Instance01	Database15	SI_USER	DOC_STATISTICS	8948	3.43	0.00	3.01	0.42
Instance01	Database15	SI_USER	MAP	4848	2.37	0.00	2.26	0.11
Instance01	Database15	SI_USER	YFS_RESOURCE	5628	1.98	0.00	1.82	0.16
Instance01	Database15	SI_USER	MDLR_PAL_ITEM_DESC	4767	1.38	0.00	1.35	0.03
Instance01	Database15	SI_USER	SERVICE_PARM_LIST	6290	1.28	0.00	1.12	0.16
Instance01	Database15	SI_USER	ADMIN_AUDIT	2100	1.15	0.00	0.76	0.39
Instance01	Database15	SI_USER	DOC_STAT_KEY	2860	1.08	0.00	0.82	0.26
Instance01	Database15	SI_USER	MAPPER_ERL_XREF	4317	1.07	0.00	1.02	0.05
Instance01	Database15	PTM	EI_MSG_PROFILE	4830	0.89	0.00	0.76	0.13
Instance01	Database15	SI_USER	WF_INST_S_WRK	2708	0.86	0.00	0.78	0.08
Instance01	Database15	SI_USER	YFS_ORGANIZATION	909	0.84	0.00	0.38	0.46
Instance01	Database15	SI_USER	YFS_STATISTICS_DETAIL	792	0.83	0.00	0.48	0.35
Instance01	Database15	SI_USER	RESOURCE_CHECKSUM	5827	0.76	0.00	0.73	0.02
Instance01	Database15	SI_USER	YFS_RESOURCE_PERMISSION	1611	0.73	0.00	0.56	0.16
Instance01	Database15	PTM	EI_COMM_PROFILE_AUDIT	1512	0.72	0.00	0.63	0.09
Instance01	Database15	SI_USER	WFD	3406	0.72	0.00	0.63	0.09
Instance01	Database15	SI_USER	XMLSCHEMAS	1678	0.70	0.00	0.69	0.01

--

This allows you to easily find the largest tables (you can modify the ORDER BY to find the tables with the most free space as well to look for inefficient indexing or design).

Once you have the largest tables in hand you have the starting point for a discussion on potential purges or archives.

Hope this helps!

↧

Sudden Differential Backup Failures after an Availability Group Failover

July 24, 2017, 1:57 pm

≫ Next: The Transient Database Snapshot Has Been Marked Suspect

≪ Previous: Toolbox - Which Tables are Using All of My Space?

As so many stories do, this story starts with a failover - in this case an Availability Group (AG) failover.

https://cdn.meme.am/cache/instances/folder984/400x/59300984.jpg

There were several different backup and status check jobs failing on NODE01 and NODE02 because the AG01 availability group was now primary on NODE02 instead of NODE01 after a failover.

The failover occurred the previous Saturday morning at 8:30am local server time because of an application-initiated server reboot of the 01 node:

Log Name: System
Source: User32
Date: 7/22/2017 8:31:08 AM
Event ID: 1074
Task Category: None
Level: Information
Keywords: Classic
User: SYSTEM
Computer: NODE01.COMPANY1.com
Description:
The process C:\Program Files\Application12\AppAgent\01\patch\ApplicationService.exe (NODE01) has initiated the restart of computer NODE01 on behalf of user NT AUTHORITY\SYSTEM for the following reason: Legacy API shutdown
Reason Code: 0x80070000
Shutdown Type: restart
Comment: The Update Agent will now reboot this machine

Always fun when an application “unexpectedly” restarts a server without advance warning.

https://i.imgflip.com/w6x8g.jpg

Even though the Quest/Dell LiteSpeed Fast Compression maintenance plans are configured to pick up “all user databases” the availability group databases were not being backed up currently and hadn’t been since the failover:

NODE01:

NODE02:

The error messages on the SQL Error Log on the 01 node are telling:

Date 7/24/2017 1:09:23 AM
Log SQL Server (Current - 7/24/2017 12:59:00 PM)
Source Backup
Message Error: 3041, Severity: 16, State: 1.

--

Date 7/24/2017 1:09:23 AM
Log SQL Server (Current - 7/24/2017 12:59:00 PM)
Source Backup
Message BACKUP failed to complete the command BACKUP DATABASE AG_DATABASE22. Check the backup application log for detailed messages.

--

Date 7/24/2017 1:09:23 AM
Log SQL Server (Current - 7/24/2017 12:59:00 PM)
Source spid269
Message FastCompression Alert: 62301 : SQL Server has returned a failure message to LiteSpeed™ for SQL Server® which has prevented the operation from succeeding.
The following message is not a LiteSpeed™ for SQL Server® message. Please refer to SQL Server books online or Microsoft technical support for a solution:
BACKUP DATABASE is terminating abnormally. This BACKUP or RESTORE command is not supported on a database mirror or secondary replica.

--

Date 7/24/2017 1:09:24 AM
Log SQL Server (Current - 7/24/2017 12:59:00 PM)
Source spid258
Message Error: Maintenance plan 'Maint Backup User Databases'.LiteSpeed™ for SQL Server® 8.2.1.42
© 2016 Dell Inc.
Selecting full backup: Maximum interval (7 days) has elapsed since last full
Executing SQLLitespeed.exe: Write new full backup \\adcnas21\SQL\NODE01\AG_DATABASE22\AG_DATABASE22.litespeed.f22.bkp
Msg 62301, Level 16, State 1, Line 0: SQL Server has returned a failure message to LiteSpeed™ for SQL Server® which has prevented the operation from succeeding.
The following message is not a LiteSpeed™ for SQL Server® message. Please refer to SQL Server books online or Microsoft technical support for a solution:
BACKUP DATABASE is terminating abnormally.
This BACKUP or RESTORE command is not supported on a database mirror or secondary replica.

** IMPORTANT ** - while this specific scenario is on a cluster running Dell/Quest LiteSpeed Fast Compression backups (a process that uses an algorithm to determine whether a differential is sufficient or a full backup is required each day), the problem does *not* directly relate to LiteSpeed - the problem is with running differential backups on availability group secondaries or database mirror partners in general.

The situation around running differentials on secondaries is described in this post from Anup Warrier (blog/@AnupWarrier):

https://sqlsailor.com/2012/05/15/backups-on-secondary-replicas-always-on-availability-groups/

The particular issue here is the fact that the AG’s aren’t weighted evenly, so even though it is “invalid” for the differentials, the AG still prefers NODE01 because it is more heavily weighted:

My recommendation to the client to fix this on LiteSpeed Fast Compression was to change the backup option on the AG from “Any Replica” to “Primary” – this would keep the backup load on the primary, which means the differential backups would work.

The cost to this is that the backup I/O load would always be on the primary, but since the “normal” state for AG01 is to live on NODE01 with backups on NODE01 then requiring the backups to run on the primary node would not be different in most situations.

Note this is the state for this particular AG - in many cases part of the use of AG's is being able to offload backups to the secondary, so in many cases this is a cost to be weighed before making this change.

If you want to run backups on the secondary node then my personal best suggestion for fixing this would be outside of LiteSpeed/Fast Compression – if you want/need to stay with differentials, we could use a scripted backup solution (like Ola Hallengren's Maintenance Solution) to enable/disable the full/differential backups on the secondary node, probably by adding a check in the backup commands to ignore databases that aren’t the AG primary.

A script to perform this Availability Group replica check appears in my previous post on determining the primary Availability Group replica:

SELECT AG.name AS AvailabilityGroupName,
HAGS.primary_replica AS PrimaryReplicaName,
HARS.role_desc as LocalReplicaRoleDesc,
DRCS.database_name AS DatabaseName,
HDRS.synchronization_state_desc as SynchronizationStateDesc,
HDRS.is_suspended AS IsSuspended,
DRCS.is_database_joined AS IsJoined
FROM master.sys.availability_groups AS AG
LEFT OUTER JOIN master.sys.dm_hadr_availability_group_states as HAGS
ON AG.group_id = HAGS.group_id
INNER JOIN master.sys.availability_replicas AS AR
ON AG.group_id = AR.group_id
INNER JOIN master.sys.dm_hadr_availability_replica_states AS HARS
ON AR.replica_id = HARS.replica_id
AND HARS.is_local = 1
INNER JOIN master.sys.dm_hadr_database_replica_cluster_states AS DRCS
ON HARS.replica_id = DRCS.replica_id
LEFT OUTER JOIN master.sys.dm_hadr_database_replica_states AS HDRS
ON DRCS.replica_id = HDRS.replica_id
AND DRCS.group_database_id = HDRS.group_database_id
ORDER BY AG.name, DRCS.database_name

Another option would be to go away from differentials completely and just run FULL’s and LOG’s which would allow for the current 80/50 weight model to continue.

I further advised the client that because of the differentials situation I don’t see a way to force backups to the secondary with LiteSpeed Fast Compression - since Fast Compression requires differentials as part of its process, Fast Compression backups *have* to run on the primary. (Maybe there is some setting in Fast Compression to fix this but I am not familiar with one.)

Hope this helps!

↧

The Transient Database Snapshot Has Been Marked Suspect

August 24, 2017, 11:03 am

≫ Next: Toolbox - IO, IO, Why are You So Slow?

≪ Previous: Sudden Differential Backup Failures after an Availability Group Failover

Yet another tale from the ticket queue...

The DBCC CheckDB was failing on INSTANCE99 and after some investigation it looked like a space issue, not an actual corruption issue.

http://baddogneedsrottenhome.com/images/emails/55ce060daa58b.jpg

The Job Failure error text was this:

Executed as user: DOMAIN\svc_acct. Microsoft (R) SQL Server Execute Package Utility Version 10.50.6000.34 for 64-bit Copyright (C) Microsoft Corporation 2010. All rights reserved. Started: 2:00:00 AM Progress: 2017-08-20 02:00:01.11 Source: {11E1AA7B-A7AC-4043-916B-DC6EABFF772B} Executing query "DECLARE @Guid UNIQUEIDENTIFIER EXECUTE msdb..sp...".: 100% complete End Progress Progress: 2017-08-20 02:00:01.30 Source: Check Database Integrity Task Executing query "USE [VLDB01] ".: 50% complete End Progress Error: 2017-08-20 03:38:19.28 Code: 0xC002F210 Source: Check Database Integrity Task Execute SQL Task Description: Executing the query "DBCC CHECKDB(N'VLDB01') WITH NO_INFOMSGS " failed with the following error: "Check terminated. The transient database snapshot for database 'VLDB01' (database ID 5) has been marked suspect due to an IO operation failure. Refer to the SQL Server error log for details. A severe error occurred on the current command. The results, if any, should be discarded.". Possible failure reasons: Problems with the query, "ResultSet" property not set correctly, parameters not set correctly, or connection not established correctly. End Error Warning: 2017-08-20 03:38:19.28 Code: 0x80019002 Source: VLDB01 Integrity Description: SSIS Warning Code DTS_W_MAXIMUMERRORCOUNTREACHED. The Execution method succeeded, but the number of errors raised (1) reached the maximum allowed (1); resulting in failure. This occurs when the number of errors reaches the number specified in MaximumErrorCount. Change the MaximumErrorCount or fix the errors. End Warning DTExec: The package execution returned DTSER_FAILURE (1). Started: 2:00:00 AM Finished: 3:38:19 AM Elapsed: 5899.51 seconds. The package execution failed. The step failed.

Looking in the SQL Error Log there were hundreds of these combinations in the minutes immediately preceding the job failure:

The operating system returned error 665(The requested operation could not be completed due to a file system limitation) to SQL Server during a write at offset 0x000048a123e000 in file 'E:\SQL_Data\VLDB01.mdf:MSSQL_DBCC17'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Error: 17053, Severity: 16, State: 1.

E:\SQL_Data\VLDB01.mdf:MSSQL_DBCC17: Operating system error 665(The requested operation could not be completed due to a file system limitation) encountered.

I have seen DBCC snapshot errors in the past and they almost always come back to disk space issues. If you look at the first listing of the 665 error above you can see it was trying to write to the snapshot file it was creating on the E: drive, which is where the primary DATA/MDF file for VLDB01 was located.

By default, CheckDB and its component commands use a snapshot of the database to perform their work. As described here by Paul Randal (@PaulRandal/blog) from SQLskills: http://sqlmag.com/blog/why-can-database-snapshot-run-out-space, snapshot files are “sparse” files that reserve a very small amount of space and then grow as needed to handle the required data. Because of this mechanism, they do not require the full amount of space up front.

https://technet.microsoft.com/en-us/library/bb457112.f13zs11_big(l=en-us).jpg

A sparse file only uses the physical space required to hold the actual ("meaningful") data. As seen in this diagram from Technet, in this example a regular/non-sparse file would be 17GB while a comparable sparse file would only be 7GB.

The text of the out of space error has since been updated from the error message seen in Paul’s article to the “transient database snapshot suspect” error we see above as described here http://www.sqlcoffee.com/Troubleshooting177.htm.

Looking at the E: drive it was a 900GB drive with 112GB currently free. The catch is that in the 675GB VLDB01 database there are two tables larger than 112GB and another that is almost 100GB!

Top 10 largest tables out of 1261 total tables in VLDB01:

InstanceName	DatabaseName	TableName	NumberOfRows	SizeinMB	DataSizeinMB	IndexSizeinMB	UnusedSizeinMB
INSTANCE99	VLDB01	BigTable1	1011522	136548.20	136523.80	10.71	13.69
INSTANCE99	VLDB01	BigTable2	9805593	122060.29	114534.34	5709.13	1816.82
INSTANCE99	VLDB01	BigTable3	17747326	91143.74	65405.88	25464.23	273.63
INSTANCE99	VLDB01	BigTable4	137138292	78046.15	39646.33	38305.33	94.49
INSTANCE99	VLDB01	Table01	1650232	46884.70	46422.93	419.40	42.37
INSTANCE99	VLDB01	Table02	76827734	26780.02	9153.05	17566.23	60.75
INSTANCE99	VLDB01	Table03	35370640	26766.98	20936.73	5733.40	96.86
INSTANCE99	VLDB01	Table04	12152300	22973.11	11173.06	11764.65	35.40
INSTANCE99	VLDB01	Table05	12604262	19292.02	7743.06	11511.93	37.03
INSTANCE99	VLDB01	Table06	31649960	14715.57	5350.62	9327.30	37.65

The biggest unit of work in a CheckDB is the individual DBCC CHECKTABLE’s of each table, and trying to run a CHECKTABLE of a 133GB table in a 112GB space was not going to fly.

Note that you don’t need 675GB of free space for the CheckDB snapshot of a 675GB database – just space for the largest object and a little more – 145GB-150GB free should be sufficient to CheckDB this particular database as it currently stands, but we need to be mindful of these large tables if they grow over time as they would then require more CheckDB snapshot space as well.

There are a couple of potential fixes here.

First and possibly most straightforward would be to clear more space on E: or to expand the drive – if we could get the drive to 150+GB free we should be good for the present (acknowledging the threat of future growth of the large tables). The catch was that there were only three files on E: and none of them had much useful free space to reclaim:

DBFileName	Path	FileSizeMB	SpaceUsedMB	FreeSpaceMB
VLDB01	E:\SQL_Data\VLDB01.mdf	654267.13	649746.81	4520.31
VLDB01_data2	E:\SQL_Data\VLDB01_1.ndf	29001.31	28892.81	108.5
VLDB01_CONFIG	E:\SQL_Data\VLDB01_CONFIG.mdf	16.25	12.06	4.19

This means that going this route would requiring expanding the E: drive. I would recommend expanding it by 100GB-150GB – this is more than we immediately need but should prevent us from asking for more space in the short term.

ProTip - consider this method any time you are asking for additional infrastructure resources – asking for just the amount of CPU/RAM/Disk/whatever that you need right now means you will probably need to ask again soon, and most infra admins I have known would rather give you more up front then have you bother them every month!

https://imgflip.com/i/1unt0z

(However, be realistic – don’t ask for an insane amount or you will just get shut down completely!)

Another option in this case since INSTANCE99 is SQL Server Enterprise Edition would be to create a manual snapshot somewhere else with more space and then to run CheckDB against that manual snapshot. This process is described here by Microsoft Certified Master Robert Davis (@SQLSoldier/blog): http://www.sqlsoldier.com/wp/sqlserver/day1of31daysofdisasterrecoverydoesdbccautomaticallyuseexistingsnapshotand is relatively straightforward:

1) Create a snapshot of your database on a different drive – something like:

CREATE DATABASE VLDB01_Snapshot ON (NAME = N' VLDB01_Data_Snap', FILENAME = N'O:\Snap\VLDB01_Data.snap') AS SNAPSHOT OF VLDB01;

2) Run CheckDB against the snapshot directly:

DBCC CHECKDB (VLDB01_Snapshot);

3) Drop the snapshot – because the snapshot is functionally a database, this is just a DROP DATABASE statement:

DROP DATABASE VLDB01_Snapshot

4) Modify the existing job to exclude VLDB01 so that it doesn’t continue to try to run with the default internal process!

Luckily, in this case there were several drives with sufficient space!

I advised the client that if they preferred to go this second way (the manual snapshot) I strongly recommend removing any existing canned maintenance plans and changing this server to the Ola Hallengren scripted maintenance. Not only is this my general recommendation anyway (#OlaRocks), but it also makes excluding a database much easier and safer.

To exclude a database under a regular maintenance plan you have to edit the job and manually check every database except the offending database, but this causes trouble when new databases are added to the instance as they must then be manually added to the maintenance plans. Under the Hallengren scripts you can say “all databases except this one” which continues to automatically pick up new databases in the future (there is no “all but this one” option in a regular maintenance plan).

Here is what the command would look like under Ola:

EXECUTE dbo.DatabaseIntegrityCheck

@Databases = 'USER_DATABASES, -VLDB01',

@CheckCommands = 'CHECKDB'

If you find yourself in this situation consider carefully which way you prefer to go and document, document, document so that future DBA’s know what happened (even if that future DBA is just you in 6/12/24 months!)

https://3.bp.blogspot.com/-YBeSt5-A_fA/Vx2SCCBc1xI/AAAAAAAAtnY/UwvtqaQBacoJ7TY5AM3_1HKSUZTy5CyZACLcB/s1600/wait%2Bhere1b.jpg

Hope this helps!

↧

Toolbox - IO, IO, Why are You So Slow?

September 6, 2017, 12:52 pm

≫ Next: Come See Me in Minnesota in October!

≪ Previous: The Transient Database Snapshot Has Been Marked Suspect

This is the next in a series of blogs I am going to create talking about useful tools (mostly scripts) that I use frequently in my day-to-day life as a production DBA. I work as a Work-From-Home DBA for a Remote DBA company, but 90+% of my job functions are the same as any other company DBA.

Many of these scripts come directly from blogs and articles created and shared by other members of the SQL Server community; some of them I have slightly modified and some I have lifted directly from those articles. I will always give attribution back to the original source and note when I have made modifications.

--

Troubleshooting performance in SQL Server (or almost any other system) is often an iterative process of discovering a bottleneck, fixing it, and then discovering the next bottleneck.

https://memegenerator.net/instance/65795220/grumpy-cat-5-bottleneck-sighting-optimize-the-system

SQL Server bottlenecks frequently fall into one of three resource categories:

CPU
Memory
I/O (Disk)

For example, you may find that "your instance is slow" (Client-speak for "I'm unhappy - FIX IT!") and upon investigation you see in your Perfmon collector (you have a Perfmon collector running, right???) that CPU is consistently at 90%+. You consider the evidence and realize a set of queries is poorly written (darn SharePoint) but you also realize that in this case you can fix them (whew!), and this helps the CPU numbers drop significantly. You're the hero, right?

...until you realize the instance is still "slow."

https://hakanforss.files.wordpress.com/2014/08/lawofbottlenecks.png

You look at the numbers again, and now you find that disk latency, which had previously been fine, is now completely in the tank during the business day, showing that I/O delays are through the roof.

What happened?

This demonstrates the concept of shifting bottleneck - while CPU use was through the roof, the engine so bogged down that it couldn't generate that much I/O, but once the CPU issue was resolved queries started moving through more quickly until the next choke point was met at the I/O limit. Odds are once you resolve the I/O situation, you would find a new bottleneck.

How do you ever defeat a bad guy that constantly moves around and frequently changes form?

https://movietalkexpress.files.wordpress.com/2016/08/3.jpg

The next concept in this series is the concept of "good enough" - all computing systems have theoretical top speeds (speed of electrons through the ether, etc.) but you never get there. At the end of the day the important question is:

How fast does your system need to be to meet the business SLAs and keep your customers happy (aka QUIET)?

Unfortunately in many cases the answer seems to be this:

https://i.kinja-img.com/gawker-media/image/upload/s--Wtfu6LxV--/c_fill,fl_progressive,g_center,h_450,q_80,w_800/1344998898279631762.jpg

--

Once you can determine this "good enough" speed (the whole discussion of "Do you have SLAs?" could fill dozens of blog posts) you can determine where an acceptable end point is to your iterative process.

I have a small handful of things I check when a system is "slow" and one of them is disk latency. At its core, disk latency is a calculation of how much delay there is between when SQL Server asks for data and when it gets it back.

The data regarding latency comes from the DMV sys.dm_io_virtual_file_stats, and the relevant metrics are io_stall, io_stall_read_ms, and io_stall_write_ms (table from Technet):

Column name	Data type	Description
database_id	smallint	ID of database.
file_id	smallint	ID of file.
sample_ms	int	Number of milliseconds since the computer was started. This column can be used to compare different outputs from this function.
num_of_reads	bigint	Number of reads issued on the file.
num_of_bytes_read	bigint	Total number of bytes read on this file.
io_stall_read_ms	bigint	Total time, in milliseconds, that the users waited for reads issued on the file.
num_of_writes	bigint	Number of writes made on this file.
num_of_bytes_written	bigint	Total number of bytes written to the file.
io_stall_write_ms	bigint	Total time, in milliseconds, that users waited for writes to be completed on the file.
io_stall	bigint	Total time, in milliseconds, that users waited for I/O to be completed on the file.
size_on_disk_bytes	bigint	Number of bytes used on the disk for this file. For sparse files, this number is the actual number of bytes on the disk that are used for database snapshots.
file_handle	varbinary	Windows file handle for this file.

How do you turn these numbers into something meaningful? As is almost always the case - someone has already done that work - you just need to find it!

Paul Randal (blog/@PaulRandal) has a great blog post "How to examine IO subsystem latencies from within SQL Server" and in that post he references a query created by Jimmy May (blog/@aspiringgeek). Jimmy was at Microsoft for many years and is currently at the SanDisk Data Propulsion Lab.

Jimmy's query is so useful that it is also included in Glenn Berry's (blog/@glennalanberry) great Diagnostic Information (DMV) Queries. Here is my *very slightly* modified version of that query:

--

-- Drive level latency information (Query 19) (Drive Level Latency)
-- Based on code from Jimmy May
SELECT @@ServerName as instanceName, [Drive], volume_mount_point,
CASE
WHEN num_of_reads = 0 THEN 0
ELSE (io_stall_read_ms/num_of_reads)
END AS [Read Latency],
CASE
WHEN io_stall_write_ms = 0 THEN 0
ELSE (io_stall_write_ms/num_of_writes)
END AS [Write Latency],
CASE
WHEN (num_of_reads = 0 AND num_of_writes = 0) THEN 0
ELSE (io_stall/(num_of_reads + num_of_writes))
END AS [Overall Latency],
CASE
WHEN num_of_reads = 0 THEN 0
ELSE (num_of_bytes_read/num_of_reads)
END AS [Avg Bytes/Read],
CASE
WHEN io_stall_write_ms = 0 THEN 0
ELSE (num_of_bytes_written/num_of_writes)
END AS [Avg Bytes/Write],
CASE
WHEN (num_of_reads = 0 AND num_of_writes = 0) THEN 0
ELSE ((num_of_bytes_read + num_of_bytes_written)/(num_of_reads + num_of_writes))
END AS [Avg Bytes/Transfer]
FROM (SELECT LEFT(UPPER(mf.physical_name), 2) AS Drive, UPPER (volume_mount_point) as volume_mount_point
,SUM(num_of_reads) AS num_of_reads,
SUM(io_stall_read_ms) AS io_stall_read_ms, SUM(num_of_writes) AS num_of_writes,
SUM(io_stall_write_ms) AS io_stall_write_ms, SUM(num_of_bytes_read) AS num_of_bytes_read,
SUM(num_of_bytes_written) AS num_of_bytes_written, SUM(io_stall) AS io_stall
FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS vfs
INNER JOIN sys.master_files AS mf WITH (NOLOCK)
ON vfs.database_id = mf.database_id AND vfs.file_id = mf.file_id
CROSS APPLY sys.dm_os_volume_stats(mf.database_id, mf.FILE_ID)
GROUP BY LEFT(UPPER(mf.physical_name), 2),UPPER (volume_mount_point)) AS tab
ORDER BY [Overall Latency] DESC OPTION (RECOMPILE);
-- Shows you the drive-level latency for reads and writes, in milliseconds
-- Latency above 20-25ms is usually a problem

The query performs a calculation comparing the stall in milliseconds to the number of disk operations to create a latency in milliseconds. A rule of thumb (as noted in the query comments) is that anything >=20ms in latency is pretty bad and worthy of investigation.

Here is a sample output of the query:

--

instanceName	Drive	volume_mount_point	Read Latency	Write Latency	Overall Latency	Avg Bytes/ Read	Avg Bytes/ Write	Avg Bytes/ Transfer
INSTANCE35	F:	F:\SPARE-01\	69	7	64	1002982	236026	940840
INSTANCE35	F:	F:\DATA01\	26	1	25	55005	2629	54384
INSTANCE35	F:	F:\280532-02_SECOND\	8	5	8	105124	8257	105114
INSTANCE35	T:	T:\	2	25	3	65483	64899	65187
INSTANCE35	E:	E:\LOGS01\	7	0	0	20080	51327	51198

--

On this instance you can see that the SPARE-01 and DATA01 mount points on the F: drive have significant overall latency (64ms and 25ms respectively) - significant enough that the users are almost certainly experiencing I/O-related impact.

If you look at the query you will see that the "overall" latency covers both reads and writes and as such is functionally a weighted average of the read and write latencies. For example you can see that the DATA01 does significantly more reads that writes since the overall latency of 25 is almost equal to the read latency of 26.

One more point from the sample numbers above - look at the T: drive. The write latency is 25 (bad) but the overall latency is only 3. Is this a problem? Usually it is not, because this overall latency shows that this drive does relatively few writes - there are so many reads comparatively that the good read latency heavily outweighs the bad write latency. In most cases the overall latency is the most important statistic of the three.

(Of course as always this is a big #ItDepends - if you have a known issue with write performance - even for the small number of writes being done on the drive - then the write latency is important!)

--

What causes latency? There are a wide variety of issues - some inside SQL but many outside the engine as well. In my experience I usually follow-up on both sides, tasking infrastructure teams to investigate numbers on the storage back-end *and* the storage network while the DBA investigates SQL Server-side I/O-related issues such as poor query plans and missing or bad indexes that cause excessive unnecessary I/O. Sometimes you will read guidance to investigate SQL first (Paul makes that recommendation in his post referenced above) but I like starting both investigations at once so that you can get both sets of results in a more timely fashion.

Disk Latency is one of the easiest things to find - run the single simple query shown above and look at the numbers. Once you determine that you *do* have a latency issue, you can move forward with internal SQL Server and external infrastructure investigations.

--

Remember - at the end of the day, you want your customer to think this:

http://imgur.com/waLAsxg

--

Hope this helps!

↧

Come See Me in Minnesota in October!

September 14, 2017, 2:28 pm

≫ Next: Toolbox - Fix Your FILEGROWTH

≪ Previous: Toolbox - IO, IO, Why are You So Slow?

The schedule for SQL Saturday #682 Minnesota 2017 is out, and I will be giving my talk on Extended Events:

--

Getting Started with Extended Events
Speaker: Andy Galbraith
Duration: 60 minutes
Track: Administration

Few subjects in Microsoft SQL Server inspire the same amount of Fear, Uncertainty, and Doubt (FUD) as Extended Events. Many DBA's continue to use Profiler and SQL Trace even though they have been deprecated for years. Why is this?

Extended Events started out in SQL Server 2008 with no user interface and only a few voices in the community documenting the features as they found them. Since then it has blossomed into a full feature of SQL Server and an amazingly low-impact replacement for Profiler and Trace.

Come learn how to get started - the basics of sessions, events, actions, targets, packages, and more. We will look at some base scenarios where Extended Events can be very useful as well as considering a few gotchas along the way. You may never go back to Profiler again!

--

The event is October 7th at St Paul College, and is *free* (with an optional $15 lunch).

I have spoken at numerous SQL Saturdays over the last few years and can definitely say the crew in Minnesota does a great job assembling a schedule and taking care of the attendees (and speakers!)

There is also a great slate of full day pre-conference sessions on Friday the 6th:

--

The Complete Primer to SQL Server Infrastructure and Cloud
David Klee
Level: Intermediate
--
Powering Up with Power BI
Brian Larson, John Thompson...
Level: Beginner
--
DBA Fundamentals - give yourself a solid SQL foundation!
Kevin Hill
Level: Beginner
--
Azure Data Services Hands On Lab
Matt Stenzel
Level: Intermediate

--

The pre-cons cost $110 which makes them a great value for a full day of training from some great SQL Server professionals!

Hope to see you there!

https://media.tenor.com/images/aa1e9033238489da177267cb6a8af273/tenor.gif

↧

Toolbox - Fix Your FILEGROWTH

October 5, 2017, 8:41 am

≫ Next: And YOU Get a Deadlock and YOU Get a Deadlock and EVERYBODY GETS A DEADLOCK!

≪ Previous: Come See Me in Minnesota in October!

One of the items we usually flag in reports is FILEGROWTH by percentage. It doesn't help that for most versions of SQL Server the default FILEGROWTH increment for one or both files (DATA and LOG) is 10%.

https://i.imgflip.com/jaq5c.jpg

As you probably already know, the key flaw to percentage-based FILEGROWTH is that over time the increment grows larger and larger, causing the actual growth itself to take longer and longer. This is especially an issue with LOG files because they have to be zero-initialized before they can be used, causing excessive I/O and file contention while the growth is in progress. Paul Randal (blog/@PaulRandal) describes why this is the case in this blog post. (If you ever get a chance to see it Paul also does a fun demo in some of his classes and talks on why zero initialization is importan, using a hex editor to read the underlying contents of disk even after the old data is "deleted")

As I mentioned above the catch to percentage-based growth is the ever-growing increment:

In the image on the left you can see that after 20 growths at 10% you are now growing at 313MB at a time. By 30 growths (not pictured) the increment is 812MB - getting close to the 1GB mark. Depending on the speed of your storage this can cause significant delay.

Another related issue is Virtual Log Files (VLF's) which I discuss here. It is important to have a meaningful increment so that the growth isn't *too* small relative to the file size.

How do we fix this? I found a script (and modified it of course) to generate the ALTER DATABASE statements to set the FILEGROWTH increment so a size fair to the individual file's current size, based on a table shown in the script:

--

/*

FILEGROWTH Reset
Submitted by Andy Galbraith
02/04/2016

Script to reset all FILEGROWTH to a fixed increment based on their current size:
CurrentSize<1GB = 16MB
1GB<=CurrentSize<5GB = 128MB
5GB<=CurrentSize<100GB = 256MB
CurrentSize>=100GB = 512MB

Actual queries are about two-thirds of the way down at 'SET @Query' if you want to modify the size parameters

Modified from a script at http://www.sqlservercentral.com/scripts/Administration/99339/

Tested on MSSQL 2005/2008/2008R2/2012/2014

*/

SET NOCOUNT ON
USE master
GO

/* Create a Table for Database File Info */
IF OBJECT_ID('tempdb..#ConfigAutoGrowth') IS NOT NULL
DROP TABLE #ConfigAutoGrowth

CREATE TABLE #ConfigAutoGrowth
(
DatabaseID INT,
DBName SYSNAME,
LogicalFileName VARCHAR(max),
FileSizeinGB decimal(10,2),
GrowthOption VARCHAR(12)
)

/* Load the Database File Table */
INSERT INTO #ConfigAutoGrowth
SELECT
SD.database_id,
SD.name,
SF.name,
sf.size*8/1024.0/1024.0 as FileSizeinGB,
CASE SF.status & 0x100000
WHEN 1048576 THEN 'Percentage'
WHEN 0 THEN 'MB'
END AS 'GROWTH Option'
FROM SYS.SYSALTFILES SF
JOIN SYS.DATABASES SD
ON SD.database_id = SF.dbid

/* Variable and Cursor Declarations */
DECLARE @name VARCHAR ( max ) /* Database Name */
DECLARE @DatabaseID INT /* Database ID */
DECLARE @LogicalFileName VARCHAR ( max ) /* Database Logical file name */
DECLARE @FileSizeinGB DECIMAL(10,2) /* Current File Size in GB */
DECLARE @GrowthOption VARCHAR ( max ) /* Current FILEGROWTH Type */
DECLARE @Query VARCHAR(max) /* Dynamic Query */

DECLARE DBCursor CURSOR FOR
SELECT DatabaseID, DBName, LogicalFileName, FileSizeinGB, GrowthOption
FROM #ConfigAutoGrowth

OPEN DBCursor

FETCH NEXT FROM DBCursor
INTO @DatabaseID,@name,@LogicalFileName,@FileSizeinGB, @GrowthOption

WHILE @@FETCH_STATUS = 0
BEGIN
PRINT 'Changing AutoGrowth option for database ['+ UPPER(@name) +'] - current file size ' + cast (@FileSizeinGB as varchar)+'GB'

IF @FileSizeinGB<1
SET @Query= 'ALTER DATABASE ['+ @name +'] MODIFY FILE (NAME = ['+@LogicalFileName+'],FILEGROWTH = 16MB)'--,MAXSIZE=UNLIMITED)'
IF @FileSizeinGB>=1 and @FileSizeinGB<5
SET @Query= 'ALTER DATABASE ['+ @name +'] MODIFY FILE (NAME = ['+@LogicalFileName+'],FILEGROWTH = 128MB)'--,MAXSIZE=UNLIMITED)'
IF @FileSizeinGB>=5 and @FileSizeinGB <100
SET @Query= 'ALTER DATABASE ['+ @name +'] MODIFY FILE (NAME = ['+@LogicalFileName+'],FILEGROWTH = 256MB)'--,MAXSIZE=UNLIMITED)'
IF @FileSizeinGB>=100
SET @Query= 'ALTER DATABASE ['+ @name +'] MODIFY FILE (NAME = ['+@LogicalFileName+'],FILEGROWTH = 512MB)'--,MAXSIZE=UNLIMITED)'

PRINT @Query
--EXECUTE(@Query)

FETCH NEXT FROM DBCursor
INTO @DatabaseID,@name,@LogicalFileName,@FileSizeinGB,@GrowthOption
END

CLOSE DBCursor
DEALLOCATE DBCursor

DROP TABLE #ConfigAutoGrowth
GO

--

SELECT
SD.database_id,
SD.name,
SF.name,
CASE SF.status & 0x100000
WHEN 1048576 THEN 'Percentage'
WHEN 0 THEN 'MB'
END AS 'GROWTH Option'
,size*8.0/1024 as SizeinMB
,growth*8.0/1024.0 as Growth
FROM SYS.SYSALTFILES SF
JOIN
SYS.DATABASES SD
ON
SD.database_id = SF.dbid
GO

--

By default the script simply PRINT's out the ALTER DATABASE statements for you to copy-paste to another window and execute, but you can un-comment out the "EXECUTE (@Query)" statement and the script will automatically enact the changes.

Hope this helps!

↧

And YOU Get a Deadlock and YOU Get a Deadlock and EVERYBODY GETS A DEADLOCK!

October 18, 2017, 12:55 pm

≫ Next: But I Don't *Want* To "Restore from a backup of the database or repair the database!"

≪ Previous: Toolbox - Fix Your FILEGROWTH

https://memegenerator.net/img/instances/400x/59507872/you-get-a-deadlock-everybody-gets-a-deadlock.jpg

We all get them at one point or another - deadlocks.

I have Object 1 and need Object 2, you have Object 2 and need Object 1:

https://i-technet.sec.s-msft.com/dynimg/IC4289.gif

This isn't the same as simple blocking, where I have Object 1 and you want it (but you can't have it because it's MINE!) In that case you just need to wait your turn, and eventually you will get sick of waiting and walk away (timeout) or I will get done with it and surrender it to you.

In the deadlock case, there truly is no way out other than terminating one of the two requests - in the example above transaction 1 needs access to the Part table to proceed with its work on the Supplier table, and transaction 2 needs access to the Supplier table to work with the Part table. This means everyone is stuck.

A great (yet sad) non-techie way of describing a deadlock is this:

https://cdn.meme.am/instances/27106434/no-job-because-no-experience-no-experience-because-no-job.jpg

SQL Server has a process called the Deadlock Monitor that periodically watches for deadlock situations and handles them by terminating one or more processes to clear the logjam. As described in Technet:

By default, the Database Engine chooses as the deadlock victim the session running the transaction that is least expensive to roll back. Alternatively, a user can specify the priority of sessions in a deadlock situation using the SET DEADLOCK_PRIORITY statement. DEADLOCK_PRIORITY can be set to LOW, NORMAL, or HIGH, or alternatively can be set to any integer value in the range (-10 to 10). The deadlock priority defaults to NORMAL. If two sessions have different deadlock priorities, the session with the lower priority is chosen as the deadlock victim. If both sessions have the same deadlock priority, the session with the transaction that is least expensive to roll back is chosen. If sessions involved in the deadlock cycle have the same deadlock priority and the same cost, a victim is chosen randomly.

"Least Expensive to Roll Back" can sometimes be difficult to determine although a general guideline is that SELECT statements are the least expensive (with no LOG to roll back) and are frequent deadlock victims.

With all of that in hand, how do you gather information about deadlocks? Classical deadlock troubleshooting relies on trace flags 1204 and 1222 or a SQL Server trace on the Deadlock Chain and Deadlock Graph events (as described in this MSSQLTip).

https://www.toonpool.com/user/589/files/living_in_the_past_333905.jpg

We all know at this point (whether some people want to admit it or not) that Extended Events are the tool of the future for this kind of work. They are more lightweight than SQL Server Trace and are continuing to be improved (unlike Trace, which has been stagnant since SQL Server 2005!) Any time you find yourself using Trace or Profiler you should pause and reflect on how to accomplish the same task with XEvents.

In this case, there is deadlock information in the System_Health Session, the basic session deployed by default on all SQL Server instances since SQL Server 2008. (Think the better, stronger, faster version of the Default Trace.) Patrick Keisler (blog/@PatrickKeisler) has a great write-up about the System_Health session on his blog here.

By default deadlock information is one giant character field that is formatted for XML. The XML is readable (sort of) if that's your thing:

<deadlock>
<victim-list>
<victimProcess id="process5776a5498" />
</victim-list>
<process-list>
<process id="process5776a5498" taskpriority="0" logused="0" waitresource="KEY: 19:72057594040745984 (9ba325989765)" waittime="4168" ownerId="327954909" transactionname="SELECT" lasttranstarted="2017-10-17T17:21:04.417" XDES="0x50560f9d0" lockMode="S" schedulerid="1" kpid="3756" status="suspended" spid="64" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="2017-10-17T17:21:04.417" lastbatchcompleted="2017-10-17T17:21:03.293" lastattention="1900-01-01T00:00:00.293" hostpid="4012" loginname="Login1" isolationlevel="read committed (2)" xactid="327954909" currentdb="19" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack>
<frame procname="adhoc" line="1" sqlhandle="0x0200000096663f042d323a9d19c06335abb6318e18e6f8750000000000000000000000000000000000000000">
SELECT dbo.Fred_Table7.ObjectID FROM dbo.Fred_Table7 WHERE (dbo.Fred_Table7.ObjectID not in (23) AND dbo.Fred_Table7.ObjectID <= 4294966615 AND (dbo.Fred_Table7.LastModifyTime >CAST('2017 10 17 20 08 18 115' AS VARBINARY(32)) OR dbo.Fred_Table7.LastModifyTime =CAST('2017 10 17 20 08 18 115' AS VARBINARY(32)) AND dbo.Fred_Table7.ObjectID > 2404307) AND (dbo.Fred_Table7.SI_INSTANCE_OBJECT = 0 AND dbo.Fred_Table7.SI_RUNNABLE_OBJECT = 0 OR (dbo.Fred_Table7.SI_INSTANCE_OBJECT = 1 OR dbo.Fred_Table7.SI_RUNNABLE_OBJECT = 1) AND dbo.Fred_Table7.SI_RUNNABLE_OBJECT = 0 AND dbo.Fred_Table7.ScheduleStatus = 1) AND dbo.Fred_Table7.TypeID in (2, 8, 21, 67, 68, 260, 261, 262, 266, 267, 270, 271, 272, 277, 279, 281, 282, 289, 295, 301, 337, 338, 339, 340, 343, 349, 351, 355, 358, 361, 365, 368, 372, 373, 379, 385, 387, 389, 391, 394, 396, 407, 408, 414, 415, 430, 431) AND (SI_PLUGIN_OBJECT = 0) AND dbo.Fred_Table7.SI_HIDDEN_OBJECT = 0) ORDER BY dbo </frame>
</executionStack>
<inputbuf>
SELECT dbo.Fred_Table7.ObjectID FROM dbo.Fred_Table7 WHERE (dbo.Fred_Table7.ObjectID not in (23) AND dbo.Fred_Table7.ObjectID <= 4294966615 AND (dbo.Fred_Table7.LastModifyTime >CAST('2017 10 17 20 08 18 115' AS VARBINARY(32)) OR dbo.Fred_Table7.LastModifyTime =CAST('2017 10 17 20 08 18 115' AS VARBINARY(32)) AND dbo.Fred_Table7.ObjectID > 2404307) AND (dbo.Fred_Table7.SI_INSTANCE_OBJECT = 0 AND dbo.Fred_Table7.SI_RUNNABLE_OBJECT = 0 OR (dbo.Fred_Table7.SI_INSTANCE_OBJECT = 1 OR dbo.Fred_Table7.SI_RUNNABLE_OBJECT = 1) AND dbo.Fred_Table7.SI_RUNNABLE_OBJECT = 0 AND dbo.Fred_Table7.ScheduleStatus = 1) AND dbo.Fred_Table7.TypeID in (2, 8, 21, 67, 68, 260, 261, 262, 266, 267, 270, 271, 272, 277, 279, 281, 282, 289, 295, 301, 337, 338, 339, 340, 343, 349, 351, 355, 358, 361, 365, 368, 372, 373, 379, 385, 387, 389, 391, 394, 396, 407, 408, 414, 415, 430, 431) AND (SI_PLUGIN_OBJECT = 0) AND dbo.Fred_Table7.SI_HIDDEN_OBJECT = 0) ORDER BY db </inputbuf>
</process>
<process id="process1b9313868" taskpriority="0" logused="284" waitresource="KEY: 19:72057594041008128 (1679744d2988)" waittime="4168" ownerId="327954910" transactionname="implicit_transaction" lasttranstarted="2017-10-17T17:21:04.417" XDES="0x57788c3a8" lockMode="X" schedulerid="2" kpid="11128" status="suspended" spid="60" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2017-10-17T17:21:04.417" lastbatchcompleted="2017-10-17T17:21:04.417" lastattention="1900-01-01T00:00:00.417" hostpid="4012" loginname="Login1" isolationlevel="read committed (2)" xactid="327954910" currentdb="19" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128058">
<executionStack>
<frame procname="adhoc" line="1" stmtstart="86" sqlhandle="0x0200000011d6ca3a9ba3506e8d74888b6554043825c5204e0000000000000000000000000000000000000000">
UPDATE dbo.Fred_Table7 SET Version = @P1, LastModifyTime = @P2 WHERE ObjectID = @P3 AND Version = @P4 </frame>
<frame procname="unknown" line="1" sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000">
unknown </frame>
</executionStack>
<inputbuf>
(@P1 int,@P2 varbinary(32),@P3 int,@P4 int)UPDATE dbo.Fred_Table7 SET Version = @P1, LastModifyTime = @P2 WHERE ObjectID = @P3 AND Version = @P4 </inputbuf>
</process>
</process-list>
<resource-list>
<keylock hobtid="72057594040745984" dbid="19" objectname="Database12.dbo.Fred_Table7" indexname="ObjectID_I7" id="lock52e379a80" mode="X" associatedObjectId="72057594040745984">
<owner-list>
<owner id="process1b9313868" mode="X" />
</owner-list>
<waiter-list>
<waiter id="process5776a5498" mode="S" requestType="wait" />
</waiter-list>
</keylock>
<keylock hobtid="72057594041008128" dbid="19" objectname="Database12.dbo.Fred_Table7" indexname="LastModifyTime_I7" id="lock35b41fa80" mode="S" associatedObjectId="72057594041008128">
<owner-list>
<owner id="process5776a5498" mode="S" />
</owner-list>
<waiter-list>
<waiter id="process1b9313868" mode="X" requestType="wait" />
</waiter-list>
</keylock>
</resource-list>
</deadlock>

You can piece through it to find the <inputbuf> tag to find the queries on the Fred_Table7 that are involved, and find the loginname of Login1, but is this really how you want to consume this data?

http://3.bp.blogspot.com/_KyTS_XQfpkA/TGHAMAdjUjI/AAAAAAAAEx4/xLZ0dYgTt1g/s1600/HaHaNo.jpg

A much easier way to deal with this information is in the Deadlock Graph.

The Deadlock Graph turns the XML information show above into, well, a graph. You can see the two transactions as the circles on the outside and the two objects involved in the rectangles in the center of the diagram. Sometimes there are more than two transactions involved in a deadlock (beyond the scope of this post) but if there are more than two they will show up in the graph as well. When you open the graph in SQL Server Management Studio you can hover over the transaction circle and see the underlying query and some other transaction data.

As mentioned above there is a trace event to return the Deadlock Graph, and you can draw it via XEvents as well with a query.

NOTE - there was a change in the way deadlock information is processed in XEvents between SQL Server 2008R2 and SQL Server 2012, so there are different scripts depending on version:

/*
SQL Server Deadlock Graph Extraction from Ring Buffer
SQL Server 2008/2008R2

Modified from https://www.sqlservercentral.com/Forums/1399315/System-health-extended-event-session-does-not-capture-latest-deadlocks?PageIndex=1
*/

;WITH SystemHealth
AS (
SELECT
CAST ( target_data AS xml ) AS SessionXML
FROM
sys.dm_xe_session_targets st
INNER JOIN
sys.dm_xe_sessions s
ON
s.[address] = st.event_session_address
WHERE
name = 'system_health'
)
SELECT
Deadlock.value ( '@timestamp', 'datetime' ) AS DeadlockDateTime
, CAST ( Deadlock.value ( '(data/value)[1]', 'Nvarchar(max)' ) AS XML ) AS DeadlockGraph
FROM
SystemHealth s
CROSS APPLY
SessionXML.nodes ( '//RingBufferTarget/event' ) AS t (Deadlock)
WHERE
Deadlock.value ( '@name', 'nvarchar(128)' ) = N'xml_deadlock_report'
ORDER BY
Deadlock.value ( '@timestamp', 'datetime' );

/*
SQL Server Deadlock Graph Extraction from Ring Buffer
SQL Server 2012+

Modified from https://www.red-gate.com/simple-talk/sql/database-administration/handling-deadlocks-in-sql-server/
*/

SELECT XEvent.query('(event/data/value/deadlock)[1]') AS DeadlockGraph
FROM ( SELECT XEvent.query('.') AS XEvent
FROM ( SELECT CAST(target_data AS XML) AS TargetData
FROM sys.dm_xe_session_targets st
JOIN sys.dm_xe_sessions s
ON s.address = st.event_session_address
WHERE s.name = 'system_health'
AND st.target_name = 'ring_buffer'
) AS Data
CROSS APPLY TargetData.nodes
('RingBufferTarget/event[@name="xml_deadlock_report"]')
AS XEventData ( XEvent )
) AS src;

Another awesome change in SQL 2012+ is that the System_Health session now writes to a file target as well as the ring buffer memory target. This is cool because the files persist more data than can be maintained in the ring buffer, giving you more history. Retrieving data from the XEL files requires a different query, which is set up to dynamically read the XEL files from your default LOG path, wherever it may be:

/*
SQL Server Deadlock Graph Extraction from XEL file target
SQL Server 2012+

http://www.sqlservercentral.com/blogs/sql-geek/2017/10/07/extracting-deadlock-information-using-system_health-extended-events/
*/

CREATE TABLE #errorlog (LogDate DATETIME , ProcessInfo VARCHAR(100), [Text] VARCHAR(MAX))

DECLARE @tag VARCHAR (MAX) , @path VARCHAR(MAX);

INSERT INTO #errorlog EXEC sp_readerrorlog;
SELECT @tag = text
FROM #errorlog
WHERE [Text] LIKE 'Logging%MSSQL\Log%';

DROP TABLE #errorlog;

SET @path = SUBSTRING(@tag, 38, CHARINDEX('MSSQL\Log', @tag) - 29);

SELECT
CONVERT(xml, event_data).query('/event/data/value/child::*') AS DeadlockReport,
CONVERT(xml, event_data).value('(event[@name="xml_deadlock_report"]/@timestamp)[1]', 'datetime')
AS Execution_Time
FROM sys.fn_xe_file_target_read_file(@path + '\system_health*.xel', NULL, NULL, NULL)
WHERE OBJECT_NAME like 'xml_deadlock_report';

As an example, here is the output of the ring buffer query versus the file target (XEL) query on a given server:

The file target has six times as much history as the ring buffer (in this case) - you should "always" see the same or more data in the XEL files than the ring buffer.

All of these scripts return an XML output like the XML shown above - it doesn't return a graph directly.

https://memegenerator.net/img/instances/71836370/say-whaaat.jpg

To get the graph is pretty straightforward - once you know how. Open the XML document, either by clicking the link in the resultset or by copy-pasting the XML into the text editor of your choice:

Once the XML document is open, either in Management Studio or in your text editor, simply save the document as extension .XDL rather than .XML The new file will have an icon like this:

If you open this file in Management Studio (via double-click or Open>File in Management Studio) you will be presented with the actual Deadlock Graph as seen above, complete with ability to hover to see transaction information, etc.

...unless you are using SQL Server 2008/2008R2 Management Studio, in which case you will see this:

Argghhhhh!

As described by Extended Events guru Jonathan Kehayias (blog/@sqlpoolboy) in a blog post here, this is an artifact of SQL 2008 Management Studio not being ready to render the deadlock output from XEvents (as opposed to the deadlock output from Trace). As Jonathan notes there are two ways to handle this - the first way is to use an up-level Management Studio of at least version 2012, either by upgrading the local copy of Management Studio (you can download the current Management Studio that is still compatible back to 2008 from Microsoft here) or by copying the XDL file to another server with a more recent Management Studio. The other way to open the file under SQL 2008/2008R2 is to use SentryOne's free tool Plan Explorer Pro. I usually don't pitch company products but this is a really useful 100% free tool.

The most common answer will be the first - copy the XDL to a server with 2012+ Management Studio, and all will be well.

Hope this helps!

↧

But I Don't Want To "Restore from a backup of the database or repair the database!"

October 23, 2017, 2:53 pm

≫ Next: PASS Summit 2017 - Live and in Living Color!

≪ Previous: And YOU Get a Deadlock and YOU Get a Deadlock and EVERYBODY GETS A DEADLOCK!

Do you like Log Shipping?

Before my time at Ntirety I had done quite a bit of work with failover clustering and mirroring (trending into some Availability Groups) but I had never really done that much replication or log shipping. Even at my previous managed services remote DBA job I simply hadn't run across it...

I have said in this blog more than once that education (training and real-world) is one of the most important parts of being a DBA, and that we should always be learning; my time at Ntirety has definitely educated me about replication and log shipping!

https://memegenerator.net/img/instances/500x/42421555/be-careful-what-you-wish-for-punk.jpg

This Log Shipping story (maybe its a cautionary tale - so many of these stories are) begins with a failed server migration. We were trying to replace a server with a new VM and due to some issues completely unrelated to log shipping, we ended up reverting to the original server (maybe once we figure that out I will write about it too).

The problem was that I had already dropped the log shipping from OldServer01.Database99 to LSSecondary.Database99 and created new log shipping from NewServer01.Database99 to LSSecondary.Database99. Part of the backout plan was dropping *that* log shipping and recreating the original relationship between OldServer01 and LSSecondary.

I had the creation process scripted out, so I ran the scripts, made sure they succeeded, checked that LSSecondary.Database99 reflected the current data (we had it in Standyby/Read-Only) and then walked away from log shipping to continue the rest of the revert process.

http://vignette1.wikia.nocookie.net/cardfight/images/e/e1/Funniest_Memes_you-my-friend-just-made-a-big-mistake_10940.jpeg/revision/latest?cb=20150805104725

Monday morning rolled around and one of the business users reported that Database99 on LSSecondary was in Restoring rather than Standby, and wanted me to check it out. Sure enough, Database99 was in Restoring even though the LSRestore job wasn't running and there was no active RESTORE process running.

OK....

I check the properties of the log shipping secondary in the Management Studio GUI and they looked as expected:

You can see these same settings via T-SQL with the following query:

SELECT secondary_database
, restore_mode
, disconnect_users
, last_restored_file
FROM msdb.dbo.log_shipping_secondary_databases

Since the Database shows restore_mode 1 (Standby) and disconnect_users 1 (true) there is no reason by the configuration that the database should be in RESTORING.

The next stop was the SQL Server Error Log. This probably should have been my first stop, but since I had just set this log shipping up I thought of the configuration first. The Error Log was a horror show:

--

Error: 3456, Severity: 16, State: 1.

Could not redo log record (31185729:63924:22), for transaction ID (0:0), on page (1:511238), database 'Database99' (database ID 5). Page: LSN = (31185729:63857:14), type = 16. Log: OpCode = 7, context 24, PrevPageLSN: (31185729:63804:511). Restore from a backup of the database, or repair the database

…followed by 100+ lines of Error Stack Dump<cue scary music>:

Using 'dbghelp.dll' version '4.0.5'

**Dump thread - spid = 0, EC = 0x00000002E61A20F0

***Stack Dump being sent to F:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\LOG\SQLDump0001.txt

* *******************************************************************************

* BEGIN STACK DUMP:

* 10/22/17 03:00:34 spid 55

* HandleAndNoteToErrorlog: Exception raised, major=34, minor=56, severity=16

* Input Buffer 506 bytes -

* RESTORE LOG [Database99] FROM DISK = N'\\LSSecondary\

* Database99\Database99_20171022053000.trn' WITH FILE = 1, STANDBY =

* N'F:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DA

* TA\Database99_20171022090018.tuf'

* MODULE BASE END SIZE

* sqlservr 0000000000A40000 00000000045EDFFF 03bae000

* ntdll 0000000077510000 00000000776B8FFF 001a9000

* kernel32 00000000772F0000 000000007740EFFF 0011f000

* KERNELBASE 000007FEFD560000 000007FEFD5CBFFF 0006c000

* ADVAPI32 000007FEFF730000 000007FEFF80AFFF 000db000

* msvcrt 000007FEFED70000 000007FEFEE0EFFF 0009f000

* sechost 000007FEFE5B0000 000007FEFE5CEFFF 0001f000

* RPCRT4 000007FEFE630000 000007FEFE75CFFF 0012d000

* MSVCR80 0000000074F50000 0000000075018FFF 000c9000

* MSVCP80 0000000074BD0000 0000000074CD8FFF 00109000

* sqlos 00000000745D0000 00000000745D6FFF 00007000

* Secur32 000007FEFCFE0000 000007FEFCFEAFFF 0000b000

* SSPICLI 000007FEFD260000 000007FEFD284FFF 00025000

* pdh 000007FEF6F00000 000007FEF6F4DFFF 0004e000

* SHLWAPI 000007FEFF680000 000007FEFF6F0FFF 00071000

* GDI32 000007FEFEC20000 000007FEFEC86FFF 00067000

* USER32 0000000077410000 0000000077509FFF 000fa000

* LPK 000007FEFF810000 000007FEFF81DFFF 0000e000

* USP10 000007FEFF190000 000007FEFF258FFF 000c9000

* USERENV 000007FEFC740000 000007FEFC75DFFF 0001e000

* profapi 000007FEFD3D0000 000007FEFD3DEFFF 0000f000

* WINMM 000007FEF6EC0000 000007FEF6EFAFFF 0003b000

* IPHLPAPI 000007FEFB670000 000007FEFB696FFF 00027000

* NSI 000007FEFE620000 000007FEFE627FFF 00008000

* WINNSI 000007FEFB660000 000007FEFB66AFFF 0000b000

* opends60 00000000745E0000 00000000745E7FFF 00008000

* NETAPI32 000007FEFAF00000 000007FEFAF15FFF 00016000

* netutils 000007FEFC970000 000007FEFC97BFFF 0000c000

* srvcli 000007FEFCF20000 000007FEFCF42FFF 00023000

* wkscli 000007FEFAEE0000 000007FEFAEF4FFF 00015000

* LOGONCLI 000007FEFCC80000 000007FEFCCAFFFF 00030000

* SAMCLI 000007FEFA310000 000007FEFA323FFF 00014000

* BatchParser 00000000743D0000 00000000743FCFFF 0002d000

* IMM32 000007FEFF700000 000007FEFF72DFFF 0002e000

* MSCTF 000007FEFEE10000 000007FEFEF18FFF 00109000

* psapi 00000000776D0000 00000000776D6FFF 00007000

* instapi10 0000000074AA0000 0000000074AACFFF 0000d000

* cscapi 000007FEF6EB0000 000007FEF6EBEFFF 0000f000

* sqlevn70 0000000071050000 0000000071250FFF 00201000

* ntmarta 000007FEFC530000 000007FEFC55CFFF 0002d000

* WLDAP32 000007FEFF130000 000007FEFF181FFF 00052000

* CRYPTSP 000007FEFCBE0000 000007FEFCBF7FFF 00018000

* rsaenh 000007FEFC9E0000 000007FEFCA26FFF 00047000

* CRYPTBASE 000007FEFD300000 000007FEFD30EFFF 0000f000

* BROWCLI 000007FEF6DE0000 000007FEF6DF1FFF 00012000

* AUTHZ 000007FEFCE70000 000007FEFCE9EFFF 0002f000

* MSCOREE 000007FEF9F90000 000007FEF9FFEFFF 0006f000

* mscoreei 000007FEF9EE0000 000007FEF9F77FFF 00098000

* ole32 000007FEFEF20000 000007FEFF122FFF 00203000

* credssp 000007FEFC900000 000007FEFC909FFF 0000a000

* msv1_0 000007FEFCB80000 000007FEFCBD1FFF 00052000

* cryptdll 000007FEFCF50000 000007FEFCF63FFF 00014000

* kerberos 000007FEFCCB0000 000007FEFCD67FFF 000b8000

* MSASN1 000007FEFD4B0000 000007FEFD4BEFFF 0000f000

* schannel 000007FEFC980000 000007FEFC9D6FFF 00057000

* CRYPT32 000007FEFD630000 000007FEFD79CFFF 0016d000

* security 0000000074210000 0000000074212FFF 00003000

* WS2_32 000007FEFE5D0000 000007FEFE61CFFF 0004d000

* SHELL32 000007FEFD820000 000007FEFE5A8FFF 00d89000

* OLEAUT32 000007FEFEC90000 000007FEFED66FFF 000d7000

* ftimport 0000000060000000 0000000060024FFF 00025000

* MSFTE 0000000049980000 0000000049D2DFFF 003ae000

* VERSION 000007FEFC520000 000007FEFC52BFFF 0000c000

* dbghelp 000000006F0C0000 000000006F21DFFF 0015e000

* WINTRUST 000007FEFD7A0000 000007FEFD7DAFFF 0003b000

* ncrypt 000007FEFCE20000 000007FEFCE6FFFF 00050000

* bcrypt 000007FEFCDF0000 000007FEFCE11FFF 00022000

* mswsock 000007FEFCC20000 000007FEFCC74FFF 00055000

* wship6 000007FEFCDE0000 000007FEFCDE6FFF 00007000

* wshtcpip 000007FEFC620000 000007FEFC626FFF 00007000

* ntdsapi 000007FEFB0A0000 000007FEFB0C6FFF 00027000

* DNSAPI 000007FEFCA30000 000007FEFCA8AFFF 0005b000

* rasadhlp 000007FEFAA70000 000007FEFAA77FFF 00008000

* fwpuclnt 000007FEFB560000 000007FEFB5B2FFF 00053000

* bcryptprimitives 000007FEFC7A0000 000007FEFC7EBFFF 0004c000

* CLBCatQ 000007FEFF560000 000007FEFF5F8FFF 00099000

* sqlncli10 0000000072690000 0000000072947FFF 002b8000

* COMCTL32 000007FEF6F50000 000007FEF6FEFFFF 000a0000

* COMDLG32 000007FEFF4C0000 000007FEFF556FFF 00097000

* SQLNCLIR10 0000000071260000 0000000071296FFF 00037000

* netbios 000007FEF4040000 000007FEF4049FFF 0000a000

* xpsqlbot 0000000074270000 0000000074277FFF 00008000

* xpstar 0000000065380000 0000000065407FFF 00088000

* SQLSCM 0000000074590000 000000007459DFFF 0000e000

* ODBC32 000007FEF2BA0000 000007FEF2C50FFF 000b1000

* ATL80 0000000074F00000 0000000074F1FFFF 00020000

* odbcint 00000000733D0000 0000000073407FFF 00038000

* clusapi 000007FEF6760000 000007FEF67AFFFF 00050000

* resutils 000007FEF6740000 000007FEF6758FFF 00019000

* xpstar 0000000074240000 0000000074264FFF 00025000

* xplog70 0000000074280000 000000007428FFFF 00010000

* xplog70 0000000074220000 0000000074221FFF 00002000

* dsrole 000007FEFBCD0000 000007FEFBCDBFFF 0000c000

* RpcRtRemote 000007FEFD3B0000 000007FEFD3C3FFF 00014000

* ssdebugps 000007FEEAE20000 000007FEEAE31FFF 00012000

* apphelp 000007FEFD2A0000 000007FEFD2F6FFF 00057000

* msxmlsql 00000000632C0000 0000000063410FFF 00151000

* dbghelp 000000006F8D0000 000000006FA2DFFF 0015e000

* P1Home: 0044004C00510053:

* P2Home: 00000000004A87B4: 0122002501210023 0124002900230027 0126002D0025002B 012800310127002F 012A003501290033 002C0039002B0037

* P3Home: 00000000004C8310: 000000002A010FF0 000000002A010FF0 0000000017407610 000000001AA86DD0 000000001736E3F0 000000002A027FD0

* P4Home: 0000000000000000:

* P5Home: 0052004500AE00AC:

* P6Home: 000000001742C658: 0050005C003A0046 00720067006F0072 00460020006D0061 00730065006C0069 00630069004D005C 006F0073006F0072

* ContextFlags: 000000000010000F: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000

* MxCsr: 0000000000001FA0:

* SegCs: 0000000000000033:

* SegDs: 000000000000002B:

* SegEs: 000000000000002B:

* SegFs: 0000000000000053:

* SegGs: 000000000000002B:

* SegSs: 000000000000002B:

* EFlags: 0000000000000202:

* Rax: 0000000000991AD3: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000

* Rcx: 000000002C568260: 0044004C00510053 00000000004A87B4 00000000004C8310 0000000000000000 0052004500AE00AC 000000001742C658

* Rdx: 0000000000000000:

* Rbx: 0000000000000000:

* Rsp: 000000002C568870: 0000000000000000 00000002E61A20F0 000000002C568AA0 000007FFFFEDC000 00000000000042AC 0000000000000000

* Rbp: 0000000000000000:

* Rsi: 00000002E61A20F0: 00000002E61A20F0 000000063039CEA0 0000000000000000 0000000000000004 00000002E61A2110 00000002E61A2110

* Rdi: 000000002C568AA0: 000000001742C650 FFFFFFFFFFFFFFFE 000007FFFFEDB4A8 0000000000000000 00000006699981A0 0000000000A426C3

* R8: 0000000000000000:

* R9: 0000000000000000:

* R10: 0000000000000000:

* R11: 000000002C569080: 0044004C00510053 00000000004A87B4 00000000004C8310 0000000000000000 0052004500AE00AC 000000001742C658

* R12: 0000000000000001:

* R13: 000000063039CEA0: 000000063039C080 000000063039C3B0 000000063039CAD0 00000002E61A2080 000000063039CF10 000000008089CE80

* R14: 000000002C569F60: 0064006E00610048 006E00410065006C 0074006F004E0064 0045006F00540065 0072006F00720072 003A0067006F006C

* R15: 000000000000003F:

* Rip: 000007FEFD56B3DD: C3000000C8C48148 003B830874F68548 0001BD1689660376 02D6840FFD3B0000 F3840F02FF830000 870F07FF83000002

* *******************************************************************************

* -------------------------------------------------------------------------------

* Short Stack Dump

000007FEFD56B3DD Module(KERNELBASE+000000000000B3DD)

0000000002AE981D Module(sqlservr+00000000020A981D)

0000000002AEDD3A Module(sqlservr+00000000020ADD3A)

0000000002B210C3 Module(sqlservr+00000000020E10C3)

0000000002B20F5D Module(sqlservr+00000000020E0F5D)

0000000000E3F5D8 Module(sqlservr+00000000003FF5D8)

0000000000E3F2A6 Module(sqlservr+00000000003FF2A6)

0000000002A47DBB Module(sqlservr+0000000002007DBB)

0000000002B24941 Module(sqlservr+00000000020E4941)

0000000002D14BFB Module(sqlservr+00000000022D4BFB)

0000000002CC0A44 Module(sqlservr+0000000002280A44)

0000000002CC0719 Module(sqlservr+0000000002280719)

0000000002CBFDC5 Module(sqlservr+000000000227FDC5)

0000000002D4C1B4 Module(sqlservr+000000000230C1B4)

0000000002D4B845 Module(sqlservr+000000000230B845)

0000000002CA321E Module(sqlservr+000000000226321E)

0000000002D4DD33 Module(sqlservr+000000000230DD33)

0000000002D52D0F Module(sqlservr+0000000002312D0F)

0000000002CC8CE6 Module(sqlservr+0000000002288CE6)

0000000001F288A1 Module(sqlservr+00000000014E88A1)

0000000000A99D59 Module(sqlservr+0000000000059D59)

0000000000A9A9B8 Module(sqlservr+000000000005A9B8)

0000000000A9B30C Module(sqlservr+000000000005B30C)

0000000000A9C1A6 Module(sqlservr+000000000005C1A6)

0000000000AE5342 Module(sqlservr+00000000000A5342)

0000000000A4BBD8 Module(sqlservr+000000000000BBD8)

0000000000A4B8BA Module(sqlservr+000000000000B8BA)

0000000000A4B6FF Module(sqlservr+000000000000B6FF)

0000000000F68FB6 Module(sqlservr+0000000000528FB6)

0000000000F69175 Module(sqlservr+0000000000529175)

0000000000F69839 Module(sqlservr+0000000000529839)

0000000000F69502 Module(sqlservr+0000000000529502)

0000000074F537D7 Module(MSVCR80+00000000000037D7)

0000000074F53894 Module(MSVCR80+0000000000003894)

00000000773059CD Module(kernel32+00000000000159CD)

000000007753B981 Module(ntdll+000000000002B981)

Stack Signature for the dump is 0x000000012832F01D

External dump process return code 0x20000001. External dump process returned no errors.

... followed by:

Error: 9003, Severity: 20, State: 6.

The log scan number (31185729:63630:107) passed to log scan in database 'Database99' is not valid. This error may indicate data corruption or that the log file (.ldf) does not match the data file (.mdf). If this error occurred during replication, re-create the publication. Otherwise, restore from backup if the problem results in a failure during startup.

https://thecatholicgeeks.files.wordpress.com/2015/05/beafraid.jpg

Pretty scary looking errors…I checked the database on OldServer01, especially since the CheckDB job hadn't run in several days (back to before the migration attempt), and thankfully it was clear:

Date and time: 2017-10-23 10:49:35

Server: OldServer01

Version: 10.50.4305.0

Edition: Enterprise Edition (64-bit)

Procedure: [master].[dbo].[DatabaseIntegrityCheck]

Parameters: @Databases = 'Database99', @CheckCommands = 'CHECKDB', @PhysicalOnly = 'N', @NoIndex = 'N', @ExtendedLogicalChecks = 'N', @TabLock = 'N'

, @FileGroups = NULL, @Objects = NULL, @LockTimeout = NULL, @LogToTable = 'N', @Execute = 'Y'

Source: https://ola.hallengren.com

Date and time: 2017-10-23 10:49:35

Database: [Database99]

Status: ONLINE

Standby: No

Updateability: READ_WRITE

User access: MULTI_USER

Is accessible: Yes

Recovery model: FULL

Date and time: 2017-10-23 10:49:35

Command: DBCC CHECKDB ([Database99]) WITH NO_INFOMSGS, ALL_ERRORMSGS, DATA_PURITY

Outcome: Succeeded

Duration: 00:35:44

Date and time: 2017-10-23 11:25:19

After some poking around and several rebuilds of log shipping (deleting the secondary, taking a new FULL on OldServer01, restoring it to LSSecondary, and then recreating the process) I figured out the issue was related to the first part of the first error:

“Could not redo log record (31185747:40079:1), for transaction ID (0:1129634848), on page (1:10522412), database 'Database99'”

The redo reader process was looking in the log shipping folder \\LSSecondary\Database99 and trying to restore the oldest TRN log backup files, which were related to the log shipping configuration from before the migration/fail-back over the weekend. Those older TRN files had log sequence numbers (LSNs) that were from before the FULL backup used to initialize the re-created log shipping, which was making the overall process error out and blowing up the process.

A colleague suggested that the log shipping restore should be skipping TRN files that are "too old" via LSN (which matched my understanding of how it "should" work) and that maybe this reflected one of the old TRN files having an invalid file header. After some consideration we realized that probably was *not* the case since the error message doesn't reflect not being able to read the backup file header, and if it couldn't read the header how would it have a log record number (31185747:40079:1) to include in the "could not redo" portion of the error?

I deleted the old backups from \\LSSecondary\Database99 and tried to run the process again, and received the same errors. When I looked in \\LSSecondary\Database99 I saw all of the old TRN files were back!

I had failed to consider (sigh) that the LSCopy job would bring all of the "old" files from the primary folder at \\OldServer01\Database99 back to \\LSSecondary\Database99, putting me back in the same situation.

Take 2 - I deleted all of the log shipping backups from *both* \\OldServer01\Database99 and \\LSSecondary\Database99 and then re-created the log shipping one last time…

https://memegenerator.net/img/instances/500x/27017972/success.jpg

This time the LS_Restore job functioned as expected, Database99 was in Standby/Read-Only on LSSecondary, and instead of my 100+ line stack dump I received some very respectable information messages in the SQL Error Log on LSSecondary:

Setting database option SINGLE_USER to ON for database Database99.
Starting up database 'Database99'.
The database 'Database99' is marked RESTORING and is in a state that does not allow recovery to be run.
Recovery is writing a checkpoint in database 'Database99' (5). This is an informational message only. No user action is required.
Recovery completed for database Database99 (database ID 5) in 2 second(s) (analysis 1080 ms, redo 0 ms, undo 466 ms.) This is an informational message only. No user action is required.
Starting up database 'Database99'.
CHECKDB for database 'Database99' finished without errors on 2017-10-23 10:49:44.400 (local time). This is an informational message only; no user action is required.
Log was restored. Database: Database99, creation date(time): 2017/10/22(02:20:41), first LSN: 31185760:34887:1, last LSN: 31185760:44042:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {'\\LSSecondary\Database99\Database99_20171023204500.trn'}). This is an informational message. No user action is required.
Setting database option MULTI_USER to ON for database Database99.

--

The SINGLE_USER and MULTI_USER messages are artifacts of telling the log shipping to disconnect users when restoring backups.

--

Hope this helps!

↧

PASS Summit 2017 - Live and in Living Color!

November 1, 2017, 6:16 am

≫ Next: Come to SQL Saturday Providence!

≪ Previous: But I Don't *Want* To "Restore from a backup of the database or repair the database!"

UPDATE - the feed and schedule are live at http://www.pass.org/summit/2017/PASStv.aspx

http://rochistory.com/blog/wp-content/uploads/2014/09/NBC-logo.png

Not in Seattle for some reason this week? Like maybe you stayed home because it was Halloween week, or you don't have the funds, or you are preparing for the zombie apocalypse?

http://www.demotivation.us/media/demotivators/demotivation.us_Zombie-Apocalypse-Preparation-level-genius_133338427933.jpg

PASS is ready to help you out - starting today (Wednesday 11/01/2017) at 8:10 PDT (15:10 UTC) PASS will stream the keynotes, some sessions, and certain other round tables and events LIVE online at http://www.pass.org/summit/2017/Home.aspx

~~The schedule for the next few days isn't up yet but will be soon - tune in!~~

https://radiosoundsfamiliar.com/resources/Rebrand/kep%20calm%20and%20tune%20in%20logo.jpg.opt197x337o0%2C0s197x337.jpg

↧

Come to SQL Saturday Providence!

December 8, 2017, 6:42 am

≫ Next: T-SQL Tuesday #98 – Take Small Bites!

≪ Previous: PASS Summit 2017 - Live and in Living Color!

Tomorrow 12/09/2017 is SQL Saturday #694 in Providence, RI.

I will be presenting by SQL Server Health Check session first thing in the morning:

--

Does it Hurt When I Do This? Performing a SQL Server Health Check

Speaker: Andy Galbraith

Duration: 60 minutes

Track: Enterprise Database Administration & Deployment

How often do you review your SQL Servers for basic security, maintenance, and performance issues?  Many of the servers I "inherit" as a managed services provider have quite a few gaping holes. It is not unusual to find databases that are never backed up, servers with constant login failures (is it an attack or a bad connection string?), and servers that need more RAM/CPU/etc. (or sometimes that even have too much!) 

Come learn how to use freely available tools from multiple layers of the SQL Server stack to check your servers for basic issues like missing backups and CheckDB as well as for more advanced issues like page life expectancy problems and improper indexing. If you are responsible in any way for a Microsoft SQL Server (DBA, Windows Admin, even a Developer) you will see value in this session!

--

In total there are six tracks with five individual sessions each, so a full day of content!

Registration is still open at https://www.sqlsaturday.com/694/RegisterNow.aspx - come to Bryant University tomorrow and check it out!

↧

T-SQL Tuesday #98 – Take Small Bites!

January 9, 2018, 1:13 pm

≫ Next: Toolbox - Why is My Database So Big????

≪ Previous: Come to SQL Saturday Providence!

https://media.makeameme.org/created/mega-bytes-well.jpg

It's T-SQL Tuesday time again - the monthly blog party was started by Adam Machanic (blog/@AdamMachanic) and each month someone different selects a new topic. This month's cycle is hosted by Arun Sirpal (blog/@blobeater1) and his chosen topic is "Your Technical Challenges Conquered" considering technical challenges and what we do to resolve them.

This was kind of a hard one for me because a lot of my blog posts are already about technical challenges from work, which means I have already written about many things that would have qualified for this category - I did find one thing from a few months ago that I hadn't documented yet...

--

As a services/production DBA I frequently get technical challenges related to issues about disk space, which usually tie back to issues about file growth, which usually tie back to issues about code, which usually tie back to issues about people (not necessarily developers - and yes, I do believe developers are people...usually...)

https://i0.wp.com/www.adamtheautomator.com/wp-content/uploads/2015/04/Worked-Fine-In-Dev-Ops-Problem-Now.jpg

This story is yet another tale of woe starting with a page at 1am for a filling drive, in this case the LOG drive.

I went to my default trace query (described here) and found...a few...records for growths:

EventName	DatabaseName	FileName	StartTime	ApplicationName	HostName	LoginName
Log File Auto Grow	App01	App01_log	9/26/2017 1:08	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 1:08	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 1:08	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 1:08	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 1:08	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 1:07	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 1:07	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 1:07	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 1:07	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 1:07	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 1:06	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 1:05	SQLCMD	DBSQL	NT SERVICE\SQLSERVERAGENT
Log File Auto Grow	App01	App01_log	9/26/2017 1:04	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 1:00	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:59	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:59	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:59	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:59	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:59	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:59	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:58	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:58	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:58	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:58	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:58	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:57	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:57	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:57	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:57	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:57	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:56	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:56	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:56	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:56	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:56	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:56	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:56	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:55	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:55	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:54	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:54	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:53	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:52	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:52	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:51	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:51	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:50	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:49	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:49	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:48	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:48	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:47	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:47	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:46	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:45	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:45	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:44	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:44	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:44	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:43	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:43	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:43	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:43	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:42	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:42	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:42	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:42	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:42	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:41	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:41	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:41	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:41	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:40	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:40	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:40	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:40	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:39	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:39	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:39	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:38	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:38	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:38	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:38	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:38	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:37	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:37	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:37	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:37	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:36	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:36	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:36	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:36	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:36	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:35	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:35	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:35	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:35	NULL	NULL	sa
Log File Auto Grow	App01	App01_log	9/26/2017 0:34	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:33	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:33	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:32	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:32	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:31	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:31	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:31	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:30	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:29	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:29	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:28	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:27	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:26	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:26	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:25	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:25	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:24	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:24	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:23	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:22	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:20	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:19	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:18	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:16	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:15	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:14	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:14	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:13	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:12	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:11	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:10	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:09	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:08	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:07	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:06	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:05	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:04	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:04	.Net SqlClient Data Provider	HostServer01	App01admin
Log File Auto Grow	App01	App01_log	9/26/2017 0:03	.Net SqlClient Data Provider	HostServer01	App01admin

There was a single growth caused by SQLCMD on DBSQL– this was the index maintenance job running directly on the SQL Server. Almost all of the other growths were triggered by a.NET application on application server HostServer01. The App01 database's LDF/log file was set to autogrow in increments of 512MB so each row in the table above represents a single instance of that growth – as you can see the unit of work was generating 512MB of work every 30-60 seconds, which is a significant load.

The next step was setting up a basic XEvents session to track what statement(s) might be generating the load - sure enough there was a serious smoking gun:

https://i.pinimg.com/originals/c0/19/57/c01957fd9e59fe4a46a67ad49e049cc0.jpg

Check out the row near the middle with a very large duration, logical_reads count, *and* writes count.

Durations in SQL 2012 Extended Events are in microseconds – a very small unit.

The catch was the duration for this statement was 6,349,645,772 (6 billion) microseconds…105.83 minutes for this one query!

--

The App01 database was in SIMPLE recovery (when I find this in the wild, I always question it – isn’t point-in-time recovery important? – but that is another discussion). The relevance here is that in SIMPLE recovery, LOG backups are irrelevant (and actually *can’t* be run) – once a transaction or batch completes, the LDF/LOG file space is marked available for re-use, meaning that *usually* the LDF file doesn’t grow very large.

A database in SIMPLE recovery growing large LDF/LOG files almost always means a long-running unit of work or accidentally open transactions (a BEGIN TRAN with no subsequent CLOSE TRAN) – looking at the errors in the SQL Error Log the previous night, the 9002 log file full errors stopped at 1:45:51am server time, which means the offending unit of work ended then one way or another (crash or success).

Sure enough when I filtered the XEvents event file to things with a duration > 100 microseconds and then scanned down to 1:45:00 I quickly saw the row shown above. Note this doesn’t mean the unit of work was excessively large in CPU/RAM/IO/etc. (and in fact the query errored out due to lack of LOG space) but the excessive duration made all of the tiny units of work over the previous 105 minutes have to persist in the transaction LDF/LOG file until this unit of work completed, preventing all of the LOG from that time to be marked for re-use until this statement ended.

--

The query in question was this:

--

exec sp_executesql N'DELETE from SecurityAction WHERE ActionDate < @1 AND (ActionCode != 102 AND ActionCode != 103 AND ActionCode != 129 AND ActionCode != 130)',N'@1 datetime',@1='2017-09-14 00:00:00'

--

Stripping off the sp_executesql wrapper and the parameter replacement turned it into this:

--

DELETE from SecurityAction
WHERE ActionDate < '2017-09-14 00:00:00'
AND
(
ActionCode != 102
AND ActionCode != 103
AND ActionCode != 129
AND ActionCode != 130
)

Checking out App01.dbo.SecurityAction, the table was 13GB with 14 million rows. Looking at a random TOP 1000 rows I saw some rows that would satisfy this query that were at least 7 months old at the time of this incident (from September 2017) and there may have been even older rows – I didn’t want to spend the resources to run an ORDER BY query to find the absolute oldest row. I didn’t know based on this evidence whether this meant this individual statement has been failing for a long time or maybe this DELETE statement had been a recent addition to the overnight processes and had been failing since then.

The DELETE looked like it must be a scheduled/regular operation from a .NET application running on HostServer01 since we were seeing it each night in a similar time window.

The immediate term action I recommended to run a batched DELETE to purge this table down – once it had been purged down maybe the nightly operation would run with a short enough duration to not break the universe, a;though I did advise them that changing the nightly process to be batched wouldn't hurt (although they were dealing with a vendor app).

The code I suggested (and we eventually went with) looked like this:

--

WHILE 1=1
BEGIN
DELETE TOP (10000) from SecurityAction
WHERE ActionDate < '2017-09-14 00:00:00'
AND
(
ActionCode != 102
AND ActionCode != 103
AND ActionCode != 129
AND ActionCode != 130
)
END

--

This would delete rows in 10,000 row batches; wrapping it in the WHILE 1=1 allows the batch to run over and over without ending. Once there are no more rows to satisfy the criteria the query *WILL CONTINUE TO RUN* but will simply delete -0- rows each time.

Once you are ready to stop the batch you simply cancel it and it will end in whatever iteration of the loop it is currently in.

I could have written a WHILE clause that actually checks for the existence of rows that meet the criteria, but there were so many rows that met the criteria right then that running the check itself would have been much more intensive than simply allowing the DELETE to run over and over.

We finally ended up going with this code and after a half day of running it had the table purged down from 14 million rows to 700,000, and a run the next day of the unmodified "regular" DELETE statement completed in under a minute.

--

This seems like a very simple fix, and runs contrary to some database teaching that say batches (and their cousins, the cursors) are evil and should be avoided no matter what. Many blogs, classes, etc. preach the value of set-based operations at all times.

This is where we see the standard DBA trope once again:

https://memegenerator.net/img/instances/58097633/what-if-i-told-you-it-depends.jpg

Always use the right tool for the job - and in a case like this the right tool was a batch-based solution. You could tune the size of the batches (is 10,000 the best number? Maybe a 100,000 row batch (or a 1,000 row batch) would have the best performance) but regardless of that a small batch was a better choice than a 13 million row set!

At the end of the day this seems very simple - but you would be amazed to see how often something like this comes up.

To minimize LOG space needed, take small bites!

http://www.nicoleconner.com.au/wp-content/uploads/2016/04/66247903.jpg

Hope this helps!

↧

Toolbox - Why is My Database So Big????

January 19, 2018, 12:28 pm

≫ Next: Toolbox - Which Clerk Is Busy?

≪ Previous: T-SQL Tuesday #98 – Take Small Bites!

I have written previously (here)about how to tell which database files have free space in them when your drive starts to fill.

What if all of your database files are full and you are still running out of space?

DRIVE	DATABASE NAME	FILENAME	FILETYPE	FILESIZE	SPACEFREE	PHYSICAL_NAME
F	DB03	DB03_MDF	DATA	35.47 GB	225.63 MB	F:\MSSQL\Data\DB03_Data.mdf
F	DB02	DB02Data	DATA	110.25 MB	92.38 MB	F:\MSSQL\Data\DB02.mdf
F	DB01	DB01Data	DATA	142.06 MB	72.69 MB	F:\MSSQL\Data\DB01.mdf
F	DB05	DB05_Data	DATA	35.72 GB	71.44 MB	F:\MSSQL\Data\DB05_Data.mdf
F	DB06	DB06_MDF1	DATA	36.47 GB	58.50 MB	F:\MSSQL\Data\DB06_Data.mdf
F	DB04	DB04Data	DATA	36.47 GB	38.00 MB	F:\MSSQL\Data\DB04_Data.mdf

http://www.joemartinfitness.com/wp-content/uploads/2013/11/Post-holiday-bloat.jpg

The next step is to see what is taking up the space in the individual databases. Maybe there's an audit table that never gets purged...or a Sales History that can be archived away, it only there were an archive process...

https://s3.amazonaws.com/lowres.cartoonstock.com/business-commerce-work-workers-employee-employer-staff-ear0117_low.jpg

**ALWAYS CREATE AN ARCHIVE/PURGE PROCESS ** #ThisIsNotAnItDepends

--

There is an easy query to return the free space available in the various tables in your database - I found it in an StackOverflow at https://stackoverflow.com/questions/15896564/get-table-and-index-storage-size-in-sql-server and then modified it somewhat to make the result set cleaner (to me anyway):

--

/*
Object Sizes

Modified from http://stackoverflow.com/questions/15896564/get-table-and-index-storage-size-in-sql-server
*/

SELECT
@@SERVERNAME as InstanceName
, DB_NAME() as DatabaseName
, ISNULL(s.name+'.'+t.NAME, '**TOTAL**') AS TableName
, SUM(p.rows) AS RowCounts
--, SUM(a.total_pages) * 8 AS TotalSpaceKB
, SUM(a.total_pages) * 8/1024.0 AS TotalSpaceMB
, SUM(a.total_pages) * 8/1024.0/1024.0 AS TotalSpaceGB
, SUM(a.used_pages) * 8/1024.0 AS UsedSpaceMB
, (SUM(a.total_pages) - SUM(a.used_pages)) * 8/1024.0 AS UnusedSpaceMB
FROM sys.tables t
INNER JOIN sys.schemas s
ON s.schema_id = t.schema_id
INNER JOIN sys.indexes i
ON t.OBJECT_ID = i.object_id
INNER JOIN sys.partitions p
ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a
ON p.partition_id = a.container_id
WHERE t.NAME NOT LIKE 'dt%' -- filter out system tables for diagramming
AND t.is_ms_shipped = 0
AND i.OBJECT_ID > 255
GROUP BY s.name+'.'+t.Name
WITH ROLLUP
ORDER BY TotalSpaceMB DESC

--

Running the query against an individual database will return all of the tables in the database and their sizes, as well as the total size of the database (from the ROLLUP):

--

InstanceName	DatabaseName	TableName	RowCounts	TotalSpaceMB	UsedSpaceMB	UnusedSpaceMB
Instance01	Database99	TOTAL	148409590	36,042.063	32,890.922	3151.141
Instance01	Database99	dbo.BusinessRulesAuditing	20686280	17,877.813	17,860.938	16.875
Instance01	Database99	dbo.BusinessRulesRuleSet	18840	3,895.766	877.945	3,017.820
Instance01	Database99	dbo.ModelWorkbookVersionExt	4383	1,818.453	1,806.445	12.008
Instance01	Database99	dbo.EquityOutputVersions	35040362	1,746.688	1,739.281	7.406
Instance01	Database99	dbo.NominalOutputs	3592710	258.305	251.047	7.258
Instance01	Database99	dbo.Auditing	173494	33.391	27.586	5.805
Instance01	Database99	dbo.EquityOverrides	13105402	515.859	511.188	4.672
Instance01	Database99	dbo.ContingentOutputs	1371465	364.641	361.719	2.922
Instance01	Database99	dbo.ContingentWSStaging	292907	333.320	329.359	3.961
Instance01	Database99	dbo.TransformedAdminContingent	585848	76.070	73.000	3.070

--

There are two different useful cases I normally find with this result set.

The first (highlighted in aqua) is the very large table. In this sample my 35GB database has one large 17GB table. This is a situation where you can investigate the table and see *why* it is so large. In many cases, this is just the way it is - sometimes one large "mother" table is a fact of life (You take the good, you take the bad, you take them both, and there you have..."

https://memegenerator.net/img/instances/11140850/god-im-old.jpg

Often though you will find that this table is an anomaly. As mentioned above, maybe there is a missing purge or archive process - with a very large table look at the table definitions to see if their are date/datetime columns, and then select the top 10 order by those columns one by one to see how old the oldest records are. You may find that you are storing years of data when you only need months and can purge the oldest rows. You may also find that even though you do need years of data, you may not need them live in production, which allows you to periodically archive them away to another database (maybe even another instance) where they can be accessed only when needed.

https://blog.parse.ly/wp-content/uploads/2015/05/say_big_data.jpg

--

The second case (highlighted in yellow) is a table with significant free space contained in the table itself. In this example a 3.7GB table has 3.0GB of free space in it!

How do you account for this? There are a few ways - when you delete large numbers of rows from the table, the space usually isn't released until the next time the related indexes are rebuilt. Another possibility is index fill factor - the amount of free space that is included in indexes when they are rebuilt. I have run into several instances where a client DBA misunderstood the meaning of the fill factor and did a reverse to themselves, setting the fill factor to 10 thinking it would leave 10% free space when in fact it left 90% free space, resulting not only in exceeding large indexes but also in very poor performance as the SQL Server needs to scan across many more pages to retrieve your data.

To help determine which index(es) may contribute to the problem, you can add the indexname to the query to break the data out that one more level:

--

/*
Object Sizes With Indexes

Modified from http://stackoverflow.com/questions/15896564/get-table-and-index-storage-size-in-sql-server
*/

SELECT
@@SERVERNAME as InstanceName
, DB_NAME() as DatabaseName
, ISNULL(s.name+'.'+t.NAME, '**TOTAL**') AS TableName
, ISNULL(i.Name, '**TOTAL**') AS IndexName
, SUM(p.rows) AS RowCounts
--, SUM(a.total_pages) * 8 AS TotalSpaceKB
, SUM(a.total_pages) * 8/1024.0 AS TotalSpaceMB
, SUM(a.total_pages) * 8/1024.0/1024.0 AS TotalSpaceGB
, SUM(a.used_pages) * 8/1024.0 AS UsedSpaceMB
, (SUM(a.total_pages) - SUM(a.used_pages)) * 8/1024.0 AS UnusedSpaceMB
FROM sys.tables t
INNER JOIN sys.schemas s
ON s.schema_id = t.schema_id
INNER JOIN sys.indexes i
ON t.OBJECT_ID = i.object_id
INNER JOIN sys.partitions p
ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a
ON p.partition_id = a.container_id
WHERE t.NAME NOT LIKE 'dt%' -- filter out system tables for diagramming
AND t.is_ms_shipped = 0
AND i.OBJECT_ID > 255
GROUP BY s.name+'.'+t.Name
, i.Name
WITH ROLLUP

--

InstanceName	DatabaseName	TableName	IndexName	RowCounts	TotalSpaceMB	UsedSpaceMB	UnusedSpaceMB
Instance01	Database99	TOTAL	TOTAL	148409590	36,042.063	35,890.922	151.141
Instance01	Database99	BusinessRulesAuditing	PK__BusinessRules	20686280	17,877.813	17,860.938	16.875
Instance01	Database99	BusinessRulesAuditing	TOTAL	20686280	17,877.813	17,860.938	16.875
Instance01	Database99	BusinessRulesRuleSet	IX_RuleSet1	18840	2,100.000	137.000	1,963.000
Instance01	Database99	BusinessRulesRuleSet	IX_RuleSet2	18840	190.000	180.000	10.000
Instance01	Database99	BusinessRulesRuleSet	PK_RuleSet	18840	1,600.000	560.000	1,040.000
Instance01	Database99	BusinessRulesRuleSet	TOTAL	56520	3,895.766	877.945	3,017.820
Instance01	Database99	EquityOutputVersions	PK_EquityOutputVersions	35040362	1,746.688	1,739.281	7.406
Instance01	Database99	EquityOutputVersions	TOTAL	35040362	1,746.688	1,739.281	7.406
Instance01	Database99	EquityOverrides	PK_EquityOverrides	13105402	515.859	511.188	4.672
Instance01	Database99	EquityOverrides	TOTAL	13105402	515.859	511.188	4.672
Instance01	Database99	Auditing	PK_Auditing	173494	33.391	27.586	5.805
Instance01	Database99	Auditing	TOTAL	173494	33.391	27.586	5.805

<result set snipped for space>

--

In this case you could look at the fill factor of the IX_RuleSet1 and PK_RuleSet indexes to see why there is so much free space. Failing that it is possible that these indexes need to be rebuilt to release that space, possibly after a large delete from the table cleared that space.

--

I find I use this query more than I ever would have thought with space issues - I start with the Database Free File query (from the previous post) and then move next to this query.

--

Hope this helps!

↧

Toolbox - Which Clerk Is Busy?

February 7, 2018, 1:28 pm

≫ Next: Error 33206 - SQL Server Audit Failed to Create the Audit File

≪ Previous: Toolbox - Why is My Database So Big????

A few years ago I wrote what would turn out to be my most-hit blog post (so far) titled "Error 701 - Insufficient System Memory - Now what?"

In that post I talk about a troubleshooting scenario that finally led to using DBCC MEMORYSTATUS to find one of the memory clerks (MEMORYCLERK_XE) consuming far too much memory on the instance. (In that case there was a runaway XEvents session burning lots of RAM).

Bottom line, you should troubleshoot which clerk was the busiest to find the source of the overall problem.

https://dominicantoday.com/wp-content/uploads/2017/09/a-4.jpg

The 701 error (Error 701: “There is insufficient system memory to run this query”) just came up this morning again in our team Skype, and my colleague mentioned that the problem was happening at a time when no one wanted to be on the server watching.

I told him about my previous situation and how looking at DBCC MEMORYSTATUS had led to my smoking gun, and it led to thinking about some way to collect and persist the MEMORYSTATUS data without someone watching live on the system.

Google led me to several variations of a #YouCanDoAnythingWithPowershell script, such as this one from Microsoft Premier Field Engineer (PFE) Tim Chapman (blog/@chapmandew) but I really wanted a Transact-SQL script I could play with myself.

More Google finally led me to a blog post and T-SQL script from the SQL Server Photographer Slava Murygin (blog/@slavasql) (follow his blog for scripts and great photos he takes at all types of SQL events)!

Slava's script translates one of the Powershell scripts into T-SQL, storing the output from DBCC MEMORYSTATUS in a temporary table for output. The one limitation of this for my needs is that it is an adhoc execution - great to run interactively but not useful to store over time.

It took a few steps to turn Slava's temp table implementation into one that stores the data in a permanent table so that it can be queried after the fact and over time.

A few notes: Slava's script relies on xp_cmdshell, and my modification still does. My modification stores data in a permanent table, which means it needs to reside in a permanent database. My script uses the "DBADatabase" including code to create it if it doesn't exist, but it is an easy find-replace to change that name if you'd like:

--

/*

Track DBCC MEMORYSTATUS data over time

Guts of query to parse DBCC results into a temp table from
http://slavasql.blogspot.com/2016/08/parsing-dbcc-memorystatus-without-using.html

Modified to store those results in a permanent table over time
Intended for use in a scheduled Agent Job but could be run manually as needed

Also modified to run on SQL 2005 (syntax changes)

Relies on xp_cmdshell

Stores data in the "DBADatabase" including creating the database if it doesn't pre-exist

If you wish to use an existing database or a different database
name simply Find-Replace for the string DBADatabase

*/

SET NOCOUNT ON
GO

IF DB_ID('DBADatabase') IS NULL /* Check if DBADatabase database exists - if not, create it */
BEGIN
EXECUTE ('CREATE DATABASE DBADatabase')

ALTER DATABASE DBADatabase SET RECOVERY SIMPLE;

ALTER AUTHORIZATION ON DATABASE::DBADatabase TO sa;

/* Read the current SQL Server default backup location */
DECLARE @BackupDirectory NVARCHAR(100)
EXEC master..xp_instance_regread @rootkey = 'HKEY_LOCAL_MACHINE',
@key = 'Software\Microsoft\MSSQLServer\MSSQLServer',
@value_name = 'BackupDirectory', @BackupDirectory = @BackupDirectory OUTPUT ;
EXECUTE ('BACKUP DATABASE DBADatabase to DISK = '''+@BackupDirectory+'\DBADatabase.bak'' WITH INIT')
PRINT 'CREATED DATABASE'
RAISERROR('Ensure that you add the DBADatabase database to backup / maintenance jobs/plans', 10, 1) WITH NOWAIT;
END;
GO

/* If Holding Table doesn't exist, create it */
IF OBJECT_ID('DBADatabase.dbo.DBCCMemoryStatus') IS NULL
CREATE TABLE [DBADatabase].[dbo].[DBCCMemoryStatus](
[Datestamp] [datetime] NOT NULL,
[DataSet] [varchar](100) NULL,
[Measure] [varchar](20) NULL,
[Counter] [varchar](100) NULL,
[Value] [money] NULL
) ON [PRIMARY]

GO

SET ANSI_PADDING OFF
GO

IF OBJECT_ID('tempdb..#tbl_MemoryStatusDump') IS NOT NULL
DROP TABLE #tbl_MemoryStatusDump;
GO
IF OBJECT_ID('tempdb..#tbl_MemoryStatus') IS NOT NULL
DROP TABLE #tbl_MemoryStatus;
GO
CREATE TABLE #tbl_MemoryStatusDump(
ID INT IDENTITY(1,1) PRIMARY KEY
, [Dump] VARCHAR(100));
GO
CREATE TABLE #tbl_MemoryStatus(
ID INT IDENTITY(1,1),
[DataSet] VARCHAR(100),
[Measure] VARCHAR(20),
[Counter] VARCHAR(100),
[Value] MONEY);
GO
INSERT INTO #tbl_MemoryStatusDump([Dump])
EXEC ('xp_cmdshell ''sqlcmd -E -S localhost -Q "DBCC MEMORYSTATUS"''');
GO
DECLARE @f BIT
, @i SMALLINT
, @m SMALLINT
, @CurSet VARCHAR(100)
, @CurMeasure VARCHAR(20)
, @Divider TINYINT
, @CurCounter VARCHAR(100)
, @CurValue VARCHAR(20);

SET @f=1
SET @m = (SELECT MAX(ID) FROM #tbl_MemoryStatusDump)
set @i = 1

WHILE @i < @m
BEGIN
SELECT @Divider = PATINDEX('% %',REVERSE(RTRIM([Dump])))
, @CurCounter = LEFT([Dump], LEN([Dump]) - @Divider)
, @CurValue = RIGHT(RTRIM([Dump]), @Divider - 1)
FROM #tbl_MemoryStatusDump WHERE ID = @i;

IF @f = 1
SELECT @CurSet = @CurCounter, @CurMeasure = @CurValue, @f = 0
FROM #tbl_MemoryStatusDump WHERE ID = @i;
ELSE IF LEFT(@CurCounter,1) = '(' SET @f = 1;
ELSE IF @CurCounter != 'NULL' and LEFT(@CurCounter,1) != '-'
INSERT INTO #tbl_MemoryStatus([DataSet], [Measure], [Counter], [Value])
SELECT @CurSet, @CurMeasure, @CurCounter, CAST(@CurValue as MONEY)
FROM #tbl_MemoryStatusDump WHERE ID = @i;
SET @i = @i + 1;
END
GO

/*Send data from temp table to permanent table */
INSERT INTO DBADatabase.dbo.DBCCMemoryStatus
SELECT
GETDATE() as Datestamp
, DataSet
, Measure
, Counter
, Value
FROM #tbl_MemoryStatus

/* Purge rows older than 96 hours to manage table size */
DELETE FROM DBADatabase.dbo.DBCCMemoryStatus
WHERE DATEDIFF(hh,DateStamp, GETDATE())>96

/*
SELECT *
FROM DBADatabase.dbo.DBCCMemoryStatus
WHERE counter = 'VM Reserved'
ORDER BY DateStamp DESC
*/

--

Running this statement interactively doesn't return any data - it just loads the data into DBADatabase.dbo.DBCCMemoryStatus. Running the commented-out SELECT at the bottom of the script as written will query that table for all rows of counter VM Reserved (virtual memory reserved) but there is much more data than that available if you modify the SELECT.

This query can be dropped into a SQL Agent job step as is and it will run - just like the interactive run it will create the database and permanent table if they don't exist and then store those nuggets of data into the permanent table for later use - you never know when you may need them!

https://memegenerator.net/img/instances/400x/57081575/winter-is-coming.jpg

--

Hope this helps!

↧

Error 33206 - SQL Server Audit Failed to Create the Audit File

March 5, 2018, 11:24 am

≫ Next: Which Blogs Do I Recommend?

≪ Previous: Toolbox - Which Clerk Is Busy?

One of the joys of working with SQL Server is the descriptive and meaningful error messages, right?

https://vignette.wikia.nocookie.net/batmantheanimatedseries/images/a/a7/DiD_39_-_Joker_Vision.jpg/revision/latest?cb=20160610021326

Exhibit A - the 33206:

Error: 33206, Severity: 17, State: 1.

SQL Server Audit failed to create the audit file 'T:\MSSQL\Audit Files\PLAND_Objects_DML_Audit_Events_A2B00B57-4B43-4570-93C8-B30EE77CC8C9_0_131645629182310000.sqlaudit'. Make sure that the disk is not full and that the SQL Server service account has the required permissions to create and write to the file.

At first glance, this error tells us one of two things is wrong - the disk is full or the SQL service account doesn't have the required permissions to create and write to a file at the given location.

Simple to troubleshoot, right?

Except when the service account already has Full Control to each folder and sub-folder in the path...

(Note that it is important to check every folder along the path - I have seen situations where the account has permissions to the direct sub-folder (in this example T:\MSSQL\Audit Files) or even the direct sub-folder and the root of the drive (T:) and yet a security check fails due to the intermediate folder not having the correct permissions. To me it shouldn't work this way, but sometimes it does...)

What about the drive being full?

Maybe not.

Having exhausted the obvious paths (at least obvious from the error text), I went back to my default next step:

https://memegenerator.net/img/instances/52632802/i-see-your-google-fu-is-as-strong-as-mine.jpg

A base search for a "SQL Error 33206" actually brought back a bunch of results about Error 33204 - "SQL Server Audit could not write to the security log" (FYI - the common fix for a 33204 is to grant the SQL service account rights to a related registry key as described here)

A better Google contains a piece of the error message:

SQL 33206 "SQL Server Audit failed to create the audit file"

The second result in the resultset for this search is a recent blog post from someone I know to be a reliable source, Microsoft Certified Master Jason Brimhall (blog/@sqlrnnr).

In his post Jason describes the same process I describe above - eliminate the obvious - but then he shows what the real problem was for my situation, the file configuration in the SQL Server Audit configuration:

When I checked the T:\MSSQL\Audit Files folder, sure enough, there were fifteen audit files reaching back over thirteen months' worth of service restarts.

To mitigate the problem I deleted the oldest of the fifteen files, and the Audit resumed.

WOOHOO!

The real fix to this situation is to configure the files as "rollover" files - setting the Audit File Maximum Limit to fifteen "Maximum Rollover Files" instead of fifteen"Maximum Files" would allow the audit to overwrite the oldest file once it reaches to configured max rather than crashing the audit as happened here.

Realize you can only do this if you can handle the oldest files being overwritten; if you have to persist the files you need to create a separate archiving process to handle that.

If nothing else, the real silver lining in my particular situation was that the creator of this audit at the client configured the audit to "Continue" rather than "Shut Down Server" (completely shut down Windows - not just stop SQL, but shut down the whole server) or "Fail Operation" (allow SQL to run but fail any operation that would meet the audit specification - not as catastrophic as "Shut Down Server" but still very impactful).

https://www.thesun.co.uk/wp-content/uploads/2016/11/nintchdbpict000273355923.jpg?w=960

There are definitely situations that call for "Shut Down Server" or "Fail Operation" - if your Audit is in place to satisfy a legal/regulatory/moral requirement, then definitely consider these options - but this is often not the case.

Hope this helps!

↧