This server has been up and running pretty good for the last 18 months. Has all the latest updates as of last week. But all of a sudden it's come down with something.
Last thursday the SQL server came to an almost frozen state. I could log onto the server but could not open a (local) session to the SQL services. CPU was OK, server was responsive and I was able to open a configuration manager and restart the SQL services. Mind you it took 3 min to stop the old services and at one point the CPU spiked for 10 sec near the end but it did start and stop by it's self.
When it came back i was able to open a session and look around. I had 2 reindexing jobs that should have finished 4 hours earlier that were killed during the Service restart. I also could see the backup ran 1 hour longer then normal and had produced a backup of about 3x it's normal size (prob because the reindex was working way past it's usual schedule). But nothing eventful in the logs. I had a job fail earlier in the morning with "failed with the following error: "Query timeout expired"”" and a lot of connection timeouts due to the SQL services grinding to a hault.
Since then i have've been seeing a lot of "Process ID XXXX was killed by hostname [ServerName], host process ID XXXX."
And this morning (Monday) we almost had a repeat. The backups failed alerting me to the issue. The SQL server was still responsive and i could open a session no problem but a lot of connections were failing and it had these errors in the agent logs
- [298] SQLServer Error: 233, Shared Memory Provider: No process is on the other end of the pipe. [SQLSTATE 08S01] (LogToTableWrite)
- [298] SQLServer Error: 233, Communication link failure [SQLSTATE 08S01] (LogToTableWrite)
- [298] SQLServer Error: 10004, Communication link failure [SQLSTATE 08S01] (LogToTableWrite)
- [298] SQLServer Error: 16389, Communication link failure [SQLSTATE 08S01] (LogToTableWrite)
The server logs were just a bunch of "Process ID XXXX was killed by hostname [ServerName], host process ID XXXX."
Error from the automated script was
[AutoBackup]
Msg 3204, Level 16, State 1, Server NS623822, Line 1
The backup or restore was aborted.
Msg 3013, Level 16, State 1, Server NS623822, Line 1
BACKUP DATABASE is terminating abnormally.
[/AutoBackup]
I tried 2 manual backups the second worked but the first failed with this error
[ManualBackup]
Msg 3204, Level 16, State 1, Line 1
The backup or restore was aborted.
Msg 3013, Level 16, State 1, Line 1
VERIFY DATABASE is terminating abnormally.
Msg 0, Level 20, State 0, Line 0
A severe error occurred on the current command. The results, if any, should be discarded.
[/ManualBackup]
I'm scanning the disks now but these are a bunch of RAID 5s drives and there are no sign of trouble in the RAID controller logs. I'm going to scan the RAM this evening but as it is now i can not tell where the source of this is comming from.