Disk Health Email Alerts
Introduction
Disk health can be reported on drives which support S.M.A.R.T. monitoring. An email alert can be sent to warn of potential problems as they unfold so appropriate action can be taken to avert loss of data. As of version 1.6 a default Proxmox installation does not include this functionality, and additional packages must be installed and configured to make use of it.
Installation & Configuration
1. Install smartmontools:
First we update aptitude and install the package:
aptitude update && aptitude install smartmontools
2. Configure the daemon:
The package smartmontools includes a background process, aka 'daemon' that will periodically check on the disks. We need to tell it which to watch, so first we verify how our disks are named. This is done by issuing the command fdisk -l.
fdisk -l
This produces output that includes our logical volumes, which cannot be monitored. It looks similar to this:
# fdisk -l Disk /dev/sda: 32.0 GB, 32000000000 bytes 255 heads, 63 sectors/track, 3890 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x00000000 Device Boot Start End Blocks Id System /dev/sda1 * 1 66 524288 83 Linux Partition 1 does not end on cylinder boundary. /dev/sda2 66 3890 30722105 8e Linux LVM WARNING: GPT (GUID Partition Table) detected on '/dev/sdb'! The util fdisk doesn't support GPT. Use GNU Parted. Disk /dev/sdb: 2998.9 GB, 2998960914432 bytes 255 heads, 63 sectors/track, 364602 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x00000000 Device Boot Start End Blocks Id System /dev/sdb1 1 267350 2147483647+ ee EFI GPT Disk /dev/dm-0: 3892 MB, 3892314112 bytes 255 heads, 63 sectors/track, 473 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x00000000 Disk /dev/dm-0 doesn't contain a valid partition table Disk /dev/dm-1: 7784 MB, 7784628224 bytes 255 heads, 63 sectors/track, 946 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x00000000 Disk /dev/dm-1 doesn't contain a valid partition table Disk /dev/dm-2: 15.8 GB, 15896412160 bytes 255 heads, 63 sectors/track, 1932 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x00000000 Disk /dev/dm-2 doesn't contain a valid partition table
Our concern is with the disks upon which the logical volumes reside, these have names like /dev/sda, /dev/sdb, /dev/hda, /dev/hdb- not the logical volumes themselves that have names like /dev/dm-#. These logical volumes belong to our VMs, not PVE.
Next we verify that our disks are manufactured with S.M.A.R.T. support, and that the capability is turned on. This is done by issuing the command smartctl -a /dev/sda, where sda is substituted with however your disk is named.
#smartctl -a /dev/sda
This produces output that looks like this:
#smartctl -a /dev/sda smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: SSDSA2SH032G1GN INTEL Serial Number: CVEM001300D0032HGN Firmware Version: 045C8860 User Capacity: 32,000,000,000 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1 Local Time is: Fri Nov 12 06:35:42 2010 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled
These are the lines that answer our inquiry:
SMART support is: Available - device has SMART capability. SMART support is: Enabled
Sometimes the capability is present in the disk, and is not enabled. if this is the case your output will include these lines instead:
SMART support is: Available - device has SMART capability. SMART support is: Disabled
To change the Disabled to Enabled we issue this command:
#smartctl -s on -a /dev/sda
Again, substituting the sda with however your disk is named.
Running the smartctl -a command again as we have done previously can verify that the functionality has indeed been enabled within the disk.
Now since we know our disks can answer with the requested information we have to tell the daemon to ask them how they are at regular intervals. We do this by editing the file /etc/default/smartmontools.
#nano /etc/default/smartmontools
These are the lines before and after we change them:
This one defines which disks are to be monitored, we uncomment this and enter the names of our disks here, as we determined previously:
#enable_smart="/dev/hda /dev/hdb" enable_smart="/dev/sda"
This one starts the daemon with the host machine, just uncomment it to make that happen:
#start_smartd=yes start_smartd=yes
This one defines the interval at which the disks are queried, values are in seconds. Defaults are fine however I prefer less checking than every half hour, so I change mine to every 3.
#smartd_opts="--interval=1800" smartd_opts="--interval=10800"
Now our disks are able to be checked up on, now we set the daemon so that an alarm condition sends an email to let us know something's amiss. By default it will keep tabs on every disk it can find, which in our case with alot of virtual disks not supporting S.M.A.R.T. monitoring can cause alot of unnecessary work for the daemon, and wasting precious CPU cycles. So here we also explicitly define which disks to look at. We do this by editing the file /etc/smartd.conf.
#nano /etc/smartd.conf
These are the pertinent lines before and after we change them:
This line says to scan for and check everything, I comment this one out to disable it, in favor of other lines later.
DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner #DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
This one tells the daemon to treat SCSI as SATA. I don't quite understand why I had to set this, or if you will have to. Maybe someone can edit this to include better information on this point. Just uncomment to enable.
# /dev/sda -a -d sat /dev/sda -a -d sat
Uncomment this line that defines which of the SCSI disks would get extended testing and when. Mine's a SATA disk, so I changed the identifier. Here I also change the schedule of the disk check from the default 3rd day (Tuesday) at 18:00 hours (6pm) and 7th day (Saturday) at 01:00 hours (1am), to Monday 9pm. The -m flag is added to specify the email address to which warnings or errors will be sent, which will be root. Proxmox is configured to forward root's email to the address specified during installation.
#/dev/sda -d scsi -s L/../../3/18 #/dev/sdb -d scsi -s L/../../7/01 /dev/sda -d sat -s L/../../2/21 -m root
Then for added peace of mind add a line to have a test email sent at startup.
/dev/sda -m root -M test
Of course there are many more ways to do this, as are detailed both in the config file and the man page. This config is done, so the file is saved.
Now start the daemon & see if it works. This is done by issuing this command:
#/etc/init.d/smartmontools start
There should be an email waiting to be read, and we can look in the log to see details with this command:
#tail -n50 /var/log/syslog
External links with regard to S.M.A.R.T.
Smartmontools website on Sourceforge: https://sourceforge.net/apps/trac/smartmontools/
S.M.A.R.T. on Wikipedia: http://wikipedia.org/wiki/S.M.A.R.T.