Disk Health Email Alerts

From Proxmox VE
Jump to navigation Jump to search

Introduction

Disk health can be reported on drives which support S.M.A.R.T. monitoring. An email alert can be sent to warn of potential problems as they unfold so appropriate action can be taken to avert loss of data. As of version 1.6 a default Proxmox installation does not include this functionality, and additional packages must be installed and configured to make use of it.

Installation & Configuration

1. Install smartmontools:

First we update aptitude and install the package:

aptitude update && aptitude install smartmontools 

2. Configure the daemon:

The package smartmontools includes a background process, aka 'daemon' that will periodically check on the disks. We need to tell it which to watch, so first we verify how our disks are named. This is done by issuing the command fdisk -l.

 fdisk -l 

This produces output that includes our logical volumes, which cannot be monitored. It looks similar to this:

# fdisk -l

Disk /dev/sda: 32.0 GB, 32000000000 bytes
255 heads, 63 sectors/track, 3890 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          66      524288   83  Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2              66        3890    30722105   8e  Linux LVM

WARNING: GPT (GUID Partition Table) detected on '/dev/sdb'! The util fdisk doesn't support GPT. Use GNU Parted.


Disk /dev/sdb: 2998.9 GB, 2998960914432 bytes
255 heads, 63 sectors/track, 364602 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1      267350  2147483647+  ee  EFI GPT

Disk /dev/dm-0: 3892 MB, 3892314112 bytes
255 heads, 63 sectors/track, 473 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

Disk /dev/dm-0 doesn't contain a valid partition table

Disk /dev/dm-1: 7784 MB, 7784628224 bytes
255 heads, 63 sectors/track, 946 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

Disk /dev/dm-1 doesn't contain a valid partition table

Disk /dev/dm-2: 15.8 GB, 15896412160 bytes
255 heads, 63 sectors/track, 1932 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

Disk /dev/dm-2 doesn't contain a valid partition table


Our concern is with the disks upon which the logical volumes reside, these have names like /dev/sda, /dev/sdb, /dev/hda, /dev/hdb- not the logical volumes themselves that have names like /dev/dm-#. These logical volumes belong to our VMs, not PVE.


Next we verify that our disks are manufactured with S.M.A.R.T. support, and that the capability is turned on. This is done by issuing the command smartctl -a /dev/sda, where sda is substituted with however your disk is named.

#smartctl -a /dev/sda

This produces output that looks like this:

#smartctl -a /dev/sda
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     SSDSA2SH032G1GN INTEL
Serial Number:    CVEM001300D0032HGN
Firmware Version: 045C8860
User Capacity:    32,000,000,000 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 1
Local Time is:    Fri Nov 12 06:35:42 2010 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled


These are the lines that answer our inquiry:

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Sometimes the capability is present in the disk, and is not enabled. if this is the case your output will include these lines instead:

SMART support is: Available - device has SMART capability.
SMART support is: Disabled

To change the Disabled to Enabled we issue this command:

#smartctl -s on -a /dev/sda

Again, substituting the sda with however your disk is named.

Running the smartctl -a command again as we have done previously can verify that the functionality has indeed been enabled within the disk.


Now since we know our disks can answer with the requested information we have to tell the daemon to ask them how they are at regular intervals. We do this by editing the file /etc/default/smartmontools.

#nano /etc/default/smartmontools

These are the lines before and after we change them:

This one defines which disks are to be monitored, we uncomment this and enter the names of our disks here, as we determined previously:

#enable_smart="/dev/hda /dev/hdb"
enable_smart="/dev/sda"

This one starts the daemon with the host machine, just uncomment it to make that happen:

#start_smartd=yes
start_smartd=yes

This one defines the interval at which the disks are queried, values are in seconds. Defaults are fine however I prefer less checking than every half hour, so I change mine to every 3.

#smartd_opts="--interval=1800"
smartd_opts="--interval=10800"


Now our disks are able to be checked up on, now we set the daemon so that an alarm condition sends an email to let us know something's amiss. By default it will keep tabs on every disk it can find, which in our case with alot of virtual disks not supporting S.M.A.R.T. monitoring can cause alot of unnecessary work for the daemon, and wasting precious CPU cycles. So here we also explicitly define which disks to look at. We do this by editing the file /etc/smartd.conf.

#nano /etc/smartd.conf

These are the pertinent lines before and after we change them:

This line says to scan for and check everything, I comment this one out to disable it, in favor of other lines later.

DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
#DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner

This one tells the daemon to treat SCSI as SATA. I don't quite understand why I had to set this, or if you will have to. Maybe someone can edit this to include better information on this point. Just uncomment to enable.

# /dev/sda -a -d sat
/dev/sda -a -d sat

Uncomment this line that defines which of the SCSI disks would get extended testing and when. Mine's a SATA disk, so I changed the identifier. Here I also change the schedule of the disk check from the default 3rd day (Tuesday) at 18:00 hours (6pm) and 7th day (Saturday) at 01:00 hours (1am), to Monday 9pm. The -m flag is added to specify the email address to which warnings or errors will be sent, which will be root. Proxmox is configured to forward root's email to the address specified during installation.

#/dev/sda -d scsi -s L/../../3/18
#/dev/sdb -d scsi -s L/../../7/01
/dev/sda -d sat -s L/../../2/21 -m root

Then for added peace of mind add a line to have a test email sent at startup.

/dev/sda -m root -M test

Of course there are many more ways to do this, as are detailed both in the config file and the man page. This config is done, so the file is saved.

Now start the daemon & see if it works. This is done by issuing this command:

#/etc/init.d/smartmontools start

There should be an email waiting to be read, and we can look in the log to see details with this command:

#tail -n50 /var/log/syslog

External links with regard to S.M.A.R.T.

Smartmontools website on Sourceforge: https://sourceforge.net/apps/trac/smartmontools/

S.M.A.R.T. on Wikipedia: http://wikipedia.org/wiki/S.M.A.R.T.