[PVE-User] Buffer I/O error on device dm-3 / nfs lost / Cluster crashed

Martin Schuchmann ms at city-pc.de
Wed Aug 14 11:55:00 CEST 2013


Hi there,

After upgrading three host to the current stable 3.0-23 with kernel 
2.6.32-22 we have on one host a returning error in kern.log:

Aug 14 07:00:05 promo2 kernel: EXT3-fs: barriers disabled
Aug 14 07:00:05 promo2 kernel: kjournald starting.  Commit interval 5 
seconds
Aug 14 07:00:05 promo2 kernel: EXT3-fs (dm-3): using internal journal
Aug 14 07:00:05 promo2 kernel: ext3_orphan_cleanup: deleting 
unreferenced inode 92192952
Aug 14 07:00:05 promo2 kernel: ext3_orphan_cleanup: deleting 
unreferenced inode 92037342
Aug 14 07:00:05 promo2 kernel: EXT3-fs (dm-3): 2 orphan inodes deleted
Aug 14 07:00:05 promo2 kernel: EXT3-fs (dm-3): recovery complete
Aug 14 07:00:05 promo2 kernel: EXT3-fs (dm-3): mounted filesystem with 
ordered data mode
Aug 14 07:03:46 promo2 kernel: device-mapper: snapshots: Invalidating 
snapshot: Unable to allocate exception.
Aug 14 07:03:47 promo2 kernel: Aborting journal on device dm-3.
Aug 14 07:03:47 promo2 kernel: Buffer I/O error on device dm-3, logical 
block 346882562
Aug 14 07:03:47 promo2 kernel: EXT3-fs (dm-3): error: 
ext3_journal_start_sb: Detected aborted journal
Aug 14 07:03:47 promo2 kernel: EXT3-fs (dm-3): error: remounting 
filesystem read-only
Aug 14 07:03:47 promo2 kernel: lost page write due to I/O error on dm-3
Aug 14 07:03:47 promo2 kernel: JBD: I/O error detected when updating 
journal superblock for dm-3.
Aug 14 07:03:49 promo2 kernel: Buffer I/O error on device dm-3, logical 
block 535724034
Aug 14 07:03:49 promo2 kernel: lost page write due to I/O error on dm-3
Aug 14 07:03:49 promo2 kernel: Buffer I/O error on device dm-3, logical 
block 535724035
Aug 14 07:03:49 promo2 kernel: lost page write due to I/O error on dm-3
Aug 14 07:03:49 promo2 kernel: Buffer I/O error on device dm-3, logical 
block 536215554
Aug 14 07:03:49 promo2 kernel: lost page write due to I/O error on dm-3
Aug 14 07:03:49 promo2 kernel: Buffer I/O error on device dm-3, logical 
block 556138498
Aug 14 07:03:49 promo2 kernel: lost page write due to I/O error on dm-3
Aug 14 07:03:49 promo2 kernel: Buffer I/O error on device dm-3, logical 
block 564822018
Aug 14 07:03:49 promo2 kernel: lost page write due to I/O error on dm-3
Aug 14 07:03:49 promo2 kernel: Buffer I/O error on device dm-3, logical 
block 564822019
Aug 14 07:03:49 promo2 kernel: lost page write due to I/O error on dm-3
Aug 14 07:03:49 promo2 kernel: Buffer I/O error on device dm-3, logical 
block 571408386
Aug 14 07:03:49 promo2 kernel: lost page write due to I/O error on dm-3
Aug 14 07:03:49 promo2 kernel: Buffer I/O error on device dm-3, logical 
block 573800450
Aug 14 07:03:49 promo2 kernel: lost page write due to I/O error on dm-3
Aug 14 07:04:36 promo2 kernel: EXT3-fs (dm-3): error: ext3_put_super: 
Couldn't clean up the journal
Aug 14 07:07:37 promo2 kernel: ct0 nfs: server 10.1.0.2 not responding, 
still trying
Aug 14 07:07:37 promo2 kernel: ct0 nfs: server 10.1.0.2 not responding, 
still trying
Aug 14 07:07:37 promo2 kernel: ct0 nfs: server 10.1.0.2 not responding, 
still trying
.. (continue until hard-reboot)


The hardware (HP DL360 G7 with P410 Controller) doesn't show any error 
in their own log (ILO Interface).
The problem returns every 12h @ 07:00h and 19:00h.

Today at 07:00h after the error in kern.log the nfs daemon stopped also 
working.
All machines on the first and the second node became inactive since that 
time and were no longer accessible from the outside, but still running. 
The local sshd on the proxmox nodes did still work, but we were not able 
to reboot the nodes because we could not stop the VMs. Only "echo b > 
/proc/sysrq-trigger" helped.

The third node was not concerned, even if he also is connected to the 
missing shared storage of node 2.

We share each local storage over the cluster via nfs to all nodes, but 
machines are only running on local storage.

Is there any hint what to do?
Using again the older kernel?
Anything about driver problems with the P410i Controller and the new kernel?


Thank you!

Martin.



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.proxmox.com/pipermail/pve-user/attachments/20130814/ecc5e762/attachment.htm>


More information about the pve-user mailing list