Details
Description
We have been running ior testing on hyperion with toss 5 and have seen ldiskfs corruption. Since I know you have access to hyperion, I was hoping you could log on and look around (the console logs are on hyperion577-pub and santricity can be run from there as well). I have set up a test filesystem called /p/ls1 created with large luns (22TB per lun with 6 luns on each RBOD) on a Netapp. The mds is on hyperion-agb25 and the 2 oss nodes are hyperion-agb27 and hyperion-agb28. I had 10 clients writing i/o to the filesystem and would power cycle an oss every hour to simulate a node crashing. Upon bringing the oss up I would run the full fsck to check for errors and bring up lustre again and continue the i/o load from clients. We hit a bug where the fsck shows corruption and doesn't mount lustre. As a side note, I was running the same testing in parallel with the same HW, but with a small 3TB lun size and did not hit this issue.
zgrep Mounting ../conman.old/console.hyperion-agb27-20120115.gz
2012-01-14 14:24:02 Mounting /dev/dm-3 on /mnt/lustre/local/ls1-OST0000
2012-01-14 14:24:04 Mounting /dev/dm-0 on /mnt/lustre/local/ls1-OST0001
2012-01-14 14:24:06 Mounting /dev/dm-4 on /mnt/lustre/local/ls1-OST0002
2012-01-14 15:21:57 Mounting local filesystems: [ OK ]
2012-01-14 15:22:03 Mounting other filesystems: [ OK ]
2012-01-14 15:24:13 Mounting /dev/dm-0 on /mnt/lustre/local/ls1-OST0000
2012-01-14 15:24:15 Mounting /dev/dm-3 on /mnt/lustre/local/ls1-OST0001
2012-01-14 15:24:17 Mounting /dev/dm-1 on /mnt/lustre/local/ls1-OST0002
2012-01-14 16:21:49 Mounting local filesystems: [ OK ]
2012-01-14 16:21:55 Mounting other filesystems: [ OK ]
2012-01-14 16:23:51 Mounting /dev/dm-3 on /mnt/lustre/local/ls1-OST0000
2012-01-14 16:23:53 Mounting /dev/dm-1 on /mnt/lustre/local/ls1-OST0001
2012-01-14 16:23:55 Mounting /dev/dm-5 on /mnt/lustre/local/ls1-OST0002
2012-01-14 17:21:53 Mounting local filesystems: [ OK ]
2012-01-14 17:21:59 Mounting other filesystems: [ OK ]
2012-01-14 18:22:00 Mounting local filesystems: [ OK ]
2012-01-14 18:22:06 Mounting other filesystems: [ OK ]
2012-01-14 19:21:56 Mounting local filesystems: [ OK ]
2012-01-14 19:22:02 Mounting other filesystems: [ OK ]
2012-01-14 20:21:52 Mounting local filesystems: [ OK ]
2012-01-14 20:21:58 Mounting other filesystems: [ OK ]
It appears that the corruption occurred way back on Friday 1/14 after 16:20.
I state this based upon the fact that the OST's did not make it back after the power cycle on 1/14 @ 17:20.
Also the following fsck results only surfaced after that power cycle:
2012-01-14 17:23:48 Group descriptor 0 checksum is invalid. FIXED.
2012-01-14 17:23:48 Group descriptor 1 checksum is invalid. FIXED.
2012-01-14 17:23:48 Group descriptor 2 checksum is invalid. FIXED.