[LU-5822] health_check file not updating properly Created: 28/Oct/14 Updated: 24/Apr/15 Resolved: 04/Dec/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | Lustre 2.5.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | Matt Ezell | Assignee: | James Nunez (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 16321 | ||||||||
| Description |
|
Over the weekend we had an OST abort and get marked read-only: [ 726.076561] LDISKFS-fs error (device dm-25): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 111692corrupted: 32768 blocks free in bitmap, 1024 - in gd
[ 726.116663]
[ 726.125085] Aborting journal on device dm-25-8.
[ 726.133359] LustreError: 17032:0:(ofd_obd.c:1095:ofd_destroy()) f1-OST00ff: error destroying object [0x100000000:0x16546e7:0x0]: 0
[ 726.176268] LDISKFS-fs (dm-25):
[ 726.179457] LDISKFS-fs error (device dm-25): ldiskfs_journal_start_sb: Detected aborted journal
[ 726.179459] LDISKFS-fs (dm-25): Remounting filesystem read-only
We rely on the /proc/fs/lustre/health_check file to notify us of these situations. Unfortunately, we never got a notification. I found a bug in the b2_5 implementation of the osd-ldiskfs osd_statfs() function. Code inspection leads me to believe it does not affect master, but I haven't tried it there. I will upload a patch momentarily. |
| Comments |
| Comment by Matt Ezell [ 28/Oct/14 ] |
|
I'd like to also add a test to make sure this doesn't break in the future, but I'm not sure of the best way to make the device go read only. I started with the read-only infrastructure that the test system uses, but that appears to set the underlying device to read-only, not the ldiskfs file system. First, I tried to implement a server_remount_fs() function so you could do 'mount -o remount,ro', but that gets a Lustre superblock that you would then need to call down into the osd-api to actually make it do anything to the underlying file system. Is there an easier way to cause the underlying filesystem to abort or go read-only? |
| Comment by Matt Ezell [ 28/Oct/14 ] |
| Comment by James Nunez (Inactive) [ 28/Oct/14 ] |
|
Matt, Thanks for the patch. I'll look into a test to check that health_check is being updated properly. James |
| Comment by Andreas Dilger [ 29/Oct/14 ] |
|
There was some work done in It would be easier to use the existing fault injection method for Lustre and add a new FAIL_LOC for this case, like: if (sb->s_flags & MS_RDONLY ||
(OBD_FAIL_CHECK(OBD_FAIL_OSD_READONLY) &&
osd->od_jndex == libcfs_fail_val))
osd->od_statfs.os_state = OS_STATE_READONLY;
or similar (this is just from the top of my head so the syntax might not be quite correct). |
| Comment by Andreas Dilger [ 29/Oct/14 ] |
|
PS: I verified that this patch is not needed for master. |
| Comment by Gerrit Updater [ 04/Dec/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12463/ |
| Comment by Peter Jones [ 04/Dec/14 ] |
|
Landed for 2.5.4 |