[LU-5822] health_check file not updating properly Created: 28/Oct/14  Updated: 24/Apr/15  Resolved: 04/Dec/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: Lustre 2.5.4

Type: Bug Priority: Major
Reporter: Matt Ezell Assignee: James Nunez (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-137 ioctl passthrough mechanism for Lustr... Resolved
Severity: 3
Rank (Obsolete): 16321

 Description   

Over the weekend we had an OST abort and get marked read-only:

[  726.076561] LDISKFS-fs error (device dm-25): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 111692corrupted: 32768 blocks free in bitmap, 1024 - in gd
[  726.116663] 
[  726.125085] Aborting journal on device dm-25-8.
[  726.133359] LustreError: 17032:0:(ofd_obd.c:1095:ofd_destroy()) f1-OST00ff: error destroying object [0x100000000:0x16546e7:0x0]: 0
[  726.176268] LDISKFS-fs (dm-25): 
[  726.179457] LDISKFS-fs error (device dm-25): ldiskfs_journal_start_sb: Detected aborted journal
[  726.179459] LDISKFS-fs (dm-25): Remounting filesystem read-only

We rely on the /proc/fs/lustre/health_check file to notify us of these situations. Unfortunately, we never got a notification. I found a bug in the b2_5 implementation of the osd-ldiskfs osd_statfs() function. Code inspection leads me to believe it does not affect master, but I haven't tried it there. I will upload a patch momentarily.



 Comments   
Comment by Matt Ezell [ 28/Oct/14 ]

I'd like to also add a test to make sure this doesn't break in the future, but I'm not sure of the best way to make the device go read only. I started with the read-only infrastructure that the test system uses, but that appears to set the underlying device to read-only, not the ldiskfs file system.

First, I tried to implement a server_remount_fs() function so you could do 'mount -o remount,ro', but that gets a Lustre superblock that you would then need to call down into the osd-api to actually make it do anything to the underlying file system.
I then looked at adding an IOCTL that lctl could call, but that also appears to require support from the osd-api.
I have a prototype patch that adds a new osd-api method, dt_abort_device() that could be called from either remount or lctl, but I'm not sure if it makes sense to add a new method just for testing this. Thoughts?

Is there an easier way to cause the underlying filesystem to abort or go read-only?

Comment by Matt Ezell [ 28/Oct/14 ]

http://review.whamcloud.com/12463

Comment by James Nunez (Inactive) [ 28/Oct/14 ]

Matt,

Thanks for the patch. I'll look into a test to check that health_check is being updated properly.

James

Comment by Andreas Dilger [ 29/Oct/14 ]

There was some work done in LU-137 to allow ioctl() pass through to the underlying filesystem, but this was complicated in the 2.4+ releases by the OSD API and the addition of ZFS. While I would be happy to see that work move forward, but it is probably overkill for this.

It would be easier to use the existing fault injection method for Lustre and add a new FAIL_LOC for this case, like:

        if (sb->s_flags & MS_RDONLY ||
            (OBD_FAIL_CHECK(OBD_FAIL_OSD_READONLY) &&
             osd->od_jndex == libcfs_fail_val))
                osd->od_statfs.os_state = OS_STATE_READONLY;

or similar (this is just from the top of my head so the syntax might not be quite correct).

Comment by Andreas Dilger [ 29/Oct/14 ]

PS: I verified that this patch is not needed for master.

Comment by Gerrit Updater [ 04/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12463/
Subject: LU-5822 osd-ldiskfs: Correctly return OS_STATE_READONLY
Project: fs/lustre-release
Branch: b2_5
Current Patch Set:
Commit: 1ff49a78e443f935670daf0c84b5b989c02dca04

Comment by Peter Jones [ 04/Dec/14 ]

Landed for 2.5.4

Generated at Sat Feb 10 01:54:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.