[LU-588] IO hangs from MMP - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.1.5, Lustre 1.8.9
Affects Version/s: Lustre 2.1.0, Lustre 1.8.6
Labels:
None
Environment:
RHEL 5.6 (2.6.18-238.19.1.el5) with one SCSI device handler patch from RHEL 5.7 kernels

Severity:
3
Rank (Obsolete):
7261

Description

I've had my mkfs.lustre commands hang from time to time while formatting all of our OSTs on an OSS simultaneously (29-30 OSTs). The problem shows up with all of the mke2fs completed but the mkfs.lustre being stuck in a TASK_UNINTERRUPTIBLE state. The system starts reporting hung tasks for mkfs.lustre, kmmpd kernel threads, and a few other system resources that are stuck waiting on mutexs from the MMP issue.

I see the following message in the dmesg/syslog.
Aug 4 10:56:12 s01ns030 kernel: LDISKFS-fs warning (device dm-78): kmmpd: kmmpd being stopped since MMP feature has been disabled.
Aug 4 10:56:16 s01ns030 kernel: LDISKFS-fs warning (device dm-70): kmmpd: kmmpd being stopped since MMP feature has been disabled.
Aug 4 10:56:17 s01ns030 kernel: LDISKFS-fs warning (device dm-66): kmmpd: kmmpd being stopped since MMP feature has been disabled.

After adding some printks to kmmpd and forcing a panic, it looks like the issue is that the buffer_head being used by the kmmpd kthread is zeroed. The problem seems to be in ldiskfs_put_super that the buffer_head for the super block is being released prior to the kmmpd kthread being stopped.

Moving the release of the super block buffer head to after the MMP thread has stopped appears to have fixed the issue for me.

— ext4-mmp-rhel5.patch.orig 2011-08-11 12:01:59.000000000 +0000
+++ ext4-mmp-rhel5.patch 2011-08-11 12:06:42.000000000 +0000
@@ -522,12 +522,21 @@

#include "ext4.h"
#include "ext4_jbd2.h"
-@@ -698,6 +700,8 @@ static void ext4_put_super(struct super_
+@@ -682,7 +682,6 @@
+ percpu_counter_destroy(&sbi->s_freeinodes_counter);
+ percpu_counter_destroy(&sbi->s_dirs_counter);
+ percpu_counter_destroy(&sbi->s_dirtyblocks_counter);
+- brelse(sbi->s_sbh);
+ #ifdef CONFIG_QUOTA
+ for (i = 0; i < MAXQUOTAS; i++)
+ kfree(sbi->s_qf_names[i]);
+@@ -698,6 +700,9 @@ static void ext4_put_super(struct super_
invalidate_bdev(sbi->journal_bdev, 0);
ext4_blkdev_remove(sbi);
}
+ if (sbi->s_mmp_tsk)
+ kthread_stop(sbi->s_mmp_tsk);
++ brelse(sbi->s_sbh);
sb->s_fs_info = NULL;
/*

Now that we are completely done shutting down the

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

ldiskfs_mmp.patch
0.8 kB
11/Aug/11 10:03 AM

Issue Links

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Activity

People

Assignee:: Jian Yu

Reporter:: Jeremy Filizetti

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 11/Aug/11 10:03 AM

Updated:: 18/Mar/13 5:47 AM

Resolved:: 18/Mar/13 5:47 AM