[LU-5420] Failure on test suite sanity test_17m: mount MDS failed, Input/output error - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0
Labels:
- HB
- dne
- patch
Environment:
client and server: lustre-b2_6-rc2 RHEL6 ldiskfs DNE mode

Severity:
3
Rank (Obsolete):
15076

Description

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/16302020-14ed-11e4-bb6a-5254006e85c2.

The sub-test test_17m failed with the following error:

test failed to respond and timed out

Hit this bug in many tests, the env is configured as 1 MDS with 2 MDTs. Didn't hit this error when the configuration is 2 MDSs with 2 MDTs
client console:

CMD: onyx-46vm7 mkdir -p /mnt/mds1
CMD: onyx-46vm7 test -b /dev/lvm-Role_MDS/P1
Starting mds1:   /dev/lvm-Role_MDS/P1 /mnt/mds1
CMD: onyx-46vm7 mkdir -p /mnt/mds1; mount -t lustre   		                   /dev/lvm-Role_MDS/P1 /mnt/mds1
onyx-46vm7: mount.lustre: mount /dev/mapper/lvm--Role_MDS-P1 at /mnt/mds1 failed: Input/output error
onyx-46vm7: Is the MGS running?
Start of /dev/lvm-Role_MDS/P1 on mds1 failed 5

Attachments

Issue Links

is related to

LU-5404 sanity test_228b FAIL: Fail to start MDT.

Open

LU-5077 insanity test_1: out of memory on MDT in crypto_create_tfm()

Resolved

LU-5407 Failover failure on test suite replay-single test_58c: test_58c failed with 2

Resolved

LU-5130 Test failure sanity test_17n: destroy remote dir error 0

Resolved

LU-8206 ptlrpc_invalidate_import ( ASSERTION( imp->imp_invalid ) failed

Closed

is related to

LU-4913 mgc import reconnect race

Resolved

mentioned in: Page Loading...; Page Loading...

(1 is related to , 2 mentioned in)

Activity

[LU-5420] Failure on test suite sanity test_17m: mount MDS failed, Input/output error

Sergey Cheremencev added a comment - 31/Oct/14 9:57 AM

Hello

We hit these problem in xyratex and have another solution http://review.whamcloud.com/#/c/12515/.
Hope it could be helpful.

Sergey Cheremencev added a comment - 31/Oct/14 9:57 AM Hello We hit these problem in xyratex and have another solution http://review.whamcloud.com/#/c/12515/ . Hope it could be helpful.

Di Wang added a comment - 15/Oct/14 10:09 PM

Just updated the patch.

Di Wang added a comment - 15/Oct/14 10:09 PM Just updated the patch.

Andreas Dilger added a comment - 14/Oct/14 9:14 PM

The patch is still failing with a hang at unmount time (this failed in four separate conf-sanity runs in different subtests):

01:52:25:INFO: task umount:26263 blocked for more than 120 seconds.
01:52:25:      Tainted: G        W  ---------------    2.6.32-431.23.3.el6_lustre.g9f5284f.x86_64 #1
01:52:26:umount        D 0000000000000000     0 26263  26262 0x00000080
01:52:27:Call Trace:
01:52:27: [<ffffffff8152b6e5>] rwsem_down_failed_common+0x95/0x1d0
01:52:27: [<ffffffff8152b843>] rwsem_down_write_failed+0x23/0x30
01:52:28: [<ffffffff8128f7f3>] call_rwsem_down_write_failed+0x13/0x20
01:52:28: [<ffffffffa0b13cd1>] client_disconnect_export+0x61/0x460 [ptlrpc]
01:52:28: [<ffffffffa058975a>] lustre_common_put_super+0x28a/0xbf0 [obdclass]
01:52:28: [<ffffffffa05bc508>] server_put_super+0x198/0xe50 [obdclass]
01:52:29: [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0
01:52:29: [<ffffffff8118b326>] kill_anon_super+0x16/0x60
01:52:29: [<ffffffffa0580d06>] lustre_kill_super+0x36/0x60 [obdclass]
01:52:29: [<ffffffff8118bac7>] deactivate_super+0x57/0x80
01:52:29: [<ffffffff811ab4cf>] mntput_no_expire+0xbf/0x110
01:52:29: [<ffffffff811ac01b>] sys_umount+0x7b/0x3a0

Andreas Dilger added a comment - 14/Oct/14 9:14 PM The patch is still failing with a hang at unmount time (this failed in four separate conf-sanity runs in different subtests): 01:52:25:INFO: task umount:26263 blocked for more than 120 seconds. 01:52:25: Tainted: G W --------------- 2.6.32-431.23.3.el6_lustre.g9f5284f.x86_64 #1 01:52:26:umount D 0000000000000000 0 26263 26262 0x00000080 01:52:27:Call Trace: 01:52:27: [<ffffffff8152b6e5>] rwsem_down_failed_common+0x95/0x1d0 01:52:27: [<ffffffff8152b843>] rwsem_down_write_failed+0x23/0x30 01:52:28: [<ffffffff8128f7f3>] call_rwsem_down_write_failed+0x13/0x20 01:52:28: [<ffffffffa0b13cd1>] client_disconnect_export+0x61/0x460 [ptlrpc] 01:52:28: [<ffffffffa058975a>] lustre_common_put_super+0x28a/0xbf0 [obdclass] 01:52:28: [<ffffffffa05bc508>] server_put_super+0x198/0xe50 [obdclass] 01:52:29: [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0 01:52:29: [<ffffffff8118b326>] kill_anon_super+0x16/0x60 01:52:29: [<ffffffffa0580d06>] lustre_kill_super+0x36/0x60 [obdclass] 01:52:29: [<ffffffff8118bac7>] deactivate_super+0x57/0x80 01:52:29: [<ffffffff811ab4cf>] mntput_no_expire+0xbf/0x110 01:52:29: [<ffffffff811ac01b>] sys_umount+0x7b/0x3a0

Andreas Dilger added a comment - 09/Oct/14 6:18 AM

Without this patch I'm also not able to test past sanity.sh test_17m and test_17n without a shared MGS+MDS failing to mount due to -EIO and causing testing to hang until I remount the MDS. I'm able to mount it manually after 2 or 3 tries, so there must be some kind of startup race between the MDS and the MGS. Once I applied this patch I made it through all of sanity.sh and sanityn.sh with multiple MDS remounts without problems until I hit a memory allocation deadlock running dbench that looks unrelated.

Andreas Dilger added a comment - 09/Oct/14 6:18 AM Without this patch I'm also not able to test past sanity.sh test_17m and test_17n without a shared MGS+MDS failing to mount due to -EIO and causing testing to hang until I remount the MDS. I'm able to mount it manually after 2 or 3 tries, so there must be some kind of startup race between the MDS and the MGS. Once I applied this patch I made it through all of sanity.sh and sanityn.sh with multiple MDS remounts without problems until I hit a memory allocation deadlock running dbench that looks unrelated.

John Hammond added a comment - 11/Aug/14 10:13 PM

Test specific issues aside, we need to fix this as putting 2 MDTs from a FS on a single node will be a likely failover configuration.

t:lustre-release# export LUSTRE=$HOME/lustre-release/lustre
t:lustre-release# export MDSCOUNT=2
t:lustre-release# llmount.sh
...
Starting mds1:   -o loop /tmp/lustre-mdt1 /mnt/mds1
Started lustre-MDT0000
Starting mds2:   -o loop /tmp/lustre-mdt2 /mnt/mds2
Started lustre-MDT0001
Starting ost1:   -o loop /tmp/lustre-ost1 /mnt/ost1
Started lustre-OST0000
Starting ost2:   -o loop /tmp/lustre-ost2 /mnt/ost2
Started lustre-OST0001
Starting client: t:  -o user_xattr,flock t@tcp:/lustre /mnt/lustre
Using TIMEOUT=20
seting jobstats to procname_uid
Setting lustre.sys.jobid_var from disable to procname_uid
Waiting 90 secs for update
Updated after 3s: wanted 'procname_uid' got 'procname_uid'
disable quota as required
t:lustre-release# umount /mnt/mds1
t:lustre-release# mount /tmp/lustre-mdt1 /mnt/mds1 -o loop -t lustre
mount.lustre: mount /dev/loop0 at /mnt/mds1 failed: Input/output error
Is the MGS running?

John Hammond added a comment - 11/Aug/14 10:13 PM Test specific issues aside, we need to fix this as putting 2 MDTs from a FS on a single node will be a likely failover configuration. t:lustre-release# export LUSTRE=$HOME/lustre-release/lustre t:lustre-release# export MDSCOUNT=2 t:lustre-release# llmount.sh ... Starting mds1: -o loop /tmp/lustre-mdt1 /mnt/mds1 Started lustre-MDT0000 Starting mds2: -o loop /tmp/lustre-mdt2 /mnt/mds2 Started lustre-MDT0001 Starting ost1: -o loop /tmp/lustre-ost1 /mnt/ost1 Started lustre-OST0000 Starting ost2: -o loop /tmp/lustre-ost2 /mnt/ost2 Started lustre-OST0001 Starting client: t: -o user_xattr,flock t@tcp:/lustre /mnt/lustre Using TIMEOUT=20 seting jobstats to procname_uid Setting lustre.sys.jobid_var from disable to procname_uid Waiting 90 secs for update Updated after 3s: wanted 'procname_uid' got 'procname_uid' disable quota as required t:lustre-release# umount /mnt/mds1 t:lustre-release# mount /tmp/lustre-mdt1 /mnt/mds1 -o loop -t lustre mount.lustre: mount /dev/loop0 at /mnt/mds1 failed: Input/output error Is the MGS running?

Di Wang added a comment - 31/Jul/14 5:31 PM - edited

Sigh, most of insanity test starts MDT or OST before MGS, that is why it cause so many insanity failure with this patch. So if "starting mgs before other targets" is a requirement then we need fix insanity.

Di Wang added a comment - 31/Jul/14 5:31 PM - edited Sigh, most of insanity test starts MDT or OST before MGS, that is why it cause so many insanity failure with this patch. So if "starting mgs before other targets" is a requirement then we need fix insanity.

Di Wang added a comment - 30/Jul/14 7:56 PM

[7/30/14, 12:54:36 PM] wangdi: the failure of insanity is because of with the fix of ~~LU-5425~~, MDT will insist MGS must be started, i.e. MDS setup process will wait there until MGS is setup. But in insanity test_1, it start mdt2 first, then mdt1/mgs, that is why the test will fail
[7/30/14, 12:55:32 PM] wangdi: so can we just fix the test case here, because it seems to me MGS must be setup first

Di Wang added a comment - 30/Jul/14 7:56 PM [7/30/14, 12:54:36 PM] wangdi: the failure of insanity is because of with the fix of LU-5425 , MDT will insist MGS must be started, i.e. MDS setup process will wait there until MGS is setup. But in insanity test_1, it start mdt2 first, then mdt1/mgs, that is why the test will fail [7/30/14, 12:55:32 PM] wangdi: so can we just fix the test case here, because it seems to me MGS must be setup first

Di Wang added a comment - 30/Jul/14 7:35 PM

Hmm, I think insanity failures should be related with the fix from ~~LU-5420~~. I am looking at it now.

Di Wang added a comment - 30/Jul/14 7:35 PM Hmm, I think insanity failures should be related with the fix from LU-5420 . I am looking at it now.

Andreas Dilger added a comment - 30/Jul/14 7:30 PM - edited

I verified that virtually all of the test failures marked ~~LU-5077~~ are actually from the three versions of the ~~LU-5420~~ patches, which fail "insanity" and "conf-sanity" repeatedly. Due to the presence of ~~LU-5425~~, I'm not 100% positive that those are caused by this patch, but definitely the insanity failures.

Andreas Dilger added a comment - 30/Jul/14 7:30 PM - edited I verified that virtually all of the test failures marked LU-5077 are actually from the three versions of the LU-5420 patches, which fail "insanity" and "conf-sanity" repeatedly. Due to the presence of LU-5425 , I'm not 100% positive that those are caused by this patch, but definitely the insanity failures.

Andreas Dilger added a comment - 30/Jul/14 7:07 PM

It seems that this patch is repeatedly failing insanity, even when it is running on b2_6. The failures are marked as ~~LU-5077~~, but I don't think that is the real reason. I suspect there is some other problem with this patch that needs to be investigated.

Andreas Dilger added a comment - 30/Jul/14 7:07 PM It seems that this patch is repeatedly failing insanity, even when it is running on b2_6. The failures are marked as LU-5077 , but I don't think that is the real reason. I suspect there is some other problem with this patch that needs to be investigated.

Andreas Dilger added a comment - 28/Jul/14 11:19 PM

The option #2 patch is testing well on my local system (single-node 2x MDT, 3x OST, client) which was having solid test failures in sanity.sh test_17m and test_17o (which I'd incorrectly attributed to ~~LU-1538~~ patch http://review.whamcloud.com/10481 that was reverted).

I've pushed an updated version of the 11240 patch at http://review.whamcloud.com/11258 with improved comments and removing some noise from the console. Since this might be a blocker I didn't refresh the original 11240 patch so that it could continue testing, but I'd prefer that the 11258 version land if it is ready.

Andreas Dilger added a comment - 28/Jul/14 11:19 PM The option #2 patch is testing well on my local system (single-node 2x MDT, 3x OST, client) which was having solid test failures in sanity.sh test_17m and test_17o (which I'd incorrectly attributed to LU-1538 patch http://review.whamcloud.com/10481 that was reverted). I've pushed an updated version of the 11240 patch at http://review.whamcloud.com/11258 with improved comments and removing some noise from the console. Since this might be a blocker I didn't refresh the original 11240 patch so that it could continue testing, but I'd prefer that the 11258 version land if it is ready.

People

Assignee:: Di Wang

Reporter:: Sarah Liu

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 26/Jul/14 7:08 PM

Updated:: 15/Jan/25 12:19 AM

Resolved:: 11/May/15 5:36 PM