[LU-8450] replay-single test 70c: mount MDS hung - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.9.0
Affects Version/s: Lustre 2.9.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

replay-single test 70c hung while mounting MDS:

Starting mds1:   /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
CMD: onyx-33vm7 mkdir -p /mnt/lustre-mds1; mount -t lustre   		                   /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1

Console log on MDS:

Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre                                   /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache
LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.127@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 683 previous similar messagesLustre: 6963:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1469762588/real 1469762588]  req@ffff880051ebaa00 x1541153266605312/t0(0) o250->MGC10.2.4.126@tcp@0@lo:26/25 lens 520/544 e 0 to 1 dl 1469762613 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 6963:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 13 previous similar messages
Lustre: 29062:0:(service.c:1335:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply 
  req@ffff88004a8faa00 x1541153729978528/t0(0) o101->6a772ed4-43ff-dc51-4d04-2c0278989dc2@10.2.4.120@tcp:-1/-1 lens 872/3512 e 24 to 0 dl 1469763017 ref 2 fl Interpret:/0/0 rc 0/0
Lustre: lustre-MDT0002: Client 6a772ed4-43ff-dc51-4d04-2c0278989dc2 (at 10.2.4.120@tcp) reconnecting
Lustre: Skipped 1 previous similar message
Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp
Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp
Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp
Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp
Lustre: Skipped 1 previous similar message
Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp
Lustre: Skipped 3 previous similar messages
Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp
Lustre: Skipped 6 previous similar messages
Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp
Lustre: Skipped 12 previous similar messages
LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.127@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 1909 previous similar messages
Lustre: 6963:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1469763188/real 1469763188]  req@ffff8800546a1200 x1541153266628608/t0(0) o250->MGC10.2.4.126@tcp@0@lo:26/25 lens 520/544 e 0 to 1 dl 1469763213 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 6963:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 19 previous similar messages
INFO: task mdt00_002:29063 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mdt00_002       D ffffffffa0b1d108     0 29063      2 0x00000080 
 ffff88004f7b3aa0 0000000000000046 ffff88004bc15c00 ffff88004f7b3fd8
 ffff88004f7b3fd8 ffff88004f7b3fd8 ffff88004bc15c00 ffffffffa0b1d100
 ffffffffa0b1d104 ffff88004bc15c00 00000000ffffffff ffffffffa0b1d108
Call Trace:
 [<ffffffff8163cb09>] schedule_preempt_disabled+0x29/0x70
 [<ffffffff8163a805>] __mutex_lock_slowpath+0xc5/0x1c0
 [<ffffffff81639c6f>] mutex_lock+0x1f/0x2f
 [<ffffffffa0a8e024>] nodemap_add_member+0x34/0x1b0 [ptlrpc]
 [<ffffffffa0dbf161>] mdt_obd_reconnect+0x81/0x1d0 [mdt]
 [<ffffffffa09d1e6f>] target_handle_connect+0x1c4f/0x2e30 [ptlrpc]
 [<ffffffffa0a6f5f2>] tgt_request_handle+0x3f2/0x1320 [ptlrpc]
 [<ffffffffa0a1bccb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
 [<ffffffffa0a19888>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
 [<ffffffff810b88d2>] ? default_wake_function+0x12/0x20
 [<ffffffff810af038>] ? __wake_up_common+0x58/0x90
 [<ffffffffa0a1fd80>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
 [<ffffffffa0a1f2e0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
 [<ffffffff810a5aef>] kthread+0xcf/0xe0
 [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
 [<ffffffff816469d8>] ret_from_fork+0x58/0x90 
 [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140

Maloo reports:
https://testing.hpdd.intel.com/test_sets/3f6a9a0e-557a-11e6-906c-5254006e85c2
https://testing.hpdd.intel.com/test_sets/cecb3c06-54af-11e6-a39e-5254006e85c2

Attachments

Issue Links

is related to

LU-3291 IU UID/GID Mapping Feature

Resolved

replay-single test 70c: mount MDS hung

Details

Description

Attachments

Issue Links

Activity

People

Dates