[LU-8357] sanity-sec LBUG on MDS umount with ASSERTION( exp->u.eu_target_data.ted_nodemap == nodemap ) Created: 30/Jun/16  Updated: 02/Aug/16  Resolved: 26/Jul/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Kit Westneat
Resolution: Fixed Votes: 0
Labels: None
Environment:

autotest


Issue Links:
Related
is related to LU-3291 IU UID/GID Mapping Feature Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-sec timesout on MDS umount. From the suite_stdout log, after all sanity-sec tests were run, we see:

13:18:19:CMD: onyx-36vm7 grep -c /mnt/lustre-mds1' ' /proc/mounts
13:18:19:Stopping /mnt/lustre-mds1 (opts:-f) on onyx-36vm7
13:18:19:CMD: onyx-36vm7 umount -d -f /mnt/lustre-mds1
14:17:19:********** Timeout by autotest system **********

Looking at the test_complete log for MDS1, vm7, doesn’t really show anything interesting. Yet, if we look at the console log for MDS1 for test_27 we see the LBUG:

13:20:13:[30936.817772] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds1
13:20:13:[30942.959187] Lustre: 3708:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1467231499/real 1467231499]  req@ffff88007b2bdb00 x1538472463395376/t0(0) o39->lustre-MDT0000-lwp-MDT0000@0@lo:12/10 lens 224/224 e 0 to 1 dl 1467231505 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
13:20:13:[30942.968919] Lustre: 3708:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 39 previous similar messages
13:20:13:[31024.097193] Lustre: 11462:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1467231575/real 1467231575]  req@ffff88007b2bfc00 x1538472463395360/t0(0) o104->lustre-MDT0000@10.2.4.148@tcp:15/16 lens 296/224 e 0 to 1 dl 1467231586 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
13:20:13:[31024.108553] Lustre: 11462:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
13:20:13:[31046.110142] LustreError: 11462:0:(ldlm_lockd.c:691:ldlm_handle_ast_error()) ### client (nid 10.2.4.148@tcp) failed to reply to blocking AST (req@ffff88007b2bfc00 x1538472463395360 status 0 rc -110), evict it ns: mdt-lustre-MDT0000_UUID lock: ffff88007a8ea200/0x40220881aef8307 lrc: 3/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13 rrc: 1 type: IBT flags: 0x60200400000020 nid: 10.2.4.148@tcp remote: 0x3f8b8faf93471726 expref: 5 pid: 11463 timeout: 4325802406 lvb_type: 0
13:20:13:[31046.122612] LustreError: 138-a: lustre-MDT0000: A client on nid 10.2.4.148@tcp was evicted due to a lock blocking callback time out: rc -110
13:20:13:[31046.135477] LustreError: 13044:0:(nodemap_member.c:50:nm_member_del()) ASSERTION( exp->u.eu_target_data.ted_nodemap == nodemap ) failed: 
13:20:13:[31046.140099] LustreError: 13044:0:(nodemap_member.c:50:nm_member_del()) LBUG
13:20:13:[31046.142126] Pid: 13044, comm: mdt00_003
13:20:13:[31046.143802] 
13:20:13:[31046.143802] Call Trace:
13:20:13:[31046.146737]  [<ffffffffa06a87d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
13:20:13:[31046.148763]  [<ffffffffa06a8d75>] lbug_with_loc+0x45/0xc0 [libcfs]
13:20:13:[31046.150703]  [<ffffffffa0a8c79d>] nm_member_del+0x18d/0x190 [ptlrpc]
13:20:13:[31046.152683]  [<ffffffffa0a8743f>] nodemap_del_member+0x5f/0x170 [ptlrpc]
13:20:13:[31046.154752]  [<ffffffffa0db95a5>] mdt_obd_disconnect+0x155/0x640 [mdt]
13:20:13:[31046.156740]  [<ffffffffa09cd15b>] target_handle_disconnect+0x10b/0x4a0 [ptlrpc]
13:20:13:[31046.158828]  [<ffffffffa0a64e37>] tgt_disconnect+0x37/0x140 [ptlrpc]
13:20:13:[31046.160881]  [<ffffffffa0a69595>] tgt_request_handle+0x915/0x1320 [ptlrpc]
13:20:13:[31046.162920]  [<ffffffffa0a15b1b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
13:20:13:[31046.165045]  [<ffffffffa0a136d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
13:20:13:[31046.166963]  [<ffffffff810b88b2>] ? default_wake_function+0x12/0x20
13:20:13:[31046.168920]  [<ffffffff810af018>] ? __wake_up_common+0x58/0x90
13:20:13:[31046.170841]  [<ffffffffa0a19bd0>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
13:20:13:[31046.172759]  [<ffffffffa0a19130>] ? ptlrpc_main+0x0/0x1de0 [ptlrpc]
13:20:13:[31046.174784]  [<ffffffff810a5acf>] kthread+0xcf/0xe0
13:20:13:[31046.176658]  [<ffffffff810a5a00>] ? kthread+0x0/0xe0
13:20:13:[31046.178486]  [<ffffffff81646318>] ret_from_fork+0x58/0x90
13:20:13:[31046.180329]  [<ffffffff810a5a00>] ? kthread+0x0/0xe0
13:20:13:[31046.182177] 
13:20:13:[31046.183948] Kernel panic - not syncing: LBUG
13:20:13:[31046.184689] CPU: 0 PID: 13044 Comm: mdt00_003 Tainted: G           OE  ------------   3.10.0-327.18.2.el7_lustre.x86_64 #1
13:20:13:[31046.184689] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
13:20:13:[31046.184689]  ffffffffa06c5def 000000006a616767 ffff88004311bb00 ffffffff81635c14
13:20:13:[31046.184689]  ffff88004311bb80 ffffffff8162f48a ffffffff00000008 ffff88004311bb90
13:20:13:[31046.184689]  ffff88004311bb30 000000006a616767 ffffffffa0aa54eb 0000000000000246

There are several instances of this failure. Logs are at
2016-06-29 - https://testing.hpdd.intel.com/test_sets/2cbf7ebc-3e44-11e6-80b9-5254006e85c2
2016-06-23 - https://testing.hpdd.intel.com/test_sets/a6a39816-39c2-11e6-acf3-5254006e85c2



 Comments   
Comment by Joseph Gmitter (Inactive) [ 01/Jul/16 ]

Hi Kit,

Could you please have a look at this nodemap issue?

Thanks.
Joe

Comment by Kit Westneat [ 05/Jul/16 ]

I found a couple places where nodemaps were being reclassified outside of the config lock, which could cause this LBUG.

I'll upload a new patch.

Comment by Gerrit Updater [ 05/Jul/16 ]

Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/21159
Subject: LU-8357 nodemap: reclassify nodemap requires active conf lock
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 318b220dfd35cffe54e3692eccd7d055f34c710f

Comment by nasf (Inactive) [ 18/Jul/16 ]

Another failure instance on master:
https://testing.hpdd.intel.com/test_sets/802afcac-4aec-11e6-8968-5254006e85c2

Comment by Gerrit Updater [ 20/Jul/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21159/
Subject: LU-8357 nodemap: reclassify nodemap requires active conf lock
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1ce1032a03dd26345e662164b7877079c54468f5

Comment by Joseph Gmitter (Inactive) [ 26/Jul/16 ]

Landed to master for 2.9.0

Generated at Sat Feb 10 02:16:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.