Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8357

sanity-sec LBUG on MDS umount with ASSERTION( exp->u.eu_target_data.ted_nodemap == nodemap )

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: Lustre 2.9.0
    • Fix Version/s: Lustre 2.9.0
    • Labels:
      None
    • Environment:
      autotest
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      sanity-sec timesout on MDS umount. From the suite_stdout log, after all sanity-sec tests were run, we see:

      13:18:19:CMD: onyx-36vm7 grep -c /mnt/lustre-mds1' ' /proc/mounts
      13:18:19:Stopping /mnt/lustre-mds1 (opts:-f) on onyx-36vm7
      13:18:19:CMD: onyx-36vm7 umount -d -f /mnt/lustre-mds1
      14:17:19:********** Timeout by autotest system **********
      

      Looking at the test_complete log for MDS1, vm7, doesn’t really show anything interesting. Yet, if we look at the console log for MDS1 for test_27 we see the LBUG:

      13:20:13:[30936.817772] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds1
      13:20:13:[30942.959187] Lustre: 3708:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1467231499/real 1467231499]  req@ffff88007b2bdb00 x1538472463395376/t0(0) o39->lustre-MDT0000-lwp-MDT0000@0@lo:12/10 lens 224/224 e 0 to 1 dl 1467231505 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      13:20:13:[30942.968919] Lustre: 3708:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 39 previous similar messages
      13:20:13:[31024.097193] Lustre: 11462:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1467231575/real 1467231575]  req@ffff88007b2bfc00 x1538472463395360/t0(0) o104->lustre-MDT0000@10.2.4.148@tcp:15/16 lens 296/224 e 0 to 1 dl 1467231586 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
      13:20:13:[31024.108553] Lustre: 11462:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
      13:20:13:[31046.110142] LustreError: 11462:0:(ldlm_lockd.c:691:ldlm_handle_ast_error()) ### client (nid 10.2.4.148@tcp) failed to reply to blocking AST (req@ffff88007b2bfc00 x1538472463395360 status 0 rc -110), evict it ns: mdt-lustre-MDT0000_UUID lock: ffff88007a8ea200/0x40220881aef8307 lrc: 3/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13 rrc: 1 type: IBT flags: 0x60200400000020 nid: 10.2.4.148@tcp remote: 0x3f8b8faf93471726 expref: 5 pid: 11463 timeout: 4325802406 lvb_type: 0
      13:20:13:[31046.122612] LustreError: 138-a: lustre-MDT0000: A client on nid 10.2.4.148@tcp was evicted due to a lock blocking callback time out: rc -110
      13:20:13:[31046.135477] LustreError: 13044:0:(nodemap_member.c:50:nm_member_del()) ASSERTION( exp->u.eu_target_data.ted_nodemap == nodemap ) failed: 
      13:20:13:[31046.140099] LustreError: 13044:0:(nodemap_member.c:50:nm_member_del()) LBUG
      13:20:13:[31046.142126] Pid: 13044, comm: mdt00_003
      13:20:13:[31046.143802] 
      13:20:13:[31046.143802] Call Trace:
      13:20:13:[31046.146737]  [<ffffffffa06a87d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      13:20:13:[31046.148763]  [<ffffffffa06a8d75>] lbug_with_loc+0x45/0xc0 [libcfs]
      13:20:13:[31046.150703]  [<ffffffffa0a8c79d>] nm_member_del+0x18d/0x190 [ptlrpc]
      13:20:13:[31046.152683]  [<ffffffffa0a8743f>] nodemap_del_member+0x5f/0x170 [ptlrpc]
      13:20:13:[31046.154752]  [<ffffffffa0db95a5>] mdt_obd_disconnect+0x155/0x640 [mdt]
      13:20:13:[31046.156740]  [<ffffffffa09cd15b>] target_handle_disconnect+0x10b/0x4a0 [ptlrpc]
      13:20:13:[31046.158828]  [<ffffffffa0a64e37>] tgt_disconnect+0x37/0x140 [ptlrpc]
      13:20:13:[31046.160881]  [<ffffffffa0a69595>] tgt_request_handle+0x915/0x1320 [ptlrpc]
      13:20:13:[31046.162920]  [<ffffffffa0a15b1b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      13:20:13:[31046.165045]  [<ffffffffa0a136d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      13:20:13:[31046.166963]  [<ffffffff810b88b2>] ? default_wake_function+0x12/0x20
      13:20:13:[31046.168920]  [<ffffffff810af018>] ? __wake_up_common+0x58/0x90
      13:20:13:[31046.170841]  [<ffffffffa0a19bd0>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
      13:20:13:[31046.172759]  [<ffffffffa0a19130>] ? ptlrpc_main+0x0/0x1de0 [ptlrpc]
      13:20:13:[31046.174784]  [<ffffffff810a5acf>] kthread+0xcf/0xe0
      13:20:13:[31046.176658]  [<ffffffff810a5a00>] ? kthread+0x0/0xe0
      13:20:13:[31046.178486]  [<ffffffff81646318>] ret_from_fork+0x58/0x90
      13:20:13:[31046.180329]  [<ffffffff810a5a00>] ? kthread+0x0/0xe0
      13:20:13:[31046.182177] 
      13:20:13:[31046.183948] Kernel panic - not syncing: LBUG
      13:20:13:[31046.184689] CPU: 0 PID: 13044 Comm: mdt00_003 Tainted: G           OE  ------------   3.10.0-327.18.2.el7_lustre.x86_64 #1
      13:20:13:[31046.184689] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
      13:20:13:[31046.184689]  ffffffffa06c5def 000000006a616767 ffff88004311bb00 ffffffff81635c14
      13:20:13:[31046.184689]  ffff88004311bb80 ffffffff8162f48a ffffffff00000008 ffff88004311bb90
      13:20:13:[31046.184689]  ffff88004311bb30 000000006a616767 ffffffffa0aa54eb 0000000000000246
      

      There are several instances of this failure. Logs are at
      2016-06-29 - https://testing.hpdd.intel.com/test_sets/2cbf7ebc-3e44-11e6-80b9-5254006e85c2
      2016-06-23 - https://testing.hpdd.intel.com/test_sets/a6a39816-39c2-11e6-acf3-5254006e85c2

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                kit.westneat Kit Westneat
                Reporter:
                jamesanunez James Nunez
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: