Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10390

MGS crashes in ldlm_reprocess_queue() when stopping

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.10.2
    • None
    • CentOS 7.4
    • 3
    • 9223372036854775807

    Description

      Never seen before but already twice with the latest 2.10.2 version: the MGS is crashing when stopping:

      [77225.855547] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
      [77225.864304] IP: [<ffffffffc0ba48ce>] ldlm_process_plain_lock+0x6e/0xb30 [ptlrpc]
      [77225.872614] PGD 0 
      [77225.874864] Oops: 0000 [#1] SMP 
      [77225.878480] Modules linked in: mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) vfat fat uas usb_storage mpt2sas mptctl mptbase dell_rbu rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core sb_edac edac_core intel_powerclamp dm_service_time coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper iTCO_wdt ablk_helper iTCO_vendor_support cryptd dm_round_robin pcspkr mxm_wmi dcdbas sg ipmi_si ipmi_devintf ipmi_msghandler mei_me mei lpc_ich shpchp acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd grace dm_multipath sunrpc dm_mod ip_tables ext4 mbcache
      [77225.957975]  jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx4_core drm tg3 ahci libahci crct10dif_pclmul mpt3sas crct10dif_common raid_class ptp crc32c_intel libata megaraid_sas i2c_core devlink scsi_transport_sas pps_core
      [77225.986851] CPU: 23 PID: 27105 Comm: ldlm_bl_14 Tainted: G           OE  ------------   3.10.0-693.2.2.el7_lustre.pl1.x86_64 #1
      [77225.999661] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.3.4 11/08/2016
      [77226.008010] task: ffff88103323cf10 ti: ffff881012ab8000 task.ti: ffff881012ab8000
      [77226.016359] RIP: 0010:[<ffffffffc0ba48ce>]  [<ffffffffc0ba48ce>] ldlm_process_plain_lock+0x6e/0xb30 [ptlrpc]
      [77226.027355] RSP: 0018:ffff881012abbbe0  EFLAGS: 00010287
      [77226.033280] RAX: 0000000000000000 RBX: ffff881011f7d400 RCX: ffff881012abbc7c
      [77226.041240] RDX: 0000000000000002 RSI: ffff881012abbc80 RDI: ffff881011f7d400
      [77226.049201] RBP: ffff881012abbc58 R08: ffff881012abbcd0 R09: ffff88103d0d7880
      [77226.057162] R10: ffff881011f7d400 R11: 7fffffffffffffff R12: ffff880168287540
      [77226.065123] R13: 0000000000000002 R14: ffff881012abbcd0 R15: ffff881011f7d460
      [77226.073085] FS:  0000000000000000(0000) GS:ffff88203c8c0000(0000) knlGS:0000000000000000
      [77226.082111] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [77226.088521] CR2: 000000000000001c CR3: 00000000019f2000 CR4: 00000000001407e0
      [77226.096482] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [77226.104443] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [77226.112403] Stack:
      [77226.114642]  ffff881012abbc7c ffff881012abbcd0 ffff881012abbc80 0000000000000000
      [77226.122930]  ffff880168287520 0000001000000001 ffff880100000010 ffff881012abbc18
      [77226.131219]  ffff881012abbc18 00000000bd734aeb 0000000000000002 ffff880168287540
      [77226.139507] Call Trace:
      [77226.142252]  [<ffffffffc0ba4860>] ? ldlm_errno2error+0x60/0x60 [ptlrpc]
      [77226.149649]  [<ffffffffc0b8f9db>] ldlm_reprocess_queue+0x13b/0x2a0 [ptlrpc]
      [77226.157434]  [<ffffffffc0b9057d>] __ldlm_reprocess_all+0x14d/0x3a0 [ptlrpc]
      [77226.165220]  [<ffffffffc0b90b30>] ldlm_reprocess_res+0x20/0x30 [ptlrpc]
      [77226.172611]  [<ffffffffc0866bef>] cfs_hash_for_each_relax+0x21f/0x400 [libcfs]
      [77226.180687]  [<ffffffffc0b90b10>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc]
      [77226.188571]  [<ffffffffc0b90b10>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc]
      [77226.196441]  [<ffffffffc0869d95>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs]
      [77226.204518]  [<ffffffffc0b90b7c>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc]
      [77226.212983]  [<ffffffffc0b917bc>] ldlm_export_cancel_locks+0x11c/0x130 [ptlrpc]
      [77226.221162]  [<ffffffffc0bbada8>] ldlm_bl_thread_main+0x4c8/0x700 [ptlrpc]
      [77226.228836]  [<ffffffff816a8fad>] ? __schedule+0x39d/0x8b0
      [77226.234977]  [<ffffffffc0bba8e0>] ? ldlm_handle_bl_callback+0x410/0x410 [ptlrpc]
      [77226.243232]  [<ffffffff810b098f>] kthread+0xcf/0xe0
      [77226.248672]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
      [77226.255472]  [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
      [77226.261494]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
      [77226.268292] Code: 89 45 a0 74 0d f6 05 b3 ac cd ff 01 0f 85 34 06 00 00 8b 83 98 00 00 00 39 83 9c 00 00 00 89 45 b8 0f 84 57 09 00 00 48 8b 45 a0 <8b> 40 1c 85 c0 0f 84 7a 09 00 00 48 8b 4d a0 48 89 c8 48 83 c0 
      [77226.289871] RIP  [<ffffffffc0ba48ce>] ldlm_process_plain_lock+0x6e/0xb30 [ptlrpc]
      [77226.298248]  RSP <ffff881012abbbe0>
      [77226.302135] CR2: 000000000000001c
      

      Best,
      Stephane

      Attachments

        Issue Links

          Activity

            [LU-10390] MGS crashes in ldlm_reprocess_queue() when stopping
            emoly.liu Emoly Liu added a comment -

            haaknci,

            Could you please collect some lustre logs on MGS by the following steps and upload here?

            • lctl set_param debug=-1 debug_mb=1024(or other value in MB)
            • lctl dk > /dev/null
            • reproduce this issue
            • lctl dk > mgs_panic.log

            Thanks.

            emoly.liu Emoly Liu added a comment - haaknci , Could you please collect some lustre logs on MGS by the following steps and upload here? lctl set_param debug=-1 debug_mb=1024(or other value in MB) lctl dk > /dev/null reproduce this issue lctl dk > mgs_panic.log Thanks.

            Lustre 2.10.8 Server.
            Lustre 2.12 clients

            Currently we see this issues on the MGS when unmounting. We are unmounting the MGS to deal with the fact that it has become unstable and is throwing the following errors 

             

            Jul 26 11:47:54 g4-mds01 kernel: LNet: 40488:0:(o2iblnd_cb.c:3202:kiblnd_check_conns()) Timed out tx for 10.141.1.87@o2ib4: 18 seconds
            Jul 26 11:48:36 g4-mds01 kernel: LustreError: 166-1: MGC10.141.1.41@o2ib4: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
            Jul 26 11:48:36 g4-mds01 kernel: LustreError: 208265:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1627263816, 300s ago), entering recovery for MGS@MGC10.141.1.41@o2ib4_0 ns: MGC10.141.1.41@o2ib4 lock: ffff9f7c79309c00/0xa65add00b40c1243 lrc: 4/1,0 mode: --/CR res: [0x346174616467:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xa65add00b40c124a expref: -99 pid: 208265 timeout: 0 lvb_type: 0
            Jul 26 11:48:36 g4-mds01 kernel: LustreError: 219839:0:(ldlm_resource.c:1101:ldlm_resource_complain()) MGC10.141.1.41@o2ib4: namespace resource [0x346174616467:0x2:0x0].0x0 (ffff9f745c79e3c0) refcount nonzero (2) after lock cleanup; forcing cleanup.Jul 26 11:48:36 g4-mds01 kernel: LustreError: 219839:0:(ldlm_resource.c:1683:ldlm_resource_dump()) --- Resource: [0x346174616467:0x2:0x0].0x0 (ffff9f745c79e3c0) refcount = 3
            Jul 26 11:48:36 g4-mds01 kernel: LustreError: 219839:0:(ldlm_resource.c:1704:ldlm_resource_dump()) Waiting locks:
            Jul 26 11:48:36 g4-mds01 kernel: LustreError: 219839:0:(ldlm_resource.c:1706:ldlm_resource_dump()) ### ### ns: MGC10.141.1.41@o2ib4 lock: ffff9f7c79309c00/0xa65add00b40c1243 lrc: 4/1,0 mode: --/CR res: [0x346174616467:0x2:0x0].0x0 rrc: 4 type: PLN flags: 0x1106400000000 nid: local remote: 0xa65add00b40c124a expref: -99 pid: 208265 timeout: 0 lvb_type: 0

            The workaround for the above issue is to umount/remount the mgs. Currently unmounting the MGS leads to a livelocked machine.

            haaknci Malcolm Haak (Inactive) added a comment - Lustre 2.10.8 Server. Lustre 2.12 clients Currently we see this issues on the MGS when unmounting. We are unmounting the MGS to deal with the fact that it has become unstable and is throwing the following errors    Jul 26 11:47:54 g4-mds01 kernel: LNet: 40488:0:(o2iblnd_cb.c:3202:kiblnd_check_conns()) Timed out tx for 10.141.1.87@o2ib4: 18 seconds Jul 26 11:48:36 g4-mds01 kernel: LustreError: 166-1: MGC10.141.1.41@o2ib4: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail Jul 26 11:48:36 g4-mds01 kernel: LustreError: 208265:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1627263816, 300s ago), entering recovery for MGS@MGC10.141.1.41@o2ib4_0 ns: MGC10.141.1.41@o2ib4 lock: ffff9f7c79309c00/0xa65add00b40c1243 lrc: 4/1,0 mode: --/CR res: [0x346174616467:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xa65add00b40c124a expref: -99 pid: 208265 timeout: 0 lvb_type: 0 Jul 26 11:48:36 g4-mds01 kernel: LustreError: 219839:0:(ldlm_resource.c:1101:ldlm_resource_complain()) MGC10.141.1.41@o2ib4: namespace resource [0x346174616467:0x2:0x0].0x0 (ffff9f745c79e3c0) refcount nonzero (2) after lock cleanup; forcing cleanup.Jul 26 11:48:36 g4-mds01 kernel: LustreError: 219839:0:(ldlm_resource.c:1683:ldlm_resource_dump()) --- Resource: [0x346174616467:0x2:0x0].0x0 (ffff9f745c79e3c0) refcount = 3 Jul 26 11:48:36 g4-mds01 kernel: LustreError: 219839:0:(ldlm_resource.c:1704:ldlm_resource_dump()) Waiting locks: Jul 26 11:48:36 g4-mds01 kernel: LustreError: 219839:0:(ldlm_resource.c:1706:ldlm_resource_dump()) ### ### ns: MGC10.141.1.41@o2ib4 lock: ffff9f7c79309c00/0xa65add00b40c1243 lrc: 4/1,0 mode: --/CR res: [0x346174616467:0x2:0x0].0x0 rrc: 4 type: PLN flags: 0x1106400000000 nid: local remote: 0xa65add00b40c124a expref: -99 pid: 208265 timeout: 0 lvb_type: 0 The workaround for the above issue is to umount/remount the mgs. Currently unmounting the MGS leads to a livelocked machine.
            emoly.liu Emoly Liu added a comment -

            haaknci,

            Could you please provide the lustre version you're using this time and any information about the testing environment and operations to hit this issue ?

            Thanks.

            emoly.liu Emoly Liu added a comment - haaknci , Could you please provide the lustre version you're using this time and any information about the testing environment and operations to hit this issue ? Thanks.

            Actually, its not. We're having this issue again.

            haaknci Malcolm Haak (Inactive) added a comment - Actually, its not. We're having this issue again.

            just for info, NCI is no longer following this ticket. ok to close from our perspective.

            javed javed shaikh (Inactive) added a comment - just for info, NCI is no longer following this ticket. ok to close from our perspective.

            I got similar issue when running test of replay-dual:26 on Lustre-2.10.4.

             

            4853.863850] Lustre: DEBUG MARKER: == replay-dual test 26: dbench and tar with mds failover ============================================= 19:10:39 (1527073839)
            [ 4857.258739] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) lustre-MDT0000: opcode 2: before 509 < left 638, rollback = 2
            [ 4857.265541] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) create: 3/24/0, destroy: 0/0/0
            [ 4857.270705] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 9/638/0
            [ 4857.276600] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) write: 5/136/0, punch: 0/0/0, quota 3/3/0
            [ 4857.285310] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) insert: 11/199/3, delete: 0/0/0
            [ 4857.290653] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) ref_add: 5/5/0, ref_del: 0/0/0
            [ 4858.049927] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) lustre-MDT0000: opcode 2: before 509 < left 638, rollback = 2
            [ 4858.056424] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) Skipped 7 previous similar messages
            [ 4858.061225] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) create: 3/24/0, destroy: 0/0/0
            [ 4858.066306] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) Skipped 7 previous similar messages
            [ 4858.071319] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 9/638/0
            [ 4858.077706] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) Skipped 7 previous similar messages
            [ 4858.083901] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) write: 5/136/0, punch: 0/0/0, quota 3/3/0
            [ 4858.090166] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) Skipped 7 previous similar messages
            [ 4858.096507] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) insert: 11/199/3, delete: 0/0/0
            [ 4858.108592] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) Skipped 7 previous similar messages
            [ 4858.114122] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) ref_add: 5/5/0, ref_del: 0/0/0
            [ 4858.120097] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) Skipped 7 previous similar messages
            [ 4859.276388] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) lustre-MDT0000: opcode 2: before 504 < left 638, rollback = 2
            [ 4859.282904] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) Skipped 87 previous similar messages
            [ 4859.297126] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) create: 3/24/4, destroy: 0/0/0
            [ 4859.303500] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) Skipped 87 previous similar messages
            [ 4859.308834] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 9/638/0
            [ 4859.314310] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) Skipped 87 previous similar messages
            [ 4859.319758] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) write: 5/136/0, punch: 0/0/0, quota 3/3/0
            [ 4859.322635] Lustre: DEBUG MARKER: sync; sync; sync
            [ 4859.332972] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) Skipped 87 previous similar messages
            [ 4859.338367] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) insert: 11/199/4, delete: 0/0/0
            [ 4859.343583] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) Skipped 87 previous similar messages
            [ 4859.349626] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) ref_add: 5/5/0, ref_del: 0/0/0
            [ 4859.357229] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) Skipped 87 previous similar messages
            [ 4861.591545] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) lustre-MDT0000: opcode 2: before 502 < left 638, rollback = 2
            [ 4861.598048] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) Skipped 55 previous similar messages
            [ 4861.603329] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) create: 3/24/5, destroy: 0/0/0
            [ 4861.608250] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) Skipped 55 previous similar messages
            [ 4861.613206] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 9/638/0
            [ 4861.618462] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) Skipped 55 previous similar messages
            [ 4861.623794] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) write: 5/136/0, punch: 0/0/0, quota 3/3/0
            [ 4861.629265] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) Skipped 55 previous similar messages
            [ 4861.634206] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) insert: 11/199/5, delete: 0/0/0
            [ 4861.638943] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) Skipped 55 previous similar messages
            [ 4861.643917] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) ref_add: 5/5/0, ref_del: 0/0/0
            [ 4861.648812] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) Skipped 55 previous similar messages
            [ 4862.383599] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno
            [ 4863.153800] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly
            [ 4864.185346] Turning device vdb (0xfd00010) read-only
            [ 4864.654871] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
            [ 4865.082331] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
            [ 4865.611961] Lustre: 6652:0:(osd_internal.h:1139:osd_trans_exec_op()) lustre-MDT0000: opcode 2: before 505 < left 638, rollback = 2
            [ 4865.618851] Lustre: 6652:0:(osd_internal.h:1139:osd_trans_exec_op()) Skipped 31 previous similar messages
            [ 4865.624480] Lustre: 6652:0:(osd_handler.c:1673:osd_trans_dump_creds()) create: 3/24/3, destroy: 0/0/0
            [ 4865.629503] Lustre: 6652:0:(osd_handler.c:1673:osd_trans_dump_creds()) Skipped 31 previous similar messages
            [ 4865.634719] Lustre: 6652:0:(osd_handler.c:1680:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 9/638/0
            [ 4865.639975] Lustre: 6652:0:(osd_handler.c:1680:osd_trans_dump_creds()) Skipped 31 previous similar messages
            [ 4865.644902] Lustre: 6652:0:(osd_handler.c:1690:osd_trans_dump_creds()) write: 5/136/0, punch: 0/0/0, quota 3/3/0
            [ 4865.650149] Lustre: 6652:0:(osd_handler.c:1690:osd_trans_dump_creds()) Skipped 31 previous similar messages
            [ 4865.655072] Lustre: 6652:0:(osd_handler.c:1697:osd_trans_dump_creds()) insert: 11/199/4, delete: 0/0/0
            [ 4865.660397] Lustre: 6652:0:(osd_handler.c:1697:osd_trans_dump_creds()) Skipped 31 previous similar messages
            [ 4865.665815] Lustre: 6652:0:(osd_handler.c:1704:osd_trans_dump_creds()) ref_add: 5/5/0, ref_del: 0/0/0
            [ 4865.671025] Lustre: 6652:0:(osd_handler.c:1704:osd_trans_dump_creds()) Skipped 31 previous similar messages
            [ 4867.620546] Lustre: DEBUG MARKER: /usr/sbin/lctl mark test_26 fail mds1 1 times
            [ 4868.162595] Lustre: DEBUG MARKER: test_26 fail mds1 1 times
            [ 4868.587024] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
            [ 4869.337772] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1
            [ 4873.832032] Lustre: 3863:0:(osd_internal.h:1139:osd_trans_exec_op()) lustre-MDT0000: opcode 2: before 508 < left 638, rollback = 2
            [ 4873.839128] Lustre: 3863:0:(osd_internal.h:1139:osd_trans_exec_op()) Skipped 151 previous similar messages
            [ 4873.844748] Lustre: 3863:0:(osd_handler.c:1673:osd_trans_dump_creds()) create: 3/24/1, destroy: 0/0/0
            [ 4873.852092] Lustre: 3863:0:(osd_handler.c:1673:osd_trans_dump_creds()) Skipped 151 previous similar messages
            [ 4873.857492] Lustre: 3863:0:(osd_handler.c:1680:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 9/638/0
            [ 4873.880971] Lustre: 3863:0:(osd_handler.c:1680:osd_trans_dump_creds()) Skipped 151 previous similar messages
            [ 4873.888084] Lustre: 3863:0:(osd_handler.c:1690:osd_trans_dump_creds()) write: 5/136/0, punch: 0/0/0, quota 3/3/0
            [ 4873.893552] Lustre: 3863:0:(osd_handler.c:1690:osd_trans_dump_creds()) Skipped 151 previous similar messages
            [ 4873.898667] Lustre: 3863:0:(osd_handler.c:1697:osd_trans_dump_creds()) insert: 11/199/3, delete: 0/0/0
            [ 4873.903637] Lustre: 3863:0:(osd_handler.c:1697:osd_trans_dump_creds()) Skipped 151 previous similar messages
            [ 4873.908995] Lustre: 3863:0:(osd_handler.c:1704:osd_trans_dump_creds()) ref_add: 5/5/0, ref_del: 0/0/0
            [ 4873.914058] Lustre: 3863:0:(osd_handler.c:1704:osd_trans_dump_creds()) Skipped 151 previous similar messages
            [ 4900.065841] Lustre: 4511:0:(osd_internal.h:1139:osd_trans_exec_op()) lustre-MDT0000: opcode 2: before 504 < left 638, rollback = 2
            [ 4900.072969] Lustre: 4511:0:(osd_internal.h:1139:osd_trans_exec_op()) Skipped 279 previous similar messages
            [ 4900.088107] Lustre: 4511:0:(osd_handler.c:1673:osd_trans_dump_creds()) create: 3/24/4, destroy: 0/0/0
            [ 4900.094712] Lustre: 4511:0:(osd_handler.c:1673:osd_trans_dump_creds()) Skipped 279 previous similar messages
            [ 4900.104952] Lustre: 4511:0:(osd_handler.c:1680:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 9/638/0
            [ 4900.124421] Lustre: 4511:0:(osd_handler.c:1680:osd_trans_dump_creds()) Skipped 279 previous similar messages
            [ 4900.138081] Lustre: 4511:0:(osd_handler.c:1690:osd_trans_dump_creds()) write: 5/136/0, punch: 0/0/0, quota 3/3/0
            [ 4900.145762] Lustre: 4511:0:(osd_handler.c:1690:osd_trans_dump_creds()) Skipped 279 previous similar messages
            [ 4900.151332] Lustre: 4511:0:(osd_handler.c:1697:osd_trans_dump_creds()) insert: 11/199/4, delete: 0/0/0
            [ 4900.156757] Lustre: 4511:0:(osd_handler.c:1697:osd_trans_dump_creds()) Skipped 279 previous similar messages
            [ 4900.163104] Lustre: 4511:0:(osd_handler.c:1704:osd_trans_dump_creds()) ref_add: 5/5/0, ref_del: 0/0/0
            [ 4900.168387] Lustre: 4511:0:(osd_handler.c:1704:osd_trans_dump_creds()) Skipped 279 previous similar messages
            [ 4935.827897] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) lustre-MDT0000: opcode 2: before 501 < left 638, rollback = 2
            [ 4935.837586] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) Skipped 239 previous similar messages
            [ 4935.845497] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) create: 3/24/6, destroy: 0/0/0
            [ 4935.852328] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) Skipped 239 previous similar messages
            [ 4935.859506] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 9/638/0
            [ 4935.866531] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) Skipped 239 previous similar messages
            [ 4935.872820] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) write: 5/136/0, punch: 0/0/0, quota 3/3/0
            [ 4935.879939] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) Skipped 239 previous similar messages
            [ 4935.886349] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) insert: 11/199/5, delete: 0/0/0
            [ 4935.892222] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) Skipped 239 previous similar messages
            [ 4935.898171] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) ref_add: 5/5/0, ref_del: 0/0/0
            [ 4935.903051] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) Skipped 239 previous similar messages
            [ 5001.767877] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) lustre-MDT0000: opcode 2: before 508 < left 638, rollback = 2
            [ 5001.774742] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) Skipped 695 previous similar messages
            [ 5001.780504] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) create: 3/24/2, destroy: 0/0/0
            [ 5001.785460] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) Skipped 695 previous similar messages
            [ 5001.790992] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 9/638/0
            [ 5001.796465] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) Skipped 695 previous similar messages
            [ 5001.802010] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) write: 5/136/0, punch: 0/0/0, quota 3/3/0
            [ 5001.808416] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) Skipped 695 previous similar messages
            [ 5001.814112] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) insert: 11/199/2, delete: 0/0/0
            [ 5001.819500] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) Skipped 695 previous similar messages
            [ 5001.825194] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) ref_add: 5/5/0, ref_del: 0/0/0
            [ 5001.830458] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) Skipped 695 previous similar messages
            [ 5043.531844] INFO: task umount:7051 blocked for more than 120 seconds.
            [ 5043.559145] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [ 5043.571856] umount D ffff965a5b45eeb0 0 7051 7050 0x00000000
            [ 5043.588676] Call Trace:
            [ 5043.590253] [<ffffffff82f13e99>] schedule_preempt_disabled+0x29/0x70
            [ 5043.594280] [<ffffffff82f11c57>] __mutex_lock_slowpath+0xc7/0x1d0
            [ 5043.598409] [<ffffffff82f1103f>] mutex_lock+0x1f/0x2f
            [ 5043.601442] [<ffffffffc07b93a7>] mgc_process_config+0x207/0x13f0 [mgc]
            [ 5043.605151] [<ffffffffc0829d78>] obd_process_config.constprop.13+0x88/0x2e0 [obdclass]
            [ 5043.612032] [<ffffffffc04bdcc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
            [ 5043.621123] [<ffffffffc0816cf9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
            [ 5043.627421] [<ffffffffc082b0af>] lustre_end_log+0x1ff/0x550 [obdclass]
            [ 5043.630929] [<ffffffffc0853d1e>] server_put_super+0x7de/0xcd0 [obdclass]
            [ 5043.634737] [<ffffffff82a1d69d>] generic_shutdown_super+0x6d/0x100
            [ 5043.638215] [<ffffffff82a1da82>] kill_anon_super+0x12/0x20
            [ 5043.653558] [<ffffffffc0829402>] lustre_kill_super+0x32/0x50 [obdclass]
            [ 5043.657368] [<ffffffff82a1de3e>] deactivate_locked_super+0x4e/0x70
            [ 5043.660755] [<ffffffff82a1e5c6>] deactivate_super+0x46/0x60
            [ 5043.663775] [<ffffffff82a3c57f>] cleanup_mnt+0x3f/0x80
            [ 5043.666571] [<ffffffff82a3c612>] __cleanup_mnt+0x12/0x20
            [ 5043.679438] [<ffffffff828b784b>] task_work_run+0xbb/0xe0
            [ 5043.682215] [<ffffffff8282ac55>] do_notify_resume+0xa5/0xc0
            [ 5043.685702] [<ffffffff82f1fada>] int_signal+0x12/0x17
            [ 5130.557643] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) lustre-MDT0000: opcode 2: before 502 < left 638, rollback = 2
            [ 5130.566402] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) Skipped 1463 previous similar messages
            [ 5130.572039] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) create: 3/24/5, destroy: 0/0/0
            [ 5130.577152] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) Skipped 1463 previous similar messages
            [ 5130.582341] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 9/638/0
            [ 5130.587512] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) Skipped 1463 previous similar messages
            [ 5130.592725] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) write: 5/136/0, punch: 0/0/0, quota 3/3/0
            [ 5130.600231] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) Skipped 1463 previous similar messages
            [ 5130.605633] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) insert: 11/199/5, delete: 0/0/0
            [ 5130.610566] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) Skipped 1463 previous similar messages
            [ 5130.615660] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) ref_add: 5/5/0, ref_del: 0/0/0
            [ 5130.620572] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) Skipped 1463 previous similar messages
            [ 5136.751418] LustreError: 3878:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1527073823, 300s ago); not entering recovery in server code, just going back to sleep ns: MGS lock: ffff965a5c79c200/0xc539b8615eb9c76 lrc: 3/0,1 mode: --/EX res: [0x65727473756c:0x2:0x0].0x0 rrc: 14 type: PLN flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 3878 timeout: 0 lvb_type: 0
            [ 5136.786410] LustreError: dumping log to /tmp/lustre-log.1527074123.3878
            [ 5143.791753] Lustre: MGS: Received new LWP connection from 10.0.2.38@tcp, removing former export from same NID
            [ 5143.797868] Lustre: MGS: Connection restored to bc8f32af-75b1-d99f-ee7c-157fd1d1987f (at 10.0.2.38@tcp)
            [ 5143.803193] Lustre: Skipped 113 previous similar messages
            [ 5151.286402] LustreError: 166-1: MGC10.0.2.48@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
            [ 5151.293159] LustreError: 3862:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1527073837, 300s ago), entering recovery for MGS@10.0.2.48@tcp ns: MGC10.0.2.48@tcp lock: ffff965a5d519a00/0xc539b8615eb9d25 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc539b8615eb9d2c expref: -99 pid: 3862 timeout: 0 lvb_type: 0
            [ 5151.293512] Lustre: MGS: Received new LWP connection from 0@lo, removing former export from same NID
            [ 5151.293514] Lustre: Skipped 2 previous similar messages
            [ 5151.294777] LustreError: 9613:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.0.2.48@tcp: namespace resource [0x65727473756c:0x2:0x0].0x0 (ffff965a7ac16840) refcount nonzero (2) after lock cleanup; forcing cleanup.
            [ 5151.294782] LustreError: 9613:0:(ldlm_resource.c:1682:ldlm_resource_dump()) — Resource: [0x65727473756c:0x2:0x0].0x0 (ffff965a7ac16840) refcount = 3
            [ 5151.294785] LustreError: 9613:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks:
            [ 5151.294793] LustreError: 9613:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: MGC10.0.2.48@tcp lock: ffff965a5d519a00/0xc539b8615eb9d25 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 4 type: PLN flags: 0x1106400000000 nid: local remote: 0xc539b8615eb9d2c expref: -99 pid: 3862 timeout: 0 lvb_type: 0
            [ 5151.358516] Lustre: Failing over lustre-MDT0000
            [ 5151.360959] Lustre: Skipped 8 previous similar messages
            [ 5151.410257] Lustre: lustre-MDT0000: Not available for connect from 10.0.2.51@tcp (stopping)
            [ 5151.416502] Lustre: Skipped 1 previous similar message
            [ 5153.740590] LustreError: 726:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff965a5ffdc000 x1601249971853200/t0(0) o13->lustre-OST0001-osc-MDT0000@10.0.2.32@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
            [ 5153.759784] LustreError: 726:0:(client.c:1166:ptlrpc_import_delay_req()) Skipped 1 previous similar message
            [ 5154.332546] LustreError: 727:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff965a5ffdc4c0 x1601249971853216/t0(0) o13->lustre-OST0000-osc-MDT0000@10.0.2.39@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
            [ 5157.129226] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
            [ 5157.134863] IP: [<ffffffffc11676ce>] ldlm_process_plain_lock+0x6e/0xb30 [ptlrpc]
            [ 5157.139884] PGD 0
            [ 5157.141267] Oops: 0000 1 SMP
            [ 5157.143341] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) mbcache jbd2 zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ppdev pcspkr sg joydev i2c_piix4 virtio_balloon parport_pc i2c_core parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom ata_generic pata_acpi virtio_scsi virtio_blk virtio_net ata_piix serio_raw virtio_pci libata virtio_ring virtio floppy [last unloaded: libcfs]
            [ 5157.212014] CPU: 0 PID: 9614 Comm: ldlm_bl_03 Kdump: loaded Tainted: P OE ------------ 3.10.0-862.2.3.el7_lustre.2.10.4_33_gd942416.ddn0.x86_64 #1
            [ 5157.226568] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
            [ 5157.244021] task: ffff965a79d25ee0 ti: ffff965a5d4f4000 task.ti: ffff965a5d4f4000
            [ 5157.248225] RIP: 0010:[<ffffffffc11676ce>] [<ffffffffc11676ce>] ldlm_process_plain_lock+0x6e/0xb30 [ptlrpc]
            [ 5157.253914] RSP: 0018:ffff965a5d4f7be0 EFLAGS: 00010287
            [ 5157.256926] RAX: 0000000000000000 RBX: ffff965a5c79d000 RCX: 0000000000000000
            [ 5157.260710] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff965a7a78e200
            [ 5157.264452] RBP: ffff965a5d4f7c58 R08: 0000000000000001 R09: 0000000000000000
            [ 5157.268412] R10: ffff965a7cc5a628 R11: 000000000000000f R12: ffff965a5888de80
            [ 5157.272403] R13: 0000000000000002 R14: ffff965a5d4f7cd0 R15: ffff965a5c79d060
            [ 5157.286857] FS: 0000000000000000(0000) GS:ffff965a7fc00000(0000) knlGS:0000000000000000
            [ 5157.291044] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
            [ 5157.294048] CR2: 000000000000001c CR3: 000000000a20e000 CR4: 00000000000006f0
            [ 5157.299570] Call Trace:
            [ 5157.301199] [<ffffffffc04bdcc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
            [ 5157.305503] [<ffffffffc1167660>] ? ldlm_errno2error+0x60/0x60 [ptlrpc]
            [ 5157.309668] [<ffffffffc11529ed>] ldlm_reprocess_queue+0x13d/0x2a0 [ptlrpc]
            [ 5157.313748] [<ffffffffc115358d>] __ldlm_reprocess_all+0x14d/0x3a0 [ptlrpc]
            [ 5157.336428] [<ffffffffc1153b46>] ldlm_reprocess_res+0x26/0x30 [ptlrpc]
            [ 5157.339825] [<ffffffffc04c3d70>] cfs_hash_for_each_relax+0x250/0x450 [libcfs]
            [ 5157.343689] [<ffffffffc1153b20>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc]
            [ 5157.347679] [<ffffffffc1153b20>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc]
            [ 5157.351716] [<ffffffffc04c7105>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs]
            [ 5157.355477] [<ffffffffc1153b8c>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc]
            [ 5157.359889] [<ffffffffc115483c>] ldlm_export_cancel_locks+0x11c/0x130 [ptlrpc]
            [ 5157.363807] [<ffffffffc117dda8>] ldlm_bl_thread_main+0x4c8/0x700 [ptlrpc]
            [ 5157.367533] [<ffffffff828cee80>] ? wake_up_state+0x20/0x20
            [ 5157.370635] [<ffffffffc117d8e0>] ? ldlm_handle_bl_callback+0x410/0x410 [ptlrpc]
            [ 5157.375198] [<ffffffff828bae31>] kthread+0xd1/0xe0
            [ 5157.378054] [<ffffffff828bad60>] ? insert_kthread_work+0x40/0x40
            [ 5157.381485] [<ffffffff82f1f5f7>] ret_from_fork_nospec_begin+0x21/0x21
            [ 5157.385004] [<ffffffff828bad60>] ? insert_kthread_work+0x40/0x40
            [ 5157.388502] Code: 89 45 a0 74 0d f6 05 33 5f 37 ff 01 0f 85 34 06 00 00 8b 83 98 00 00 00 39 83 9c 00 00 00 89 45 b8 0f 84 57 09 00 00 48 8b 45 a0 <8b> 40 1c 85 c0 0f 84 7a 09 00 00 48 8b 4d a0 48 89 c8 48 83 c0
            [ 5157.400499] RIP [<ffffffffc11676ce>] ldlm_process_plain_lock+0x6e/0xb30 [ptlrpc]
            [ 5157.404838] RSP <ffff965a5d4f7be0>
            [ 5157.410691] CR2: 000000000000001c

            lixi Li Xi (Inactive) added a comment - I got similar issue when running test of replay-dual:26 on Lustre-2.10.4.   4853.863850] Lustre: DEBUG MARKER: == replay-dual test 26: dbench and tar with mds failover ============================================= 19:10:39 (1527073839) [ 4857.258739] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) lustre-MDT0000: opcode 2: before 509 < left 638, rollback = 2 [ 4857.265541] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) create: 3/24/0, destroy: 0/0/0 [ 4857.270705] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 9/638/0 [ 4857.276600] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) write: 5/136/0, punch: 0/0/0, quota 3/3/0 [ 4857.285310] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) insert: 11/199/3, delete: 0/0/0 [ 4857.290653] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) ref_add: 5/5/0, ref_del: 0/0/0 [ 4858.049927] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) lustre-MDT0000: opcode 2: before 509 < left 638, rollback = 2 [ 4858.056424] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) Skipped 7 previous similar messages [ 4858.061225] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) create: 3/24/0, destroy: 0/0/0 [ 4858.066306] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) Skipped 7 previous similar messages [ 4858.071319] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 9/638/0 [ 4858.077706] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) Skipped 7 previous similar messages [ 4858.083901] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) write: 5/136/0, punch: 0/0/0, quota 3/3/0 [ 4858.090166] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) Skipped 7 previous similar messages [ 4858.096507] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) insert: 11/199/3, delete: 0/0/0 [ 4858.108592] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) Skipped 7 previous similar messages [ 4858.114122] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) ref_add: 5/5/0, ref_del: 0/0/0 [ 4858.120097] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) Skipped 7 previous similar messages [ 4859.276388] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) lustre-MDT0000: opcode 2: before 504 < left 638, rollback = 2 [ 4859.282904] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) Skipped 87 previous similar messages [ 4859.297126] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) create: 3/24/4, destroy: 0/0/0 [ 4859.303500] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) Skipped 87 previous similar messages [ 4859.308834] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 9/638/0 [ 4859.314310] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) Skipped 87 previous similar messages [ 4859.319758] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) write: 5/136/0, punch: 0/0/0, quota 3/3/0 [ 4859.322635] Lustre: DEBUG MARKER: sync; sync; sync [ 4859.332972] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) Skipped 87 previous similar messages [ 4859.338367] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) insert: 11/199/4, delete: 0/0/0 [ 4859.343583] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) Skipped 87 previous similar messages [ 4859.349626] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) ref_add: 5/5/0, ref_del: 0/0/0 [ 4859.357229] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) Skipped 87 previous similar messages [ 4861.591545] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) lustre-MDT0000: opcode 2: before 502 < left 638, rollback = 2 [ 4861.598048] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) Skipped 55 previous similar messages [ 4861.603329] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) create: 3/24/5, destroy: 0/0/0 [ 4861.608250] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) Skipped 55 previous similar messages [ 4861.613206] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 9/638/0 [ 4861.618462] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) Skipped 55 previous similar messages [ 4861.623794] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) write: 5/136/0, punch: 0/0/0, quota 3/3/0 [ 4861.629265] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) Skipped 55 previous similar messages [ 4861.634206] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) insert: 11/199/5, delete: 0/0/0 [ 4861.638943] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) Skipped 55 previous similar messages [ 4861.643917] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) ref_add: 5/5/0, ref_del: 0/0/0 [ 4861.648812] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) Skipped 55 previous similar messages [ 4862.383599] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno [ 4863.153800] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly [ 4864.185346] Turning device vdb (0xfd00010) read-only [ 4864.654871] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000 [ 4865.082331] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000 [ 4865.611961] Lustre: 6652:0:(osd_internal.h:1139:osd_trans_exec_op()) lustre-MDT0000: opcode 2: before 505 < left 638, rollback = 2 [ 4865.618851] Lustre: 6652:0:(osd_internal.h:1139:osd_trans_exec_op()) Skipped 31 previous similar messages [ 4865.624480] Lustre: 6652:0:(osd_handler.c:1673:osd_trans_dump_creds()) create: 3/24/3, destroy: 0/0/0 [ 4865.629503] Lustre: 6652:0:(osd_handler.c:1673:osd_trans_dump_creds()) Skipped 31 previous similar messages [ 4865.634719] Lustre: 6652:0:(osd_handler.c:1680:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 9/638/0 [ 4865.639975] Lustre: 6652:0:(osd_handler.c:1680:osd_trans_dump_creds()) Skipped 31 previous similar messages [ 4865.644902] Lustre: 6652:0:(osd_handler.c:1690:osd_trans_dump_creds()) write: 5/136/0, punch: 0/0/0, quota 3/3/0 [ 4865.650149] Lustre: 6652:0:(osd_handler.c:1690:osd_trans_dump_creds()) Skipped 31 previous similar messages [ 4865.655072] Lustre: 6652:0:(osd_handler.c:1697:osd_trans_dump_creds()) insert: 11/199/4, delete: 0/0/0 [ 4865.660397] Lustre: 6652:0:(osd_handler.c:1697:osd_trans_dump_creds()) Skipped 31 previous similar messages [ 4865.665815] Lustre: 6652:0:(osd_handler.c:1704:osd_trans_dump_creds()) ref_add: 5/5/0, ref_del: 0/0/0 [ 4865.671025] Lustre: 6652:0:(osd_handler.c:1704:osd_trans_dump_creds()) Skipped 31 previous similar messages [ 4867.620546] Lustre: DEBUG MARKER: /usr/sbin/lctl mark test_26 fail mds1 1 times [ 4868.162595] Lustre: DEBUG MARKER: test_26 fail mds1 1 times [ 4868.587024] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true [ 4869.337772] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1 [ 4873.832032] Lustre: 3863:0:(osd_internal.h:1139:osd_trans_exec_op()) lustre-MDT0000: opcode 2: before 508 < left 638, rollback = 2 [ 4873.839128] Lustre: 3863:0:(osd_internal.h:1139:osd_trans_exec_op()) Skipped 151 previous similar messages [ 4873.844748] Lustre: 3863:0:(osd_handler.c:1673:osd_trans_dump_creds()) create: 3/24/1, destroy: 0/0/0 [ 4873.852092] Lustre: 3863:0:(osd_handler.c:1673:osd_trans_dump_creds()) Skipped 151 previous similar messages [ 4873.857492] Lustre: 3863:0:(osd_handler.c:1680:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 9/638/0 [ 4873.880971] Lustre: 3863:0:(osd_handler.c:1680:osd_trans_dump_creds()) Skipped 151 previous similar messages [ 4873.888084] Lustre: 3863:0:(osd_handler.c:1690:osd_trans_dump_creds()) write: 5/136/0, punch: 0/0/0, quota 3/3/0 [ 4873.893552] Lustre: 3863:0:(osd_handler.c:1690:osd_trans_dump_creds()) Skipped 151 previous similar messages [ 4873.898667] Lustre: 3863:0:(osd_handler.c:1697:osd_trans_dump_creds()) insert: 11/199/3, delete: 0/0/0 [ 4873.903637] Lustre: 3863:0:(osd_handler.c:1697:osd_trans_dump_creds()) Skipped 151 previous similar messages [ 4873.908995] Lustre: 3863:0:(osd_handler.c:1704:osd_trans_dump_creds()) ref_add: 5/5/0, ref_del: 0/0/0 [ 4873.914058] Lustre: 3863:0:(osd_handler.c:1704:osd_trans_dump_creds()) Skipped 151 previous similar messages [ 4900.065841] Lustre: 4511:0:(osd_internal.h:1139:osd_trans_exec_op()) lustre-MDT0000: opcode 2: before 504 < left 638, rollback = 2 [ 4900.072969] Lustre: 4511:0:(osd_internal.h:1139:osd_trans_exec_op()) Skipped 279 previous similar messages [ 4900.088107] Lustre: 4511:0:(osd_handler.c:1673:osd_trans_dump_creds()) create: 3/24/4, destroy: 0/0/0 [ 4900.094712] Lustre: 4511:0:(osd_handler.c:1673:osd_trans_dump_creds()) Skipped 279 previous similar messages [ 4900.104952] Lustre: 4511:0:(osd_handler.c:1680:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 9/638/0 [ 4900.124421] Lustre: 4511:0:(osd_handler.c:1680:osd_trans_dump_creds()) Skipped 279 previous similar messages [ 4900.138081] Lustre: 4511:0:(osd_handler.c:1690:osd_trans_dump_creds()) write: 5/136/0, punch: 0/0/0, quota 3/3/0 [ 4900.145762] Lustre: 4511:0:(osd_handler.c:1690:osd_trans_dump_creds()) Skipped 279 previous similar messages [ 4900.151332] Lustre: 4511:0:(osd_handler.c:1697:osd_trans_dump_creds()) insert: 11/199/4, delete: 0/0/0 [ 4900.156757] Lustre: 4511:0:(osd_handler.c:1697:osd_trans_dump_creds()) Skipped 279 previous similar messages [ 4900.163104] Lustre: 4511:0:(osd_handler.c:1704:osd_trans_dump_creds()) ref_add: 5/5/0, ref_del: 0/0/0 [ 4900.168387] Lustre: 4511:0:(osd_handler.c:1704:osd_trans_dump_creds()) Skipped 279 previous similar messages [ 4935.827897] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) lustre-MDT0000: opcode 2: before 501 < left 638, rollback = 2 [ 4935.837586] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) Skipped 239 previous similar messages [ 4935.845497] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) create: 3/24/6, destroy: 0/0/0 [ 4935.852328] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) Skipped 239 previous similar messages [ 4935.859506] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 9/638/0 [ 4935.866531] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) Skipped 239 previous similar messages [ 4935.872820] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) write: 5/136/0, punch: 0/0/0, quota 3/3/0 [ 4935.879939] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) Skipped 239 previous similar messages [ 4935.886349] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) insert: 11/199/5, delete: 0/0/0 [ 4935.892222] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) Skipped 239 previous similar messages [ 4935.898171] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) ref_add: 5/5/0, ref_del: 0/0/0 [ 4935.903051] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) Skipped 239 previous similar messages [ 5001.767877] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) lustre-MDT0000: opcode 2: before 508 < left 638, rollback = 2 [ 5001.774742] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) Skipped 695 previous similar messages [ 5001.780504] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) create: 3/24/2, destroy: 0/0/0 [ 5001.785460] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) Skipped 695 previous similar messages [ 5001.790992] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 9/638/0 [ 5001.796465] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) Skipped 695 previous similar messages [ 5001.802010] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) write: 5/136/0, punch: 0/0/0, quota 3/3/0 [ 5001.808416] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) Skipped 695 previous similar messages [ 5001.814112] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) insert: 11/199/2, delete: 0/0/0 [ 5001.819500] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) Skipped 695 previous similar messages [ 5001.825194] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) ref_add: 5/5/0, ref_del: 0/0/0 [ 5001.830458] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) Skipped 695 previous similar messages [ 5043.531844] INFO: task umount:7051 blocked for more than 120 seconds. [ 5043.559145] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 5043.571856] umount D ffff965a5b45eeb0 0 7051 7050 0x00000000 [ 5043.588676] Call Trace: [ 5043.590253] [<ffffffff82f13e99>] schedule_preempt_disabled+0x29/0x70 [ 5043.594280] [<ffffffff82f11c57>] __mutex_lock_slowpath+0xc7/0x1d0 [ 5043.598409] [<ffffffff82f1103f>] mutex_lock+0x1f/0x2f [ 5043.601442] [<ffffffffc07b93a7>] mgc_process_config+0x207/0x13f0 [mgc] [ 5043.605151] [<ffffffffc0829d78>] obd_process_config.constprop.13+0x88/0x2e0 [obdclass] [ 5043.612032] [<ffffffffc04bdcc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 5043.621123] [<ffffffffc0816cf9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [ 5043.627421] [<ffffffffc082b0af>] lustre_end_log+0x1ff/0x550 [obdclass] [ 5043.630929] [<ffffffffc0853d1e>] server_put_super+0x7de/0xcd0 [obdclass] [ 5043.634737] [<ffffffff82a1d69d>] generic_shutdown_super+0x6d/0x100 [ 5043.638215] [<ffffffff82a1da82>] kill_anon_super+0x12/0x20 [ 5043.653558] [<ffffffffc0829402>] lustre_kill_super+0x32/0x50 [obdclass] [ 5043.657368] [<ffffffff82a1de3e>] deactivate_locked_super+0x4e/0x70 [ 5043.660755] [<ffffffff82a1e5c6>] deactivate_super+0x46/0x60 [ 5043.663775] [<ffffffff82a3c57f>] cleanup_mnt+0x3f/0x80 [ 5043.666571] [<ffffffff82a3c612>] __cleanup_mnt+0x12/0x20 [ 5043.679438] [<ffffffff828b784b>] task_work_run+0xbb/0xe0 [ 5043.682215] [<ffffffff8282ac55>] do_notify_resume+0xa5/0xc0 [ 5043.685702] [<ffffffff82f1fada>] int_signal+0x12/0x17 [ 5130.557643] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) lustre-MDT0000: opcode 2: before 502 < left 638, rollback = 2 [ 5130.566402] Lustre: 3864:0:(osd_internal.h:1139:osd_trans_exec_op()) Skipped 1463 previous similar messages [ 5130.572039] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) create: 3/24/5, destroy: 0/0/0 [ 5130.577152] Lustre: 3864:0:(osd_handler.c:1673:osd_trans_dump_creds()) Skipped 1463 previous similar messages [ 5130.582341] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 9/638/0 [ 5130.587512] Lustre: 3864:0:(osd_handler.c:1680:osd_trans_dump_creds()) Skipped 1463 previous similar messages [ 5130.592725] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) write: 5/136/0, punch: 0/0/0, quota 3/3/0 [ 5130.600231] Lustre: 3864:0:(osd_handler.c:1690:osd_trans_dump_creds()) Skipped 1463 previous similar messages [ 5130.605633] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) insert: 11/199/5, delete: 0/0/0 [ 5130.610566] Lustre: 3864:0:(osd_handler.c:1697:osd_trans_dump_creds()) Skipped 1463 previous similar messages [ 5130.615660] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) ref_add: 5/5/0, ref_del: 0/0/0 [ 5130.620572] Lustre: 3864:0:(osd_handler.c:1704:osd_trans_dump_creds()) Skipped 1463 previous similar messages [ 5136.751418] LustreError: 3878:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1527073823, 300s ago); not entering recovery in server code, just going back to sleep ns: MGS lock: ffff965a5c79c200/0xc539b8615eb9c76 lrc: 3/0,1 mode: --/EX res: [0x65727473756c:0x2:0x0] .0x0 rrc: 14 type: PLN flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 3878 timeout: 0 lvb_type: 0 [ 5136.786410] LustreError: dumping log to /tmp/lustre-log.1527074123.3878 [ 5143.791753] Lustre: MGS: Received new LWP connection from 10.0.2.38@tcp, removing former export from same NID [ 5143.797868] Lustre: MGS: Connection restored to bc8f32af-75b1-d99f-ee7c-157fd1d1987f (at 10.0.2.38@tcp) [ 5143.803193] Lustre: Skipped 113 previous similar messages [ 5151.286402] LustreError: 166-1: MGC10.0.2.48@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail [ 5151.293159] LustreError: 3862:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1527073837, 300s ago), entering recovery for MGS@10.0.2.48@tcp ns: MGC10.0.2.48@tcp lock: ffff965a5d519a00/0xc539b8615eb9d25 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0] .0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xc539b8615eb9d2c expref: -99 pid: 3862 timeout: 0 lvb_type: 0 [ 5151.293512] Lustre: MGS: Received new LWP connection from 0@lo, removing former export from same NID [ 5151.293514] Lustre: Skipped 2 previous similar messages [ 5151.294777] LustreError: 9613:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.0.2.48@tcp: namespace resource [0x65727473756c:0x2:0x0] .0x0 (ffff965a7ac16840) refcount nonzero (2) after lock cleanup; forcing cleanup. [ 5151.294782] LustreError: 9613:0:(ldlm_resource.c:1682:ldlm_resource_dump()) — Resource: [0x65727473756c:0x2:0x0] .0x0 (ffff965a7ac16840) refcount = 3 [ 5151.294785] LustreError: 9613:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks: [ 5151.294793] LustreError: 9613:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: MGC10.0.2.48@tcp lock: ffff965a5d519a00/0xc539b8615eb9d25 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0] .0x0 rrc: 4 type: PLN flags: 0x1106400000000 nid: local remote: 0xc539b8615eb9d2c expref: -99 pid: 3862 timeout: 0 lvb_type: 0 [ 5151.358516] Lustre: Failing over lustre-MDT0000 [ 5151.360959] Lustre: Skipped 8 previous similar messages [ 5151.410257] Lustre: lustre-MDT0000: Not available for connect from 10.0.2.51@tcp (stopping) [ 5151.416502] Lustre: Skipped 1 previous similar message [ 5153.740590] LustreError: 726:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff965a5ffdc000 x1601249971853200/t0(0) o13->lustre-OST0001-osc-MDT0000@10.0.2.32@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 [ 5153.759784] LustreError: 726:0:(client.c:1166:ptlrpc_import_delay_req()) Skipped 1 previous similar message [ 5154.332546] LustreError: 727:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff965a5ffdc4c0 x1601249971853216/t0(0) o13->lustre-OST0000-osc-MDT0000@10.0.2.39@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 [ 5157.129226] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c [ 5157.134863] IP: [<ffffffffc11676ce>] ldlm_process_plain_lock+0x6e/0xb30 [ptlrpc] [ 5157.139884] PGD 0 [ 5157.141267] Oops: 0000 1 SMP [ 5157.143341] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) mbcache jbd2 zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ppdev pcspkr sg joydev i2c_piix4 virtio_balloon parport_pc i2c_core parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom ata_generic pata_acpi virtio_scsi virtio_blk virtio_net ata_piix serio_raw virtio_pci libata virtio_ring virtio floppy [last unloaded: libcfs] [ 5157.212014] CPU: 0 PID: 9614 Comm: ldlm_bl_03 Kdump: loaded Tainted: P OE ------------ 3.10.0-862.2.3.el7_lustre.2.10.4_33_gd942416.ddn0.x86_64 #1 [ 5157.226568] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 5157.244021] task: ffff965a79d25ee0 ti: ffff965a5d4f4000 task.ti: ffff965a5d4f4000 [ 5157.248225] RIP: 0010: [<ffffffffc11676ce>] [<ffffffffc11676ce>] ldlm_process_plain_lock+0x6e/0xb30 [ptlrpc] [ 5157.253914] RSP: 0018:ffff965a5d4f7be0 EFLAGS: 00010287 [ 5157.256926] RAX: 0000000000000000 RBX: ffff965a5c79d000 RCX: 0000000000000000 [ 5157.260710] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff965a7a78e200 [ 5157.264452] RBP: ffff965a5d4f7c58 R08: 0000000000000001 R09: 0000000000000000 [ 5157.268412] R10: ffff965a7cc5a628 R11: 000000000000000f R12: ffff965a5888de80 [ 5157.272403] R13: 0000000000000002 R14: ffff965a5d4f7cd0 R15: ffff965a5c79d060 [ 5157.286857] FS: 0000000000000000(0000) GS:ffff965a7fc00000(0000) knlGS:0000000000000000 [ 5157.291044] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 5157.294048] CR2: 000000000000001c CR3: 000000000a20e000 CR4: 00000000000006f0 [ 5157.299570] Call Trace: [ 5157.301199] [<ffffffffc04bdcc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 5157.305503] [<ffffffffc1167660>] ? ldlm_errno2error+0x60/0x60 [ptlrpc] [ 5157.309668] [<ffffffffc11529ed>] ldlm_reprocess_queue+0x13d/0x2a0 [ptlrpc] [ 5157.313748] [<ffffffffc115358d>] __ldlm_reprocess_all+0x14d/0x3a0 [ptlrpc] [ 5157.336428] [<ffffffffc1153b46>] ldlm_reprocess_res+0x26/0x30 [ptlrpc] [ 5157.339825] [<ffffffffc04c3d70>] cfs_hash_for_each_relax+0x250/0x450 [libcfs] [ 5157.343689] [<ffffffffc1153b20>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc] [ 5157.347679] [<ffffffffc1153b20>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc] [ 5157.351716] [<ffffffffc04c7105>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs] [ 5157.355477] [<ffffffffc1153b8c>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc] [ 5157.359889] [<ffffffffc115483c>] ldlm_export_cancel_locks+0x11c/0x130 [ptlrpc] [ 5157.363807] [<ffffffffc117dda8>] ldlm_bl_thread_main+0x4c8/0x700 [ptlrpc] [ 5157.367533] [<ffffffff828cee80>] ? wake_up_state+0x20/0x20 [ 5157.370635] [<ffffffffc117d8e0>] ? ldlm_handle_bl_callback+0x410/0x410 [ptlrpc] [ 5157.375198] [<ffffffff828bae31>] kthread+0xd1/0xe0 [ 5157.378054] [<ffffffff828bad60>] ? insert_kthread_work+0x40/0x40 [ 5157.381485] [<ffffffff82f1f5f7>] ret_from_fork_nospec_begin+0x21/0x21 [ 5157.385004] [<ffffffff828bad60>] ? insert_kthread_work+0x40/0x40 [ 5157.388502] Code: 89 45 a0 74 0d f6 05 33 5f 37 ff 01 0f 85 34 06 00 00 8b 83 98 00 00 00 39 83 9c 00 00 00 89 45 b8 0f 84 57 09 00 00 48 8b 45 a0 <8b> 40 1c 85 c0 0f 84 7a 09 00 00 48 8b 4d a0 48 89 c8 48 83 c0 [ 5157.400499] RIP [<ffffffffc11676ce>] ldlm_process_plain_lock+0x6e/0xb30 [ptlrpc] [ 5157.404838] RSP <ffff965a5d4f7be0> [ 5157.410691] CR2: 000000000000001c

            Hi Emoly,

            We're using ldiskfs, not ZFS. The crash occurs almost each time I stop the MGS, usually after at least a few days of production. But because it's a system in production, my actions are limited, I cannot test much right now. We're almost constantly redeploying clients for rolling updates and the MGS has to be available. Still, I'll see what I can do. Also what kind of information would help you (like before unmounting), perhaps the output of 

            /proc/fs/lustre/ldlm/namespaces/MGS/pool/stats as mentioned by Malcolm?

             

            Also, as I said, I often see these 'lock timed out' errors on the clients (see my previous comment), which are related to the MGS.

            Thanks!!

            Stephane

            sthiell Stephane Thiell added a comment - Hi Emoly, We're using ldiskfs, not ZFS. The crash occurs almost each time I stop the MGS, usually after at least a few days of production. But because it's a system in production, my actions are limited, I cannot test much right now. We're almost constantly redeploying clients for rolling updates and the MGS has to be available. Still, I'll see what I can do. Also what kind of information would help you (like before unmounting), perhaps the output of  /proc/fs/lustre/ldlm/namespaces/MGS/pool/stats as mentioned by Malcolm?   Also, as I said, I often see these 'lock timed out' errors on the clients (see my previous comment), which are related to the MGS. Thanks!! Stephane
            emoly.liu Emoly Liu added a comment - - edited

            sthiell,

            I'm checking the patch of LU-8287 that Malcolm mentioned in LU-10635. Are you using ZFS? Is this issue reproducible for you?  And what operations did you do before you hit this issue?
            Any help to reproduce this issue will be appreciated.

            Thanks,
            Emoly

            emoly.liu Emoly Liu added a comment - - edited sthiell , I'm checking the patch of LU-8287 that Malcolm mentioned in LU-10635 . Are you using ZFS? Is this issue reproducible for you?  And what operations did you do before you hit this issue? Any help to reproduce this issue will be appreciated. Thanks, Emoly

            Hello,

            This issue just happened in 2.10.3 when we tried to unmount the MGS. In this case, the server didn't crash but we saw a lot of soft lockup/CPU stuck stack traces on the MGS like this one:

            [83347.066299] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [ldlm_bl_11:16013]
            [83347.066322] Modules linked in: mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) vfat fat uas usb_storage mpt2sas mptctl mptbase rpcsec_gss_krb5 dell_rbu nfsv4 dns_resolver nfs fscache ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi dm_service_time kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt dcdbas iTCO_vendor_support ipmi_si ipmi_devintf mxm_wmi dm_round_robin pcspkr sg ipmi_msghandler acpi_power_meter wmi mei_me mei shpchp lpc_ich nfsd auth_rpcgss dm_multipath dm_mod nfs_acl lockd grace sunrpc ip_tables ext4 mbcache
            [83347.066330]  jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm tg3 ahci crct10dif_pclmul crct10dif_common mlx4_core mpt3sas drm libahci crc32c_intel ptp raid_class libata megaraid_sas devlink i2c_core scsi_transport_sas pps_core
            [83347.066331] CPU: 5 PID: 16013 Comm: ldlm_bl_11 Tainted: G           OEL ------------   3.10.0-693.2.2.el7_lustre.pl2.x86_64 #1
            [83347.066332] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.6.0 10/26/2017
            [83347.066333] task: ffff88203ad88000 ti: ffff88201c804000 task.ti: ffff88201c804000
            [83347.066335] RIP: 0010:[<ffffffff810fa326>]  [<ffffffff810fa326>] native_queued_spin_lock_slowpath+0x116/0x1e0
            [83347.066335] RSP: 0018:ffff88201c807b70  EFLAGS: 00000246
            [83347.066336] RAX: 0000000000000000 RBX: 000000013c796cc0 RCX: 0000000000290000
            [83347.066337] RDX: ffff88103d017880 RSI: 0000000000810001 RDI: ffff88102137981c
            [83347.066337] RBP: ffff88201c807b70 R08: ffff88203c697880 R09: 0000000000000000
            [83347.066338] R10: ffff88202b216a00 R11: 0000000000000000 R12: ffff88201c807c58
            [83347.066338] R13: 0000000000000001 R14: ffff88201c807b28 R15: ffffffff81322c35
            [83347.066339] FS:  0000000000000000(0000) GS:ffff88203c680000(0000) knlGS:0000000000000000
            [83347.066340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [83347.066340] CR2: 00007fd7ddcd8090 CR3: 00000000019f2000 CR4: 00000000001407e0
            [83347.066341] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            [83347.066341] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            [83347.066341] Stack:
            [83347.066343]  ffff88201c807b80 ffffffff8169e61f ffff88201c807b90 ffffffff816abb70
            [83347.066344]  ffff88201c807bd0 ffffffffc0b72198 0000000000000000 ffff88202b216a00
            [83347.066345]  ffff88201c807c18 ffff88202b216a60 ffff88202b216a00 ffff8810175b8000
            [83347.066345] Call Trace:
            [83347.066347]  [<ffffffff8169e61f>] queued_spin_lock_slowpath+0xb/0xf
            [83347.066349]  [<ffffffff816abb70>] _raw_spin_lock+0x20/0x30
            [83347.066367]  [<ffffffffc0b72198>] ldlm_handle_conflict_lock+0xd8/0x330 [ptlrpc]
            [83347.066388]  [<ffffffffc0b86755>] ldlm_process_plain_lock+0x435/0xb30 [ptlrpc]
            [83347.066407]  [<ffffffffc0b86320>] ? ldlm_errno2error+0x60/0x60 [ptlrpc]
            [83347.066425]  [<ffffffffc0b7199b>] ldlm_reprocess_queue+0x13b/0x2a0 [ptlrpc]
            [83347.066443]  [<ffffffffc0b7253d>] __ldlm_reprocess_all+0x14d/0x3a0 [ptlrpc]
            [83347.066460]  [<ffffffffc0b72af0>] ldlm_reprocess_res+0x20/0x30 [ptlrpc]
            [83347.066466]  [<ffffffffc0847bef>] cfs_hash_for_each_relax+0x21f/0x400 [libcfs]
            [83347.066483]  [<ffffffffc0b72ad0>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc]
            [83347.066501]  [<ffffffffc0b72ad0>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc]
            [83347.066506]  [<ffffffffc084ad95>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs]
            [83347.066526]  [<ffffffffc0b72b3c>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc]
            [83347.066544]  [<ffffffffc0b737cc>] ldlm_export_cancel_locks+0x11c/0x130 [ptlrpc]
            [83347.066566]  [<ffffffffc0b9c9a8>] ldlm_bl_thread_main+0x4c8/0x700 [ptlrpc]
            [83347.066567]  [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
            [83347.066588]  [<ffffffffc0b9c4e0>] ? ldlm_handle_bl_callback+0x410/0x410 [ptlrpc]
            [83347.066589]  [<ffffffff810b098f>] kthread+0xcf/0xe0
            [83347.066590]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
            [83347.066592]  [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
            [83347.066593]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
            

            After a few minutes, I decided to take a crash dump. It is available upon request. Attaching the associated vmcore-dmesg.txt as vmcore-dmesg-oak-md1-s1-2018-03-16-06-47-23.txt

            Prior to un-mounting the MGS, I also noticed tons of 'lock timed out' errors on clients for the MGS (at 10.0.2.51@o2ib5):

            [716108.759689] LustreError: 1392:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1521202950, 300s ago), entering recovery for MGS@MGC10.0.2.51@o2ib5_0 ns: MGC10.0.2.51@o2ib5 lock: ffff8803e8b51800/0x509cd0f71fde08cc lrc: 4/1,0 mode: --/CR res: [0x6b616f:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x2e0a3ddb63e79ade expref: -99 pid: 1392 timeout: 0 lvb_type: 0
            [716108.760126] LustreError: 5988:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff880402be2600) refcount nonzero (2) after lock cleanup; forcing cleanup.
            [716108.760128] LustreError: 5988:0:(ldlm_resource.c:1100:ldlm_resource_complain()) Skipped 1 previous similar message
            [716108.760130] LustreError: 5988:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff880402be2600) refcount = 3
            [716108.760132] LustreError: 5988:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks:
            [716108.760137] LustreError: 5988:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: MGC10.0.2.51@o2ib5 lock: ffff8803e8b51800/0x509cd0f71fde08cc lrc: 4/1,0 mode: --/CR res: [0x6b616f:0x2:0x0].0x0 rrc: 4 type: PLN flags: 0x1106400000000 nid: local remote: 0x2e0a3ddb63e79ade expref: -99 pid: 1392 timeout: 0 lvb_type: 0
            [716108.760138] LustreError: 5988:0:(ldlm_resource.c:1705:ldlm_resource_dump()) Skipped 1 previous similar message
            [716108.778087] LustreError: 1392:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
            [716415.536505] LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail
            [716415.539195] LustreError: Skipped 1 previous similar message
            [716415.540823] LustreError: 5999:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff8800a2fb46c0) refcount = 2
            [716415.543552] LustreError: 5999:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks:
            [716415.544914] Lustre: MGC10.0.2.51@o2ib5: Connection restored to MGC10.0.2.51@o2ib5_0 (at 10.0.2.51@o2ib5)
            [716415.546536] Lustre: Skipped 1 previous similar message
            [716723.374375] LustreError: 1392:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1521203565, 300s ago), entering recovery for MGS@MGC10.0.2.51@o2ib5_0 ns: MGC10.0.2.51@o2ib5 lock: ffff8802383c6c00/0x509cd0f71fde2a67 lrc: 4/1,0 mode: --/CR res: [0x6b616f:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x2e0a3ddb63e8a325 expref: -99 pid: 1392 timeout: 0 lvb_type: 0
            [716723.380318] LustreError: 1392:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
            [716723.381110] LustreError: 6003:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff8803aafc4900) refcount nonzero (2) after lock cleanup; forcing cleanup.
            [716723.381112] LustreError: 6003:0:(ldlm_resource.c:1100:ldlm_resource_complain()) Skipped 1 previous similar message
            [716723.381115] LustreError: 6003:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff8803aafc4900) refcount = 3
            [716723.381117] LustreError: 6003:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks:
            [716723.381123] LustreError: 6003:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: MGC10.0.2.51@o2ib5 lock: ffff8802383c6c00/0x509cd0f71fde2a67 lrc: 4/1,0 mode: --/CR res: [0x6b616f:0x2:0x0].0x0 rrc: 4 type: PLN flags: 0x1106400000000 nid: local remote: 0x2e0a3ddb63e8a325 expref: -99 pid: 1392 timeout: 0 lvb_type: 0
            [716723.381124] LustreError: 6003:0:(ldlm_resource.c:1705:ldlm_resource_dump()) Skipped 1 previous similar message
            [717033.536366] LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail
            [717033.539206] LustreError: Skipped 1 previous similar message
            [717033.541034] Lustre: MGC10.0.2.51@o2ib5: Connection restored to MGC10.0.2.51@o2ib5_0 (at 10.0.2.51@o2ib5)
            [717033.542492] Lustre: Skipped 1 previous similar message
            [717340.609290] LustreError: 1392:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1521204182, 300s ago), entering recovery for MGS@MGC10.0.2.51@o2ib5_0 ns: MGC10.0.2.51@o2ib5 lock: ffff8803fd41c000/0x509cd0f71fde4c02 lrc: 4/1,0 mode: --/CR res: [0x6b616f:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x2e0a3ddb63e9adcd expref: -99 pid: 1392 timeout: 0 lvb_type: 0
            [717340.615144] LustreError: 1392:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
            [717340.617285] LustreError: 6017:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff88039bd87540) refcount nonzero (1) after lock cleanup; forcing cleanup.
            [717340.619996] LustreError: 6017:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff88039bd87540) refcount = 2
            [717340.621862] LustreError: 6017:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks:
            [717340.623052] LustreError: 6017:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: ?? lock: ffff8803fd41c000/0x509cd0f71fde4c02 lrc: 4/1,0 mode: --/CR res: ?? rrc=?? type: ??? flags: 0x1106400000000 nid: local remote: 0x2e0a3ddb63e9adcd expref: -99 pid: 1392 timeout: 0 lvb_type: 0
            [717648.431191] LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail
            [717648.433742] LustreError: Skipped 1 previous similar message
            [717648.434877] LustreError: 6028:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff8803be8ce000) refcount = 2
            [717648.436695] LustreError: 6028:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks:
            [717648.438034] Lustre: MGC10.0.2.51@o2ib5: Connection restored to MGC10.0.2.51@o2ib5_0 (at 10.0.2.51@o2ib5)
            [717648.439395] Lustre: Skipped 1 previous similar message

            Thanks!
            Stephane

            sthiell Stephane Thiell added a comment - Hello, This issue just happened in 2.10.3 when we tried to unmount the MGS. In this case, the server didn't crash but we saw a lot of soft lockup/CPU stuck stack traces on the MGS like this one: [83347.066299] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [ldlm_bl_11:16013] [83347.066322] Modules linked in: mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) vfat fat uas usb_storage mpt2sas mptctl mptbase rpcsec_gss_krb5 dell_rbu nfsv4 dns_resolver nfs fscache ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi dm_service_time kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt dcdbas iTCO_vendor_support ipmi_si ipmi_devintf mxm_wmi dm_round_robin pcspkr sg ipmi_msghandler acpi_power_meter wmi mei_me mei shpchp lpc_ich nfsd auth_rpcgss dm_multipath dm_mod nfs_acl lockd grace sunrpc ip_tables ext4 mbcache [83347.066330] jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm tg3 ahci crct10dif_pclmul crct10dif_common mlx4_core mpt3sas drm libahci crc32c_intel ptp raid_class libata megaraid_sas devlink i2c_core scsi_transport_sas pps_core [83347.066331] CPU: 5 PID: 16013 Comm: ldlm_bl_11 Tainted: G OEL ------------ 3.10.0-693.2.2.el7_lustre.pl2.x86_64 #1 [83347.066332] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.6.0 10/26/2017 [83347.066333] task: ffff88203ad88000 ti: ffff88201c804000 task.ti: ffff88201c804000 [83347.066335] RIP: 0010:[<ffffffff810fa326>] [<ffffffff810fa326>] native_queued_spin_lock_slowpath+0x116/0x1e0 [83347.066335] RSP: 0018:ffff88201c807b70 EFLAGS: 00000246 [83347.066336] RAX: 0000000000000000 RBX: 000000013c796cc0 RCX: 0000000000290000 [83347.066337] RDX: ffff88103d017880 RSI: 0000000000810001 RDI: ffff88102137981c [83347.066337] RBP: ffff88201c807b70 R08: ffff88203c697880 R09: 0000000000000000 [83347.066338] R10: ffff88202b216a00 R11: 0000000000000000 R12: ffff88201c807c58 [83347.066338] R13: 0000000000000001 R14: ffff88201c807b28 R15: ffffffff81322c35 [83347.066339] FS: 0000000000000000(0000) GS:ffff88203c680000(0000) knlGS:0000000000000000 [83347.066340] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [83347.066340] CR2: 00007fd7ddcd8090 CR3: 00000000019f2000 CR4: 00000000001407e0 [83347.066341] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [83347.066341] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [83347.066341] Stack: [83347.066343] ffff88201c807b80 ffffffff8169e61f ffff88201c807b90 ffffffff816abb70 [83347.066344] ffff88201c807bd0 ffffffffc0b72198 0000000000000000 ffff88202b216a00 [83347.066345] ffff88201c807c18 ffff88202b216a60 ffff88202b216a00 ffff8810175b8000 [83347.066345] Call Trace: [83347.066347] [<ffffffff8169e61f>] queued_spin_lock_slowpath+0xb/0xf [83347.066349] [<ffffffff816abb70>] _raw_spin_lock+0x20/0x30 [83347.066367] [<ffffffffc0b72198>] ldlm_handle_conflict_lock+0xd8/0x330 [ptlrpc] [83347.066388] [<ffffffffc0b86755>] ldlm_process_plain_lock+0x435/0xb30 [ptlrpc] [83347.066407] [<ffffffffc0b86320>] ? ldlm_errno2error+0x60/0x60 [ptlrpc] [83347.066425] [<ffffffffc0b7199b>] ldlm_reprocess_queue+0x13b/0x2a0 [ptlrpc] [83347.066443] [<ffffffffc0b7253d>] __ldlm_reprocess_all+0x14d/0x3a0 [ptlrpc] [83347.066460] [<ffffffffc0b72af0>] ldlm_reprocess_res+0x20/0x30 [ptlrpc] [83347.066466] [<ffffffffc0847bef>] cfs_hash_for_each_relax+0x21f/0x400 [libcfs] [83347.066483] [<ffffffffc0b72ad0>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc] [83347.066501] [<ffffffffc0b72ad0>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc] [83347.066506] [<ffffffffc084ad95>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs] [83347.066526] [<ffffffffc0b72b3c>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc] [83347.066544] [<ffffffffc0b737cc>] ldlm_export_cancel_locks+0x11c/0x130 [ptlrpc] [83347.066566] [<ffffffffc0b9c9a8>] ldlm_bl_thread_main+0x4c8/0x700 [ptlrpc] [83347.066567] [<ffffffff810c4810>] ? wake_up_state+0x20/0x20 [83347.066588] [<ffffffffc0b9c4e0>] ? ldlm_handle_bl_callback+0x410/0x410 [ptlrpc] [83347.066589] [<ffffffff810b098f>] kthread+0xcf/0xe0 [83347.066590] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 [83347.066592] [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 [83347.066593] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 After a few minutes, I decided to take a crash dump. It is available upon request. Attaching the associated vmcore-dmesg.txt as vmcore-dmesg-oak-md1-s1-2018-03-16-06-47-23.txt Prior to un-mounting the MGS, I also noticed tons of 'lock timed out' errors on clients for the MGS (at 10.0.2.51@o2ib5): [716108.759689] LustreError: 1392:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1521202950, 300s ago), entering recovery for MGS@MGC10.0.2.51@o2ib5_0 ns: MGC10.0.2.51@o2ib5 lock: ffff8803e8b51800/0x509cd0f71fde08cc lrc: 4/1,0 mode: --/CR res: [0x6b616f:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x2e0a3ddb63e79ade expref: -99 pid: 1392 timeout: 0 lvb_type: 0 [716108.760126] LustreError: 5988:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff880402be2600) refcount nonzero (2) after lock cleanup; forcing cleanup. [716108.760128] LustreError: 5988:0:(ldlm_resource.c:1100:ldlm_resource_complain()) Skipped 1 previous similar message [716108.760130] LustreError: 5988:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff880402be2600) refcount = 3 [716108.760132] LustreError: 5988:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks: [716108.760137] LustreError: 5988:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: MGC10.0.2.51@o2ib5 lock: ffff8803e8b51800/0x509cd0f71fde08cc lrc: 4/1,0 mode: --/CR res: [0x6b616f:0x2:0x0].0x0 rrc: 4 type: PLN flags: 0x1106400000000 nid: local remote: 0x2e0a3ddb63e79ade expref: -99 pid: 1392 timeout: 0 lvb_type: 0 [716108.760138] LustreError: 5988:0:(ldlm_resource.c:1705:ldlm_resource_dump()) Skipped 1 previous similar message [716108.778087] LustreError: 1392:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message [716415.536505] LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail [716415.539195] LustreError: Skipped 1 previous similar message [716415.540823] LustreError: 5999:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff8800a2fb46c0) refcount = 2 [716415.543552] LustreError: 5999:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks: [716415.544914] Lustre: MGC10.0.2.51@o2ib5: Connection restored to MGC10.0.2.51@o2ib5_0 (at 10.0.2.51@o2ib5) [716415.546536] Lustre: Skipped 1 previous similar message [716723.374375] LustreError: 1392:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1521203565, 300s ago), entering recovery for MGS@MGC10.0.2.51@o2ib5_0 ns: MGC10.0.2.51@o2ib5 lock: ffff8802383c6c00/0x509cd0f71fde2a67 lrc: 4/1,0 mode: --/CR res: [0x6b616f:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x2e0a3ddb63e8a325 expref: -99 pid: 1392 timeout: 0 lvb_type: 0 [716723.380318] LustreError: 1392:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message [716723.381110] LustreError: 6003:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff8803aafc4900) refcount nonzero (2) after lock cleanup; forcing cleanup. [716723.381112] LustreError: 6003:0:(ldlm_resource.c:1100:ldlm_resource_complain()) Skipped 1 previous similar message [716723.381115] LustreError: 6003:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff8803aafc4900) refcount = 3 [716723.381117] LustreError: 6003:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks: [716723.381123] LustreError: 6003:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: MGC10.0.2.51@o2ib5 lock: ffff8802383c6c00/0x509cd0f71fde2a67 lrc: 4/1,0 mode: --/CR res: [0x6b616f:0x2:0x0].0x0 rrc: 4 type: PLN flags: 0x1106400000000 nid: local remote: 0x2e0a3ddb63e8a325 expref: -99 pid: 1392 timeout: 0 lvb_type: 0 [716723.381124] LustreError: 6003:0:(ldlm_resource.c:1705:ldlm_resource_dump()) Skipped 1 previous similar message [717033.536366] LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail [717033.539206] LustreError: Skipped 1 previous similar message [717033.541034] Lustre: MGC10.0.2.51@o2ib5: Connection restored to MGC10.0.2.51@o2ib5_0 (at 10.0.2.51@o2ib5) [717033.542492] Lustre: Skipped 1 previous similar message [717340.609290] LustreError: 1392:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1521204182, 300s ago), entering recovery for MGS@MGC10.0.2.51@o2ib5_0 ns: MGC10.0.2.51@o2ib5 lock: ffff8803fd41c000/0x509cd0f71fde4c02 lrc: 4/1,0 mode: --/CR res: [0x6b616f:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x2e0a3ddb63e9adcd expref: -99 pid: 1392 timeout: 0 lvb_type: 0 [717340.615144] LustreError: 1392:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message [717340.617285] LustreError: 6017:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff88039bd87540) refcount nonzero (1) after lock cleanup; forcing cleanup. [717340.619996] LustreError: 6017:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff88039bd87540) refcount = 2 [717340.621862] LustreError: 6017:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks: [717340.623052] LustreError: 6017:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: ?? lock: ffff8803fd41c000/0x509cd0f71fde4c02 lrc: 4/1,0 mode: --/CR res: ?? rrc=?? type: ??? flags: 0x1106400000000 nid: local remote: 0x2e0a3ddb63e9adcd expref: -99 pid: 1392 timeout: 0 lvb_type: 0 [717648.431191] LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail [717648.433742] LustreError: Skipped 1 previous similar message [717648.434877] LustreError: 6028:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff8803be8ce000) refcount = 2 [717648.436695] LustreError: 6028:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks: [717648.438034] Lustre: MGC10.0.2.51@o2ib5: Connection restored to MGC10.0.2.51@o2ib5_0 (at 10.0.2.51@o2ib5) [717648.439395] Lustre: Skipped 1 previous similar message Thanks! Stephane

            My testing was limited. I installed 2.10.1 , 2.10.2 and 2.10.3. None of them had sane output in /proc/fs/lustre/ldlm/namespaces/MGS/pool/stats

            At this point I decided something is horridly wrong since nobody answered my questions in LU-10635

            I'm going to start with 2.9.1 and see if its also whacky in 2.9 and then when I figure out which point release it all went bad I'll start doing a git bisect

            This has all come to ahead for NCI because this non-granting or half-granting of locks is starting to really cause issues with clients

             

            LDLM is broken. It looks like its broken in 2.10.3 still. Despite your lack of shutdown crashes post upgrade.

            I couldn't get my test environment to crash, for the record. But its tiny so I don't think its a valid test.

            I do think having a look at the crash dumps will be enlightening.

            EDIT: Other issue is related to LU-9704

            mhaakddn Malcolm Haak - NCI (Inactive) added a comment - - edited My testing was limited. I installed 2.10.1 , 2.10.2 and 2.10.3. None of them had sane output in /proc/fs/lustre/ldlm/namespaces/MGS/pool/stats At this point I decided something is horridly wrong since nobody answered my questions in LU-10635 I'm going to start with 2.9.1 and see if its also whacky in 2.9 and then when I figure out which point release it all went bad I'll start doing a git bisect This has all come to ahead for NCI because this non-granting or half-granting of locks is starting to really cause issues with clients   LDLM is broken. It looks like its broken in 2.10.3 still. Despite your lack of shutdown crashes post upgrade. I couldn't get my test environment to crash, for the record. But its tiny so I don't think its a valid test. I do think having a look at the crash dumps will be enlightening. EDIT: Other issue is related to LU-9704

            People

              emoly.liu Emoly Liu
              sthiell Stephane Thiell
              Votes:
              1 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: