Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11734

LNet crashes with 2.12.0-rc1: lnet_attach_rsp_tracker() ASSERTION(md->md_rspt_ptr == ((void *)0)) failed

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.12.0
    • Lustre 2.12.0
    • None
    • lustre 2.12.0-rc1
    • 3
    • 9223372036854775807

    Description

      In my testing I see at bring up:

      2018-12-05T14:24:14.274614-05:00 ninja84.ccs.ornl.gov kernel: Lustre: Lustre: Build Version: 2.12.0_RC1_dirty

      2018-12-05T14:24:14.671611-05:00 ninja84.ccs.ornl.gov kernel: LNetError: 3759:0:(lib-move.c:4429:lnet_attach_rsp_tracker()) ASSERTION( md->md_r

      spt_ptr == ((void *)0) ) failed:

      2018-12-05T14:24:14.671682-05:00 ninja84.ccs.ornl.gov kernel: LNetError: 3759:0:(lib-move.c:4429:lnet_attach_rsp_tracker()) LBUG

      2018-12-05T14:24:14.671716-05:00 ninja84.ccs.ornl.gov kernel: Pid: 3759, comm: monitor_thread 3.10.0-862.3.2.el7.x86_64 #1 SMP Tue May 15 18:22

      :15 EDT 2018

      2018-12-05T14:24:14.681179-05:00 ninja84.ccs.ornl.gov kernel: Call Trace:

      2018-12-05T14:24:14.685051-05:00 ninja84.ccs.ornl.gov kernel: Lustre: Echo OBD driver; http://www.lustre.org/

      2018-12-05T14:24:14.695841-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc0c267cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]

      2018-12-05T14:24:14.697263-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc0c2687c>] lbug_with_loc+0x4c/0xa0 [libcfs]

      2018-12-05T14:24:14.703501-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc102c257>] lnet_attach_rsp_tracker.isra.29+0xd7/0xe0 [lnet]

      2018-12-05T14:24:14.711154-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc1033ba1>] LNetGet+0x5d1/0xa90 [lnet]

      2018-12-05T14:24:14.716870-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc103c0b9>] lnet_check_routers+0x399/0xbf0 [lnet]

      2018-12-05T14:24:14.723556-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc10354bf>] lnet_monitor_thread+0x4ff/0x560 [lnet]

      2018-12-05T14:24:14.730325-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffff8dabb161>] kthread+0xd1/0xe0

      2018-12-05T14:24:14.735265-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffff8e120677>] ret_from_fork_nospec_end+0x0/0x39

      2018-12-05T14:24:14.741596-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffffffffff>] 0xffffffffffffffff

      Attachments

        Issue Links

          Activity

            [LU-11734] LNet crashes with 2.12.0-rc1: lnet_attach_rsp_tracker() ASSERTION(md->md_rspt_ptr == ((void *)0)) failed

            Sure, will do

            I confirm this specific problem is fixed in RC2.

            sthiell Stephane Thiell added a comment - Sure, will do I confirm this specific problem is fixed in RC2.
            pjones Peter Jones added a comment -

            Landed for 2.12. sthiell please open a new ticket if there are any problems with RC2

            pjones Peter Jones added a comment - Landed for 2.12. sthiell please open a new ticket if there are any problems with RC2

            Thanks Peter!

            sthiell Stephane Thiell added a comment - Thanks Peter!

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33794/
            Subject: LU-11734 lnet: handle multi-md usage
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 8c249097e62713baf51aec808489a86acf46748d

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33794/ Subject: LU-11734 lnet: handle multi-md usage Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8c249097e62713baf51aec808489a86acf46748d
            pjones Peter Jones added a comment -

            sthiell just wait for RC2 - which should be available shortly...

            pjones Peter Jones added a comment - sthiell just wait for RC2 - which should be available shortly...

            Perhaps I missed something, do I need to apply the patch from https://review.whamcloud.com/#/c/31313/ too? Please clarify which patches need to be applied on top of 2.12.0_RC1 if you want me to try again. Thanks!

            sthiell Stephane Thiell added a comment - Perhaps I missed something, do I need to apply the patch from  https://review.whamcloud.com/#/c/31313/  too? Please clarify which patches need to be applied on top of 2.12.0_RC1 if you want me to try again. Thanks!

            Slightly different symptom on MGS/MDS 2.12.0_RC1 + Amir's patch, but definitively unusable:

            fir-md1-s1 login: [  497.904084] LNet: HW NUMA nodes: 4, HW CPU cores: 48, npartitions: 4
            [  497.913010] alg: No test for adler32 (adler32-zlib)
            [  498.713626] Lustre: Lustre: Build Version: 2.12.0_RC1
            [  498.836281] LNet: Using FastReg for registration
            [  498.968174] LNet: Added LNI 10.0.10.51@o2ib7 [8/256/0/180]
            [  499.835260] LNet: 20661:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.202@o2ib7: 498 seconds
            [  500.088147] LDISKFS-fs warning (device dm-1): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.
            [  500.088147] 
            [  542.110957] LDISKFS-fs (dm-1): recovery complete
            [  542.115778] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
            [  542.437699] Lustre: MGS: Connection restored to ebe34e0e-4031-a65f-5f91-0d1e392e8147 (at 0@lo)
            [  543.013105] LDISKFS-fs warning (device dm-2): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.
            [  543.013105] 
            [  543.014468] LDISKFS-fs warning (device dm-4): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.
            [  543.014468] 
            [  558.768986] Lustre: MGS: Connection restored to 73368e2a-0b82-5939-f05f-1ad2a79768da (at 10.9.101.59@o2ib4)
            [  561.842373] Lustre: MGS: Connection restored to d9b25a46-8406-5014-40e9-a5748f9ad5c5 (at 10.0.10.105@o2ib7)
            [  561.852127] Lustre: Skipped 1 previous similar message
            [  567.115808] Lustre: MGS: Connection restored to a8053d09-bfcd-3aa9-c515-8645bbb097c5 (at 10.8.0.68@o2ib6)
            [  571.051218] Lustre: MGS: Received new LWP connection from 10.9.101.59@o2ib4, removing former export from same NID
            [  572.706429] Lustre: MGS: Connection restored to 38b4f0c2-db8c-d604-7e00-2c04b790499d (at 10.9.0.63@o2ib4)
            [  572.716088] Lustre: Skipped 6 previous similar messages
            [  573.921768] Lustre: MGS: Received new LWP connection from 10.0.10.105@o2ib7, removing former export from same NID
            [  575.260500] Lustre: MGS: Received new LWP connection from 10.9.0.61@o2ib4, removing former export from same NID
            [  578.088401] Lustre: MGS: Received new LWP connection from 10.9.101.59@o2ib4, removing former export from same NID
            [  580.973963] Lustre: MGS: Connection restored to d9b25a46-8406-5014-40e9-a5748f9ad5c5 (at 10.0.10.105@o2ib7)
            [  580.985864] Lustre: Skipped 6 previous similar messages
            [  582.286596] Lustre: MGS: Received new LWP connection from 10.9.0.61@o2ib4, removing former export from same NID
            [  582.296688] Lustre: Skipped 2 previous similar messages
            [  583.358103] LustreError: 20965:0:(ldlm_lib.c:3248:target_bulk_io()) @@@ timeout on bulk READ after 6+0s  req@ffff8f086f7d0450 x1619041162737216/t0(0) o256->d1fa3f21-b7f3-5a11-e34d-8236124081cd@10.9.0.1@o2ib4:489/0 lens 304/240 e 0 to 0 dl 1544243514 ref 1 fl Interpret:/0/0 rc 0/0
            [  585.015471] LDISKFS-fs (dm-4): file extents enabled, maximum tree depth=5
            [  585.067346] LDISKFS-fs (dm-2): file extents enabled, maximum tree depth=5
            [  585.315122] LDISKFS-fs (dm-4): recovery complete
            [  585.319941] LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc
            [  585.354354] LDISKFS-fs (dm-2): recovery complete
            [  585.358987] LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc
            [  590.570227] NMI watchdog: Watchdog detected hard LOCKUP on cpu 22
            [  590.576145] Modules linked in: mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_ucm rpcrdma rdma_ucm ib_uverbs ib_iser ib_umad rdma_cm iw_cm libiscsi ib_ipoib scsi_transport_iscsi ib_cm mlx5_ib ib_core mpt2sas mptctl mptbase dell_rbu sunrpc vfat fat dm_round_robin amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd dm_multipath dcdbas ses pcspkr dm_mod ipmi_si enclosure ipmi_devintf i2c_piix4 sg ccp k10temp ipmi_msghandler acpi_power_meter ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci mlx5_core crct10dif_pclmul crct10dif_common libahci mlxfw devlink drm tg3 crc32c_intel libata ptp megaraid_sas drm_panel_orientation_quirks pps_core mpt3sas(OE) raid_class scsi_transport_sas
            [  590.671805] CPU: 22 PID: 20866 Comm: ll_mgs_0002 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.1.3.el7_lustre.x86_64 #1
            [  590.684220] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.3.6 04/20/2018
            [  590.691786] Call Trace:
            [  590.694241]  <NMI>  [<ffffffff87361e41>] dump_stack+0x19/0x1b
            [  590.700025]  [<ffffffff86d494c5>] watchdog_overflow_callback+0x135/0x140
            [  590.706726]  [<ffffffff86da1297>] __perf_event_overflow+0x57/0x100
            [  590.712910]  [<ffffffff86daa904>] perf_event_overflow+0x14/0x20
            [  590.718831]  [<ffffffff86c055f0>] x86_pmu_handle_irq+0x140/0x1a0
            [  590.724838]  [<ffffffff86f77c08>] ? ioremap_page_range+0x2e8/0x480
            [  590.731018]  [<ffffffff86dfb854>] ? vunmap_page_range+0x234/0x470
            [  590.737106]  [<ffffffff86dfbaa1>] ? unmap_kernel_range_noflush+0x11/0x20
            [  590.743811]  [<ffffffff8703d49e>] ? ghes_copy_tofrom_phys+0x10e/0x210
            [  590.750254]  [<ffffffff8703d640>] ? ghes_read_estatus+0xa0/0x190
            [  590.756263]  [<ffffffff8736b031>] perf_event_nmi_handler+0x31/0x50
            [  590.762440]  [<ffffffff8736c8fc>] nmi_handle.isra.0+0x8c/0x150
            [  590.768274]  [<ffffffff8736cb1d>] do_nmi+0x15d/0x460
            [  590.773239]  [<ffffffff8736bd69>] end_repeat_nmi+0x1e/0x81
            [  590.778731]  [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200
            [  590.786127]  [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200
            [  590.793521]  [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200
            [  590.800910]  <EOE>  [<ffffffff8735bfcb>] queued_spin_lock_slowpath+0xb/0xf
            [  590.807812]  [<ffffffff8736a507>] _raw_spin_lock_irqsave+0x37/0x40
            [  590.814001]  [<ffffffffc090abab>] ib_fmr_pool_map_phys+0x4b/0x300 [ib_core]
            [  590.820965]  [<ffffffffc0ec66cc>] kiblnd_fmr_pool_map+0x15c/0x690 [ko2iblnd]
            [  590.828017]  [<ffffffffc0eca965>] kiblnd_map_tx.isra.28+0x155/0x470 [ko2iblnd]
            [  590.835235]  [<ffffffffc0ecc216>] kiblnd_setup_rd_iov+0x2f6/0x400 [ko2iblnd]
            [  590.842282]  [<ffffffffc0ecfa6a>] kiblnd_send+0x5da/0xa20 [ko2iblnd]
            [  590.848645]  [<ffffffffc0c430e4>] lnet_ni_send+0x44/0xd0 [lnet]
            [  590.854569]  [<ffffffffc0c4ad92>] lnet_send+0x82/0x1c0 [lnet]
            [  590.860323]  [<ffffffffc0c4b19c>] LNetPut+0x2cc/0xb50 [lnet]
            [  590.866029]  [<ffffffffc0f5c5d6>] ptl_send_buf+0x146/0x530 [ptlrpc]
            [  590.872302]  [<ffffffffc0bc1bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
            [  590.879378]  [<ffffffffc0f5f98b>] ptlrpc_send_reply+0x29b/0x840 [ptlrpc]
            [  590.886109]  [<ffffffffc0f1e94e>] target_send_reply_msg+0x8e/0x170 [ptlrpc]
            [  590.893092]  [<ffffffffc0f28e0e>] target_send_reply+0x30e/0x730 [ptlrpc]
            [  590.899823]  [<ffffffffc0f66877>] ? lustre_msg_set_last_committed+0x27/0xa0 [ptlrpc]
            [  590.907607]  [<ffffffffc0fccf07>] tgt_request_handle+0x697/0x1580 [ptlrpc]
            [  590.914514]  [<ffffffffc0fa6a51>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
            [  590.922088]  [<ffffffffc0bc1bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
            [  590.929161]  [<ffffffffc0f7192b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [  590.936846]  [<ffffffffc0f6e7b5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
            [  590.943633]  [<ffffffff86cd67c2>] ? default_wake_function+0x12/0x20
            [  590.949898]  [<ffffffff86ccba9b>] ? __wake_up_common+0x5b/0x90
            [  590.955757]  [<ffffffffc0f7525c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
            [  590.962050]  [<ffffffffc0f74760>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
            [  590.969441]  [<ffffffff86cc1c31>] kthread+0xd1/0xe0
            [  590.974320]  [<ffffffff86cc1b60>] ? insert_kthread_work+0x40/0x40
            [  590.980413]  [<ffffffff87374c24>] ret_from_fork_nospec_begin+0xe/0x21
            [  590.986850]  [<ffffffff86cc1b60>] ? insert_kthread_work+0x40/0x40
            [  590.992945] Kernel panic - not syncing: Hard LOCKUP
            [  590.997823] CPU: 22 PID: 20866 Comm: ll_mgs_0002 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.1.3.el7_lustre.x86_64 #1
            [  591.010241] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.3.6 04/20/2018
            [  591.017807] Call Trace:
            [  591.020262]  <NMI>  [<ffffffff87361e41>] dump_stack+0x19/0x1b
            [  591.026035]  [<ffffffff8735b550>] panic+0xe8/0x21f
            [  591.030826]  [<ffffffff87374c24>] ? ret_from_fork_nospec_begin+0xe/0x21
            [  591.037442]  [<ffffffff86c9739f>] nmi_panic+0x3f/0x40
            [  591.042492]  [<ffffffff86d494b1>] watchdog_overflow_callback+0x121/0x140
            [  591.049193]  [<ffffffff86da1297>] __perf_event_overflow+0x57/0x100
            [  591.055370]  [<ffffffff86daa904>] perf_event_overflow+0x14/0x20
            [  591.061292]  [<ffffffff86c055f0>] x86_pmu_handle_irq+0x140/0x1a0
            [  591.067296]  [<ffffffff86f77c08>] ? ioremap_page_range+0x2e8/0x480
            [  591.073477]  [<ffffffff86dfb854>] ? vunmap_page_range+0x234/0x470
            [  591.079570]  [<ffffffff86dfbaa1>] ? unmap_kernel_range_noflush+0x11/0x20
            [  591.086268]  [<ffffffff8703d49e>] ? ghes_copy_tofrom_phys+0x10e/0x210
            [  591.092709]  [<ffffffff8703d640>] ? ghes_read_estatus+0xa0/0x190
            [  591.098714]  [<ffffffff8736b031>] perf_event_nmi_handler+0x31/0x50
            [  591.104893]  [<ffffffff8736c8fc>] nmi_handle.isra.0+0x8c/0x150
            [  591.110726]  [<ffffffff8736cb1d>] do_nmi+0x15d/0x460
            [  591.115694]  [<ffffffff8736bd69>] end_repeat_nmi+0x1e/0x81
            [  591.121181]  [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200
            [  591.128572]  [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200
            [  591.135964]  [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200
            [  591.143355]  <EOE>  [<ffffffff8735bfcb>] queued_spin_lock_slowpath+0xb/0xf
            [  591.150258]  [<ffffffff8736a507>] _raw_spin_lock_irqsave+0x37/0x40
            [  591.156442]  [<ffffffffc090abab>] ib_fmr_pool_map_phys+0x4b/0x300 [ib_core]
            [  591.163407]  [<ffffffffc0ec66cc>] kiblnd_fmr_pool_map+0x15c/0x690 [ko2iblnd]
            [  591.170452]  [<ffffffffc0eca965>] kiblnd_map_tx.isra.28+0x155/0x470 [ko2iblnd]
            [  591.177672]  [<ffffffffc0ecc216>] kiblnd_setup_rd_iov+0x2f6/0x400 [ko2iblnd]
            [  591.184719]  [<ffffffffc0ecfa6a>] kiblnd_send+0x5da/0xa20 [ko2iblnd]
            [  591.191077]  [<ffffffffc0c430e4>] lnet_ni_send+0x44/0xd0 [lnet]
            [  591.197004]  [<ffffffffc0c4ad92>] lnet_send+0x82/0x1c0 [lnet]
            [  591.202759]  [<ffffffffc0c4b19c>] LNetPut+0x2cc/0xb50 [lnet]
            [  591.208448]  [<ffffffffc0f5c5d6>] ptl_send_buf+0x146/0x530 [ptlrpc]
            [  591.214715]  [<ffffffffc0bc1bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
            [  591.221786]  [<ffffffffc0f5f98b>] ptlrpc_send_reply+0x29b/0x840 [ptlrpc]
            [  591.228509]  [<ffffffffc0f1e94e>] target_send_reply_msg+0x8e/0x170 [ptlrpc]
            [  591.235494]  [<ffffffffc0f28e0e>] target_send_reply+0x30e/0x730 [ptlrpc]
            [  591.242224]  [<ffffffffc0f66877>] ? lustre_msg_set_last_committed+0x27/0xa0 [ptlrpc]
            [  591.249992]  [<ffffffffc0fccf07>] tgt_request_handle+0x697/0x1580 [ptlrpc]
            [  591.256897]  [<ffffffffc0fa6a51>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
            [  591.264463]  [<ffffffffc0bc1bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
            [  591.271534]  [<ffffffffc0f7192b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [  591.279214]  [<ffffffffc0f6e7b5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
            [  591.285996]  [<ffffffff86cd67c2>] ? default_wake_function+0x12/0x20
            [  591.292263]  [<ffffffff86ccba9b>] ? __wake_up_common+0x5b/0x90
            [  591.298124]  [<ffffffffc0f7525c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
            [  591.304415]  [<ffffffffc0f74760>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
            [  591.311807]  [<ffffffff86cc1c31>] kthread+0xd1/0xe0
            [  591.316685]  [<ffffffff86cc1b60>] ? insert_kthread_work+0x40/0x40
            [  591.322780]  [<ffffffff87374c24>] ret_from_fork_nospec_begin+0xe/0x21
            [  591.329219]  [<ffffffff86cc1b60>] ? insert_kthread_work+0x40/0x40
            [    0.000000] Initializing cgroup subsys cpuset
            
            sthiell Stephane Thiell added a comment - Slightly different symptom on MGS/MDS 2.12.0_RC1 + Amir's patch, but definitively unusable: fir-md1-s1 login: [ 497.904084] LNet: HW NUMA nodes: 4, HW CPU cores: 48, npartitions: 4 [ 497.913010] alg: No test for adler32 (adler32-zlib) [ 498.713626] Lustre: Lustre: Build Version: 2.12.0_RC1 [ 498.836281] LNet: Using FastReg for registration [ 498.968174] LNet: Added LNI 10.0.10.51@o2ib7 [8/256/0/180] [ 499.835260] LNet: 20661:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.202@o2ib7: 498 seconds [ 500.088147] LDISKFS-fs warning (device dm-1): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait. [ 500.088147] [ 542.110957] LDISKFS-fs (dm-1): recovery complete [ 542.115778] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc [ 542.437699] Lustre: MGS: Connection restored to ebe34e0e-4031-a65f-5f91-0d1e392e8147 (at 0@lo) [ 543.013105] LDISKFS-fs warning (device dm-2): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait. [ 543.013105] [ 543.014468] LDISKFS-fs warning (device dm-4): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait. [ 543.014468] [ 558.768986] Lustre: MGS: Connection restored to 73368e2a-0b82-5939-f05f-1ad2a79768da (at 10.9.101.59@o2ib4) [ 561.842373] Lustre: MGS: Connection restored to d9b25a46-8406-5014-40e9-a5748f9ad5c5 (at 10.0.10.105@o2ib7) [ 561.852127] Lustre: Skipped 1 previous similar message [ 567.115808] Lustre: MGS: Connection restored to a8053d09-bfcd-3aa9-c515-8645bbb097c5 (at 10.8.0.68@o2ib6) [ 571.051218] Lustre: MGS: Received new LWP connection from 10.9.101.59@o2ib4, removing former export from same NID [ 572.706429] Lustre: MGS: Connection restored to 38b4f0c2-db8c-d604-7e00-2c04b790499d (at 10.9.0.63@o2ib4) [ 572.716088] Lustre: Skipped 6 previous similar messages [ 573.921768] Lustre: MGS: Received new LWP connection from 10.0.10.105@o2ib7, removing former export from same NID [ 575.260500] Lustre: MGS: Received new LWP connection from 10.9.0.61@o2ib4, removing former export from same NID [ 578.088401] Lustre: MGS: Received new LWP connection from 10.9.101.59@o2ib4, removing former export from same NID [ 580.973963] Lustre: MGS: Connection restored to d9b25a46-8406-5014-40e9-a5748f9ad5c5 (at 10.0.10.105@o2ib7) [ 580.985864] Lustre: Skipped 6 previous similar messages [ 582.286596] Lustre: MGS: Received new LWP connection from 10.9.0.61@o2ib4, removing former export from same NID [ 582.296688] Lustre: Skipped 2 previous similar messages [ 583.358103] LustreError: 20965:0:(ldlm_lib.c:3248:target_bulk_io()) @@@ timeout on bulk READ after 6+0s req@ffff8f086f7d0450 x1619041162737216/t0(0) o256->d1fa3f21-b7f3-5a11-e34d-8236124081cd@10.9.0.1@o2ib4:489/0 lens 304/240 e 0 to 0 dl 1544243514 ref 1 fl Interpret:/0/0 rc 0/0 [ 585.015471] LDISKFS-fs (dm-4): file extents enabled, maximum tree depth=5 [ 585.067346] LDISKFS-fs (dm-2): file extents enabled, maximum tree depth=5 [ 585.315122] LDISKFS-fs (dm-4): recovery complete [ 585.319941] LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc [ 585.354354] LDISKFS-fs (dm-2): recovery complete [ 585.358987] LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc [ 590.570227] NMI watchdog: Watchdog detected hard LOCKUP on cpu 22 [ 590.576145] Modules linked in: mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_ucm rpcrdma rdma_ucm ib_uverbs ib_iser ib_umad rdma_cm iw_cm libiscsi ib_ipoib scsi_transport_iscsi ib_cm mlx5_ib ib_core mpt2sas mptctl mptbase dell_rbu sunrpc vfat fat dm_round_robin amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd dm_multipath dcdbas ses pcspkr dm_mod ipmi_si enclosure ipmi_devintf i2c_piix4 sg ccp k10temp ipmi_msghandler acpi_power_meter ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci mlx5_core crct10dif_pclmul crct10dif_common libahci mlxfw devlink drm tg3 crc32c_intel libata ptp megaraid_sas drm_panel_orientation_quirks pps_core mpt3sas(OE) raid_class scsi_transport_sas [ 590.671805] CPU: 22 PID: 20866 Comm: ll_mgs_0002 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.1.3.el7_lustre.x86_64 #1 [ 590.684220] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.3.6 04/20/2018 [ 590.691786] Call Trace: [ 590.694241] <NMI> [<ffffffff87361e41>] dump_stack+0x19/0x1b [ 590.700025] [<ffffffff86d494c5>] watchdog_overflow_callback+0x135/0x140 [ 590.706726] [<ffffffff86da1297>] __perf_event_overflow+0x57/0x100 [ 590.712910] [<ffffffff86daa904>] perf_event_overflow+0x14/0x20 [ 590.718831] [<ffffffff86c055f0>] x86_pmu_handle_irq+0x140/0x1a0 [ 590.724838] [<ffffffff86f77c08>] ? ioremap_page_range+0x2e8/0x480 [ 590.731018] [<ffffffff86dfb854>] ? vunmap_page_range+0x234/0x470 [ 590.737106] [<ffffffff86dfbaa1>] ? unmap_kernel_range_noflush+0x11/0x20 [ 590.743811] [<ffffffff8703d49e>] ? ghes_copy_tofrom_phys+0x10e/0x210 [ 590.750254] [<ffffffff8703d640>] ? ghes_read_estatus+0xa0/0x190 [ 590.756263] [<ffffffff8736b031>] perf_event_nmi_handler+0x31/0x50 [ 590.762440] [<ffffffff8736c8fc>] nmi_handle.isra.0+0x8c/0x150 [ 590.768274] [<ffffffff8736cb1d>] do_nmi+0x15d/0x460 [ 590.773239] [<ffffffff8736bd69>] end_repeat_nmi+0x1e/0x81 [ 590.778731] [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200 [ 590.786127] [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200 [ 590.793521] [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200 [ 590.800910] <EOE> [<ffffffff8735bfcb>] queued_spin_lock_slowpath+0xb/0xf [ 590.807812] [<ffffffff8736a507>] _raw_spin_lock_irqsave+0x37/0x40 [ 590.814001] [<ffffffffc090abab>] ib_fmr_pool_map_phys+0x4b/0x300 [ib_core] [ 590.820965] [<ffffffffc0ec66cc>] kiblnd_fmr_pool_map+0x15c/0x690 [ko2iblnd] [ 590.828017] [<ffffffffc0eca965>] kiblnd_map_tx.isra.28+0x155/0x470 [ko2iblnd] [ 590.835235] [<ffffffffc0ecc216>] kiblnd_setup_rd_iov+0x2f6/0x400 [ko2iblnd] [ 590.842282] [<ffffffffc0ecfa6a>] kiblnd_send+0x5da/0xa20 [ko2iblnd] [ 590.848645] [<ffffffffc0c430e4>] lnet_ni_send+0x44/0xd0 [lnet] [ 590.854569] [<ffffffffc0c4ad92>] lnet_send+0x82/0x1c0 [lnet] [ 590.860323] [<ffffffffc0c4b19c>] LNetPut+0x2cc/0xb50 [lnet] [ 590.866029] [<ffffffffc0f5c5d6>] ptl_send_buf+0x146/0x530 [ptlrpc] [ 590.872302] [<ffffffffc0bc1bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [ 590.879378] [<ffffffffc0f5f98b>] ptlrpc_send_reply+0x29b/0x840 [ptlrpc] [ 590.886109] [<ffffffffc0f1e94e>] target_send_reply_msg+0x8e/0x170 [ptlrpc] [ 590.893092] [<ffffffffc0f28e0e>] target_send_reply+0x30e/0x730 [ptlrpc] [ 590.899823] [<ffffffffc0f66877>] ? lustre_msg_set_last_committed+0x27/0xa0 [ptlrpc] [ 590.907607] [<ffffffffc0fccf07>] tgt_request_handle+0x697/0x1580 [ptlrpc] [ 590.914514] [<ffffffffc0fa6a51>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [ 590.922088] [<ffffffffc0bc1bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [ 590.929161] [<ffffffffc0f7192b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 590.936846] [<ffffffffc0f6e7b5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [ 590.943633] [<ffffffff86cd67c2>] ? default_wake_function+0x12/0x20 [ 590.949898] [<ffffffff86ccba9b>] ? __wake_up_common+0x5b/0x90 [ 590.955757] [<ffffffffc0f7525c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [ 590.962050] [<ffffffffc0f74760>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [ 590.969441] [<ffffffff86cc1c31>] kthread+0xd1/0xe0 [ 590.974320] [<ffffffff86cc1b60>] ? insert_kthread_work+0x40/0x40 [ 590.980413] [<ffffffff87374c24>] ret_from_fork_nospec_begin+0xe/0x21 [ 590.986850] [<ffffffff86cc1b60>] ? insert_kthread_work+0x40/0x40 [ 590.992945] Kernel panic - not syncing: Hard LOCKUP [ 590.997823] CPU: 22 PID: 20866 Comm: ll_mgs_0002 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.1.3.el7_lustre.x86_64 #1 [ 591.010241] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.3.6 04/20/2018 [ 591.017807] Call Trace: [ 591.020262] <NMI> [<ffffffff87361e41>] dump_stack+0x19/0x1b [ 591.026035] [<ffffffff8735b550>] panic+0xe8/0x21f [ 591.030826] [<ffffffff87374c24>] ? ret_from_fork_nospec_begin+0xe/0x21 [ 591.037442] [<ffffffff86c9739f>] nmi_panic+0x3f/0x40 [ 591.042492] [<ffffffff86d494b1>] watchdog_overflow_callback+0x121/0x140 [ 591.049193] [<ffffffff86da1297>] __perf_event_overflow+0x57/0x100 [ 591.055370] [<ffffffff86daa904>] perf_event_overflow+0x14/0x20 [ 591.061292] [<ffffffff86c055f0>] x86_pmu_handle_irq+0x140/0x1a0 [ 591.067296] [<ffffffff86f77c08>] ? ioremap_page_range+0x2e8/0x480 [ 591.073477] [<ffffffff86dfb854>] ? vunmap_page_range+0x234/0x470 [ 591.079570] [<ffffffff86dfbaa1>] ? unmap_kernel_range_noflush+0x11/0x20 [ 591.086268] [<ffffffff8703d49e>] ? ghes_copy_tofrom_phys+0x10e/0x210 [ 591.092709] [<ffffffff8703d640>] ? ghes_read_estatus+0xa0/0x190 [ 591.098714] [<ffffffff8736b031>] perf_event_nmi_handler+0x31/0x50 [ 591.104893] [<ffffffff8736c8fc>] nmi_handle.isra.0+0x8c/0x150 [ 591.110726] [<ffffffff8736cb1d>] do_nmi+0x15d/0x460 [ 591.115694] [<ffffffff8736bd69>] end_repeat_nmi+0x1e/0x81 [ 591.121181] [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200 [ 591.128572] [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200 [ 591.135964] [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200 [ 591.143355] <EOE> [<ffffffff8735bfcb>] queued_spin_lock_slowpath+0xb/0xf [ 591.150258] [<ffffffff8736a507>] _raw_spin_lock_irqsave+0x37/0x40 [ 591.156442] [<ffffffffc090abab>] ib_fmr_pool_map_phys+0x4b/0x300 [ib_core] [ 591.163407] [<ffffffffc0ec66cc>] kiblnd_fmr_pool_map+0x15c/0x690 [ko2iblnd] [ 591.170452] [<ffffffffc0eca965>] kiblnd_map_tx.isra.28+0x155/0x470 [ko2iblnd] [ 591.177672] [<ffffffffc0ecc216>] kiblnd_setup_rd_iov+0x2f6/0x400 [ko2iblnd] [ 591.184719] [<ffffffffc0ecfa6a>] kiblnd_send+0x5da/0xa20 [ko2iblnd] [ 591.191077] [<ffffffffc0c430e4>] lnet_ni_send+0x44/0xd0 [lnet] [ 591.197004] [<ffffffffc0c4ad92>] lnet_send+0x82/0x1c0 [lnet] [ 591.202759] [<ffffffffc0c4b19c>] LNetPut+0x2cc/0xb50 [lnet] [ 591.208448] [<ffffffffc0f5c5d6>] ptl_send_buf+0x146/0x530 [ptlrpc] [ 591.214715] [<ffffffffc0bc1bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [ 591.221786] [<ffffffffc0f5f98b>] ptlrpc_send_reply+0x29b/0x840 [ptlrpc] [ 591.228509] [<ffffffffc0f1e94e>] target_send_reply_msg+0x8e/0x170 [ptlrpc] [ 591.235494] [<ffffffffc0f28e0e>] target_send_reply+0x30e/0x730 [ptlrpc] [ 591.242224] [<ffffffffc0f66877>] ? lustre_msg_set_last_committed+0x27/0xa0 [ptlrpc] [ 591.249992] [<ffffffffc0fccf07>] tgt_request_handle+0x697/0x1580 [ptlrpc] [ 591.256897] [<ffffffffc0fa6a51>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [ 591.264463] [<ffffffffc0bc1bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [ 591.271534] [<ffffffffc0f7192b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 591.279214] [<ffffffffc0f6e7b5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [ 591.285996] [<ffffffff86cd67c2>] ? default_wake_function+0x12/0x20 [ 591.292263] [<ffffffff86ccba9b>] ? __wake_up_common+0x5b/0x90 [ 591.298124] [<ffffffffc0f7525c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [ 591.304415] [<ffffffffc0f74760>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [ 591.311807] [<ffffffff86cc1c31>] kthread+0xd1/0xe0 [ 591.316685] [<ffffffff86cc1b60>] ? insert_kthread_work+0x40/0x40 [ 591.322780] [<ffffffff87374c24>] ret_from_fork_nospec_begin+0xe/0x21 [ 591.329219] [<ffffffff86cc1b60>] ? insert_kthread_work+0x40/0x40 [ 0.000000] Initializing cgroup subsys cpuset

            I tried again with 2.12.0_RC1 + Amir's patch at https://review.whamcloud.com/#/c/33794/1 , the OSS oopsed:

            [  496.611635] LNet: HW NUMA nodes: 4, HW CPU cores: 32, npartitions: 4
            [  496.619204] alg: No test for adler32 (adler32-zlib)
            [  497.421277] Lustre: Lustre: Build Version: 2.12.0_RC1
            [  497.541113] LNet: Using FastReg for registration
            [  497.674010] LNet: Added LNI 10.0.10.104@o2ib7 [8/256/0/180]
            [  498.540700] LNet: 120701:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.202@o2ib7: 497 seconds
            [  499.695014] md: md5 stopped.
            [  499.728157] async_tx: api initialized (async)
            [  499.734214] xor: automatically using best checksumming function:
            [  499.749692]    avx       :  9660.000 MB/sec
            [  499.781700] raid6: sse2x1   gen()  6117 MB/s
            [  499.802691] raid6: sse2x2   gen() 11347 MB/s
            [  499.823694] raid6: sse2x4   gen() 12980 MB/s
            [  499.844693] raid6: avx2x1   gen() 14261 MB/s
            [  499.865692] raid6: avx2x2   gen() 18882 MB/s
            [  499.886692] raid6: avx2x4   gen() 18851 MB/s
            [  499.890963] raid6: using algorithm avx2x2 gen() (18882 MB/s)
            [  499.896626] raid6: using avx2x2 recovery algorithm
            [  499.917975] md/raid:md5: device dm-55 operational as raid disk 0
            [  499.923990] md/raid:md5: device dm-11 operational as raid disk 9
            [  499.930008] md/raid:md5: device dm-10 operational as raid disk 8
            [  499.936020] md/raid:md5: device dm-1 operational as raid disk 7
            [  499.941966] md/raid:md5: device dm-83 operational as raid disk 6
            [  499.947996] md/raid:md5: device sdau operational as raid disk 5
            [  499.953926] md/raid:md5: device dm-72 operational as raid disk 4
            [  499.959939] md/raid:md5: device dm-61 operational as raid disk 3
            [  499.965954] md/raid:md5: device dm-60 operational as raid disk 2
            [  499.971970] md/raid:md5: device dm-116 operational as raid disk 1
            [  499.979129] md/raid:md5: raid level 6 active with 10 out of 10 devices, algorithm 2
            [  500.021173] md5: detected capacity change from 0 to 64011422924800
            [  500.044705] md: md3 stopped.
            [  500.077660] md/raid:md3: device dm-87 operational as raid disk 0
            [  500.083732] md/raid:md3: device dm-86 operational as raid disk 9
            [  500.089794] md/raid:md3: device dm-7 operational as raid disk 8
            [  500.095771] md/raid:md3: device dm-81 operational as raid disk 7
            [  500.101791] md/raid:md3: device dm-80 operational as raid disk 6
            [  500.107802] md/raid:md3: device dm-69 operational as raid disk 5
            [  500.113821] md/raid:md3: device dm-68 operational as raid disk 4
            [  500.119838] md/raid:md3: device dm-111 operational as raid disk 3
            [  500.125948] md/raid:md3: device dm-110 operational as raid disk 2
            [  500.132043] md/raid:md3: device dm-88 operational as raid disk 1
            [  500.138901] md/raid:md3: raid level 6 active with 10 out of 10 devices, algorithm 2
            [  500.166972] md3: detected capacity change from 0 to 64011422924800
            [  500.181231] md: md11 stopped.
            [  500.205433] md/raid:md11: device dm-91 operational as raid disk 0
            [  500.211589] md/raid:md11: device dm-48 operational as raid disk 9
            [  500.217708] md/raid:md11: device dm-47 operational as raid disk 8
            [  500.223824] md/raid:md11: device dm-35 operational as raid disk 7
            [  500.229925] md/raid:md11: device dm-34 operational as raid disk 6
            [  500.236037] md/raid:md11: device dm-104 operational as raid disk 5
            [  500.242310] md/raid:md11: device dm-25 operational as raid disk 4
            [  500.248439] md/raid:md11: device dm-100 operational as raid disk 3
            [  500.254664] md/raid:md11: device dm-99 operational as raid disk 2
            [  500.260789] md/raid:md11: device dm-118 operational as raid disk 1
            [  500.267842] md/raid:md11: raid level 6 active with 10 out of 10 devices, algorithm 2
            [  500.297329] md11: detected capacity change from 0 to 64011422924800
            [  500.306119] md: md1 stopped.
            [  500.341217] md/raid:md1: device dm-58 operational as raid disk 0
            [  500.347236] md/raid:md1: device dm-6 operational as raid disk 9
            [  500.353163] md/raid:md1: device dm-5 operational as raid disk 8
            [  500.359089] md/raid:md1: device dm-76 operational as raid disk 7
            [  500.365110] md/raid:md1: device dm-75 operational as raid disk 6
            [  500.371134] md/raid:md1: device dm-64 operational as raid disk 5
            [  500.377156] md/raid:md1: device dm-63 operational as raid disk 4
            [  500.383168] md/raid:md1: device dm-109 operational as raid disk 3
            [  500.389267] md/raid:md1: device dm-56 operational as raid disk 2
            [  500.395287] md/raid:md1: device dm-67 operational as raid disk 1
            [  500.401961] md/raid:md1: raid level 6 active with 10 out of 10 devices, algorithm 2
            [  500.430931] md1: detected capacity change from 0 to 64011422924800
            [  500.444576] md: md7 stopped.
            [  500.480725] md/raid:md7: device dm-98 operational as raid disk 0
            [  500.486749] md/raid:md7: device dm-40 operational as raid disk 9
            [  500.492776] md/raid:md7: device dm-39 operational as raid disk 8
            [  500.498794] md/raid:md7: device dm-27 operational as raid disk 7
            [  500.504806] md/raid:md7: device dm-26 operational as raid disk 6
            [  500.510829] md/raid:md7: device dm-19 operational as raid disk 5
            [  500.516846] md/raid:md7: device dm-18 operational as raid disk 4
            [  500.522857] md/raid:md7: device dm-117 operational as raid disk 3
            [  500.528957] md/raid:md7: device dm-94 operational as raid disk 2
            [  500.534972] md/raid:md7: device dm-21 operational as raid disk 1
            [  500.541667] md/raid:md7: raid level 6 active with 10 out of 10 devices, algorithm 2
            [  500.593292] md7: detected capacity change from 0 to 64011422924800
            [  500.602526] md: md9 stopped.
            [  500.645764] md/raid:md9: device dm-49 operational as raid disk 0
            [  500.651782] md/raid:md9: device dm-44 operational as raid disk 9
            [  500.657811] md/raid:md9: device dm-43 operational as raid disk 8
            [  500.663832] md/raid:md9: device dm-107 operational as raid disk 7
            [  500.669945] md/raid:md9: device dm-31 operational as raid disk 6
            [  500.675959] md/raid:md9: device dm-22 operational as raid disk 5
            [  500.681974] md/raid:md9: device dm-103 operational as raid disk 4
            [  500.688074] md/raid:md9: device dm-97 operational as raid disk 3
            [  500.694090] md/raid:md9: device dm-96 operational as raid disk 2
            [  500.700104] md/raid:md9: device dm-50 operational as raid disk 1
            [  500.708843] md/raid:md9: raid level 6 active with 10 out of 10 devices, algorithm 2
            [  500.748835] md9: detected capacity change from 0 to 64011422924800
            [  500.752863] LDISKFS-fs (md5): file extents enabled, maximum tree depth=5
            [  500.841913] LDISKFS-fs (md3): file extents enabled, maximum tree depth=5
            [  501.055692] LDISKFS-fs (md5): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
            [  501.143861] LDISKFS-fs (md3): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
            [  501.167914] LDISKFS-fs (md1): file extents enabled, maximum tree depth=5
            [  501.182682] LDISKFS-fs (md11): file extents enabled, maximum tree depth=5
            [  501.508363] LDISKFS-fs (md11): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
            [  501.510261] LDISKFS-fs (md1): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
            [  501.521916] LDISKFS-fs (md9): file extents enabled, maximum tree depth=5
            [  501.554606] BUG: unable to handle kernel paging request at 000000000001b780
            [  501.561618] IP: [<ffffffff901121d0>] native_queued_spin_lock_slowpath+0x110/0x200
            [  501.569135] PGD 1032e6c067 PUD 1032e6d067 PMD 0 
            [  501.573830] Oops: 0002 [#1] SMP 
            [  501.577114] Modules linked in: ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) raid456 async_raid6_recov async_memcpy async_pq raid6_pq libcrc32c async_xor xor async_tx ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache mpt2sas mptctl mptbase dell_rbu vfat fat rpcrdma sunrpc dm_service_time ib_iser libiscsi scsi_transport_iscsi dcdbas amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr k10temp i2c_piix4 ccp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm dm_multipath ses enclosure dm_mod ipmi_si mlx5_ib ipmi_devintf sg ipmi_msghandler ib_core acpi_power_meter ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea mlx5_core sysfillrect sysimgblt fb_sys_fops ttm ahci mlxfw devlink libahci crct10dif_pclmul tg3 crct10dif_common ptp drm libata megaraid_sas crc32c_intel drm_panel_orientation_quirks pps_core mpt3sas(OE) raid_class scsi_transport_sas
            [  501.680694] CPU: 30 PID: 121538 Comm: mount.lustre Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.1.3.el7_lustre.x86_64 #1
            [  501.693284] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.3.6 04/20/2018
            [  501.700853] task: ffff8b02a92fe180 ti: ffff8b02b2e64000 task.ti: ffff8b02b2e64000
            [  501.708330] RIP: 0010:[<ffffffff901121d0>]  [<ffffffff901121d0>] native_queued_spin_lock_slowpath+0x110/0x200
            [  501.718272] RSP: 0018:ffff8b02b2e674b8  EFLAGS: 00010002
            [  501.723583] RAX: 00000000000016e6 RBX: 0000000000000287 RCX: 0000000000f10000
            [  501.730717] RDX: 000000000001b780 RSI: 00000000b7318240 RDI: ffff8b22b7318fc0
            [  501.737850] RBP: ffff8b02b2e674b8 R08: ffff8b22bf7db780 R09: 0000000000000000
            [  501.744982] R10: 0000000000000002 R11: 00000000000002a0 R12: 0000000000000000
            [  501.752116] R13: ffff8b2256e55800 R14: ffff8b22b7318fc0 R15: 0000000000000002
            [  501.759249] FS:  00007fc3f2cd5840(0000) GS:ffff8b22bf7c0000(0000) knlGS:0000000000000000
            [  501.767335] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [  501.773079] CR2: 000000000001b780 CR3: 0000001034c28000 CR4: 00000000003407e0
            [  501.780213] Call Trace:
            [  501.782671]  [<ffffffff9075bfcb>] queued_spin_lock_slowpath+0xb/0xf
            [  501.788943]  [<ffffffff9076a507>] _raw_spin_lock_irqsave+0x37/0x40
            [  501.795132]  [<ffffffffc06c1bab>] ib_fmr_pool_map_phys+0x4b/0x300 [ib_core]
            [  501.802095]  [<ffffffffc0bfd6cc>] kiblnd_fmr_pool_map+0x15c/0x690 [ko2iblnd]
            [  501.809148]  [<ffffffffc0c01965>] kiblnd_map_tx.isra.28+0x155/0x470 [ko2iblnd]
            [  501.816373]  [<ffffffffc0c03216>] kiblnd_setup_rd_iov+0x2f6/0x400 [ko2iblnd]
            [  501.820899] LDISKFS-fs (md7): file extents enabled, maximum tree depth=5
            [  501.825866] LDISKFS-fs (md9): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
            [  501.840808]  [<ffffffffc0c06a6a>] kiblnd_send+0x5da/0xa20 [ko2iblnd]
            [  501.847176]  [<ffffffffc097a0e4>] lnet_ni_send+0x44/0xd0 [lnet]
            [  501.853102]  [<ffffffffc0981d92>] lnet_send+0x82/0x1c0 [lnet]
            [  501.858855]  [<ffffffffc098219c>] LNetPut+0x2cc/0xb50 [lnet]
            [  501.864561]  [<ffffffffc0c935d6>] ptl_send_buf+0x146/0x530 [ptlrpc]
            [  501.870855]  [<ffffffffc0c95f1d>] ptl_send_rpc+0x69d/0xe70 [ptlrpc]
            [  501.877146]  [<ffffffffc0c8ab60>] ptlrpc_send_new_req+0x450/0xa60 [ptlrpc]
            [  501.884045]  [<ffffffffc0c8f788>] ptlrpc_set_wait+0x3f8/0x8d0 [ptlrpc]
            [  501.890600]  [<ffffffffc0a765e5>] ? lustre_get_jobid+0x185/0x2e0 [obdclass]
            [  501.897592]  [<ffffffffc0c9a2e0>] ? lustre_msg_buf_v2+0x1b0/0x1b0 [ptlrpc]
            [  501.904491]  [<ffffffffc0c9bbaa>] ? lustre_msg_set_jobid+0x9a/0x110 [ptlrpc]
            [  501.911560]  [<ffffffffc0c8fce3>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
            [  501.918173]  [<ffffffffc0f24b44>] mgc_target_register+0x134/0x4c0 [mgc]
            [  501.924785]  [<ffffffffc0f2862b>] mgc_set_info_async+0x37b/0x1610 [mgc]
            [  501.931407]  [<ffffffffc08fef07>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
            [  501.938036]  [<ffffffffc0a85b7b>] server_start_targets+0x116b/0x2a20 [obdclass]
            [  501.945362]  [<ffffffffc0a5ab10>] ? lustre_start_mgc+0x260/0x2510 [obdclass]
            [  501.952421]  [<ffffffffc0a884fc>] server_fill_super+0x10cc/0x1890 [obdclass]
            [  501.959487]  [<ffffffffc0a5d828>] lustre_fill_super+0x328/0x950 [obdclass]
            [  501.966375]  [<ffffffffc0a5d500>] ? lustre_common_put_super+0x270/0x270 [obdclass]
            [  501.973945]  [<ffffffff902452cf>] mount_nodev+0x4f/0xb0
            [  501.979185]  [<ffffffffc0a55908>] lustre_mount+0x38/0x60 [obdclass]
            [  501.985452]  [<ffffffff90245e4e>] mount_fs+0x3e/0x1b0
            [  501.990507]  [<ffffffff902639e7>] vfs_kern_mount+0x67/0x110
            [  501.996076]  [<ffffffff9026600f>] do_mount+0x1ef/0xce0
            [  502.001218]  [<ffffffff9023e2aa>] ? __check_object_size+0x1ca/0x250
            [  502.007485]  [<ffffffff9021c74c>] ? kmem_cache_alloc_trace+0x3c/0x200
            [  502.013922]  [<ffffffff90266e43>] SyS_mount+0x83/0xd0
            [  502.018977]  [<ffffffff90774ddb>] system_call_fastpath+0x22/0x27
            [  502.024980] Code: 87 47 02 c1 e0 10 45 31 c9 85 c0 74 44 48 89 c2 c1 e8 13 48 c1 ea 0d 48 98 83 e2 30 48 81 c2 80 b7 01 00 48 03 14 c5 60 b9 d4 90 <4c> 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 41 8b 40 
            [  502.045501] RIP  [<ffffffff901121d0>] native_queued_spin_lock_slowpath+0x110/0x200
            [  502.053096]  RSP <ffff8b02b2e674b8>
            [  502.056589] CR2: 000000000001b780
             

             

            sthiell Stephane Thiell added a comment - I tried again with 2.12.0_RC1 + Amir's patch at https://review.whamcloud.com/#/c/33794/1  , the OSS oopsed: [ 496.611635] LNet: HW NUMA nodes: 4, HW CPU cores: 32, npartitions: 4 [ 496.619204] alg: No test for adler32 (adler32-zlib) [ 497.421277] Lustre: Lustre: Build Version: 2.12.0_RC1 [ 497.541113] LNet: Using FastReg for registration [ 497.674010] LNet: Added LNI 10.0.10.104@o2ib7 [8/256/0/180] [ 498.540700] LNet: 120701:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.202@o2ib7: 497 seconds [ 499.695014] md: md5 stopped. [ 499.728157] async_tx: api initialized (async) [ 499.734214] xor: automatically using best checksumming function: [ 499.749692] avx : 9660.000 MB/sec [ 499.781700] raid6: sse2x1 gen() 6117 MB/s [ 499.802691] raid6: sse2x2 gen() 11347 MB/s [ 499.823694] raid6: sse2x4 gen() 12980 MB/s [ 499.844693] raid6: avx2x1 gen() 14261 MB/s [ 499.865692] raid6: avx2x2 gen() 18882 MB/s [ 499.886692] raid6: avx2x4 gen() 18851 MB/s [ 499.890963] raid6: using algorithm avx2x2 gen() (18882 MB/s) [ 499.896626] raid6: using avx2x2 recovery algorithm [ 499.917975] md/raid:md5: device dm-55 operational as raid disk 0 [ 499.923990] md/raid:md5: device dm-11 operational as raid disk 9 [ 499.930008] md/raid:md5: device dm-10 operational as raid disk 8 [ 499.936020] md/raid:md5: device dm-1 operational as raid disk 7 [ 499.941966] md/raid:md5: device dm-83 operational as raid disk 6 [ 499.947996] md/raid:md5: device sdau operational as raid disk 5 [ 499.953926] md/raid:md5: device dm-72 operational as raid disk 4 [ 499.959939] md/raid:md5: device dm-61 operational as raid disk 3 [ 499.965954] md/raid:md5: device dm-60 operational as raid disk 2 [ 499.971970] md/raid:md5: device dm-116 operational as raid disk 1 [ 499.979129] md/raid:md5: raid level 6 active with 10 out of 10 devices, algorithm 2 [ 500.021173] md5: detected capacity change from 0 to 64011422924800 [ 500.044705] md: md3 stopped. [ 500.077660] md/raid:md3: device dm-87 operational as raid disk 0 [ 500.083732] md/raid:md3: device dm-86 operational as raid disk 9 [ 500.089794] md/raid:md3: device dm-7 operational as raid disk 8 [ 500.095771] md/raid:md3: device dm-81 operational as raid disk 7 [ 500.101791] md/raid:md3: device dm-80 operational as raid disk 6 [ 500.107802] md/raid:md3: device dm-69 operational as raid disk 5 [ 500.113821] md/raid:md3: device dm-68 operational as raid disk 4 [ 500.119838] md/raid:md3: device dm-111 operational as raid disk 3 [ 500.125948] md/raid:md3: device dm-110 operational as raid disk 2 [ 500.132043] md/raid:md3: device dm-88 operational as raid disk 1 [ 500.138901] md/raid:md3: raid level 6 active with 10 out of 10 devices, algorithm 2 [ 500.166972] md3: detected capacity change from 0 to 64011422924800 [ 500.181231] md: md11 stopped. [ 500.205433] md/raid:md11: device dm-91 operational as raid disk 0 [ 500.211589] md/raid:md11: device dm-48 operational as raid disk 9 [ 500.217708] md/raid:md11: device dm-47 operational as raid disk 8 [ 500.223824] md/raid:md11: device dm-35 operational as raid disk 7 [ 500.229925] md/raid:md11: device dm-34 operational as raid disk 6 [ 500.236037] md/raid:md11: device dm-104 operational as raid disk 5 [ 500.242310] md/raid:md11: device dm-25 operational as raid disk 4 [ 500.248439] md/raid:md11: device dm-100 operational as raid disk 3 [ 500.254664] md/raid:md11: device dm-99 operational as raid disk 2 [ 500.260789] md/raid:md11: device dm-118 operational as raid disk 1 [ 500.267842] md/raid:md11: raid level 6 active with 10 out of 10 devices, algorithm 2 [ 500.297329] md11: detected capacity change from 0 to 64011422924800 [ 500.306119] md: md1 stopped. [ 500.341217] md/raid:md1: device dm-58 operational as raid disk 0 [ 500.347236] md/raid:md1: device dm-6 operational as raid disk 9 [ 500.353163] md/raid:md1: device dm-5 operational as raid disk 8 [ 500.359089] md/raid:md1: device dm-76 operational as raid disk 7 [ 500.365110] md/raid:md1: device dm-75 operational as raid disk 6 [ 500.371134] md/raid:md1: device dm-64 operational as raid disk 5 [ 500.377156] md/raid:md1: device dm-63 operational as raid disk 4 [ 500.383168] md/raid:md1: device dm-109 operational as raid disk 3 [ 500.389267] md/raid:md1: device dm-56 operational as raid disk 2 [ 500.395287] md/raid:md1: device dm-67 operational as raid disk 1 [ 500.401961] md/raid:md1: raid level 6 active with 10 out of 10 devices, algorithm 2 [ 500.430931] md1: detected capacity change from 0 to 64011422924800 [ 500.444576] md: md7 stopped. [ 500.480725] md/raid:md7: device dm-98 operational as raid disk 0 [ 500.486749] md/raid:md7: device dm-40 operational as raid disk 9 [ 500.492776] md/raid:md7: device dm-39 operational as raid disk 8 [ 500.498794] md/raid:md7: device dm-27 operational as raid disk 7 [ 500.504806] md/raid:md7: device dm-26 operational as raid disk 6 [ 500.510829] md/raid:md7: device dm-19 operational as raid disk 5 [ 500.516846] md/raid:md7: device dm-18 operational as raid disk 4 [ 500.522857] md/raid:md7: device dm-117 operational as raid disk 3 [ 500.528957] md/raid:md7: device dm-94 operational as raid disk 2 [ 500.534972] md/raid:md7: device dm-21 operational as raid disk 1 [ 500.541667] md/raid:md7: raid level 6 active with 10 out of 10 devices, algorithm 2 [ 500.593292] md7: detected capacity change from 0 to 64011422924800 [ 500.602526] md: md9 stopped. [ 500.645764] md/raid:md9: device dm-49 operational as raid disk 0 [ 500.651782] md/raid:md9: device dm-44 operational as raid disk 9 [ 500.657811] md/raid:md9: device dm-43 operational as raid disk 8 [ 500.663832] md/raid:md9: device dm-107 operational as raid disk 7 [ 500.669945] md/raid:md9: device dm-31 operational as raid disk 6 [ 500.675959] md/raid:md9: device dm-22 operational as raid disk 5 [ 500.681974] md/raid:md9: device dm-103 operational as raid disk 4 [ 500.688074] md/raid:md9: device dm-97 operational as raid disk 3 [ 500.694090] md/raid:md9: device dm-96 operational as raid disk 2 [ 500.700104] md/raid:md9: device dm-50 operational as raid disk 1 [ 500.708843] md/raid:md9: raid level 6 active with 10 out of 10 devices, algorithm 2 [ 500.748835] md9: detected capacity change from 0 to 64011422924800 [ 500.752863] LDISKFS-fs (md5): file extents enabled, maximum tree depth=5 [ 500.841913] LDISKFS-fs (md3): file extents enabled, maximum tree depth=5 [ 501.055692] LDISKFS-fs (md5): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc [ 501.143861] LDISKFS-fs (md3): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc [ 501.167914] LDISKFS-fs (md1): file extents enabled, maximum tree depth=5 [ 501.182682] LDISKFS-fs (md11): file extents enabled, maximum tree depth=5 [ 501.508363] LDISKFS-fs (md11): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc [ 501.510261] LDISKFS-fs (md1): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc [ 501.521916] LDISKFS-fs (md9): file extents enabled, maximum tree depth=5 [ 501.554606] BUG: unable to handle kernel paging request at 000000000001b780 [ 501.561618] IP: [<ffffffff901121d0>] native_queued_spin_lock_slowpath+0x110/0x200 [ 501.569135] PGD 1032e6c067 PUD 1032e6d067 PMD 0 [ 501.573830] Oops: 0002 [#1] SMP [ 501.577114] Modules linked in: ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) raid456 async_raid6_recov async_memcpy async_pq raid6_pq libcrc32c async_xor xor async_tx ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache mpt2sas mptctl mptbase dell_rbu vfat fat rpcrdma sunrpc dm_service_time ib_iser libiscsi scsi_transport_iscsi dcdbas amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr k10temp i2c_piix4 ccp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm dm_multipath ses enclosure dm_mod ipmi_si mlx5_ib ipmi_devintf sg ipmi_msghandler ib_core acpi_power_meter ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea mlx5_core sysfillrect sysimgblt fb_sys_fops ttm ahci mlxfw devlink libahci crct10dif_pclmul tg3 crct10dif_common ptp drm libata megaraid_sas crc32c_intel drm_panel_orientation_quirks pps_core mpt3sas(OE) raid_class scsi_transport_sas [ 501.680694] CPU: 30 PID: 121538 Comm: mount.lustre Kdump: loaded Tainted: G OE ------------ 3.10.0-957.1.3.el7_lustre.x86_64 #1 [ 501.693284] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.3.6 04/20/2018 [ 501.700853] task: ffff8b02a92fe180 ti: ffff8b02b2e64000 task.ti: ffff8b02b2e64000 [ 501.708330] RIP: 0010:[<ffffffff901121d0>] [<ffffffff901121d0>] native_queued_spin_lock_slowpath+0x110/0x200 [ 501.718272] RSP: 0018:ffff8b02b2e674b8 EFLAGS: 00010002 [ 501.723583] RAX: 00000000000016e6 RBX: 0000000000000287 RCX: 0000000000f10000 [ 501.730717] RDX: 000000000001b780 RSI: 00000000b7318240 RDI: ffff8b22b7318fc0 [ 501.737850] RBP: ffff8b02b2e674b8 R08: ffff8b22bf7db780 R09: 0000000000000000 [ 501.744982] R10: 0000000000000002 R11: 00000000000002a0 R12: 0000000000000000 [ 501.752116] R13: ffff8b2256e55800 R14: ffff8b22b7318fc0 R15: 0000000000000002 [ 501.759249] FS: 00007fc3f2cd5840(0000) GS:ffff8b22bf7c0000(0000) knlGS:0000000000000000 [ 501.767335] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 501.773079] CR2: 000000000001b780 CR3: 0000001034c28000 CR4: 00000000003407e0 [ 501.780213] Call Trace: [ 501.782671] [<ffffffff9075bfcb>] queued_spin_lock_slowpath+0xb/0xf [ 501.788943] [<ffffffff9076a507>] _raw_spin_lock_irqsave+0x37/0x40 [ 501.795132] [<ffffffffc06c1bab>] ib_fmr_pool_map_phys+0x4b/0x300 [ib_core] [ 501.802095] [<ffffffffc0bfd6cc>] kiblnd_fmr_pool_map+0x15c/0x690 [ko2iblnd] [ 501.809148] [<ffffffffc0c01965>] kiblnd_map_tx.isra.28+0x155/0x470 [ko2iblnd] [ 501.816373] [<ffffffffc0c03216>] kiblnd_setup_rd_iov+0x2f6/0x400 [ko2iblnd] [ 501.820899] LDISKFS-fs (md7): file extents enabled, maximum tree depth=5 [ 501.825866] LDISKFS-fs (md9): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc [ 501.840808] [<ffffffffc0c06a6a>] kiblnd_send+0x5da/0xa20 [ko2iblnd] [ 501.847176] [<ffffffffc097a0e4>] lnet_ni_send+0x44/0xd0 [lnet] [ 501.853102] [<ffffffffc0981d92>] lnet_send+0x82/0x1c0 [lnet] [ 501.858855] [<ffffffffc098219c>] LNetPut+0x2cc/0xb50 [lnet] [ 501.864561] [<ffffffffc0c935d6>] ptl_send_buf+0x146/0x530 [ptlrpc] [ 501.870855] [<ffffffffc0c95f1d>] ptl_send_rpc+0x69d/0xe70 [ptlrpc] [ 501.877146] [<ffffffffc0c8ab60>] ptlrpc_send_new_req+0x450/0xa60 [ptlrpc] [ 501.884045] [<ffffffffc0c8f788>] ptlrpc_set_wait+0x3f8/0x8d0 [ptlrpc] [ 501.890600] [<ffffffffc0a765e5>] ? lustre_get_jobid+0x185/0x2e0 [obdclass] [ 501.897592] [<ffffffffc0c9a2e0>] ? lustre_msg_buf_v2+0x1b0/0x1b0 [ptlrpc] [ 501.904491] [<ffffffffc0c9bbaa>] ? lustre_msg_set_jobid+0x9a/0x110 [ptlrpc] [ 501.911560] [<ffffffffc0c8fce3>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc] [ 501.918173] [<ffffffffc0f24b44>] mgc_target_register+0x134/0x4c0 [mgc] [ 501.924785] [<ffffffffc0f2862b>] mgc_set_info_async+0x37b/0x1610 [mgc] [ 501.931407] [<ffffffffc08fef07>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 501.938036] [<ffffffffc0a85b7b>] server_start_targets+0x116b/0x2a20 [obdclass] [ 501.945362] [<ffffffffc0a5ab10>] ? lustre_start_mgc+0x260/0x2510 [obdclass] [ 501.952421] [<ffffffffc0a884fc>] server_fill_super+0x10cc/0x1890 [obdclass] [ 501.959487] [<ffffffffc0a5d828>] lustre_fill_super+0x328/0x950 [obdclass] [ 501.966375] [<ffffffffc0a5d500>] ? lustre_common_put_super+0x270/0x270 [obdclass] [ 501.973945] [<ffffffff902452cf>] mount_nodev+0x4f/0xb0 [ 501.979185] [<ffffffffc0a55908>] lustre_mount+0x38/0x60 [obdclass] [ 501.985452] [<ffffffff90245e4e>] mount_fs+0x3e/0x1b0 [ 501.990507] [<ffffffff902639e7>] vfs_kern_mount+0x67/0x110 [ 501.996076] [<ffffffff9026600f>] do_mount+0x1ef/0xce0 [ 502.001218] [<ffffffff9023e2aa>] ? __check_object_size+0x1ca/0x250 [ 502.007485] [<ffffffff9021c74c>] ? kmem_cache_alloc_trace+0x3c/0x200 [ 502.013922] [<ffffffff90266e43>] SyS_mount+0x83/0xd0 [ 502.018977] [<ffffffff90774ddb>] system_call_fastpath+0x22/0x27 [ 502.024980] Code: 87 47 02 c1 e0 10 45 31 c9 85 c0 74 44 48 89 c2 c1 e8 13 48 c1 ea 0d 48 98 83 e2 30 48 81 c2 80 b7 01 00 48 03 14 c5 60 b9 d4 90 <4c> 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 41 8b 40 [ 502.045501] RIP [<ffffffff901121d0>] native_queued_spin_lock_slowpath+0x110/0x200 [ 502.053096] RSP <ffff8b02b2e674b8> [ 502.056589] CR2: 000000000001b780  

            Just wanted to say that we hit the same issue with 2.12.0_RC1, it's super unstable. I've used quite extensively 2.11.56_3_g4e42995 but didn't hit this issue before the upgrade to CentOS 7.6 + 2.12.0_RC1.

            [Fri Dec  7 17:28:33 2018]CentOS Linux 7 (Core)^M
            [Fri Dec  7 17:28:33 2018]Kernel 3.10.0-957.1.3.el7_lustre.x86_64 on an x86_64^M
            [Fri Dec  7 17:28:33 2018]^M
            [Fri Dec  7 17:28:33 2018]fir-io2-s1 login: [ 4678.271368] LNet: HW NUMA nodes: 4, HW CPU cores: 32, npartitions: 4^M
            [Fri Dec  7 18:44:21 2018][ 4678.279024] alg: No test for adler32 (adler32-zlib)^M
            [Fri Dec  7 18:44:22 2018][ 4679.080205] Lustre: Lustre: Build Version: 2.12.0_RC1^M
            [Fri Dec  7 18:44:22 2018][ 4679.196318] LNet: Using FastReg for registration^M
            [Fri Dec  7 18:44:22 2018][ 4679.328400] LNet: Added LNI 10.0.10.103@o2ib7 [8/256/0/180]^M
            [Fri Dec  7 18:44:23 2018][ 4680.195333] LNet: 19041:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.202@o2ib7: 4679 seconds^M
            [Fri Dec  7 18:45:23 2018][ 4740.334804] LNetError: 19070:0:(lib-move.c:4429:lnet_attach_rsp_tracker()) ASSERTION( md->md_rspt_ptr == ((void *)0) ) failed: ^M
            [Fri Dec  7 18:45:23 2018][ 4740.346280] LNetError: 19070:0:(lib-move.c:4429:lnet_attach_rsp_tracker()) LBUG^M
            [Fri Dec  7 18:45:23 2018][ 4740.353596] Pid: 19070, comm: monitor_thread 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018^M
            [Fri Dec  7 18:45:23 2018][ 4740.363800] Call Trace:^M
            [Fri Dec  7 18:45:23 2018][ 4740.366262]  [<ffffffffc094c7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]^M
            [Fri Dec  7 18:45:23 2018][ 4740.372832]  [<ffffffffc094c87c>] lbug_with_loc+0x4c/0xa0 [libcfs]^M
            [Fri Dec  7 18:45:23 2018][ 4740.379068]  [<ffffffffc09e8667>] lnet_attach_rsp_tracker.isra.29+0xd7/0xe0 [lnet]^M
            [Fri Dec  7 18:45:23 2018][ 4740.386685]  [<ffffffffc09effb1>] LNetGet+0x5d1/0xa90 [lnet]^M
            [Fri Dec  7 18:45:23 2018][ 4740.392395]  [<ffffffffc09f8399>] lnet_check_routers+0x399/0xbf0 [lnet]^M
            [Fri Dec  7 18:45:23 2018][ 4740.399054]  [<ffffffffc09f18cf>] lnet_monitor_thread+0x4ff/0x560 [lnet]^M
            [Fri Dec  7 18:45:23 2018][ 4740.405811]  [<ffffffffb76c1c31>] kthread+0xd1/0xe0^M
            [Fri Dec  7 18:45:23 2018][ 4740.410725]  [<ffffffffb7d74c24>] ret_from_fork_nospec_begin+0xe/0x21^M
            [Fri Dec  7 18:45:23 2018][ 4740.417215]  [<ffffffffffffffff>] 0xffffffffffffffff^M
            [Fri Dec  7 18:45:23 2018][ 4740.422237] Kernel panic - not syncing: LBUG^M
            

            Config is CentOS 7.6 (3.10.0-957.1.3.el7_lustre.x86_64), Lustre 2.12.0 RC1, in-kernel OFED stack, AMD EPYC server (Dell R6415), mlx5 EDR.

            Perfectly working config was CentOS 7.5 (3.10.0-862.14.4.el7_lustre.x86_64), Lustre 2.11.56_3_g4e42995, in-kernel OFED stack, AMD EPYC server (Dell R6415), mlx5 EDR.

            Thanks,
            Stephane

            sthiell Stephane Thiell added a comment - Just wanted to say that we hit the same issue with 2.12.0_RC1, it's super unstable. I've used quite extensively 2.11.56_3_g4e42995 but didn't hit this issue before the upgrade to CentOS 7.6 + 2.12.0_RC1. [Fri Dec 7 17:28:33 2018]CentOS Linux 7 (Core)^M [Fri Dec 7 17:28:33 2018]Kernel 3.10.0-957.1.3.el7_lustre.x86_64 on an x86_64^M [Fri Dec 7 17:28:33 2018]^M [Fri Dec 7 17:28:33 2018]fir-io2-s1 login: [ 4678.271368] LNet: HW NUMA nodes: 4, HW CPU cores: 32, npartitions: 4^M [Fri Dec 7 18:44:21 2018][ 4678.279024] alg: No test for adler32 (adler32-zlib)^M [Fri Dec 7 18:44:22 2018][ 4679.080205] Lustre: Lustre: Build Version: 2.12.0_RC1^M [Fri Dec 7 18:44:22 2018][ 4679.196318] LNet: Using FastReg for registration^M [Fri Dec 7 18:44:22 2018][ 4679.328400] LNet: Added LNI 10.0.10.103@o2ib7 [8/256/0/180]^M [Fri Dec 7 18:44:23 2018][ 4680.195333] LNet: 19041:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.202@o2ib7: 4679 seconds^M [Fri Dec 7 18:45:23 2018][ 4740.334804] LNetError: 19070:0:(lib-move.c:4429:lnet_attach_rsp_tracker()) ASSERTION( md->md_rspt_ptr == ((void *)0) ) failed: ^M [Fri Dec 7 18:45:23 2018][ 4740.346280] LNetError: 19070:0:(lib-move.c:4429:lnet_attach_rsp_tracker()) LBUG^M [Fri Dec 7 18:45:23 2018][ 4740.353596] Pid: 19070, comm: monitor_thread 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018^M [Fri Dec 7 18:45:23 2018][ 4740.363800] Call Trace:^M [Fri Dec 7 18:45:23 2018][ 4740.366262] [<ffffffffc094c7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]^M [Fri Dec 7 18:45:23 2018][ 4740.372832] [<ffffffffc094c87c>] lbug_with_loc+0x4c/0xa0 [libcfs]^M [Fri Dec 7 18:45:23 2018][ 4740.379068] [<ffffffffc09e8667>] lnet_attach_rsp_tracker.isra.29+0xd7/0xe0 [lnet]^M [Fri Dec 7 18:45:23 2018][ 4740.386685] [<ffffffffc09effb1>] LNetGet+0x5d1/0xa90 [lnet]^M [Fri Dec 7 18:45:23 2018][ 4740.392395] [<ffffffffc09f8399>] lnet_check_routers+0x399/0xbf0 [lnet]^M [Fri Dec 7 18:45:23 2018][ 4740.399054] [<ffffffffc09f18cf>] lnet_monitor_thread+0x4ff/0x560 [lnet]^M [Fri Dec 7 18:45:23 2018][ 4740.405811] [<ffffffffb76c1c31>] kthread+0xd1/0xe0^M [Fri Dec 7 18:45:23 2018][ 4740.410725] [<ffffffffb7d74c24>] ret_from_fork_nospec_begin+0xe/0x21^M [Fri Dec 7 18:45:23 2018][ 4740.417215] [<ffffffffffffffff>] 0xffffffffffffffff^M [Fri Dec 7 18:45:23 2018][ 4740.422237] Kernel panic - not syncing: LBUG^M Config is CentOS 7.6 (3.10.0-957.1.3.el7_lustre.x86_64), Lustre 2.12.0 RC1, in-kernel OFED stack, AMD EPYC server (Dell R6415), mlx5 EDR. Perfectly working config was CentOS 7.5 (3.10.0-862.14.4.el7_lustre.x86_64), Lustre 2.11.56_3_g4e42995, in-kernel OFED stack, AMD EPYC server (Dell R6415), mlx5 EDR. Thanks, Stephane

            Patch seems to resolve my bringup issues. I'm doing further testing to make sure no other regressions exist.

            simmonsja James A Simmons added a comment - Patch seems to resolve my bringup issues. I'm doing further testing to make sure no other regressions exist.

            I don't think this particular patch is relevant to that ticket since LU-10669 was before health. But it could be related in the sense that an MD can be used multiple times, which could lead to a corner scenario causing the assert:

            lib-msg.c:346:lnet_msg_detach_md()) ASSERTION( md->md_refcount >= 0

            ashehata Amir Shehata (Inactive) added a comment - I don't think this particular patch is relevant to that ticket since LU-10669 was before health. But it could be related in the sense that an MD can be used multiple times, which could lead to a corner scenario causing the assert: lib-msg.c:346:lnet_msg_detach_md()) ASSERTION( md->md_refcount >= 0

            People

              ashehata Amir Shehata (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: