Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.12.0
-
None
-
lustre 2.12.0-rc1
-
3
-
9223372036854775807
Description
In my testing I see at bring up:
2018-12-05T14:24:14.274614-05:00 ninja84.ccs.ornl.gov kernel: Lustre: Lustre: Build Version: 2.12.0_RC1_dirty
2018-12-05T14:24:14.671611-05:00 ninja84.ccs.ornl.gov kernel: LNetError: 3759:0:(lib-move.c:4429:lnet_attach_rsp_tracker()) ASSERTION( md->md_r
spt_ptr == ((void *)0) ) failed:
2018-12-05T14:24:14.671682-05:00 ninja84.ccs.ornl.gov kernel: LNetError: 3759:0:(lib-move.c:4429:lnet_attach_rsp_tracker()) LBUG
2018-12-05T14:24:14.671716-05:00 ninja84.ccs.ornl.gov kernel: Pid: 3759, comm: monitor_thread 3.10.0-862.3.2.el7.x86_64 #1 SMP Tue May 15 18:22
:15 EDT 2018
2018-12-05T14:24:14.681179-05:00 ninja84.ccs.ornl.gov kernel: Call Trace:
2018-12-05T14:24:14.685051-05:00 ninja84.ccs.ornl.gov kernel: Lustre: Echo OBD driver; http://www.lustre.org/
2018-12-05T14:24:14.695841-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc0c267cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
2018-12-05T14:24:14.697263-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc0c2687c>] lbug_with_loc+0x4c/0xa0 [libcfs]
2018-12-05T14:24:14.703501-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc102c257>] lnet_attach_rsp_tracker.isra.29+0xd7/0xe0 [lnet]
2018-12-05T14:24:14.711154-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc1033ba1>] LNetGet+0x5d1/0xa90 [lnet]
2018-12-05T14:24:14.716870-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc103c0b9>] lnet_check_routers+0x399/0xbf0 [lnet]
2018-12-05T14:24:14.723556-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc10354bf>] lnet_monitor_thread+0x4ff/0x560 [lnet]
2018-12-05T14:24:14.730325-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffff8dabb161>] kthread+0xd1/0xe0
2018-12-05T14:24:14.735265-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffff8e120677>] ret_from_fork_nospec_end+0x0/0x39
2018-12-05T14:24:14.741596-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffffffffff>] 0xffffffffffffffff
Attachments
Issue Links
- is related to
-
LU-10669 Potential race condition when unlinking MD
-
- Open
-
Activity
Landed for 2.12. sthiell please open a new ticket if there are any problems with RC2
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33794/
Subject: LU-11734 lnet: handle multi-md usage
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8c249097e62713baf51aec808489a86acf46748d
Perhaps I missed something, do I need to apply the patch from https://review.whamcloud.com/#/c/31313/ too? Please clarify which patches need to be applied on top of 2.12.0_RC1 if you want me to try again. Thanks!
Slightly different symptom on MGS/MDS 2.12.0_RC1 + Amir's patch, but definitively unusable:
fir-md1-s1 login: [ 497.904084] LNet: HW NUMA nodes: 4, HW CPU cores: 48, npartitions: 4 [ 497.913010] alg: No test for adler32 (adler32-zlib) [ 498.713626] Lustre: Lustre: Build Version: 2.12.0_RC1 [ 498.836281] LNet: Using FastReg for registration [ 498.968174] LNet: Added LNI 10.0.10.51@o2ib7 [8/256/0/180] [ 499.835260] LNet: 20661:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.202@o2ib7: 498 seconds [ 500.088147] LDISKFS-fs warning (device dm-1): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait. [ 500.088147] [ 542.110957] LDISKFS-fs (dm-1): recovery complete [ 542.115778] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc [ 542.437699] Lustre: MGS: Connection restored to ebe34e0e-4031-a65f-5f91-0d1e392e8147 (at 0@lo) [ 543.013105] LDISKFS-fs warning (device dm-2): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait. [ 543.013105] [ 543.014468] LDISKFS-fs warning (device dm-4): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait. [ 543.014468] [ 558.768986] Lustre: MGS: Connection restored to 73368e2a-0b82-5939-f05f-1ad2a79768da (at 10.9.101.59@o2ib4) [ 561.842373] Lustre: MGS: Connection restored to d9b25a46-8406-5014-40e9-a5748f9ad5c5 (at 10.0.10.105@o2ib7) [ 561.852127] Lustre: Skipped 1 previous similar message [ 567.115808] Lustre: MGS: Connection restored to a8053d09-bfcd-3aa9-c515-8645bbb097c5 (at 10.8.0.68@o2ib6) [ 571.051218] Lustre: MGS: Received new LWP connection from 10.9.101.59@o2ib4, removing former export from same NID [ 572.706429] Lustre: MGS: Connection restored to 38b4f0c2-db8c-d604-7e00-2c04b790499d (at 10.9.0.63@o2ib4) [ 572.716088] Lustre: Skipped 6 previous similar messages [ 573.921768] Lustre: MGS: Received new LWP connection from 10.0.10.105@o2ib7, removing former export from same NID [ 575.260500] Lustre: MGS: Received new LWP connection from 10.9.0.61@o2ib4, removing former export from same NID [ 578.088401] Lustre: MGS: Received new LWP connection from 10.9.101.59@o2ib4, removing former export from same NID [ 580.973963] Lustre: MGS: Connection restored to d9b25a46-8406-5014-40e9-a5748f9ad5c5 (at 10.0.10.105@o2ib7) [ 580.985864] Lustre: Skipped 6 previous similar messages [ 582.286596] Lustre: MGS: Received new LWP connection from 10.9.0.61@o2ib4, removing former export from same NID [ 582.296688] Lustre: Skipped 2 previous similar messages [ 583.358103] LustreError: 20965:0:(ldlm_lib.c:3248:target_bulk_io()) @@@ timeout on bulk READ after 6+0s req@ffff8f086f7d0450 x1619041162737216/t0(0) o256->d1fa3f21-b7f3-5a11-e34d-8236124081cd@10.9.0.1@o2ib4:489/0 lens 304/240 e 0 to 0 dl 1544243514 ref 1 fl Interpret:/0/0 rc 0/0 [ 585.015471] LDISKFS-fs (dm-4): file extents enabled, maximum tree depth=5 [ 585.067346] LDISKFS-fs (dm-2): file extents enabled, maximum tree depth=5 [ 585.315122] LDISKFS-fs (dm-4): recovery complete [ 585.319941] LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc [ 585.354354] LDISKFS-fs (dm-2): recovery complete [ 585.358987] LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc [ 590.570227] NMI watchdog: Watchdog detected hard LOCKUP on cpu 22 [ 590.576145] Modules linked in: mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_ucm rpcrdma rdma_ucm ib_uverbs ib_iser ib_umad rdma_cm iw_cm libiscsi ib_ipoib scsi_transport_iscsi ib_cm mlx5_ib ib_core mpt2sas mptctl mptbase dell_rbu sunrpc vfat fat dm_round_robin amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd dm_multipath dcdbas ses pcspkr dm_mod ipmi_si enclosure ipmi_devintf i2c_piix4 sg ccp k10temp ipmi_msghandler acpi_power_meter ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci mlx5_core crct10dif_pclmul crct10dif_common libahci mlxfw devlink drm tg3 crc32c_intel libata ptp megaraid_sas drm_panel_orientation_quirks pps_core mpt3sas(OE) raid_class scsi_transport_sas [ 590.671805] CPU: 22 PID: 20866 Comm: ll_mgs_0002 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.1.3.el7_lustre.x86_64 #1 [ 590.684220] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.3.6 04/20/2018 [ 590.691786] Call Trace: [ 590.694241] <NMI> [<ffffffff87361e41>] dump_stack+0x19/0x1b [ 590.700025] [<ffffffff86d494c5>] watchdog_overflow_callback+0x135/0x140 [ 590.706726] [<ffffffff86da1297>] __perf_event_overflow+0x57/0x100 [ 590.712910] [<ffffffff86daa904>] perf_event_overflow+0x14/0x20 [ 590.718831] [<ffffffff86c055f0>] x86_pmu_handle_irq+0x140/0x1a0 [ 590.724838] [<ffffffff86f77c08>] ? ioremap_page_range+0x2e8/0x480 [ 590.731018] [<ffffffff86dfb854>] ? vunmap_page_range+0x234/0x470 [ 590.737106] [<ffffffff86dfbaa1>] ? unmap_kernel_range_noflush+0x11/0x20 [ 590.743811] [<ffffffff8703d49e>] ? ghes_copy_tofrom_phys+0x10e/0x210 [ 590.750254] [<ffffffff8703d640>] ? ghes_read_estatus+0xa0/0x190 [ 590.756263] [<ffffffff8736b031>] perf_event_nmi_handler+0x31/0x50 [ 590.762440] [<ffffffff8736c8fc>] nmi_handle.isra.0+0x8c/0x150 [ 590.768274] [<ffffffff8736cb1d>] do_nmi+0x15d/0x460 [ 590.773239] [<ffffffff8736bd69>] end_repeat_nmi+0x1e/0x81 [ 590.778731] [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200 [ 590.786127] [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200 [ 590.793521] [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200 [ 590.800910] <EOE> [<ffffffff8735bfcb>] queued_spin_lock_slowpath+0xb/0xf [ 590.807812] [<ffffffff8736a507>] _raw_spin_lock_irqsave+0x37/0x40 [ 590.814001] [<ffffffffc090abab>] ib_fmr_pool_map_phys+0x4b/0x300 [ib_core] [ 590.820965] [<ffffffffc0ec66cc>] kiblnd_fmr_pool_map+0x15c/0x690 [ko2iblnd] [ 590.828017] [<ffffffffc0eca965>] kiblnd_map_tx.isra.28+0x155/0x470 [ko2iblnd] [ 590.835235] [<ffffffffc0ecc216>] kiblnd_setup_rd_iov+0x2f6/0x400 [ko2iblnd] [ 590.842282] [<ffffffffc0ecfa6a>] kiblnd_send+0x5da/0xa20 [ko2iblnd] [ 590.848645] [<ffffffffc0c430e4>] lnet_ni_send+0x44/0xd0 [lnet] [ 590.854569] [<ffffffffc0c4ad92>] lnet_send+0x82/0x1c0 [lnet] [ 590.860323] [<ffffffffc0c4b19c>] LNetPut+0x2cc/0xb50 [lnet] [ 590.866029] [<ffffffffc0f5c5d6>] ptl_send_buf+0x146/0x530 [ptlrpc] [ 590.872302] [<ffffffffc0bc1bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [ 590.879378] [<ffffffffc0f5f98b>] ptlrpc_send_reply+0x29b/0x840 [ptlrpc] [ 590.886109] [<ffffffffc0f1e94e>] target_send_reply_msg+0x8e/0x170 [ptlrpc] [ 590.893092] [<ffffffffc0f28e0e>] target_send_reply+0x30e/0x730 [ptlrpc] [ 590.899823] [<ffffffffc0f66877>] ? lustre_msg_set_last_committed+0x27/0xa0 [ptlrpc] [ 590.907607] [<ffffffffc0fccf07>] tgt_request_handle+0x697/0x1580 [ptlrpc] [ 590.914514] [<ffffffffc0fa6a51>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [ 590.922088] [<ffffffffc0bc1bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [ 590.929161] [<ffffffffc0f7192b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 590.936846] [<ffffffffc0f6e7b5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [ 590.943633] [<ffffffff86cd67c2>] ? default_wake_function+0x12/0x20 [ 590.949898] [<ffffffff86ccba9b>] ? __wake_up_common+0x5b/0x90 [ 590.955757] [<ffffffffc0f7525c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [ 590.962050] [<ffffffffc0f74760>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [ 590.969441] [<ffffffff86cc1c31>] kthread+0xd1/0xe0 [ 590.974320] [<ffffffff86cc1b60>] ? insert_kthread_work+0x40/0x40 [ 590.980413] [<ffffffff87374c24>] ret_from_fork_nospec_begin+0xe/0x21 [ 590.986850] [<ffffffff86cc1b60>] ? insert_kthread_work+0x40/0x40 [ 590.992945] Kernel panic - not syncing: Hard LOCKUP [ 590.997823] CPU: 22 PID: 20866 Comm: ll_mgs_0002 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.1.3.el7_lustre.x86_64 #1 [ 591.010241] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.3.6 04/20/2018 [ 591.017807] Call Trace: [ 591.020262] <NMI> [<ffffffff87361e41>] dump_stack+0x19/0x1b [ 591.026035] [<ffffffff8735b550>] panic+0xe8/0x21f [ 591.030826] [<ffffffff87374c24>] ? ret_from_fork_nospec_begin+0xe/0x21 [ 591.037442] [<ffffffff86c9739f>] nmi_panic+0x3f/0x40 [ 591.042492] [<ffffffff86d494b1>] watchdog_overflow_callback+0x121/0x140 [ 591.049193] [<ffffffff86da1297>] __perf_event_overflow+0x57/0x100 [ 591.055370] [<ffffffff86daa904>] perf_event_overflow+0x14/0x20 [ 591.061292] [<ffffffff86c055f0>] x86_pmu_handle_irq+0x140/0x1a0 [ 591.067296] [<ffffffff86f77c08>] ? ioremap_page_range+0x2e8/0x480 [ 591.073477] [<ffffffff86dfb854>] ? vunmap_page_range+0x234/0x470 [ 591.079570] [<ffffffff86dfbaa1>] ? unmap_kernel_range_noflush+0x11/0x20 [ 591.086268] [<ffffffff8703d49e>] ? ghes_copy_tofrom_phys+0x10e/0x210 [ 591.092709] [<ffffffff8703d640>] ? ghes_read_estatus+0xa0/0x190 [ 591.098714] [<ffffffff8736b031>] perf_event_nmi_handler+0x31/0x50 [ 591.104893] [<ffffffff8736c8fc>] nmi_handle.isra.0+0x8c/0x150 [ 591.110726] [<ffffffff8736cb1d>] do_nmi+0x15d/0x460 [ 591.115694] [<ffffffff8736bd69>] end_repeat_nmi+0x1e/0x81 [ 591.121181] [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200 [ 591.128572] [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200 [ 591.135964] [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200 [ 591.143355] <EOE> [<ffffffff8735bfcb>] queued_spin_lock_slowpath+0xb/0xf [ 591.150258] [<ffffffff8736a507>] _raw_spin_lock_irqsave+0x37/0x40 [ 591.156442] [<ffffffffc090abab>] ib_fmr_pool_map_phys+0x4b/0x300 [ib_core] [ 591.163407] [<ffffffffc0ec66cc>] kiblnd_fmr_pool_map+0x15c/0x690 [ko2iblnd] [ 591.170452] [<ffffffffc0eca965>] kiblnd_map_tx.isra.28+0x155/0x470 [ko2iblnd] [ 591.177672] [<ffffffffc0ecc216>] kiblnd_setup_rd_iov+0x2f6/0x400 [ko2iblnd] [ 591.184719] [<ffffffffc0ecfa6a>] kiblnd_send+0x5da/0xa20 [ko2iblnd] [ 591.191077] [<ffffffffc0c430e4>] lnet_ni_send+0x44/0xd0 [lnet] [ 591.197004] [<ffffffffc0c4ad92>] lnet_send+0x82/0x1c0 [lnet] [ 591.202759] [<ffffffffc0c4b19c>] LNetPut+0x2cc/0xb50 [lnet] [ 591.208448] [<ffffffffc0f5c5d6>] ptl_send_buf+0x146/0x530 [ptlrpc] [ 591.214715] [<ffffffffc0bc1bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [ 591.221786] [<ffffffffc0f5f98b>] ptlrpc_send_reply+0x29b/0x840 [ptlrpc] [ 591.228509] [<ffffffffc0f1e94e>] target_send_reply_msg+0x8e/0x170 [ptlrpc] [ 591.235494] [<ffffffffc0f28e0e>] target_send_reply+0x30e/0x730 [ptlrpc] [ 591.242224] [<ffffffffc0f66877>] ? lustre_msg_set_last_committed+0x27/0xa0 [ptlrpc] [ 591.249992] [<ffffffffc0fccf07>] tgt_request_handle+0x697/0x1580 [ptlrpc] [ 591.256897] [<ffffffffc0fa6a51>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [ 591.264463] [<ffffffffc0bc1bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [ 591.271534] [<ffffffffc0f7192b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 591.279214] [<ffffffffc0f6e7b5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [ 591.285996] [<ffffffff86cd67c2>] ? default_wake_function+0x12/0x20 [ 591.292263] [<ffffffff86ccba9b>] ? __wake_up_common+0x5b/0x90 [ 591.298124] [<ffffffffc0f7525c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [ 591.304415] [<ffffffffc0f74760>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [ 591.311807] [<ffffffff86cc1c31>] kthread+0xd1/0xe0 [ 591.316685] [<ffffffff86cc1b60>] ? insert_kthread_work+0x40/0x40 [ 591.322780] [<ffffffff87374c24>] ret_from_fork_nospec_begin+0xe/0x21 [ 591.329219] [<ffffffff86cc1b60>] ? insert_kthread_work+0x40/0x40 [ 0.000000] Initializing cgroup subsys cpuset
I tried again with 2.12.0_RC1 + Amir's patch at https://review.whamcloud.com/#/c/33794/1 , the OSS oopsed:
[ 496.611635] LNet: HW NUMA nodes: 4, HW CPU cores: 32, npartitions: 4 [ 496.619204] alg: No test for adler32 (adler32-zlib) [ 497.421277] Lustre: Lustre: Build Version: 2.12.0_RC1 [ 497.541113] LNet: Using FastReg for registration [ 497.674010] LNet: Added LNI 10.0.10.104@o2ib7 [8/256/0/180] [ 498.540700] LNet: 120701:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.202@o2ib7: 497 seconds [ 499.695014] md: md5 stopped. [ 499.728157] async_tx: api initialized (async) [ 499.734214] xor: automatically using best checksumming function: [ 499.749692] avx : 9660.000 MB/sec [ 499.781700] raid6: sse2x1 gen() 6117 MB/s [ 499.802691] raid6: sse2x2 gen() 11347 MB/s [ 499.823694] raid6: sse2x4 gen() 12980 MB/s [ 499.844693] raid6: avx2x1 gen() 14261 MB/s [ 499.865692] raid6: avx2x2 gen() 18882 MB/s [ 499.886692] raid6: avx2x4 gen() 18851 MB/s [ 499.890963] raid6: using algorithm avx2x2 gen() (18882 MB/s) [ 499.896626] raid6: using avx2x2 recovery algorithm [ 499.917975] md/raid:md5: device dm-55 operational as raid disk 0 [ 499.923990] md/raid:md5: device dm-11 operational as raid disk 9 [ 499.930008] md/raid:md5: device dm-10 operational as raid disk 8 [ 499.936020] md/raid:md5: device dm-1 operational as raid disk 7 [ 499.941966] md/raid:md5: device dm-83 operational as raid disk 6 [ 499.947996] md/raid:md5: device sdau operational as raid disk 5 [ 499.953926] md/raid:md5: device dm-72 operational as raid disk 4 [ 499.959939] md/raid:md5: device dm-61 operational as raid disk 3 [ 499.965954] md/raid:md5: device dm-60 operational as raid disk 2 [ 499.971970] md/raid:md5: device dm-116 operational as raid disk 1 [ 499.979129] md/raid:md5: raid level 6 active with 10 out of 10 devices, algorithm 2 [ 500.021173] md5: detected capacity change from 0 to 64011422924800 [ 500.044705] md: md3 stopped. [ 500.077660] md/raid:md3: device dm-87 operational as raid disk 0 [ 500.083732] md/raid:md3: device dm-86 operational as raid disk 9 [ 500.089794] md/raid:md3: device dm-7 operational as raid disk 8 [ 500.095771] md/raid:md3: device dm-81 operational as raid disk 7 [ 500.101791] md/raid:md3: device dm-80 operational as raid disk 6 [ 500.107802] md/raid:md3: device dm-69 operational as raid disk 5 [ 500.113821] md/raid:md3: device dm-68 operational as raid disk 4 [ 500.119838] md/raid:md3: device dm-111 operational as raid disk 3 [ 500.125948] md/raid:md3: device dm-110 operational as raid disk 2 [ 500.132043] md/raid:md3: device dm-88 operational as raid disk 1 [ 500.138901] md/raid:md3: raid level 6 active with 10 out of 10 devices, algorithm 2 [ 500.166972] md3: detected capacity change from 0 to 64011422924800 [ 500.181231] md: md11 stopped. [ 500.205433] md/raid:md11: device dm-91 operational as raid disk 0 [ 500.211589] md/raid:md11: device dm-48 operational as raid disk 9 [ 500.217708] md/raid:md11: device dm-47 operational as raid disk 8 [ 500.223824] md/raid:md11: device dm-35 operational as raid disk 7 [ 500.229925] md/raid:md11: device dm-34 operational as raid disk 6 [ 500.236037] md/raid:md11: device dm-104 operational as raid disk 5 [ 500.242310] md/raid:md11: device dm-25 operational as raid disk 4 [ 500.248439] md/raid:md11: device dm-100 operational as raid disk 3 [ 500.254664] md/raid:md11: device dm-99 operational as raid disk 2 [ 500.260789] md/raid:md11: device dm-118 operational as raid disk 1 [ 500.267842] md/raid:md11: raid level 6 active with 10 out of 10 devices, algorithm 2 [ 500.297329] md11: detected capacity change from 0 to 64011422924800 [ 500.306119] md: md1 stopped. [ 500.341217] md/raid:md1: device dm-58 operational as raid disk 0 [ 500.347236] md/raid:md1: device dm-6 operational as raid disk 9 [ 500.353163] md/raid:md1: device dm-5 operational as raid disk 8 [ 500.359089] md/raid:md1: device dm-76 operational as raid disk 7 [ 500.365110] md/raid:md1: device dm-75 operational as raid disk 6 [ 500.371134] md/raid:md1: device dm-64 operational as raid disk 5 [ 500.377156] md/raid:md1: device dm-63 operational as raid disk 4 [ 500.383168] md/raid:md1: device dm-109 operational as raid disk 3 [ 500.389267] md/raid:md1: device dm-56 operational as raid disk 2 [ 500.395287] md/raid:md1: device dm-67 operational as raid disk 1 [ 500.401961] md/raid:md1: raid level 6 active with 10 out of 10 devices, algorithm 2 [ 500.430931] md1: detected capacity change from 0 to 64011422924800 [ 500.444576] md: md7 stopped. [ 500.480725] md/raid:md7: device dm-98 operational as raid disk 0 [ 500.486749] md/raid:md7: device dm-40 operational as raid disk 9 [ 500.492776] md/raid:md7: device dm-39 operational as raid disk 8 [ 500.498794] md/raid:md7: device dm-27 operational as raid disk 7 [ 500.504806] md/raid:md7: device dm-26 operational as raid disk 6 [ 500.510829] md/raid:md7: device dm-19 operational as raid disk 5 [ 500.516846] md/raid:md7: device dm-18 operational as raid disk 4 [ 500.522857] md/raid:md7: device dm-117 operational as raid disk 3 [ 500.528957] md/raid:md7: device dm-94 operational as raid disk 2 [ 500.534972] md/raid:md7: device dm-21 operational as raid disk 1 [ 500.541667] md/raid:md7: raid level 6 active with 10 out of 10 devices, algorithm 2 [ 500.593292] md7: detected capacity change from 0 to 64011422924800 [ 500.602526] md: md9 stopped. [ 500.645764] md/raid:md9: device dm-49 operational as raid disk 0 [ 500.651782] md/raid:md9: device dm-44 operational as raid disk 9 [ 500.657811] md/raid:md9: device dm-43 operational as raid disk 8 [ 500.663832] md/raid:md9: device dm-107 operational as raid disk 7 [ 500.669945] md/raid:md9: device dm-31 operational as raid disk 6 [ 500.675959] md/raid:md9: device dm-22 operational as raid disk 5 [ 500.681974] md/raid:md9: device dm-103 operational as raid disk 4 [ 500.688074] md/raid:md9: device dm-97 operational as raid disk 3 [ 500.694090] md/raid:md9: device dm-96 operational as raid disk 2 [ 500.700104] md/raid:md9: device dm-50 operational as raid disk 1 [ 500.708843] md/raid:md9: raid level 6 active with 10 out of 10 devices, algorithm 2 [ 500.748835] md9: detected capacity change from 0 to 64011422924800 [ 500.752863] LDISKFS-fs (md5): file extents enabled, maximum tree depth=5 [ 500.841913] LDISKFS-fs (md3): file extents enabled, maximum tree depth=5 [ 501.055692] LDISKFS-fs (md5): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc [ 501.143861] LDISKFS-fs (md3): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc [ 501.167914] LDISKFS-fs (md1): file extents enabled, maximum tree depth=5 [ 501.182682] LDISKFS-fs (md11): file extents enabled, maximum tree depth=5 [ 501.508363] LDISKFS-fs (md11): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc [ 501.510261] LDISKFS-fs (md1): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc [ 501.521916] LDISKFS-fs (md9): file extents enabled, maximum tree depth=5 [ 501.554606] BUG: unable to handle kernel paging request at 000000000001b780 [ 501.561618] IP: [<ffffffff901121d0>] native_queued_spin_lock_slowpath+0x110/0x200 [ 501.569135] PGD 1032e6c067 PUD 1032e6d067 PMD 0 [ 501.573830] Oops: 0002 [#1] SMP [ 501.577114] Modules linked in: ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) raid456 async_raid6_recov async_memcpy async_pq raid6_pq libcrc32c async_xor xor async_tx ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache mpt2sas mptctl mptbase dell_rbu vfat fat rpcrdma sunrpc dm_service_time ib_iser libiscsi scsi_transport_iscsi dcdbas amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr k10temp i2c_piix4 ccp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm dm_multipath ses enclosure dm_mod ipmi_si mlx5_ib ipmi_devintf sg ipmi_msghandler ib_core acpi_power_meter ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea mlx5_core sysfillrect sysimgblt fb_sys_fops ttm ahci mlxfw devlink libahci crct10dif_pclmul tg3 crct10dif_common ptp drm libata megaraid_sas crc32c_intel drm_panel_orientation_quirks pps_core mpt3sas(OE) raid_class scsi_transport_sas [ 501.680694] CPU: 30 PID: 121538 Comm: mount.lustre Kdump: loaded Tainted: G OE ------------ 3.10.0-957.1.3.el7_lustre.x86_64 #1 [ 501.693284] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.3.6 04/20/2018 [ 501.700853] task: ffff8b02a92fe180 ti: ffff8b02b2e64000 task.ti: ffff8b02b2e64000 [ 501.708330] RIP: 0010:[<ffffffff901121d0>] [<ffffffff901121d0>] native_queued_spin_lock_slowpath+0x110/0x200 [ 501.718272] RSP: 0018:ffff8b02b2e674b8 EFLAGS: 00010002 [ 501.723583] RAX: 00000000000016e6 RBX: 0000000000000287 RCX: 0000000000f10000 [ 501.730717] RDX: 000000000001b780 RSI: 00000000b7318240 RDI: ffff8b22b7318fc0 [ 501.737850] RBP: ffff8b02b2e674b8 R08: ffff8b22bf7db780 R09: 0000000000000000 [ 501.744982] R10: 0000000000000002 R11: 00000000000002a0 R12: 0000000000000000 [ 501.752116] R13: ffff8b2256e55800 R14: ffff8b22b7318fc0 R15: 0000000000000002 [ 501.759249] FS: 00007fc3f2cd5840(0000) GS:ffff8b22bf7c0000(0000) knlGS:0000000000000000 [ 501.767335] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 501.773079] CR2: 000000000001b780 CR3: 0000001034c28000 CR4: 00000000003407e0 [ 501.780213] Call Trace: [ 501.782671] [<ffffffff9075bfcb>] queued_spin_lock_slowpath+0xb/0xf [ 501.788943] [<ffffffff9076a507>] _raw_spin_lock_irqsave+0x37/0x40 [ 501.795132] [<ffffffffc06c1bab>] ib_fmr_pool_map_phys+0x4b/0x300 [ib_core] [ 501.802095] [<ffffffffc0bfd6cc>] kiblnd_fmr_pool_map+0x15c/0x690 [ko2iblnd] [ 501.809148] [<ffffffffc0c01965>] kiblnd_map_tx.isra.28+0x155/0x470 [ko2iblnd] [ 501.816373] [<ffffffffc0c03216>] kiblnd_setup_rd_iov+0x2f6/0x400 [ko2iblnd] [ 501.820899] LDISKFS-fs (md7): file extents enabled, maximum tree depth=5 [ 501.825866] LDISKFS-fs (md9): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc [ 501.840808] [<ffffffffc0c06a6a>] kiblnd_send+0x5da/0xa20 [ko2iblnd] [ 501.847176] [<ffffffffc097a0e4>] lnet_ni_send+0x44/0xd0 [lnet] [ 501.853102] [<ffffffffc0981d92>] lnet_send+0x82/0x1c0 [lnet] [ 501.858855] [<ffffffffc098219c>] LNetPut+0x2cc/0xb50 [lnet] [ 501.864561] [<ffffffffc0c935d6>] ptl_send_buf+0x146/0x530 [ptlrpc] [ 501.870855] [<ffffffffc0c95f1d>] ptl_send_rpc+0x69d/0xe70 [ptlrpc] [ 501.877146] [<ffffffffc0c8ab60>] ptlrpc_send_new_req+0x450/0xa60 [ptlrpc] [ 501.884045] [<ffffffffc0c8f788>] ptlrpc_set_wait+0x3f8/0x8d0 [ptlrpc] [ 501.890600] [<ffffffffc0a765e5>] ? lustre_get_jobid+0x185/0x2e0 [obdclass] [ 501.897592] [<ffffffffc0c9a2e0>] ? lustre_msg_buf_v2+0x1b0/0x1b0 [ptlrpc] [ 501.904491] [<ffffffffc0c9bbaa>] ? lustre_msg_set_jobid+0x9a/0x110 [ptlrpc] [ 501.911560] [<ffffffffc0c8fce3>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc] [ 501.918173] [<ffffffffc0f24b44>] mgc_target_register+0x134/0x4c0 [mgc] [ 501.924785] [<ffffffffc0f2862b>] mgc_set_info_async+0x37b/0x1610 [mgc] [ 501.931407] [<ffffffffc08fef07>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 501.938036] [<ffffffffc0a85b7b>] server_start_targets+0x116b/0x2a20 [obdclass] [ 501.945362] [<ffffffffc0a5ab10>] ? lustre_start_mgc+0x260/0x2510 [obdclass] [ 501.952421] [<ffffffffc0a884fc>] server_fill_super+0x10cc/0x1890 [obdclass] [ 501.959487] [<ffffffffc0a5d828>] lustre_fill_super+0x328/0x950 [obdclass] [ 501.966375] [<ffffffffc0a5d500>] ? lustre_common_put_super+0x270/0x270 [obdclass] [ 501.973945] [<ffffffff902452cf>] mount_nodev+0x4f/0xb0 [ 501.979185] [<ffffffffc0a55908>] lustre_mount+0x38/0x60 [obdclass] [ 501.985452] [<ffffffff90245e4e>] mount_fs+0x3e/0x1b0 [ 501.990507] [<ffffffff902639e7>] vfs_kern_mount+0x67/0x110 [ 501.996076] [<ffffffff9026600f>] do_mount+0x1ef/0xce0 [ 502.001218] [<ffffffff9023e2aa>] ? __check_object_size+0x1ca/0x250 [ 502.007485] [<ffffffff9021c74c>] ? kmem_cache_alloc_trace+0x3c/0x200 [ 502.013922] [<ffffffff90266e43>] SyS_mount+0x83/0xd0 [ 502.018977] [<ffffffff90774ddb>] system_call_fastpath+0x22/0x27 [ 502.024980] Code: 87 47 02 c1 e0 10 45 31 c9 85 c0 74 44 48 89 c2 c1 e8 13 48 c1 ea 0d 48 98 83 e2 30 48 81 c2 80 b7 01 00 48 03 14 c5 60 b9 d4 90 <4c> 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 41 8b 40 [ 502.045501] RIP [<ffffffff901121d0>] native_queued_spin_lock_slowpath+0x110/0x200 [ 502.053096] RSP <ffff8b02b2e674b8> [ 502.056589] CR2: 000000000001b780
Just wanted to say that we hit the same issue with 2.12.0_RC1, it's super unstable. I've used quite extensively 2.11.56_3_g4e42995 but didn't hit this issue before the upgrade to CentOS 7.6 + 2.12.0_RC1.
[Fri Dec 7 17:28:33 2018]CentOS Linux 7 (Core)^M [Fri Dec 7 17:28:33 2018]Kernel 3.10.0-957.1.3.el7_lustre.x86_64 on an x86_64^M [Fri Dec 7 17:28:33 2018]^M [Fri Dec 7 17:28:33 2018]fir-io2-s1 login: [ 4678.271368] LNet: HW NUMA nodes: 4, HW CPU cores: 32, npartitions: 4^M [Fri Dec 7 18:44:21 2018][ 4678.279024] alg: No test for adler32 (adler32-zlib)^M [Fri Dec 7 18:44:22 2018][ 4679.080205] Lustre: Lustre: Build Version: 2.12.0_RC1^M [Fri Dec 7 18:44:22 2018][ 4679.196318] LNet: Using FastReg for registration^M [Fri Dec 7 18:44:22 2018][ 4679.328400] LNet: Added LNI 10.0.10.103@o2ib7 [8/256/0/180]^M [Fri Dec 7 18:44:23 2018][ 4680.195333] LNet: 19041:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.202@o2ib7: 4679 seconds^M [Fri Dec 7 18:45:23 2018][ 4740.334804] LNetError: 19070:0:(lib-move.c:4429:lnet_attach_rsp_tracker()) ASSERTION( md->md_rspt_ptr == ((void *)0) ) failed: ^M [Fri Dec 7 18:45:23 2018][ 4740.346280] LNetError: 19070:0:(lib-move.c:4429:lnet_attach_rsp_tracker()) LBUG^M [Fri Dec 7 18:45:23 2018][ 4740.353596] Pid: 19070, comm: monitor_thread 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018^M [Fri Dec 7 18:45:23 2018][ 4740.363800] Call Trace:^M [Fri Dec 7 18:45:23 2018][ 4740.366262] [<ffffffffc094c7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]^M [Fri Dec 7 18:45:23 2018][ 4740.372832] [<ffffffffc094c87c>] lbug_with_loc+0x4c/0xa0 [libcfs]^M [Fri Dec 7 18:45:23 2018][ 4740.379068] [<ffffffffc09e8667>] lnet_attach_rsp_tracker.isra.29+0xd7/0xe0 [lnet]^M [Fri Dec 7 18:45:23 2018][ 4740.386685] [<ffffffffc09effb1>] LNetGet+0x5d1/0xa90 [lnet]^M [Fri Dec 7 18:45:23 2018][ 4740.392395] [<ffffffffc09f8399>] lnet_check_routers+0x399/0xbf0 [lnet]^M [Fri Dec 7 18:45:23 2018][ 4740.399054] [<ffffffffc09f18cf>] lnet_monitor_thread+0x4ff/0x560 [lnet]^M [Fri Dec 7 18:45:23 2018][ 4740.405811] [<ffffffffb76c1c31>] kthread+0xd1/0xe0^M [Fri Dec 7 18:45:23 2018][ 4740.410725] [<ffffffffb7d74c24>] ret_from_fork_nospec_begin+0xe/0x21^M [Fri Dec 7 18:45:23 2018][ 4740.417215] [<ffffffffffffffff>] 0xffffffffffffffff^M [Fri Dec 7 18:45:23 2018][ 4740.422237] Kernel panic - not syncing: LBUG^M
Config is CentOS 7.6 (3.10.0-957.1.3.el7_lustre.x86_64), Lustre 2.12.0 RC1, in-kernel OFED stack, AMD EPYC server (Dell R6415), mlx5 EDR.
Perfectly working config was CentOS 7.5 (3.10.0-862.14.4.el7_lustre.x86_64), Lustre 2.11.56_3_g4e42995, in-kernel OFED stack, AMD EPYC server (Dell R6415), mlx5 EDR.
Thanks,
Stephane
Phew!