[LU-11734] LNet crashes with 2.12.0-rc1: lnet_attach_rsp_tracker() ASSERTION(md->md_rspt_ptr == ((void *)0)) failed Created: 05/Dec/18  Updated: 15/Dec/18  Resolved: 08/Dec/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Blocker
Reporter: James A Simmons Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

lustre 2.12.0-rc1


Issue Links:
Related
is related to LU-10669 Potential race condition when unlinki... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In my testing I see at bring up:

2018-12-05T14:24:14.274614-05:00 ninja84.ccs.ornl.gov kernel: Lustre: Lustre: Build Version: 2.12.0_RC1_dirty

2018-12-05T14:24:14.671611-05:00 ninja84.ccs.ornl.gov kernel: LNetError: 3759:0:(lib-move.c:4429:lnet_attach_rsp_tracker()) ASSERTION( md->md_r

spt_ptr == ((void *)0) ) failed:

2018-12-05T14:24:14.671682-05:00 ninja84.ccs.ornl.gov kernel: LNetError: 3759:0:(lib-move.c:4429:lnet_attach_rsp_tracker()) LBUG

2018-12-05T14:24:14.671716-05:00 ninja84.ccs.ornl.gov kernel: Pid: 3759, comm: monitor_thread 3.10.0-862.3.2.el7.x86_64 #1 SMP Tue May 15 18:22

:15 EDT 2018

2018-12-05T14:24:14.681179-05:00 ninja84.ccs.ornl.gov kernel: Call Trace:

2018-12-05T14:24:14.685051-05:00 ninja84.ccs.ornl.gov kernel: Lustre: Echo OBD driver; http://www.lustre.org/

2018-12-05T14:24:14.695841-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc0c267cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]

2018-12-05T14:24:14.697263-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc0c2687c>] lbug_with_loc+0x4c/0xa0 [libcfs]

2018-12-05T14:24:14.703501-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc102c257>] lnet_attach_rsp_tracker.isra.29+0xd7/0xe0 [lnet]

2018-12-05T14:24:14.711154-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc1033ba1>] LNetGet+0x5d1/0xa90 [lnet]

2018-12-05T14:24:14.716870-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc103c0b9>] lnet_check_routers+0x399/0xbf0 [lnet]

2018-12-05T14:24:14.723556-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffc10354bf>] lnet_monitor_thread+0x4ff/0x560 [lnet]

2018-12-05T14:24:14.730325-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffff8dabb161>] kthread+0xd1/0xe0

2018-12-05T14:24:14.735265-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffff8e120677>] ret_from_fork_nospec_end+0x0/0x39

2018-12-05T14:24:14.741596-05:00 ninja84.ccs.ornl.gov kernel: [<ffffffffffffffff>] 0xffffffffffffffff



 Comments   
Comment by Andreas Dilger [ 05/Dec/18 ]

Amir, could you please take a look?

 

James, could you please provide a bit of basic info about your network config?  LND type, routing, module parameters, OFED version.

Comment by James A Simmons [ 05/Dec/18 ]

Basic IB to IB mlx4 using default OFED stack on x86 servers.

Comment by Amir Shehata (Inactive) [ 05/Dec/18 ]

working on a patch. Should have it up soon.

Comment by Gerrit Updater [ 05/Dec/18 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33794
Subject: LU-11734 lnet: handle multi-md usage
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 14bf25aa1c1723ce4a0a88341e6e62299994cb8a

Comment by Andreas Dilger [ 05/Dec/18 ]

This looks related to the issues seen in LU-10669?

Comment by Amir Shehata (Inactive) [ 06/Dec/18 ]

I don't think this particular patch is relevant to that ticket since LU-10669 was before health. But it could be related in the sense that an MD can be used multiple times, which could lead to a corner scenario causing the assert:

lib-msg.c:346:lnet_msg_detach_md()) ASSERTION( md->md_refcount >= 0

Comment by James A Simmons [ 06/Dec/18 ]

Patch seems to resolve my bringup issues. I'm doing further testing to make sure no other regressions exist.

Comment by Stephane Thiell [ 08/Dec/18 ]

Just wanted to say that we hit the same issue with 2.12.0_RC1, it's super unstable. I've used quite extensively 2.11.56_3_g4e42995 but didn't hit this issue before the upgrade to CentOS 7.6 + 2.12.0_RC1.

[Fri Dec  7 17:28:33 2018]CentOS Linux 7 (Core)^M
[Fri Dec  7 17:28:33 2018]Kernel 3.10.0-957.1.3.el7_lustre.x86_64 on an x86_64^M
[Fri Dec  7 17:28:33 2018]^M
[Fri Dec  7 17:28:33 2018]fir-io2-s1 login: [ 4678.271368] LNet: HW NUMA nodes: 4, HW CPU cores: 32, npartitions: 4^M
[Fri Dec  7 18:44:21 2018][ 4678.279024] alg: No test for adler32 (adler32-zlib)^M
[Fri Dec  7 18:44:22 2018][ 4679.080205] Lustre: Lustre: Build Version: 2.12.0_RC1^M
[Fri Dec  7 18:44:22 2018][ 4679.196318] LNet: Using FastReg for registration^M
[Fri Dec  7 18:44:22 2018][ 4679.328400] LNet: Added LNI 10.0.10.103@o2ib7 [8/256/0/180]^M
[Fri Dec  7 18:44:23 2018][ 4680.195333] LNet: 19041:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.202@o2ib7: 4679 seconds^M
[Fri Dec  7 18:45:23 2018][ 4740.334804] LNetError: 19070:0:(lib-move.c:4429:lnet_attach_rsp_tracker()) ASSERTION( md->md_rspt_ptr == ((void *)0) ) failed: ^M
[Fri Dec  7 18:45:23 2018][ 4740.346280] LNetError: 19070:0:(lib-move.c:4429:lnet_attach_rsp_tracker()) LBUG^M
[Fri Dec  7 18:45:23 2018][ 4740.353596] Pid: 19070, comm: monitor_thread 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018^M
[Fri Dec  7 18:45:23 2018][ 4740.363800] Call Trace:^M
[Fri Dec  7 18:45:23 2018][ 4740.366262]  [<ffffffffc094c7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]^M
[Fri Dec  7 18:45:23 2018][ 4740.372832]  [<ffffffffc094c87c>] lbug_with_loc+0x4c/0xa0 [libcfs]^M
[Fri Dec  7 18:45:23 2018][ 4740.379068]  [<ffffffffc09e8667>] lnet_attach_rsp_tracker.isra.29+0xd7/0xe0 [lnet]^M
[Fri Dec  7 18:45:23 2018][ 4740.386685]  [<ffffffffc09effb1>] LNetGet+0x5d1/0xa90 [lnet]^M
[Fri Dec  7 18:45:23 2018][ 4740.392395]  [<ffffffffc09f8399>] lnet_check_routers+0x399/0xbf0 [lnet]^M
[Fri Dec  7 18:45:23 2018][ 4740.399054]  [<ffffffffc09f18cf>] lnet_monitor_thread+0x4ff/0x560 [lnet]^M
[Fri Dec  7 18:45:23 2018][ 4740.405811]  [<ffffffffb76c1c31>] kthread+0xd1/0xe0^M
[Fri Dec  7 18:45:23 2018][ 4740.410725]  [<ffffffffb7d74c24>] ret_from_fork_nospec_begin+0xe/0x21^M
[Fri Dec  7 18:45:23 2018][ 4740.417215]  [<ffffffffffffffff>] 0xffffffffffffffff^M
[Fri Dec  7 18:45:23 2018][ 4740.422237] Kernel panic - not syncing: LBUG^M

Config is CentOS 7.6 (3.10.0-957.1.3.el7_lustre.x86_64), Lustre 2.12.0 RC1, in-kernel OFED stack, AMD EPYC server (Dell R6415), mlx5 EDR.

Perfectly working config was CentOS 7.5 (3.10.0-862.14.4.el7_lustre.x86_64), Lustre 2.11.56_3_g4e42995, in-kernel OFED stack, AMD EPYC server (Dell R6415), mlx5 EDR.

Thanks,
Stephane

Comment by Stephane Thiell [ 08/Dec/18 ]

I tried again with 2.12.0_RC1 + Amir's patch at https://review.whamcloud.com/#/c/33794/1 , the OSS oopsed:

[  496.611635] LNet: HW NUMA nodes: 4, HW CPU cores: 32, npartitions: 4
[  496.619204] alg: No test for adler32 (adler32-zlib)
[  497.421277] Lustre: Lustre: Build Version: 2.12.0_RC1
[  497.541113] LNet: Using FastReg for registration
[  497.674010] LNet: Added LNI 10.0.10.104@o2ib7 [8/256/0/180]
[  498.540700] LNet: 120701:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.202@o2ib7: 497 seconds
[  499.695014] md: md5 stopped.
[  499.728157] async_tx: api initialized (async)
[  499.734214] xor: automatically using best checksumming function:
[  499.749692]    avx       :  9660.000 MB/sec
[  499.781700] raid6: sse2x1   gen()  6117 MB/s
[  499.802691] raid6: sse2x2   gen() 11347 MB/s
[  499.823694] raid6: sse2x4   gen() 12980 MB/s
[  499.844693] raid6: avx2x1   gen() 14261 MB/s
[  499.865692] raid6: avx2x2   gen() 18882 MB/s
[  499.886692] raid6: avx2x4   gen() 18851 MB/s
[  499.890963] raid6: using algorithm avx2x2 gen() (18882 MB/s)
[  499.896626] raid6: using avx2x2 recovery algorithm
[  499.917975] md/raid:md5: device dm-55 operational as raid disk 0
[  499.923990] md/raid:md5: device dm-11 operational as raid disk 9
[  499.930008] md/raid:md5: device dm-10 operational as raid disk 8
[  499.936020] md/raid:md5: device dm-1 operational as raid disk 7
[  499.941966] md/raid:md5: device dm-83 operational as raid disk 6
[  499.947996] md/raid:md5: device sdau operational as raid disk 5
[  499.953926] md/raid:md5: device dm-72 operational as raid disk 4
[  499.959939] md/raid:md5: device dm-61 operational as raid disk 3
[  499.965954] md/raid:md5: device dm-60 operational as raid disk 2
[  499.971970] md/raid:md5: device dm-116 operational as raid disk 1
[  499.979129] md/raid:md5: raid level 6 active with 10 out of 10 devices, algorithm 2
[  500.021173] md5: detected capacity change from 0 to 64011422924800
[  500.044705] md: md3 stopped.
[  500.077660] md/raid:md3: device dm-87 operational as raid disk 0
[  500.083732] md/raid:md3: device dm-86 operational as raid disk 9
[  500.089794] md/raid:md3: device dm-7 operational as raid disk 8
[  500.095771] md/raid:md3: device dm-81 operational as raid disk 7
[  500.101791] md/raid:md3: device dm-80 operational as raid disk 6
[  500.107802] md/raid:md3: device dm-69 operational as raid disk 5
[  500.113821] md/raid:md3: device dm-68 operational as raid disk 4
[  500.119838] md/raid:md3: device dm-111 operational as raid disk 3
[  500.125948] md/raid:md3: device dm-110 operational as raid disk 2
[  500.132043] md/raid:md3: device dm-88 operational as raid disk 1
[  500.138901] md/raid:md3: raid level 6 active with 10 out of 10 devices, algorithm 2
[  500.166972] md3: detected capacity change from 0 to 64011422924800
[  500.181231] md: md11 stopped.
[  500.205433] md/raid:md11: device dm-91 operational as raid disk 0
[  500.211589] md/raid:md11: device dm-48 operational as raid disk 9
[  500.217708] md/raid:md11: device dm-47 operational as raid disk 8
[  500.223824] md/raid:md11: device dm-35 operational as raid disk 7
[  500.229925] md/raid:md11: device dm-34 operational as raid disk 6
[  500.236037] md/raid:md11: device dm-104 operational as raid disk 5
[  500.242310] md/raid:md11: device dm-25 operational as raid disk 4
[  500.248439] md/raid:md11: device dm-100 operational as raid disk 3
[  500.254664] md/raid:md11: device dm-99 operational as raid disk 2
[  500.260789] md/raid:md11: device dm-118 operational as raid disk 1
[  500.267842] md/raid:md11: raid level 6 active with 10 out of 10 devices, algorithm 2
[  500.297329] md11: detected capacity change from 0 to 64011422924800
[  500.306119] md: md1 stopped.
[  500.341217] md/raid:md1: device dm-58 operational as raid disk 0
[  500.347236] md/raid:md1: device dm-6 operational as raid disk 9
[  500.353163] md/raid:md1: device dm-5 operational as raid disk 8
[  500.359089] md/raid:md1: device dm-76 operational as raid disk 7
[  500.365110] md/raid:md1: device dm-75 operational as raid disk 6
[  500.371134] md/raid:md1: device dm-64 operational as raid disk 5
[  500.377156] md/raid:md1: device dm-63 operational as raid disk 4
[  500.383168] md/raid:md1: device dm-109 operational as raid disk 3
[  500.389267] md/raid:md1: device dm-56 operational as raid disk 2
[  500.395287] md/raid:md1: device dm-67 operational as raid disk 1
[  500.401961] md/raid:md1: raid level 6 active with 10 out of 10 devices, algorithm 2
[  500.430931] md1: detected capacity change from 0 to 64011422924800
[  500.444576] md: md7 stopped.
[  500.480725] md/raid:md7: device dm-98 operational as raid disk 0
[  500.486749] md/raid:md7: device dm-40 operational as raid disk 9
[  500.492776] md/raid:md7: device dm-39 operational as raid disk 8
[  500.498794] md/raid:md7: device dm-27 operational as raid disk 7
[  500.504806] md/raid:md7: device dm-26 operational as raid disk 6
[  500.510829] md/raid:md7: device dm-19 operational as raid disk 5
[  500.516846] md/raid:md7: device dm-18 operational as raid disk 4
[  500.522857] md/raid:md7: device dm-117 operational as raid disk 3
[  500.528957] md/raid:md7: device dm-94 operational as raid disk 2
[  500.534972] md/raid:md7: device dm-21 operational as raid disk 1
[  500.541667] md/raid:md7: raid level 6 active with 10 out of 10 devices, algorithm 2
[  500.593292] md7: detected capacity change from 0 to 64011422924800
[  500.602526] md: md9 stopped.
[  500.645764] md/raid:md9: device dm-49 operational as raid disk 0
[  500.651782] md/raid:md9: device dm-44 operational as raid disk 9
[  500.657811] md/raid:md9: device dm-43 operational as raid disk 8
[  500.663832] md/raid:md9: device dm-107 operational as raid disk 7
[  500.669945] md/raid:md9: device dm-31 operational as raid disk 6
[  500.675959] md/raid:md9: device dm-22 operational as raid disk 5
[  500.681974] md/raid:md9: device dm-103 operational as raid disk 4
[  500.688074] md/raid:md9: device dm-97 operational as raid disk 3
[  500.694090] md/raid:md9: device dm-96 operational as raid disk 2
[  500.700104] md/raid:md9: device dm-50 operational as raid disk 1
[  500.708843] md/raid:md9: raid level 6 active with 10 out of 10 devices, algorithm 2
[  500.748835] md9: detected capacity change from 0 to 64011422924800
[  500.752863] LDISKFS-fs (md5): file extents enabled, maximum tree depth=5
[  500.841913] LDISKFS-fs (md3): file extents enabled, maximum tree depth=5
[  501.055692] LDISKFS-fs (md5): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
[  501.143861] LDISKFS-fs (md3): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
[  501.167914] LDISKFS-fs (md1): file extents enabled, maximum tree depth=5
[  501.182682] LDISKFS-fs (md11): file extents enabled, maximum tree depth=5
[  501.508363] LDISKFS-fs (md11): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
[  501.510261] LDISKFS-fs (md1): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
[  501.521916] LDISKFS-fs (md9): file extents enabled, maximum tree depth=5
[  501.554606] BUG: unable to handle kernel paging request at 000000000001b780
[  501.561618] IP: [<ffffffff901121d0>] native_queued_spin_lock_slowpath+0x110/0x200
[  501.569135] PGD 1032e6c067 PUD 1032e6d067 PMD 0 
[  501.573830] Oops: 0002 [#1] SMP 
[  501.577114] Modules linked in: ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) raid456 async_raid6_recov async_memcpy async_pq raid6_pq libcrc32c async_xor xor async_tx ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache mpt2sas mptctl mptbase dell_rbu vfat fat rpcrdma sunrpc dm_service_time ib_iser libiscsi scsi_transport_iscsi dcdbas amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr k10temp i2c_piix4 ccp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm dm_multipath ses enclosure dm_mod ipmi_si mlx5_ib ipmi_devintf sg ipmi_msghandler ib_core acpi_power_meter ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea mlx5_core sysfillrect sysimgblt fb_sys_fops ttm ahci mlxfw devlink libahci crct10dif_pclmul tg3 crct10dif_common ptp drm libata megaraid_sas crc32c_intel drm_panel_orientation_quirks pps_core mpt3sas(OE) raid_class scsi_transport_sas
[  501.680694] CPU: 30 PID: 121538 Comm: mount.lustre Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.1.3.el7_lustre.x86_64 #1
[  501.693284] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.3.6 04/20/2018
[  501.700853] task: ffff8b02a92fe180 ti: ffff8b02b2e64000 task.ti: ffff8b02b2e64000
[  501.708330] RIP: 0010:[<ffffffff901121d0>]  [<ffffffff901121d0>] native_queued_spin_lock_slowpath+0x110/0x200
[  501.718272] RSP: 0018:ffff8b02b2e674b8  EFLAGS: 00010002
[  501.723583] RAX: 00000000000016e6 RBX: 0000000000000287 RCX: 0000000000f10000
[  501.730717] RDX: 000000000001b780 RSI: 00000000b7318240 RDI: ffff8b22b7318fc0
[  501.737850] RBP: ffff8b02b2e674b8 R08: ffff8b22bf7db780 R09: 0000000000000000
[  501.744982] R10: 0000000000000002 R11: 00000000000002a0 R12: 0000000000000000
[  501.752116] R13: ffff8b2256e55800 R14: ffff8b22b7318fc0 R15: 0000000000000002
[  501.759249] FS:  00007fc3f2cd5840(0000) GS:ffff8b22bf7c0000(0000) knlGS:0000000000000000
[  501.767335] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  501.773079] CR2: 000000000001b780 CR3: 0000001034c28000 CR4: 00000000003407e0
[  501.780213] Call Trace:
[  501.782671]  [<ffffffff9075bfcb>] queued_spin_lock_slowpath+0xb/0xf
[  501.788943]  [<ffffffff9076a507>] _raw_spin_lock_irqsave+0x37/0x40
[  501.795132]  [<ffffffffc06c1bab>] ib_fmr_pool_map_phys+0x4b/0x300 [ib_core]
[  501.802095]  [<ffffffffc0bfd6cc>] kiblnd_fmr_pool_map+0x15c/0x690 [ko2iblnd]
[  501.809148]  [<ffffffffc0c01965>] kiblnd_map_tx.isra.28+0x155/0x470 [ko2iblnd]
[  501.816373]  [<ffffffffc0c03216>] kiblnd_setup_rd_iov+0x2f6/0x400 [ko2iblnd]
[  501.820899] LDISKFS-fs (md7): file extents enabled, maximum tree depth=5
[  501.825866] LDISKFS-fs (md9): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
[  501.840808]  [<ffffffffc0c06a6a>] kiblnd_send+0x5da/0xa20 [ko2iblnd]
[  501.847176]  [<ffffffffc097a0e4>] lnet_ni_send+0x44/0xd0 [lnet]
[  501.853102]  [<ffffffffc0981d92>] lnet_send+0x82/0x1c0 [lnet]
[  501.858855]  [<ffffffffc098219c>] LNetPut+0x2cc/0xb50 [lnet]
[  501.864561]  [<ffffffffc0c935d6>] ptl_send_buf+0x146/0x530 [ptlrpc]
[  501.870855]  [<ffffffffc0c95f1d>] ptl_send_rpc+0x69d/0xe70 [ptlrpc]
[  501.877146]  [<ffffffffc0c8ab60>] ptlrpc_send_new_req+0x450/0xa60 [ptlrpc]
[  501.884045]  [<ffffffffc0c8f788>] ptlrpc_set_wait+0x3f8/0x8d0 [ptlrpc]
[  501.890600]  [<ffffffffc0a765e5>] ? lustre_get_jobid+0x185/0x2e0 [obdclass]
[  501.897592]  [<ffffffffc0c9a2e0>] ? lustre_msg_buf_v2+0x1b0/0x1b0 [ptlrpc]
[  501.904491]  [<ffffffffc0c9bbaa>] ? lustre_msg_set_jobid+0x9a/0x110 [ptlrpc]
[  501.911560]  [<ffffffffc0c8fce3>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
[  501.918173]  [<ffffffffc0f24b44>] mgc_target_register+0x134/0x4c0 [mgc]
[  501.924785]  [<ffffffffc0f2862b>] mgc_set_info_async+0x37b/0x1610 [mgc]
[  501.931407]  [<ffffffffc08fef07>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[  501.938036]  [<ffffffffc0a85b7b>] server_start_targets+0x116b/0x2a20 [obdclass]
[  501.945362]  [<ffffffffc0a5ab10>] ? lustre_start_mgc+0x260/0x2510 [obdclass]
[  501.952421]  [<ffffffffc0a884fc>] server_fill_super+0x10cc/0x1890 [obdclass]
[  501.959487]  [<ffffffffc0a5d828>] lustre_fill_super+0x328/0x950 [obdclass]
[  501.966375]  [<ffffffffc0a5d500>] ? lustre_common_put_super+0x270/0x270 [obdclass]
[  501.973945]  [<ffffffff902452cf>] mount_nodev+0x4f/0xb0
[  501.979185]  [<ffffffffc0a55908>] lustre_mount+0x38/0x60 [obdclass]
[  501.985452]  [<ffffffff90245e4e>] mount_fs+0x3e/0x1b0
[  501.990507]  [<ffffffff902639e7>] vfs_kern_mount+0x67/0x110
[  501.996076]  [<ffffffff9026600f>] do_mount+0x1ef/0xce0
[  502.001218]  [<ffffffff9023e2aa>] ? __check_object_size+0x1ca/0x250
[  502.007485]  [<ffffffff9021c74c>] ? kmem_cache_alloc_trace+0x3c/0x200
[  502.013922]  [<ffffffff90266e43>] SyS_mount+0x83/0xd0
[  502.018977]  [<ffffffff90774ddb>] system_call_fastpath+0x22/0x27
[  502.024980] Code: 87 47 02 c1 e0 10 45 31 c9 85 c0 74 44 48 89 c2 c1 e8 13 48 c1 ea 0d 48 98 83 e2 30 48 81 c2 80 b7 01 00 48 03 14 c5 60 b9 d4 90 <4c> 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 41 8b 40 
[  502.045501] RIP  [<ffffffff901121d0>] native_queued_spin_lock_slowpath+0x110/0x200
[  502.053096]  RSP <ffff8b02b2e674b8>
[  502.056589] CR2: 000000000001b780
 

 

Comment by Stephane Thiell [ 08/Dec/18 ]

Slightly different symptom on MGS/MDS 2.12.0_RC1 + Amir's patch, but definitively unusable:

fir-md1-s1 login: [  497.904084] LNet: HW NUMA nodes: 4, HW CPU cores: 48, npartitions: 4
[  497.913010] alg: No test for adler32 (adler32-zlib)
[  498.713626] Lustre: Lustre: Build Version: 2.12.0_RC1
[  498.836281] LNet: Using FastReg for registration
[  498.968174] LNet: Added LNI 10.0.10.51@o2ib7 [8/256/0/180]
[  499.835260] LNet: 20661:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 10.0.10.202@o2ib7: 498 seconds
[  500.088147] LDISKFS-fs warning (device dm-1): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.
[  500.088147] 
[  542.110957] LDISKFS-fs (dm-1): recovery complete
[  542.115778] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[  542.437699] Lustre: MGS: Connection restored to ebe34e0e-4031-a65f-5f91-0d1e392e8147 (at 0@lo)
[  543.013105] LDISKFS-fs warning (device dm-2): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.
[  543.013105] 
[  543.014468] LDISKFS-fs warning (device dm-4): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.
[  543.014468] 
[  558.768986] Lustre: MGS: Connection restored to 73368e2a-0b82-5939-f05f-1ad2a79768da (at 10.9.101.59@o2ib4)
[  561.842373] Lustre: MGS: Connection restored to d9b25a46-8406-5014-40e9-a5748f9ad5c5 (at 10.0.10.105@o2ib7)
[  561.852127] Lustre: Skipped 1 previous similar message
[  567.115808] Lustre: MGS: Connection restored to a8053d09-bfcd-3aa9-c515-8645bbb097c5 (at 10.8.0.68@o2ib6)
[  571.051218] Lustre: MGS: Received new LWP connection from 10.9.101.59@o2ib4, removing former export from same NID
[  572.706429] Lustre: MGS: Connection restored to 38b4f0c2-db8c-d604-7e00-2c04b790499d (at 10.9.0.63@o2ib4)
[  572.716088] Lustre: Skipped 6 previous similar messages
[  573.921768] Lustre: MGS: Received new LWP connection from 10.0.10.105@o2ib7, removing former export from same NID
[  575.260500] Lustre: MGS: Received new LWP connection from 10.9.0.61@o2ib4, removing former export from same NID
[  578.088401] Lustre: MGS: Received new LWP connection from 10.9.101.59@o2ib4, removing former export from same NID
[  580.973963] Lustre: MGS: Connection restored to d9b25a46-8406-5014-40e9-a5748f9ad5c5 (at 10.0.10.105@o2ib7)
[  580.985864] Lustre: Skipped 6 previous similar messages
[  582.286596] Lustre: MGS: Received new LWP connection from 10.9.0.61@o2ib4, removing former export from same NID
[  582.296688] Lustre: Skipped 2 previous similar messages
[  583.358103] LustreError: 20965:0:(ldlm_lib.c:3248:target_bulk_io()) @@@ timeout on bulk READ after 6+0s  req@ffff8f086f7d0450 x1619041162737216/t0(0) o256->d1fa3f21-b7f3-5a11-e34d-8236124081cd@10.9.0.1@o2ib4:489/0 lens 304/240 e 0 to 0 dl 1544243514 ref 1 fl Interpret:/0/0 rc 0/0
[  585.015471] LDISKFS-fs (dm-4): file extents enabled, maximum tree depth=5
[  585.067346] LDISKFS-fs (dm-2): file extents enabled, maximum tree depth=5
[  585.315122] LDISKFS-fs (dm-4): recovery complete
[  585.319941] LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc
[  585.354354] LDISKFS-fs (dm-2): recovery complete
[  585.358987] LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc
[  590.570227] NMI watchdog: Watchdog detected hard LOCKUP on cpu 22
[  590.576145] Modules linked in: mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_ucm rpcrdma rdma_ucm ib_uverbs ib_iser ib_umad rdma_cm iw_cm libiscsi ib_ipoib scsi_transport_iscsi ib_cm mlx5_ib ib_core mpt2sas mptctl mptbase dell_rbu sunrpc vfat fat dm_round_robin amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd dm_multipath dcdbas ses pcspkr dm_mod ipmi_si enclosure ipmi_devintf i2c_piix4 sg ccp k10temp ipmi_msghandler acpi_power_meter ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci mlx5_core crct10dif_pclmul crct10dif_common libahci mlxfw devlink drm tg3 crc32c_intel libata ptp megaraid_sas drm_panel_orientation_quirks pps_core mpt3sas(OE) raid_class scsi_transport_sas
[  590.671805] CPU: 22 PID: 20866 Comm: ll_mgs_0002 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.1.3.el7_lustre.x86_64 #1
[  590.684220] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.3.6 04/20/2018
[  590.691786] Call Trace:
[  590.694241]  <NMI>  [<ffffffff87361e41>] dump_stack+0x19/0x1b
[  590.700025]  [<ffffffff86d494c5>] watchdog_overflow_callback+0x135/0x140
[  590.706726]  [<ffffffff86da1297>] __perf_event_overflow+0x57/0x100
[  590.712910]  [<ffffffff86daa904>] perf_event_overflow+0x14/0x20
[  590.718831]  [<ffffffff86c055f0>] x86_pmu_handle_irq+0x140/0x1a0
[  590.724838]  [<ffffffff86f77c08>] ? ioremap_page_range+0x2e8/0x480
[  590.731018]  [<ffffffff86dfb854>] ? vunmap_page_range+0x234/0x470
[  590.737106]  [<ffffffff86dfbaa1>] ? unmap_kernel_range_noflush+0x11/0x20
[  590.743811]  [<ffffffff8703d49e>] ? ghes_copy_tofrom_phys+0x10e/0x210
[  590.750254]  [<ffffffff8703d640>] ? ghes_read_estatus+0xa0/0x190
[  590.756263]  [<ffffffff8736b031>] perf_event_nmi_handler+0x31/0x50
[  590.762440]  [<ffffffff8736c8fc>] nmi_handle.isra.0+0x8c/0x150
[  590.768274]  [<ffffffff8736cb1d>] do_nmi+0x15d/0x460
[  590.773239]  [<ffffffff8736bd69>] end_repeat_nmi+0x1e/0x81
[  590.778731]  [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200
[  590.786127]  [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200
[  590.793521]  [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200
[  590.800910]  <EOE>  [<ffffffff8735bfcb>] queued_spin_lock_slowpath+0xb/0xf
[  590.807812]  [<ffffffff8736a507>] _raw_spin_lock_irqsave+0x37/0x40
[  590.814001]  [<ffffffffc090abab>] ib_fmr_pool_map_phys+0x4b/0x300 [ib_core]
[  590.820965]  [<ffffffffc0ec66cc>] kiblnd_fmr_pool_map+0x15c/0x690 [ko2iblnd]
[  590.828017]  [<ffffffffc0eca965>] kiblnd_map_tx.isra.28+0x155/0x470 [ko2iblnd]
[  590.835235]  [<ffffffffc0ecc216>] kiblnd_setup_rd_iov+0x2f6/0x400 [ko2iblnd]
[  590.842282]  [<ffffffffc0ecfa6a>] kiblnd_send+0x5da/0xa20 [ko2iblnd]
[  590.848645]  [<ffffffffc0c430e4>] lnet_ni_send+0x44/0xd0 [lnet]
[  590.854569]  [<ffffffffc0c4ad92>] lnet_send+0x82/0x1c0 [lnet]
[  590.860323]  [<ffffffffc0c4b19c>] LNetPut+0x2cc/0xb50 [lnet]
[  590.866029]  [<ffffffffc0f5c5d6>] ptl_send_buf+0x146/0x530 [ptlrpc]
[  590.872302]  [<ffffffffc0bc1bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[  590.879378]  [<ffffffffc0f5f98b>] ptlrpc_send_reply+0x29b/0x840 [ptlrpc]
[  590.886109]  [<ffffffffc0f1e94e>] target_send_reply_msg+0x8e/0x170 [ptlrpc]
[  590.893092]  [<ffffffffc0f28e0e>] target_send_reply+0x30e/0x730 [ptlrpc]
[  590.899823]  [<ffffffffc0f66877>] ? lustre_msg_set_last_committed+0x27/0xa0 [ptlrpc]
[  590.907607]  [<ffffffffc0fccf07>] tgt_request_handle+0x697/0x1580 [ptlrpc]
[  590.914514]  [<ffffffffc0fa6a51>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[  590.922088]  [<ffffffffc0bc1bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[  590.929161]  [<ffffffffc0f7192b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[  590.936846]  [<ffffffffc0f6e7b5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[  590.943633]  [<ffffffff86cd67c2>] ? default_wake_function+0x12/0x20
[  590.949898]  [<ffffffff86ccba9b>] ? __wake_up_common+0x5b/0x90
[  590.955757]  [<ffffffffc0f7525c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[  590.962050]  [<ffffffffc0f74760>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[  590.969441]  [<ffffffff86cc1c31>] kthread+0xd1/0xe0
[  590.974320]  [<ffffffff86cc1b60>] ? insert_kthread_work+0x40/0x40
[  590.980413]  [<ffffffff87374c24>] ret_from_fork_nospec_begin+0xe/0x21
[  590.986850]  [<ffffffff86cc1b60>] ? insert_kthread_work+0x40/0x40
[  590.992945] Kernel panic - not syncing: Hard LOCKUP
[  590.997823] CPU: 22 PID: 20866 Comm: ll_mgs_0002 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.1.3.el7_lustre.x86_64 #1
[  591.010241] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.3.6 04/20/2018
[  591.017807] Call Trace:
[  591.020262]  <NMI>  [<ffffffff87361e41>] dump_stack+0x19/0x1b
[  591.026035]  [<ffffffff8735b550>] panic+0xe8/0x21f
[  591.030826]  [<ffffffff87374c24>] ? ret_from_fork_nospec_begin+0xe/0x21
[  591.037442]  [<ffffffff86c9739f>] nmi_panic+0x3f/0x40
[  591.042492]  [<ffffffff86d494b1>] watchdog_overflow_callback+0x121/0x140
[  591.049193]  [<ffffffff86da1297>] __perf_event_overflow+0x57/0x100
[  591.055370]  [<ffffffff86daa904>] perf_event_overflow+0x14/0x20
[  591.061292]  [<ffffffff86c055f0>] x86_pmu_handle_irq+0x140/0x1a0
[  591.067296]  [<ffffffff86f77c08>] ? ioremap_page_range+0x2e8/0x480
[  591.073477]  [<ffffffff86dfb854>] ? vunmap_page_range+0x234/0x470
[  591.079570]  [<ffffffff86dfbaa1>] ? unmap_kernel_range_noflush+0x11/0x20
[  591.086268]  [<ffffffff8703d49e>] ? ghes_copy_tofrom_phys+0x10e/0x210
[  591.092709]  [<ffffffff8703d640>] ? ghes_read_estatus+0xa0/0x190
[  591.098714]  [<ffffffff8736b031>] perf_event_nmi_handler+0x31/0x50
[  591.104893]  [<ffffffff8736c8fc>] nmi_handle.isra.0+0x8c/0x150
[  591.110726]  [<ffffffff8736cb1d>] do_nmi+0x15d/0x460
[  591.115694]  [<ffffffff8736bd69>] end_repeat_nmi+0x1e/0x81
[  591.121181]  [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200
[  591.128572]  [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200
[  591.135964]  [<ffffffff86d121e2>] ? native_queued_spin_lock_slowpath+0x122/0x200
[  591.143355]  <EOE>  [<ffffffff8735bfcb>] queued_spin_lock_slowpath+0xb/0xf
[  591.150258]  [<ffffffff8736a507>] _raw_spin_lock_irqsave+0x37/0x40
[  591.156442]  [<ffffffffc090abab>] ib_fmr_pool_map_phys+0x4b/0x300 [ib_core]
[  591.163407]  [<ffffffffc0ec66cc>] kiblnd_fmr_pool_map+0x15c/0x690 [ko2iblnd]
[  591.170452]  [<ffffffffc0eca965>] kiblnd_map_tx.isra.28+0x155/0x470 [ko2iblnd]
[  591.177672]  [<ffffffffc0ecc216>] kiblnd_setup_rd_iov+0x2f6/0x400 [ko2iblnd]
[  591.184719]  [<ffffffffc0ecfa6a>] kiblnd_send+0x5da/0xa20 [ko2iblnd]
[  591.191077]  [<ffffffffc0c430e4>] lnet_ni_send+0x44/0xd0 [lnet]
[  591.197004]  [<ffffffffc0c4ad92>] lnet_send+0x82/0x1c0 [lnet]
[  591.202759]  [<ffffffffc0c4b19c>] LNetPut+0x2cc/0xb50 [lnet]
[  591.208448]  [<ffffffffc0f5c5d6>] ptl_send_buf+0x146/0x530 [ptlrpc]
[  591.214715]  [<ffffffffc0bc1bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[  591.221786]  [<ffffffffc0f5f98b>] ptlrpc_send_reply+0x29b/0x840 [ptlrpc]
[  591.228509]  [<ffffffffc0f1e94e>] target_send_reply_msg+0x8e/0x170 [ptlrpc]
[  591.235494]  [<ffffffffc0f28e0e>] target_send_reply+0x30e/0x730 [ptlrpc]
[  591.242224]  [<ffffffffc0f66877>] ? lustre_msg_set_last_committed+0x27/0xa0 [ptlrpc]
[  591.249992]  [<ffffffffc0fccf07>] tgt_request_handle+0x697/0x1580 [ptlrpc]
[  591.256897]  [<ffffffffc0fa6a51>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[  591.264463]  [<ffffffffc0bc1bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[  591.271534]  [<ffffffffc0f7192b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[  591.279214]  [<ffffffffc0f6e7b5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[  591.285996]  [<ffffffff86cd67c2>] ? default_wake_function+0x12/0x20
[  591.292263]  [<ffffffff86ccba9b>] ? __wake_up_common+0x5b/0x90
[  591.298124]  [<ffffffffc0f7525c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[  591.304415]  [<ffffffffc0f74760>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[  591.311807]  [<ffffffff86cc1c31>] kthread+0xd1/0xe0
[  591.316685]  [<ffffffff86cc1b60>] ? insert_kthread_work+0x40/0x40
[  591.322780]  [<ffffffff87374c24>] ret_from_fork_nospec_begin+0xe/0x21
[  591.329219]  [<ffffffff86cc1b60>] ? insert_kthread_work+0x40/0x40
[    0.000000] Initializing cgroup subsys cpuset
Comment by Stephane Thiell [ 08/Dec/18 ]

Perhaps I missed something, do I need to apply the patch from https://review.whamcloud.com/#/c/31313/ too? Please clarify which patches need to be applied on top of 2.12.0_RC1 if you want me to try again. Thanks!

Comment by Peter Jones [ 08/Dec/18 ]

sthiell just wait for RC2 - which should be available shortly...

Comment by Gerrit Updater [ 08/Dec/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33794/
Subject: LU-11734 lnet: handle multi-md usage
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8c249097e62713baf51aec808489a86acf46748d

Comment by Stephane Thiell [ 08/Dec/18 ]

Thanks Peter!

Comment by Peter Jones [ 08/Dec/18 ]

Landed for 2.12. sthiell please open a new ticket if there are any problems with RC2

Comment by Stephane Thiell [ 09/Dec/18 ]

Sure, will do

I confirm this specific problem is fixed in RC2.

Comment by Peter Jones [ 09/Dec/18 ]

Phew!

Generated at Sat Feb 10 02:46:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.