Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.10.0
-
None
-
3
-
9223372036854775807
Description
We have experienced both client and server crashes when running lustre 2.10.0. I first noticed this after upgrading our servers to 2.10.0 and had a client crash a couple of times when doing some stress tests. At the time, I was still running a 2.9.0 client. I also found this thread, which appears related.
http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2017-August/014698.html
Lately, our servers have started crashing too - we've had about 4 or 5 crashes in the last week on various OSS's. Our clients are almost all updated to 2.10.0 or 2.10.1 now. I've included than exerpt from the dmesg obtained from the latest server crash and I'll upload the entire vmcore-dmesg.txt shortly. Previous to this, the server crash dumps weren't working properly so this is the only server crash dump we have. I do have a couple of client crash dumps. Our current server configuration is:
centOS 7.3
kernel 3.10.0-514.26.2.el7.x86_64
lustre 2.10.0
zfs-0.7.1-1
Let me know if you need any other info. We are upgrading to lustre 2.10.1 now in the hopes this is already found and fixed. I couldn't find a related LU but my apologies if this is a duplicate of another LU.
[91954.508837] BUG: unable to handle kernel NULL pointer dereference at (null)
[91954.510562] IP: [<ffffffff8168e99a>] _raw_spin_unlock+0xa/0x30
[91954.512269] PGD 0
[91954.513878] Oops: 0002 1 SMP
[91954.515491] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ko2iblnd(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core iTCO_wdt iTCO_vendor_support zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate intel_powerclamp coretemp kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd mei_me mei pcspkr sb_edac i2c_i801 edac_core lpc_ich ioatdma dm_service_time ses enclosure sg ipmi_devintf
[91954.525584] ipmi_si ipmi_msghandler shpchp wmi dm_multipath nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper syscopyarea sysfillrect crct10dif_pclmul sysimgblt crct10dif_common fb_sys_fops crc32c_intel ttm isci igb mlx4_core ahci megaraid_sas drm libsas libahci mpt3sas ptp pps_core libata dca raid_class i2c_algo_bit scsi_transport_sas i2c_core devlink fjes dm_mirror dm_region_hash dm_log dm_mod
[91954.534280] CPU: 14 PID: 22556 Comm: socknal_sd01_01 Tainted: P OE ------------ 3.10.0-514.26.2.el7.x86_64 #1
[91954.536043] Hardware name: Supermicro SYS-6027TR-D71FRF/X9DRT, BIOS 3.2a 08/04/2015
[91954.537761] task: ffff882008e8ce70 ti: ffff88203ccbc000 task.ti: ffff88203ccbc000
[91954.539471] RIP: 0010:[<ffffffff8168e99a>] [<ffffffff8168e99a>] _raw_spin_unlock+0xa/0x30
[91954.541242] RSP: 0018:ffff88203ccbfd08 EFLAGS: 00010202
[91954.542917] RAX: ffff8820362f1f30 RBX: ffff8820362f1dc0 RCX: 0000000000000000
[91954.544572] RDX: 000000000000ab0c RSI: 0000000000005587 RDI: 0000000000000000
[91954.546214] RBP: ffff88203ccbfd20 R08: ffff8820377d3e80 R09: 0000000000000001
[91954.547818] R10: 0000000000000400 R11: 0000000000000800 R12: 000000000002ac38
[91954.549427] R13: ffff882037691600 R14: ffff882007f6d074 R15: ffff881dace51810
[91954.550999] FS: 0000000000000000(0000) GS:ffff88207fd80000(0000) knlGS:0000000000000000
[91954.552552] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[91954.554081] CR2: 0000000000000000 CR3: 00000000019be000 CR4: 00000000001407e0
[91954.555593] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[91954.557079] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[91954.558527] Stack:
[91954.560007] ffffffffa098d8b6 ffff881f038c7000 ffff882007f6d000 ffff88203ccbfd60
[91954.561472] ffffffffa0a11841 ffff881dace51800 ffff881f038c7000 0000000000000001
[91954.562870] ffff88203cea2100 0000000000000000 ffff881f038c7010 ffff88203ccbfd90
[91954.564260] Call Trace:
[91954.565615] [<ffffffffa098d8b6>] ? cfs_percpt_unlock+0x36/0xc0 [libcfs]
[91954.566966] [<ffffffffa0a11841>] lnet_return_tx_credits_locked+0x211/0x480 [lnet]
[91954.568305] [<ffffffffa0a04770>] lnet_msg_decommit+0xd0/0x6c0 [lnet]
[91954.569604] [<ffffffffa0a050f9>] lnet_finalize+0x1e9/0x690 [lnet]
[91954.570876] [<ffffffffa0a90f45>] ksocknal_tx_done+0x85/0x1c0 [ksocklnd]
[91954.572149] [<ffffffffa0a95bb4>] ksocknal_scheduler+0x234/0x670 [ksocklnd]
[91954.573381] [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30
[91954.574584] [<ffffffffa0a95980>] ? ksocknal_recv+0x2a0/0x2a0 [ksocklnd]
[91954.575765] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[91954.576947] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[91954.578095] [<ffffffff81697758>] ret_from_fork+0x58/0x90
[91954.579209] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[91954.580348] Code: 90 8d 8a 00 00 02 00 89 d0 f0 0f b1 0f 39 d0 75 ea b8 01 00 00 00 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 0f 1f 44 00 00 <66> 83 07 02 c3 90 8b 37 f0 66 83 07 02 f6 47 02 01 74 f1 55 48
[91954.582717] RIP [<ffffffff8168e99a>] _raw_spin_unlock+0xa/0x30
[91954.583886] RSP <ffff88203ccbfd08>
[91954.585020] CR2: 0000000000000000
Attachments
Issue Links
- is related to
-
LU-9817 Multi-Rail Crash on message free
- Resolved