Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
[602080.689569] ------------[ cut here ]------------ [602080.694306] WARNING: CPU: 23 PID: 3716 at lib/list_debug.c:29 __list_add+0x65/0xc0 [602080.701989] list_add corruption. next->prev should be prev (ffff886bae1f0a98), but was dead000000000200. (next=ffff888f64636000). [602080.713769] Modules linked in: osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) vtsspp(OE) sep4_1(OE) socperf2_0(OE) ebtable_filter ebtables ip6table_filter ip6_tables pax(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache bridge 8021q garp mrp stp llc rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) mlx4_en(OE) mlx4_ib(OE) ib_core(OE) mlx4_core(OE) ipt_REJECT nf_reject_ipv4 xt_conntrack iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle skx_edac edac_core vfat fat intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel [602080.786461] hpilo hpwdt aesni_intel lrw gf128mul glue_helper ablk_helper ses enclosure mei_me cryptd mei ipmi_si pcspkr shpchp joydev ipmi_devintf wmi sg ipmi_msghandler lpc_ich acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc knem(OE) ip_tables xfs sr_mod cdrom sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect mlx5_core(OE) sysimgblt mlxfw(OE) fb_sys_fops ahci ttm libahci mlx_compat(OE) bnx2x(OE) uas tg3(OE) devlink mdio drm smartpqi(OE) crct10dif_pclmul crct10dif_common ptp scsi_transport_sas libata usb_storage libcrc32c crc32c_intel i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod [602080.843594] CPU: 23 PID: 3716 Comm: kiblnd_sd_11_01 Tainted: G W OE ------------ 3.10.0-693.el7.x86_64 #1 [602080.854240] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 06/20/2018 [602080.862880] ffff88238d297c48 0000000029e8e2dd ffff88238d297bf8 ffffffff816a3d91 [602080.870425] ffff88238d297c38 ffffffff810879c8 0000001dae25b800 ffff886b81b5ac00 [602080.877971] ffff888f64636000 ffff886bae1f0a98 ffff886b81b5ac00 ffff886ba43f8030 [602080.885517] Call Trace: [602080.888061] [<ffffffff816a3d91>] dump_stack+0x19/0x1b [602080.893305] [<ffffffff810879c8>] __warn+0xd8/0x100 [602080.898286] [<ffffffff81087a4f>] warn_slowpath_fmt+0x5f/0x80 [602080.904140] [<ffffffff8133d8a5>] __list_add+0x65/0xc0 [602080.909388] [<ffffffffc0d8b676>] lnet_msg_commit+0x66/0x180 [lnet] [602080.915767] [<ffffffffc0d981bd>] lnet_parse+0x3ed/0xcf0 [lnet] [602080.921800] [<ffffffffc0fedd9b>] kiblnd_handle_rx+0x1eb/0x640 [ko2iblnd] [602080.928701] [<ffffffffc0ff4336>] kiblnd_scheduler+0xe66/0x10a0 [ko2iblnd] [602080.935687] [<ffffffff810ce54e>] ? dequeue_task_fair+0x41e/0x660 [602080.941889] [<ffffffff810c4810>] ? wake_up_state+0x20/0x20 [602080.947568] [<ffffffffc0ff34d0>] ? kiblnd_cq_event+0x80/0x80 [ko2iblnd] [602080.954379] [<ffffffff810b098f>] kthread+0xcf/0xe0 [602080.959359] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 [602080.965562] [<ffffffff816b4f18>] ret_from_fork+0x58/0x90 [602080.971064] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
After investigating, looks into piece of code:
..... if (msg->msg_rx_committed) { /* forwarding msg committed for both receiving and sending */ if (cpt != msg->msg_rx_cpt) { lnet_net_unlock(cpt); cpt2 = msg->msg_rx_cpt; lnet_net_lock(cpt2); } lnet_msg_decommit_rx(msg, status); } ......
The cpt2 store msg_rx_cpt and then get the cpt_lock, but the msg_rx_cpt could be changed in this period. Then we may mess up the msc_active list below.
Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35654
Subject: LU-12618 lnet: list corruption
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 427d45cfde7bea2dd95a1972a621954917ea721d