[LU-12180] Crash when routerstat is running during lustre_rmmod Created: 11/Apr/19  Updated: 25/Nov/19  Resolved: 11/Apr/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jeremy Filizetti Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Several multi-rail systems running lnet self test


Epic/Theme: lnet
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I've seen this crash several times when I have a "routerstat 1" running in one window and running lustre_rmmod.  Could be coincidence but I didn't see it when running 2.10.7.  I'm running 4e737a6a8a0f75425255c21eb95e43d9a950193b as head from the b2_12 branch.

 

This is in a multi-rail environment.

 

[ 1518.289981] BUG: unable to handle kernel paging request at fffffffffffffff0
 [ 1518.298117] IP: [<ffffffffc0c1c5c6>] cfs_percpt_number+0x6/0x10 [libcfs]
 [ 1518.305955] PGD 459c814067 PUD 459c816067 PMD 0 
 [ 1518.311665] Oops: 0000 [#1] SMP 
 [ 1518.315955] Modules linked in: lnet_selftest(OE-) ksocklnd(OE) ko2iblnd(OE) obdclass(OE) lnet(OE) libcfs(OE) ib_ipoib xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_isert iscsi_target_mod target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ucm rpcrdma sunrpc rdma_ucm ib_umad ib_uverbs ib_iser rdma_cm iw_cm libiscsi scsi_transport_iscsi ib_cm skx_edac nfit libnvdimm intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel ipmi_ssif aesni_intel lrw gf128mul
 [ 1518.394242] glue_helper ablk_helper cryptd ses enclosure pcspkr sg ipmi_si ipmi_devintf ipmi_msghandler hpilo mlx5_ib hpwdt ib_core mei_me lpc_ich mei wmi ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic uas usb_storage mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops serio_raw ttm crct10dif_pclmul crct10dif_common crc32c_intel mlx5_core drm smartpqi scsi_transport_sas mlxfw drm_panel_orientation_quirks devlink tg3 ptp pps_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: lnet_selftest]
 [ 1518.450540] CPU: 19 PID: 50662 Comm: routerstat Kdump: loaded Tainted: G OE ------------ 3.10.0-957.5.1.el7.x86_64 #1
 [ 1518.465721] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 10/02/2018
 [ 1518.475984] task: ffff9bc19df2d140 ti: ffff9bc11bed0000 task.ti: ffff9bc11bed0000
 [ 1518.485188] RIP: 0010:[<ffffffffc0c1c5c6>] [<ffffffffc0c1c5c6>] cfs_percpt_number+0x6/0x10 [libcfs]
 [ 1518.496089] RSP: 0018:ffff9bc11bed3db0 EFLAGS: 00010296
 [ 1518.503102] RAX: 0000000000000004 RBX: ffff9ba85e34eb00 RCX: 0000000000000000
 [ 1518.511930] RDX: 0000000000000001 RSI: 00000000ffffffff RDI: 0000000000000000
 [ 1518.520751] RBP: ffff9bc11bed3dd0 R08: 000000000001f120 R09: ffff9b633fc03700
 [ 1518.529564] R10: ffffffffc0c999cb R11: 0000000000000246 R12: 0000000000000000
 [ 1518.538349] R13: 0000000000000000 R14: 0000000000000300 R15: ffff9ba85e34eb00
 [ 1518.547109] FS: 00007f1f47adc740(0000) GS:ffff9ba99fac0000(0000) knlGS:0000000000000000
 [ 1518.556828] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [ 1518.564166] CR2: fffffffffffffff0 CR3: 00000046a18d0000 CR4: 00000000007607e0
 [ 1518.572891] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [ 1518.581607] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [ 1518.590306] PKRU: 55555554
 [ 1518.594523] Call Trace:
 [ 1518.598455] [<ffffffffc0c6c96b>] ? lnet_counters_get_common+0xeb/0x150 [lnet]
 [ 1518.607204] [<ffffffffc0c6ca3c>] lnet_counters_get+0x6c/0x150 [lnet]
 [ 1518.615162] [<ffffffffc0c99a0b>] __proc_lnet_stats+0xfb/0x810 [lnet]
 [ 1518.623081] [<ffffffffc0c09602>] lprocfs_call_handler+0x22/0x50 [libcfs]
 [ 1518.631332] [<ffffffffc0c98ee5>] proc_lnet_stats+0x25/0x30 [lnet]
 [ 1518.638962] [<ffffffffc0c0965d>] lnet_debugfs_read+0x2d/0x40 [libcfs]
 [ 1518.646929] [<ffffffffa46414bf>] vfs_read+0x9f/0x170
 [ 1518.653382] [<ffffffffa464237f>] SyS_read+0x7f/0xf0
 [ 1518.659717] [<ffffffffa4b74ddb>] system_call_fastpath+0x22/0x27
 [ 1518.667084] Code: 85 03 00 50 4b c5 c0 c7 05 cc 85 03 00 00 00 02 00 e8 ff f8 fe ff 8b 1d 89 67 01 00 e9 e8 f9 ff ff 0f 1f 40 00 0f 1f 44 00 00 55 <8b> 47 f0 48 89 e5 5d c3 66 90 0f 1f 44 00 00 55 48 89 e5 41 57 
 [ 1518.689553] RIP [<ffffffffc0c1c5c6>] cfs_percpt_number+0x6/0x10 [libcfs]
 [ 1518.697778] RSP <ffff9bc11bed3db0>
 [ 1518.702661] CR2: fffffffffffffff0


 Comments   
Comment by James A Simmons [ 11/Apr/19 ]

https://review.whamcloud.com/#/c/34634/

Comment by James A Simmons [ 11/Apr/19 ]

Duplicate of LU-11986 which has a fix

Comment by Peter Jones [ 11/Apr/19 ]

Jeremy

This is expected to be in the final 2.12.1 release

Peter

Comment by Jeremy Filizetti [ 11/Apr/19 ]

Thanks, I probably should have taken a closer look to see possible dupes.

Generated at Sat Feb 10 02:50:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.