[LU-11580] Crash after cfs_percpt_alloc when doing modprobe lnet Created: 29/Oct/18  Updated: 04/Jan/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Åke Sandgren Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Ubuntu 18.04 (4.15.0-38)
MOFED 4.4


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

With

options libcfs cpu_npartitions=1 cpu_pattern="0[0-27]"

or just

options libcfs cpu_pattern="0[0-27]"

I get a crash when loading lnet.

2018-10-29T09:21:10.475881+01:00 b-an04 kernel: [ 2138.540432] LNet: HW NUMA nodes: 2, HW CPU cores: 28, npartitions: 1
2018-10-29T09:21:35.396884+01:00 b-an04 kernel: [ 2162.956139] BUG: unable to handle kernel paging request at ffffffffaaddd990
2018-10-29T09:21:35.396900+01:00 b-an04 kernel: [ 2162.963917] IP: native_queued_spin_lock_slowpath+0x16f/0x1a0
2018-10-29T09:21:35.396902+01:00 b-an04 kernel: [ 2162.970228] PGD 1c80e0e067 P4D 1c80e0e067 PUD 1c80e0f063 PMD 8000001c80a000e1

 

Without any options for libcfs loading lnet works as it should.

This worked whith MOFED 3.3 and Lustre 2.7



 Comments   
Comment by Åke Sandgren [ 29/Oct/18 ]

Crash log is:

 

2018-10-29T09:21:10.475881+01:00 b-an04 kernel: [ 2138.540432] LNet: HW NUMA nodes: 2, HW CPU cores: 28, npartitions: 1
2018-10-29T09:21:35.396884+01:00 b-an04 kernel: [ 2162.956139] BUG: unable to handle kernel paging request at ffffffffaaddd990
2018-10-29T09:21:35.396900+01:00 b-an04 kernel: [ 2162.963917] IP: native_queued_spin_lock_slowpath+0x16f/0x1a0
2018-10-29T09:21:35.396902+01:00 b-an04 kernel: [ 2162.970228] PGD 1c80e0e067 P4D 1c80e0e067 PUD 1c80e0f063 PMD 8000001c80a000e1

2018-10-29T09:21:35.396906+01:00 b-an04 kernel: [ 2162.978287] Oops: 0003 1 SMP PTI
2018-10-29T09:21:35.396908+01:00 b-an04 kernel: [ 2162.982177] Modules linked in: lnet(OE+) libcfs(OE) ipmi_poweroff 8021q garp mrp stp llc openafs(POE) nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_meta nft_set_bitmap nft_set_hash nft_set_rbtree rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) snd_hda_codec_hdmi ipmi_ssif intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul snd_hda_intel ghash_clmulni_intel pcbc snd_hda_codec snd_hda_core aesni_intel nls_iso8859_1 snd_hwdep aes_x86_64 snd_pcm crypto_simd glue_helper cryptd snd_seq_midi snd_seq_midi_event snd_rawmidi intel_cstate snd_seq snd_seq_device cdc_ether snd_timer usbnet intel_rapl_perf mii snd soundcore mei_me
2018-10-29T09:21:35.396911+01:00 b-an04 kernel: [ 2163.061433] mei shpchp lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mac_hid sch_fq_codel nfsv4 nfs lockd grace fscache knem(OE) psmouse sunrpc nf_tables_inet nf_tables_ipv6 nf_tables_ipv4 nf_tables nfnetlink ip_tables x_tables autofs4 xfs mlx5_ib(OE) ib_core(OE) nouveau video i2c_algo_bit ttm mlx5_core(OE) mlxfw(OE) drm_kms_helper syscopyarea devlink sysfillrect bnx2x sysimgblt mlx_compat(OE) tg3 fb_sys_fops ptp mxm_wmi drm pps_core megaraid_sas mdio libcrc32c wmi [last unloaded: libcfs]
2018-10-29T09:21:35.396913+01:00 b-an04 kernel: [ 2163.112273] CPU: 6 PID: 4354 Comm: modprobe Tainted: P OE 4.15.0-38-generic #41-Ubuntu
2018-10-29T09:21:35.396915+01:00 b-an04 kernel: [ 2163.122368] Hardware name: LENOVO System x3650 M5: [8871AC1]/00YJ380, BIOS [TCE136H-2.70] 06/13/2018
2018-10-29T09:21:35.396917+01:00 b-an04 kernel: [ 2163.132947] RIP: 0010:native_queued_spin_lock_slowpath+0x16f/0x1a0
2018-10-29T09:21:35.396918+01:00 b-an04 kernel: [ 2163.139841] RSP: 0018:ffff9c9c4c8739d0 EFLAGS: 00010086
2018-10-29T09:21:35.396919+01:00 b-an04 kernel: [ 2163.145668] RAX: ffffffffaaddd990 RBX: ffffffffab0ff940 RCX: ffff8d03c01a3440
2018-10-29T09:21:35.396921+01:00 b-an04 kernel: [ 2163.153629] RDX: 0000000000002ac2 RSI: 00000000ab0ff940 RDI: ffffffffab0ff940
2018-10-29T09:21:35.396923+01:00 b-an04 kernel: [ 2163.161589] RBP: ffff9c9c4c8739d0 R08: 00000000001c0000 R09: 0000000100400039
2018-10-29T09:21:35.396925+01:00 b-an04 kernel: [ 2163.169549] R10: ffff9c9c4c873ab0 R11: ffff8d23b285d660 R12: ffff8d03bfc07780
2018-10-29T09:21:35.396926+01:00 b-an04 kernel: [ 2163.177510] R13: ffffe985bfcbb680 R14: ffff8d03c01a70a0 R15: ffff8d03bfc07780
2018-10-29T09:21:35.396928+01:00 b-an04 kernel: [ 2163.185472] FS: 00007f3c5305d540(0000) GS:ffff8d03c0180000(0000) knlGS:0000000000000000
2018-10-29T09:21:35.396929+01:00 b-an04 kernel: [ 2163.194498] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2018-10-29T09:21:35.396930+01:00 b-an04 kernel: [ 2163.200906] CR2: ffffffffaaddd990 CR3: 0000001feda5a002 CR4: 00000000003606e0
2018-10-29T09:21:35.396932+01:00 b-an04 kernel: [ 2163.208867] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2018-10-29T09:21:35.396933+01:00 b-an04 kernel: [ 2163.216827] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2018-10-29T09:21:35.396934+01:00 b-an04 kernel: [ 2163.224787] Call Trace:

2018-10-29T09:21:35.396936+01:00 b-an04 kernel: [ 2163.227518] _raw_spin_lock+0x21/0x30
2018-10-29T09:21:35.396938+01:00 b-an04 kernel: [ 2163.231594] get_partial_node.isra.72+0x5c/0x260
2018-10-29T09:21:35.396955+01:00 b-an04 kernel: [ 2163.236745] ? default_wake_function+0x12/0x20
2018-10-29T09:21:35.396957+01:00 b-an04 kernel: [ 2163.241701] ? __wake_up_common+0x73/0x130
2018-10-29T09:21:35.396959+01:00 b-an04 kernel: [ 2163.246268] ___slab_alloc+0x17d/0x4b0
2018-10-29T09:21:35.396960+01:00 b-an04 kernel: [ 2163.250450] ? ep_poll_callback+0x20e/0x2a0
2018-10-29T09:21:35.396962+01:00 b-an04 kernel: [ 2163.255116] ? ___slab_alloc+0x17d/0x4b0
2018-10-29T09:21:35.396964+01:00 b-an04 kernel: [ 2163.259489] ? idr_alloc_cmn+0x97/0xd0
2018-10-29T09:21:35.396965+01:00 b-an04 kernel: [ 2163.263676] ? cfs_percpt_alloc+0x202/0x430 [libcfs]
2018-10-29T09:21:35.396966+01:00 b-an04 kernel: [ 2163.269213] ? __wake_up_common+0x73/0x130
2018-10-29T09:21:35.396967+01:00 b-an04 kernel: [ 2163.273782] __slab_alloc+0x20/0x40
2018-10-29T09:21:35.396970+01:00 b-an04 kernel: [ 2163.277671] ? __slab_alloc+0x20/0x40
2018-10-29T09:21:35.396971+01:00 b-an04 kernel: [ 2163.281754] __kmalloc_node+0xbe/0x2c0
2018-10-29T09:21:35.396973+01:00 b-an04 kernel: [ 2163.285939] ? cfs_percpt_alloc+0x202/0x430 [libcfs]
2018-10-29T09:21:35.396974+01:00 b-an04 kernel: [ 2163.291481] cfs_percpt_alloc+0x202/0x430 [libcfs]
2018-10-29T09:21:35.396976+01:00 b-an04 kernel: [ 2163.296821] cfs_percpt_lock_create+0xc1/0x2b0 [libcfs]
2018-10-29T09:21:35.396978+01:00 b-an04 kernel: [ 2163.302641] ? 0xffffffffc1092000
2018-10-29T09:21:35.396979+01:00 b-an04 kernel: [ 2163.306345] lnet_lib_init+0xef/0x340 [lnet]
2018-10-29T09:21:35.396981+01:00 b-an04 kernel: [ 2163.311115] lnet_init+0x81/0x1000 [lnet]
2018-10-29T09:21:35.396983+01:00 b-an04 kernel: [ 2163.315587] do_one_initcall+0x52/0x19f
2018-10-29T09:21:35.396985+01:00 b-an04 kernel: [ 2163.319866] ? __vunmap+0x81/0xb0
2018-10-29T09:21:35.396986+01:00 b-an04 kernel: [ 2163.323560] ? _cond_resched+0x19/0x40
2018-10-29T09:21:35.396987+01:00 b-an04 kernel: [ 2163.327740] ? kmem_cache_alloc_trace+0xa6/0x1b0
2018-10-29T09:21:35.396988+01:00 b-an04 kernel: [ 2163.332891] ? do_init_module+0x27/0x209
2018-10-29T09:21:35.396989+01:00 b-an04 kernel: [ 2163.337265] do_init_module+0x5f/0x209
2018-10-29T09:21:35.396991+01:00 b-an04 kernel: [ 2163.341446] load_module+0x191e/0x1f10
2018-10-29T09:21:35.396993+01:00 b-an04 kernel: [ 2163.345627] ? ima_post_read_file+0x96/0xa0
2018-10-29T09:21:35.396994+01:00 b-an04 kernel: [ 2163.350292] SYSC_finit_module+0xfc/0x120
2018-10-29T09:21:35.396996+01:00 b-an04 kernel: [ 2163.354763] ? SYSC_finit_module+0xfc/0x120
2018-10-29T09:21:35.396997+01:00 b-an04 kernel: [ 2163.359430] SyS_finit_module+0xe/0x10
2018-10-29T09:21:35.396998+01:00 b-an04 kernel: [ 2163.363611] do_syscall_64+0x73/0x130
2018-10-29T09:21:35.397000+01:00 b-an04 kernel: [ 2163.367693] entry_SYSCALL_64_after_hwframe+0x3d/0xa2

2018-10-29T09:21:35.397001+01:00 b-an04 kernel: [ 2163.373326] RIP: 0033:0x7f3c52b88839
2018-10-29T09:21:35.397002+01:00 b-an04 kernel: [ 2163.377310] RSP: 002b:00007ffca81283c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
2018-10-29T09:21:35.397003+01:00 b-an04 kernel: [ 2163.385757] RAX: ffffffffffffffda RBX: 000055b1be249ee0 RCX: 00007f3c52b88839
2018-10-29T09:21:35.397005+01:00 b-an04 kernel: [ 2163.393718] RDX: 0000000000000000 RSI: 000055b1be242a90 RDI: 0000000000000003
2018-10-29T09:21:35.397007+01:00 b-an04 kernel: [ 2163.401679] RBP: 000055b1be242a90 R08: 0000000000000000 R09: 0000000000000000
2018-10-29T09:21:35.397008+01:00 b-an04 kernel: [ 2163.409639] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
2018-10-29T09:21:35.397010+01:00 b-an04 kernel: [ 2163.417599] R13: 000055b1be249ff0 R14: 0000000000040000 R15: 0000000000000000
2018-10-29T09:21:35.397012+01:00 b-an04 kernel: [ 2163.425561] Code: c3 f3 90 4c 8b 09 4d 85 c9 74 f6 eb c9 c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 04 48 63 d2 48 05 40 34 02 00 48 03 04 d5 c0 96 da aa <48> 89 08 8b 41 08 85 c0 75 09 f3 90 8b 41 08 85 c0 74 f7 4c 8b
2018-10-29T09:21:35.397013+01:00 b-an04 kernel: [ 2163.446638] RIP: native_queued_spin_lock_slowpath+0x16f/0x1a0 RSP: ffff9c9c4c8739d0
2018-10-29T09:21:35.397014+01:00 b-an04 kernel: [ 2163.455181] CR2: ffffffffaaddd990
2018-10-29T09:21:35.397015+01:00 b-an04 kernel: [ 2163.458876] --[ end trace 762c34a38ab0c36f ]--

Comment by Peter Jones [ 04/Jan/19 ]

ake_s does this issue still occur with the GA version of 2.12? Can you elaborate a little on the distro kernel versions and Lustre versions on both servers and clients where this worked and what you are using now?

Generated at Sat Feb 10 08:50:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.