[LU-9754] multi-rail: test_UT_0025 : BUG: unable to handle kernel NULL pointer dereference at 0000000000000001 Created: 08/Jul/17  Updated: 10/Jul/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Saurabh Tandan (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 0
Labels: None
Environment:

2.10 RC1
onyx-23vm7


Issue Links:
Blocker
is blocking LU-9715 Crash in libcfs_init() Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

As the test_UT_0025 was run the node restarted :

DEBUG:root:s.run_script()
[171420.208546] LNet: Added LNI 10.2.4.36@tcp1 [8/256/0/180]
[171420.211608] LNet: Added LNI 192.168.122.129@tcp1 [8/256/0/180]
[171420.212895] LNet: Added LNI 192.168.122.84@tcp1 [8/256/0/180]
[171420.214153] LNet: Added LNI 192.168.122.39@tcp1 [8/256/0/180]
[171420.234617] LNet: Removed LNI 10.2.4.36@tcp1
[171420.250192] LNet: Removed LNI 192.168.122.129@tcp1
[171420.251372] LNet: Removed LNI 192.168.122.84@tcp1
[171421.252136] LNet: Removed LNI 192.168.122.39@tcp1
DEBUG:root:s.push_results()
DEBUG:root:logout
DEBUG:root:Exiting handler
[171477.627760] LNetError: 120-3: Refusing connection from 10.2.4.37 for 10.2.4.36@tcp: No matching NI
[171477.631857] LNetError: Skipped 13 previous similar messages
DEBUG:root:Setting termtype to ansi
DEBUG:root:Setting termtype to XTERM
DEBUG:root:from lutf_agent_ctrl import *
DEBUG:root:s = Agent_Ctl_Script('root','10.2.4.35','/root/LUTF/python/tests/multi-rail/mr_test_UT_0025.py:/root/LUTF/python/tests/test_infra/lnet_test_infra_utils.py:/root/LUTF/python/tests/test_infra/selftest_template.py','/root/LUTF/python/tests')
DEBUG:root:s.run_script()
[171562.440523] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
[171562.442025] IP: [<ffffffff81321300>] strncmp+0x60/0x60
[171562.442025] PGD 7946e067 PUD 79d21067 PMD 0 
[171562.442025] Oops: 0000 [#1] SMP 
[171562.442025] Modules linked in: libcfs(OE+) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core ppdev pcspkr virtio_balloon i2c_piix4 parport_pc parport ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_blk 8139too drm ata_piix i2c_core serio_raw libata virtio_pci 8139cp virtio_ring mii virtio floppy [last unloaded: libcfs]
[171562.442025] CPU: 0 PID: 19063 Comm: modprobe Tainted: G           OE  ------------   3.10.0-514.21.1.el7_lustre.x86_64 #1
[171562.442025] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[171562.442025] task: ffff88007bfb0000 ti: ffff8800119d0000 task.ti: ffff8800119d0000
[171562.442025] RIP: 0010:[<ffffffff81321300>]  [<ffffffff81321300>] strncmp+0x60/0x60
[171562.442025] RSP: 0018:ffff8800119d3ce0  EFLAGS: 00010206
[171562.442025] RAX: 0000000000000001 RBX: 0000000000000006 RCX: ffff8800777e6f80
[171562.442025] RDX: 000000000000005b RSI: 000000000000005b RDI: 0000000000000001
[171562.442025] RBP: ffff8800119d3d48 R08: 315b31205d305b30 R09: 33205d325b32205d
[171562.442025] R10: 5b32205d315b3120 R11: 005d335b33205d32 R12: 0000000000000000
[171562.442025] R13: ffff8800777e6f80 R14: 0000000000000000 R15: ffffffffa065d120
[171562.442025] FS:  00007f333caf1740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[171562.442025] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[171562.442025] CR2: 0000000000000001 CR3: 000000007b2ea000 CR4: 00000000000006f0
[171562.442025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[171562.442025] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[171562.442025] Stack:
[171562.442025]  ffffffffa0635085 0000000000000246 ffffffff81d3abf0 ffff8800119d3d18
[171562.442025]  ffff8800777e6f80 0000000000000000 ffff8800777e6f80 00000000b23a2366
[171562.442025]  0000000000000000 ffff8800777e6f80 ffffffffa0696000 0000000000000000
[171562.442025] Call Trace:
[171562.442025]  [<ffffffffa0635085>] ? cfs_cpu_init+0x5e5/0x12d0 [libcfs]
[171562.442025]  [<ffffffffa0696000>] ? 0xffffffffa0695fff
[171562.442025]  [<ffffffffa069602f>] libcfs_init+0x2f/0x1000 [libcfs]
[171562.442025]  [<ffffffff810020e8>] do_one_initcall+0xb8/0x230
[171562.442025]  [<ffffffff81100918>] load_module+0x22c8/0x2930
[171562.442025]  [<ffffffff8133df80>] ? ddebug_proc_write+0xf0/0xf0
[171562.442025]  [<ffffffff810fc8e3>] ? copy_module_from_fd.isra.42+0x53/0x150
[171562.442025]  [<ffffffff81101136>] SyS_finit_module+0xa6/0xd0
[171562.442025]  [<ffffffff81697849>] system_call_fastpath+0x16/0x1b
[171562.442025] Code: c1 75 18 48 83 c0 01 84 c9 74 05 48 39 d0 75 e3 31 c0 5d c3 0f 1f 80 00 00 00 00 44 38 c1 19 c0 83 c8 01 5d c3 66 0f 1f 44 00 00 <0f> b6 07 55 48 89 e5 40 38 f0 74 1b 84 c0 89 f2 75 0a eb 1c 0f 
[171562.442025] RIP  [<ffffffff81321300>] strncmp+0x60/0x60
[171562.442025]  RSP <ffff8800119d3ce0>
[171562.442025] CR2: 0000000000000001
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 3.10.0-514.21.1.el7_lustre.x86_64 (jenkins@trevis-310-el7-x8664-3.trevis.hpdd.intel.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Tue Jun 27 19:47:14 UTC 2017

UT_0005 was run before it and it passed successfully.



 Comments   
Comment by Amir Shehata (Inactive) [ 10/Jul/17 ]

I looked at lustre.conf on onyx23vm7 and I think you're running in:
LU-9715: https://review.whamcloud.com/#/c/27872/

This has landed but it's not part of 2.10 RC1

Can you just delete

options libcfs cpu_npartitions=4 cpu_pattern="0[0] 1[1] 2[2] 3[3]"

from

/etc/modprobe.d/lustre.conf

and try the test again. I believe that should fix the issue you're seeing.

Comment by Saurabh Tandan (Inactive) [ 10/Jul/17 ]

I tried, with the above suggestion. I am still hitting the issue.

Comment by Amir Shehata (Inactive) [ 10/Jul/17 ]

I still believe that LU-9715 is the problem.

0025 tests

  8 """
  9 UT ID: UT-0025
 10 Description:
 11 »·······- unconfigure LNet
 12 »·······- unload LNet
 13 »·······- modify the /etc/modprobe.d/lustre.conf file to contain
 14 »·······»·······options libcfs cpu_npartitions=4 cpu_pattern="0[0] 1[1] 2[2] 3[3]"
 15 »·······- Look up the interfaces on the system
 16 »·······- configure intf1 on CPT 0, 2
 17 »·······- configure intf2 on CPT 1, 3
 18 »·······- sanitize configuration
 19 »·······- cleanup system
 20 """

and 0030

  8 """
  9 UT ID: UT-0030
 10 Description:
 11 »·······- unconfigure LNet
 12 »·······- unload LNet
 13 »·······- modify the /etc/modprobe.d/lustre.conf file to contain
 14 »·······»·······options libcfs cpu_npartitions=4 cpu_pattern="0[0] 1[1] 2[2] 3[3]"
 15 »·······- Look up the interfaces on the system
 16 »·······- configure intf1 on CPT 0, 2
 17 »·······- configure intf2 on CPT 1, 3
 18 »·······- configure intf3 on all CPTS.
 19 »·······- sanitize configuration
 20 »·······- cleanup system
 21 """

These test scrips explicitly use the cpu_pattern tunable, which causes the crash.

To confirm I commented out the part of the script that explicitly writes in the lustre.conf file the cpu_pattern tunable and the test didn't crash the node. It failed, because the CPTs identified in the test are not there

Comment by Saurabh Tandan (Inactive) [ 10/Jul/17 ]

I tried again, now I see same observations made above.

Comment by Amir Shehata (Inactive) [ 10/Jul/17 ]

This is the same issue as LU-9715. The fix for that issue should be included in 2.10 GA, since it breaks clearly advertised functionality and it is a low risk fix.

Generated at Sat Feb 10 02:28:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.