[LU-9754] multi-rail: test_UT_0025 : BUG: unable to handle kernel NULL pointer dereference at 0000000000000001 Created: 08/Jul/17 Updated: 10/Jul/17 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Saurabh Tandan (Inactive) | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
2.10 RC1 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
As the test_UT_0025 was run the node restarted : DEBUG:root:s.run_script()
[171420.208546] LNet: Added LNI 10.2.4.36@tcp1 [8/256/0/180]
[171420.211608] LNet: Added LNI 192.168.122.129@tcp1 [8/256/0/180]
[171420.212895] LNet: Added LNI 192.168.122.84@tcp1 [8/256/0/180]
[171420.214153] LNet: Added LNI 192.168.122.39@tcp1 [8/256/0/180]
[171420.234617] LNet: Removed LNI 10.2.4.36@tcp1
[171420.250192] LNet: Removed LNI 192.168.122.129@tcp1
[171420.251372] LNet: Removed LNI 192.168.122.84@tcp1
[171421.252136] LNet: Removed LNI 192.168.122.39@tcp1
DEBUG:root:s.push_results()
DEBUG:root:logout
DEBUG:root:Exiting handler
[171477.627760] LNetError: 120-3: Refusing connection from 10.2.4.37 for 10.2.4.36@tcp: No matching NI
[171477.631857] LNetError: Skipped 13 previous similar messages
DEBUG:root:Setting termtype to ansi
DEBUG:root:Setting termtype to XTERM
DEBUG:root:from lutf_agent_ctrl import *
DEBUG:root:s = Agent_Ctl_Script('root','10.2.4.35','/root/LUTF/python/tests/multi-rail/mr_test_UT_0025.py:/root/LUTF/python/tests/test_infra/lnet_test_infra_utils.py:/root/LUTF/python/tests/test_infra/selftest_template.py','/root/LUTF/python/tests')
DEBUG:root:s.run_script()
[171562.440523] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
[171562.442025] IP: [<ffffffff81321300>] strncmp+0x60/0x60
[171562.442025] PGD 7946e067 PUD 79d21067 PMD 0
[171562.442025] Oops: 0000 [#1] SMP
[171562.442025] Modules linked in: libcfs(OE+) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core ppdev pcspkr virtio_balloon i2c_piix4 parport_pc parport ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_blk 8139too drm ata_piix i2c_core serio_raw libata virtio_pci 8139cp virtio_ring mii virtio floppy [last unloaded: libcfs]
[171562.442025] CPU: 0 PID: 19063 Comm: modprobe Tainted: G OE ------------ 3.10.0-514.21.1.el7_lustre.x86_64 #1
[171562.442025] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[171562.442025] task: ffff88007bfb0000 ti: ffff8800119d0000 task.ti: ffff8800119d0000
[171562.442025] RIP: 0010:[<ffffffff81321300>] [<ffffffff81321300>] strncmp+0x60/0x60
[171562.442025] RSP: 0018:ffff8800119d3ce0 EFLAGS: 00010206
[171562.442025] RAX: 0000000000000001 RBX: 0000000000000006 RCX: ffff8800777e6f80
[171562.442025] RDX: 000000000000005b RSI: 000000000000005b RDI: 0000000000000001
[171562.442025] RBP: ffff8800119d3d48 R08: 315b31205d305b30 R09: 33205d325b32205d
[171562.442025] R10: 5b32205d315b3120 R11: 005d335b33205d32 R12: 0000000000000000
[171562.442025] R13: ffff8800777e6f80 R14: 0000000000000000 R15: ffffffffa065d120
[171562.442025] FS: 00007f333caf1740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[171562.442025] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[171562.442025] CR2: 0000000000000001 CR3: 000000007b2ea000 CR4: 00000000000006f0
[171562.442025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[171562.442025] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[171562.442025] Stack:
[171562.442025] ffffffffa0635085 0000000000000246 ffffffff81d3abf0 ffff8800119d3d18
[171562.442025] ffff8800777e6f80 0000000000000000 ffff8800777e6f80 00000000b23a2366
[171562.442025] 0000000000000000 ffff8800777e6f80 ffffffffa0696000 0000000000000000
[171562.442025] Call Trace:
[171562.442025] [<ffffffffa0635085>] ? cfs_cpu_init+0x5e5/0x12d0 [libcfs]
[171562.442025] [<ffffffffa0696000>] ? 0xffffffffa0695fff
[171562.442025] [<ffffffffa069602f>] libcfs_init+0x2f/0x1000 [libcfs]
[171562.442025] [<ffffffff810020e8>] do_one_initcall+0xb8/0x230
[171562.442025] [<ffffffff81100918>] load_module+0x22c8/0x2930
[171562.442025] [<ffffffff8133df80>] ? ddebug_proc_write+0xf0/0xf0
[171562.442025] [<ffffffff810fc8e3>] ? copy_module_from_fd.isra.42+0x53/0x150
[171562.442025] [<ffffffff81101136>] SyS_finit_module+0xa6/0xd0
[171562.442025] [<ffffffff81697849>] system_call_fastpath+0x16/0x1b
[171562.442025] Code: c1 75 18 48 83 c0 01 84 c9 74 05 48 39 d0 75 e3 31 c0 5d c3 0f 1f 80 00 00 00 00 44 38 c1 19 c0 83 c8 01 5d c3 66 0f 1f 44 00 00 <0f> b6 07 55 48 89 e5 40 38 f0 74 1b 84 c0 89 f2 75 0a eb 1c 0f
[171562.442025] RIP [<ffffffff81321300>] strncmp+0x60/0x60
[171562.442025] RSP <ffff8800119d3ce0>
[171562.442025] CR2: 0000000000000001
[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Initializing cgroup subsys cpuacct
[ 0.000000] Linux version 3.10.0-514.21.1.el7_lustre.x86_64 (jenkins@trevis-310-el7-x8664-3.trevis.hpdd.intel.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Tue Jun 27 19:47:14 UTC 2017
UT_0005 was run before it and it passed successfully. |
| Comments |
| Comment by Amir Shehata (Inactive) [ 10/Jul/17 ] |
|
I looked at lustre.conf on onyx23vm7 and I think you're running in: This has landed but it's not part of 2.10 RC1 Can you just delete options libcfs cpu_npartitions=4 cpu_pattern="0[0] 1[1] 2[2] 3[3]"
from /etc/modprobe.d/lustre.conf and try the test again. I believe that should fix the issue you're seeing. |
| Comment by Saurabh Tandan (Inactive) [ 10/Jul/17 ] |
|
I tried, with the above suggestion. I am still hitting the issue. |
| Comment by Amir Shehata (Inactive) [ 10/Jul/17 ] |
|
I still believe that 0025 tests 8 """
9 UT ID: UT-0025
10 Description:
11 »·······- unconfigure LNet
12 »·······- unload LNet
13 »·······- modify the /etc/modprobe.d/lustre.conf file to contain
14 »·······»·······options libcfs cpu_npartitions=4 cpu_pattern="0[0] 1[1] 2[2] 3[3]"
15 »·······- Look up the interfaces on the system
16 »·······- configure intf1 on CPT 0, 2
17 »·······- configure intf2 on CPT 1, 3
18 »·······- sanitize configuration
19 »·······- cleanup system
20 """
and 0030 8 """
9 UT ID: UT-0030
10 Description:
11 »·······- unconfigure LNet
12 »·······- unload LNet
13 »·······- modify the /etc/modprobe.d/lustre.conf file to contain
14 »·······»·······options libcfs cpu_npartitions=4 cpu_pattern="0[0] 1[1] 2[2] 3[3]"
15 »·······- Look up the interfaces on the system
16 »·······- configure intf1 on CPT 0, 2
17 »·······- configure intf2 on CPT 1, 3
18 »·······- configure intf3 on all CPTS.
19 »·······- sanitize configuration
20 »·······- cleanup system
21 """
These test scrips explicitly use the cpu_pattern tunable, which causes the crash. To confirm I commented out the part of the script that explicitly writes in the lustre.conf file the cpu_pattern tunable and the test didn't crash the node. It failed, because the CPTs identified in the test are not there |
| Comment by Saurabh Tandan (Inactive) [ 10/Jul/17 ] |
|
I tried again, now I see same observations made above. |
| Comment by Amir Shehata (Inactive) [ 10/Jul/17 ] |
|
This is the same issue as |