[LU-3093] lstcon_dstnodes_prep()) ASSERTION( grp->grp_nnode >= 1 ) Created: 03/Apr/13 Updated: 04/Sep/13 Resolved: 04/Sep/13 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.4 |
| Fix Version/s: | Lustre 2.4.1, Lustre 2.5.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Ned Bass | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: | |||
| Severity: | 3 |
| Rank (Obsolete): | 7513 |
| Description |
|
An lst session node crashed when I tried to run an lnet selftest. Backtrace from crash: crash> bt
PID: 2961 TASK: ffff880bed544ae0 CPU: 8 COMMAND: "lst"
#0 [ffff8808edeff980] machine_kexec at ffffffff81035b6b
#1 [ffff8808edeff9e0] crash_kexec at ffffffff810c08d2
#2 [ffff8808edeffab0] panic at ffffffff8150d3f3
#3 [ffff8808edeffb30] lbug_with_loc at ffffffffa02c1e4b [libcfs]
#4 [ffff8808edeffb50] lstcon_dstnodes_prep at ffffffffa059cc36 [lnet_selftest]
#5 [ffff8808edeffbc0] lstcon_testrpc_prep at ffffffffa059eb9a [lnet_selftest]
#6 [ffff8808edeffc20] lstcon_rpc_trans_ndlist at ffffffffa059f30f [lnet_selftest]
#7 [ffff8808edeffc90] lstcon_test_add at ffffffffa059b2ce [lnet_selftest]
#8 [ffff8808edeffd10] lst_test_add_ioctl at ffffffffa05a03e7 [lnet_selftest]
#9 [ffff8808edeffda0] lstcon_ioctl_entry at ffffffffa05a3f95 [lnet_selftest]
#10 [ffff8808edeffdd0] libcfs_ioctl at ffffffffa02cac84 [libcfs]
#11 [ffff8808edeffe10] libcfs_ioctl at ffffffffa02c5ac4 [libcfs]
#12 [ffff8808edeffe60] vfs_ioctl at ffffffff81194acc
#13 [ffff8808edeffea0] do_vfs_ioctl at ffffffff81194c14
#14 [ffff8808edefff30] sys_ioctl at ffffffff81195191
#15 [ffff8808edefff80] system_call_fastpath at ffffffff8100b072
RIP: 00002aaaaadada47 RSP: 00007fffffffe1b8 RFLAGS: 00010206
RAX: 0000000000000010 RBX: ffffffff8100b072 RCX: 00002aaaaae1dac0
RDX: 00007fffffffe0a0 RSI: 00000000c008653f RDI: 0000000000000003
RBP: 0000000000000000 R8: 0000000000000c26 R9: 0000000000000068
R10: 00007fffffffdf40 R11: 0000000000000246 R12: 00000000c008653f
R13: 00007fffffffe0a0 R14: 000000000061c2e0 R15: 0000000000000003
ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
Console panic message: LustreError: 2961:0:(conrpc.c:738:lstcon_dstnodes_prep()) ASSERTION( grp->grp_nnode >= 1 ) failed: LustreError: 2961:0:(conrpc.c:738:lstcon_dstnodes_prep()) LBUG Pid: 2961, comm: lst Call Trace: [<ffffffffa02c17e5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa02c1df7>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa059cc36>] lstcon_dstnodes_prep+0x236/0x280 [lnet_selftest] [<ffffffff8116041a>] ? alloc_pages_current+0xaa/0x110 [<ffffffffa059eb9a>] lstcon_testrpc_prep+0xfa/0x310 [lnet_selftest] [<ffffffffa0596a40>] ? lstcon_testrpc_condition+0x0/0x1c0 [lnet_selftest] [<ffffffffa059f30f>] lstcon_rpc_trans_ndlist+0x24f/0x300 [lnet_selftest] [<ffffffffa059b2ce>] lstcon_test_add+0x4ce/0x930 [lnet_selftest] [<ffffffffa05a03e7>] lst_test_add_ioctl+0xaf7/0xc20 [lnet_selftest] [<ffffffffa05a3f95>] lstcon_ioctl_entry+0x505/0x5f0 [lnet_selftest] [<ffffffffa02cac84>] libcfs_ioctl+0x354/0x900 [libcfs] [<ffffffffa02c5ac4>] libcfs_ioctl+0x84/0x180 [libcfs] [<ffffffff81194acc>] vfs_ioctl+0x7c/0xa0 [<ffffffff81194c14>] do_vfs_ioctl+0x84/0x580 [<ffffffff81195191>] sys_ioctl+0x81/0xa0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Kernel panic - not syncing: LBUG Pid: 2961, comm: lst Tainted: G W --------------- 2.6.32-358.5chaos.ch5.1.x86_64 #1 Call Trace: [<ffffffff8150d3ec>] ? panic+0xa7/0x16f [<ffffffffa02c1e4b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa059cc36>] ? lstcon_dstnodes_prep+0x236/0x280 [lnet_selftest] [<ffffffff8116041a>] ? alloc_pages_current+0xaa/0x110 [<ffffffffa059eb9a>] ? lstcon_testrpc_prep+0xfa/0x310 [lnet_selftest] [<ffffffffa0596a40>] ? lstcon_testrpc_condition+0x0/0x1c0 [lnet_selftest] [<ffffffffa059f30f>] ? lstcon_rpc_trans_ndlist+0x24f/0x300 [lnet_selftest] [<ffffffffa059b2ce>] ? lstcon_test_add+0x4ce/0x930 [lnet_selftest] [<ffffffffa05a03e7>] ? lst_test_add_ioctl+0xaf7/0xc20 [lnet_selftest] [<ffffffffa05a3f95>] ? lstcon_ioctl_entry+0x505/0x5f0 [lnet_selftest] [<ffffffffa02cac84>] ? libcfs_ioctl+0x354/0x900 [libcfs] [<ffffffffa02c5ac4>] ? libcfs_ioctl+0x84/0x180 [libcfs] [<ffffffff81194acc>] ? vfs_ioctl+0x7c/0xa0 [<ffffffff81194c14>] ? do_vfs_ioctl+0x84/0x580 [<ffffffff81195191>] ? sys_ioctl+0x81/0xa0 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b lst script (note the servers NID was mispelled) #!/bin/bash
export LST_SESSION=1234
lst new_session read/write
lst add_group servers 172.19.1.101@owib100
lst add_group readers 172.16.66.53@tcp
lst add_batch bulk_rw
lst add_test --batch bulk_rw --from readers --to servers brw read check=simple size=1M
lst run bulk_rw
# display server stats for 30 seconds
lst stat servers & sleep 30; kill $!
# tear down
lst end_session
|
| Comments |
| Comment by Peter Jones [ 03/Apr/13 ] |
|
Amir Could you please look into this one? Thanks Peter |
| Comment by Amir Shehata (Inactive) [ 05/Apr/13 ] |
|
In order to save time, I think the best approach is to agree first on the proposed solution with all interested parties, before going ahead with the implementation. As Ned pointed out in his comment, and I agree, I was aware that the the fix doesn't resolve the vulnerability in the kernel, but I was under the false assumption that I should keep the fix as simple as possible. The approach to properly fix this issue and thus remove the assert completely, since it's not appropriate to assert on user input, is outlined below. Currently the APIs under question are (and their explanation): Add nodes to a group Further more: update the group The problematic cases are: General Design Guidelines: The proposed solution is: The groups remain (possibly empty) and can be updated at a later time. Kernel Selftest Behavior: Justification: This opens up the flexibility of a user space application creating groups that can be reused, by adding and removing nodes, without having to worry about the group implicitly being deleted. LST Utility behavior Other user space utilities can be implemented differently, but the kernel selftest module will ensure the sanity of the passed in parameters before carrying out a test case thus avoiding asserting and bringing down the entire OS. Is this solution acceptable? |
| Comment by Ned Bass [ 05/Apr/13 ] |
|
Amir, thanks for the detailed analysis. This solution looks acceptable to me. |
| Comment by Amir Shehata (Inactive) [ 15/Apr/13 ] |
|
We reached a consensus to make sure that the vulnerability in the Kernel module is addressed, as well as ensuring that lst utility catches this case in its checks as well. The fix will be added to review soon. My hard drive crashed, causing a delay in the resolution to this issue. |
| Comment by Amir Shehata (Inactive) [ 18/Apr/13 ] |
|
Fix is ready for review: |
| Comment by Amir Shehata (Inactive) [ 23/Apr/13 ] |
|
Just a reminder that the fix is ready for review: http://review.whamcloud.com/#change,6092 |
| Comment by Bob Glossman (Inactive) [ 08/Aug/13 ] |
|
back port to b2_4: http://review.whamcloud.com/7277 |