Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 2.8.0
-
3
-
9223372036854775807
Description
We have had three clients crash pretty quickly when testing Lustre 2.8 clients with lustre 2.5.5 servers. The clients were using tag 2.8.0_0.0.llnlpreview.18, the servers are using 2.5.5-8chaos (see the lustre-release-fe-llnl repo for each).
They hit the following LBUG:
2016-08-09 08:24:53 [569127.746044] Lustre: lcy-OST0004-osc-ffff882025188000: Connection restored to 10.1.1.175@o2ib9 (at 10.1.1.175@o2ib9) 2016-08-09 08:24:53 [569127.760391] Lustre: Skipped 7 previous similar messages 2016-08-09 08:27:04 [569258.797932] LustreError: 166-1: MGC10.1.1.169@o2ib9: Connection to MGS (at 10.1.1.169@o2ib9) was lost; in progress operations using this service will fail 2016-08-09 08:28:14 [569328.893397] Lustre: lcy-OST0004-osc-ffff882025188000: Connection to lcy-OST0004 (at 10.1.1.175@o2ib9) was lost; in progress operations using this service will wait for rec overy to complete 2016-08-09 08:28:14 [569328.918048] Lustre: lcy-OST0004-osc-ffff882025188000: Connection restored to 10.1.1.175@o2ib9 (at 10.1.1.175@o2ib9) 2016-08-09 08:28:14 [569328.932430] Lustre: Skipped 1 previous similar message 2016-08-09 08:36:14 [569809.168166] Lustre: 130908:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1470756919/real 1470756919] req@ffff880d11ca e600 x1541675108398336/t0(0) o8->lquake-OST0003-osc-ffff882022b7f800@172.19.1.130@o2ib100:28/4 lens 520/544 e 0 to 1 dl 1470756974 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2016-08-09 08:36:14 [569809.209141] Lustre: 130908:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 9 previous similar messages 2016-08-09 08:36:40 [569835.180706] Lustre: lcy-OST0008-osc-ffff882025188000: Connection to lcy-OST0008 (at 10.1.1.179@o2ib9) was lost; in progress operations using this service will wait for rec overy to complete 2016-08-09 08:36:40 [569835.206029] Lustre: lcy-OST0008-osc-ffff882025188000: Connection restored to 10.1.1.179@o2ib9 (at 10.1.1.179@o2ib9) 2016-08-09 08:46:15 [570411.139216] Lustre: 130984:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1470757409/real 1470757409] req@ffff881f9afbce00 x1541675108443980/t0(0) o10->lcy-OST0001-osc-ffff882025188000@10.1.1.172@o2ib9:6/4 lens 560/432 e 3 to 1 dl 1470757575 ref 1 fl Rpc:X/2/ffffffff rc -11/-1 2016-08-09 08:46:15 [570411.180184] Lustre: 130984:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 22 previous similar messages 2016-08-09 08:46:15 [570411.195473] LustreError: 130984:0:(niobuf.c:732:ptl_send_rpc()) @@@ xid: 1541675108443980, replied: 1541675108452731, list_empty:0 2016-08-09 08:46:15 [570411.195473] req@ffff881f9afbce00 x1541675108443980/t0(0) o10->lcy-OST0001-osc-ffff882025188000@10.1.1.172@o2ib9:6/4 lens 560/432 e 3 to 1 dl 1470757575 ref 1 fl Rpc:XS/2/ffffffff rc -11/-1 2016-08-09 08:46:15 [570411.238718] LustreError: 130984:0:(niobuf.c:733:ptl_send_rpc()) LBUG 2016-08-09 08:46:15 [570411.248645] Pid: 130984, comm: ptlrpcd_07_05 2016-08-09 08:46:15 Aug 9 08:51:14 opal173 kernel: [570411.256196] 2016-08-09 08:46:15 [570411.256196] Call Trace: 2016-08-09 08:46:15 LustreError: 130984:0:(niobuf.c:733:ptl_send_rpc()) LBUG 2016-08-09 08:46:15 Aug 9[570411.266553] [<ffffffffa0aea7e3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] 2016-08-09 08:46:15 08:46:15 opal173 kernel: [57041[570411.278045] [<ffffffffa0aead85>] lbug_with_loc+0x45/0xc0 [libcfs] 2016-08-09 08:46:15 1.238718] LustreError: 130984:0:[570411.288199] [<ffffffffa0ec853a>] ptl_send_rpc+0xa1a/0xdb0 [ptlrpc] 2016-08-09 08:46:15 (niobuf.c:733:ptl_send_rpc()) LB[570411.298337] [<ffffffffa0ec1209>] ptlrpc_check_set.part.21+0x1789/0x1d80 [ptlrpc] 2016-08-09 08:46:15 UG 2016-08-09 08:46:15 [570411.309879] [<ffffffffa0ec185b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] 2016-08-09 08:46:16 [570411.319667] [<ffffffffa0eed03b>] ptlrpcd_check+0x4eb/0x5e0 [ptlrpc] 2016-08-09 08:46:16 [570411.329345] [<ffffffffa0eed3eb>] ptlrpcd+0x2bb/0x560 [ptlrpc] 2016-08-09 08:46:16 [570411.338362] [<ffffffff810bd4b0>] ? default_wake_function+0x0/0x20 2016-08-09 08:46:16 [570411.347762] [<ffffffffa0eed130>] ? ptlrpcd+0x0/0x560 [ptlrpc] 2016-08-09 08:46:16 [570411.356709] [<ffffffff810a997f>] kthread+0xcf/0xe0 2016-08-09 08:46:16 [570411.364548] [<ffffffff810a98b0>] ? kthread+0x0/0xe0 2016-08-09 08:46:16 [570411.372443] [<ffffffff8165d658>] ret_from_fork+0x58/0x90 2016-08-09 08:46:16 [570411.380778] [<ffffffff810a98b0>] ? kthread+0x0/0xe0 2016-08-09 08:46:16 [570411.388582] 2016-08-09 08:46:16 [570411.392834] Kernel panic - not syncing: LBUG 2016-08-09 08:46:16 [570411.399778] CPU: 33 PID: 130984 Comm: ptlrpcd_07_05 Tainted: P OE ------------ 3.10.0-327.22.2.1chaos.ch6.x86_64 #1 2016-08-09 08:46:16 [570411.416933] Hardware name: Penguin Computing Relion OCP1930e/S2600KPR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016 2016-08-09 08:46:16 [570411.431470] ffffffffa0b06e0f 000000003006f545 ffff880fb772bb48 ffffffff8164c6b4 2016-08-09 08:46:16 [570411.441962] ffff880fb772bbc8 ffffffff816456af ffffffff00000008 ffff880fb772bbd8 2016-08-09 08:46:16 [570411.452453] ffff880fb772bb78 000000003006f545 ffffffffa0f5a675 0000000000000246 2016-08-09 08:46:16 [570411.462912] Call Trace: 2016-08-09 08:46:16 [570411.467750] [<ffffffff8164c6b4>] dump_stack+0x19/0x1b 2016-08-09 08:46:16 [570411.475583] [<ffffffff816456af>] panic+0xd8/0x1e7 2016-08-09 08:46:16 [570411.482985] [<ffffffffa0aeadeb>] lbug_with_loc+0xab/0xc0 [libcfs] 2016-08-09 08:46:16 [570411.491940] [<ffffffffa0ec853a>] ptl_send_rpc+0xa1a/0xdb0 [ptlrpc] 2016-08-09 08:46:16 [570411.500942] [<ffffffffa0ec1209>] ptlrpc_check_set.part.21+0x1789/0x1d80 [ptlrpc] 2016-08-09 08:46:16 [570411.511280] [<ffffffffa0ec185b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] 2016-08-09 08:46:16 [570411.520423] [<ffffffffa0eed03b>] ptlrpcd_check+0x4eb/0x5e0 [ptlrpc] 2016-08-09 08:46:16 [570411.529433] [<ffffffffa0eed3eb>] ptlrpcd+0x2bb/0x560 [ptlrpc] 2016-08-09 08:46:16 [570411.537780] [<ffffffff810bd4b0>] ? wake_up_state+0x20/0x20 2016-08-09 08:46:16 [570411.545821] [<ffffffffa0eed130>] ? ptlrpcd_check+0x5e0/0x5e0 [ptlrpc] 2016-08-09 08:46:16 [570411.554875] [<ffffffff810a997f>] kthread+0xcf/0xe0 2016-08-09 08:46:16 [570411.562041] [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140 2016-08-09 08:46:16 [570411.571021] [<ffffffff8165d658>] ret_from_fork+0x58/0x90 2016-08-09 08:46:16 [570411.578700] [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140
Patch ported and landed to 2.8 FE branch