[LU-8488] LBUG in ptl_send_rpc(), request->rq_xid <= imp->imp_known_replied_xid fails Created: 09/Aug/16 Updated: 20/Dec/17 Resolved: 18/Dec/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Christopher Morrone | Assignee: | Alex Zhuravlev |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | llnl | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
We have had three clients crash pretty quickly when testing Lustre 2.8 clients with lustre 2.5.5 servers. The clients were using tag 2.8.0_0.0.llnlpreview.18, the servers are using 2.5.5-8chaos (see the lustre-release-fe-llnl repo for each). They hit the following LBUG: 2016-08-09 08:24:53 [569127.746044] Lustre: lcy-OST0004-osc-ffff882025188000: Connection restored to 10.1.1.175@o2ib9 (at 10.1.1.175@o2ib9) 2016-08-09 08:24:53 [569127.760391] Lustre: Skipped 7 previous similar messages 2016-08-09 08:27:04 [569258.797932] LustreError: 166-1: MGC10.1.1.169@o2ib9: Connection to MGS (at 10.1.1.169@o2ib9) was lost; in progress operations using this service will fail 2016-08-09 08:28:14 [569328.893397] Lustre: lcy-OST0004-osc-ffff882025188000: Connection to lcy-OST0004 (at 10.1.1.175@o2ib9) was lost; in progress operations using this service will wait for rec overy to complete 2016-08-09 08:28:14 [569328.918048] Lustre: lcy-OST0004-osc-ffff882025188000: Connection restored to 10.1.1.175@o2ib9 (at 10.1.1.175@o2ib9) 2016-08-09 08:28:14 [569328.932430] Lustre: Skipped 1 previous similar message 2016-08-09 08:36:14 [569809.168166] Lustre: 130908:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1470756919/real 1470756919] req@ffff880d11ca e600 x1541675108398336/t0(0) o8->lquake-OST0003-osc-ffff882022b7f800@172.19.1.130@o2ib100:28/4 lens 520/544 e 0 to 1 dl 1470756974 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2016-08-09 08:36:14 [569809.209141] Lustre: 130908:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 9 previous similar messages 2016-08-09 08:36:40 [569835.180706] Lustre: lcy-OST0008-osc-ffff882025188000: Connection to lcy-OST0008 (at 10.1.1.179@o2ib9) was lost; in progress operations using this service will wait for rec overy to complete 2016-08-09 08:36:40 [569835.206029] Lustre: lcy-OST0008-osc-ffff882025188000: Connection restored to 10.1.1.179@o2ib9 (at 10.1.1.179@o2ib9) 2016-08-09 08:46:15 [570411.139216] Lustre: 130984:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1470757409/real 1470757409] req@ffff881f9afbce00 x1541675108443980/t0(0) o10->lcy-OST0001-osc-ffff882025188000@10.1.1.172@o2ib9:6/4 lens 560/432 e 3 to 1 dl 1470757575 ref 1 fl Rpc:X/2/ffffffff rc -11/-1 2016-08-09 08:46:15 [570411.180184] Lustre: 130984:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 22 previous similar messages 2016-08-09 08:46:15 [570411.195473] LustreError: 130984:0:(niobuf.c:732:ptl_send_rpc()) @@@ xid: 1541675108443980, replied: 1541675108452731, list_empty:0 2016-08-09 08:46:15 [570411.195473] req@ffff881f9afbce00 x1541675108443980/t0(0) o10->lcy-OST0001-osc-ffff882025188000@10.1.1.172@o2ib9:6/4 lens 560/432 e 3 to 1 dl 1470757575 ref 1 fl Rpc:XS/2/ffffffff rc -11/-1 2016-08-09 08:46:15 [570411.238718] LustreError: 130984:0:(niobuf.c:733:ptl_send_rpc()) LBUG 2016-08-09 08:46:15 [570411.248645] Pid: 130984, comm: ptlrpcd_07_05 2016-08-09 08:46:15 Aug 9 08:51:14 opal173 kernel: [570411.256196] 2016-08-09 08:46:15 [570411.256196] Call Trace: 2016-08-09 08:46:15 LustreError: 130984:0:(niobuf.c:733:ptl_send_rpc()) LBUG 2016-08-09 08:46:15 Aug 9[570411.266553] [<ffffffffa0aea7e3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] 2016-08-09 08:46:15 08:46:15 opal173 kernel: [57041[570411.278045] [<ffffffffa0aead85>] lbug_with_loc+0x45/0xc0 [libcfs] 2016-08-09 08:46:15 1.238718] LustreError: 130984:0:[570411.288199] [<ffffffffa0ec853a>] ptl_send_rpc+0xa1a/0xdb0 [ptlrpc] 2016-08-09 08:46:15 (niobuf.c:733:ptl_send_rpc()) LB[570411.298337] [<ffffffffa0ec1209>] ptlrpc_check_set.part.21+0x1789/0x1d80 [ptlrpc] 2016-08-09 08:46:15 UG 2016-08-09 08:46:15 [570411.309879] [<ffffffffa0ec185b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] 2016-08-09 08:46:16 [570411.319667] [<ffffffffa0eed03b>] ptlrpcd_check+0x4eb/0x5e0 [ptlrpc] 2016-08-09 08:46:16 [570411.329345] [<ffffffffa0eed3eb>] ptlrpcd+0x2bb/0x560 [ptlrpc] 2016-08-09 08:46:16 [570411.338362] [<ffffffff810bd4b0>] ? default_wake_function+0x0/0x20 2016-08-09 08:46:16 [570411.347762] [<ffffffffa0eed130>] ? ptlrpcd+0x0/0x560 [ptlrpc] 2016-08-09 08:46:16 [570411.356709] [<ffffffff810a997f>] kthread+0xcf/0xe0 2016-08-09 08:46:16 [570411.364548] [<ffffffff810a98b0>] ? kthread+0x0/0xe0 2016-08-09 08:46:16 [570411.372443] [<ffffffff8165d658>] ret_from_fork+0x58/0x90 2016-08-09 08:46:16 [570411.380778] [<ffffffff810a98b0>] ? kthread+0x0/0xe0 2016-08-09 08:46:16 [570411.388582] 2016-08-09 08:46:16 [570411.392834] Kernel panic - not syncing: LBUG 2016-08-09 08:46:16 [570411.399778] CPU: 33 PID: 130984 Comm: ptlrpcd_07_05 Tainted: P OE ------------ 3.10.0-327.22.2.1chaos.ch6.x86_64 #1 2016-08-09 08:46:16 [570411.416933] Hardware name: Penguin Computing Relion OCP1930e/S2600KPR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016 2016-08-09 08:46:16 [570411.431470] ffffffffa0b06e0f 000000003006f545 ffff880fb772bb48 ffffffff8164c6b4 2016-08-09 08:46:16 [570411.441962] ffff880fb772bbc8 ffffffff816456af ffffffff00000008 ffff880fb772bbd8 2016-08-09 08:46:16 [570411.452453] ffff880fb772bb78 000000003006f545 ffffffffa0f5a675 0000000000000246 2016-08-09 08:46:16 [570411.462912] Call Trace: 2016-08-09 08:46:16 [570411.467750] [<ffffffff8164c6b4>] dump_stack+0x19/0x1b 2016-08-09 08:46:16 [570411.475583] [<ffffffff816456af>] panic+0xd8/0x1e7 2016-08-09 08:46:16 [570411.482985] [<ffffffffa0aeadeb>] lbug_with_loc+0xab/0xc0 [libcfs] 2016-08-09 08:46:16 [570411.491940] [<ffffffffa0ec853a>] ptl_send_rpc+0xa1a/0xdb0 [ptlrpc] 2016-08-09 08:46:16 [570411.500942] [<ffffffffa0ec1209>] ptlrpc_check_set.part.21+0x1789/0x1d80 [ptlrpc] 2016-08-09 08:46:16 [570411.511280] [<ffffffffa0ec185b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] 2016-08-09 08:46:16 [570411.520423] [<ffffffffa0eed03b>] ptlrpcd_check+0x4eb/0x5e0 [ptlrpc] 2016-08-09 08:46:16 [570411.529433] [<ffffffffa0eed3eb>] ptlrpcd+0x2bb/0x560 [ptlrpc] 2016-08-09 08:46:16 [570411.537780] [<ffffffff810bd4b0>] ? wake_up_state+0x20/0x20 2016-08-09 08:46:16 [570411.545821] [<ffffffffa0eed130>] ? ptlrpcd_check+0x5e0/0x5e0 [ptlrpc] 2016-08-09 08:46:16 [570411.554875] [<ffffffff810a997f>] kthread+0xcf/0xe0 2016-08-09 08:46:16 [570411.562041] [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140 2016-08-09 08:46:16 [570411.571021] [<ffffffff8165d658>] ret_from_fork+0x58/0x90 2016-08-09 08:46:16 [570411.578700] [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140 |
| Comments |
| Comment by Peter Jones [ 10/Aug/16 ] |
|
Alex Could you please look into this one? thanks Peter |
| Comment by Kilian Cavalotti [ 17/Aug/16 ] |
|
We (Stanford Research Computing) have seen a very similar LBUG on a client (a single occurence so far): Aug 15 20:55:06 sherlock-ln03 kernel: Lustre: regal-OST000a-osc-ffff880332e3b000: Connection restored to regal-OST000a (at 10.210.34.203@o2ib1) Aug 15 20:55:06 sherlock-ln03 kernel: Lustre: Skipped 1 previous similar message Aug 15 20:55:06 sherlock-ln03 kernel: LustreError: 20432:0:(niobuf.c:737:ptl_send_rpc()) @@@ xid: 1542242077009292, replied: 1542242077009307, list_empty:0 Aug 15 20:55:06 sherlock-ln03 kernel: req@ffff880ed871c3c0 x1542242077009292/t0(0) o4->regal-OST004f-osc-ffff880332e3b000@10.210.34.216@o2ib1:6/4 lens 488/448 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 Aug 15 20:55:06 sherlock-ln03 kernel: LustreError: 20432:0:(niobuf.c:738:ptl_send_rpc()) LBUG Aug 15 20:55:06 sherlock-ln03 kernel: Pid: 20432, comm: ptlrpcd_00_06 Aug 15 20:55:06 sherlock-ln03 kernel: Aug 15 20:55:06 sherlock-ln03 kernel: Call Trace: Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa03d6895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa03d6e97>] lbug_with_loc+0x47/0xb0 [libcfs] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07c1498>] ptl_send_rpc+0x408/0xe80 [ptlrpc] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07f80c4>] ? sptlrpc_req_refresh_ctx+0x154/0x910 [ptlrpc] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07c17d5>] ? ptl_send_rpc+0x745/0xe80 [ptlrpc] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07b5de1>] ptlrpc_send_new_req+0x511/0x9b0 [ptlrpc] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07ba280>] ptlrpc_check_set+0xa20/0x1ca0 [ptlrpc] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffff8100ba4e>] ? common_interrupt+0xe/0x13 Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07e8633>] ptlrpcd_check+0x3d3/0x610 [ptlrpc] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07b0416>] ? ptlrpc_set_next_timeout+0xc6/0x1a0 [ptlrpc] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07e8ae2>] ptlrpcd+0x272/0x4f0 [ptlrpc] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20 Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffffa07e8870>] ? ptlrpcd+0x0/0x4f0 [ptlrpc] Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffff810a0fce>] kthread+0x9e/0xc0 Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffff810a0f30>] ? kthread+0x0/0xc0 Aug 15 20:55:06 sherlock-ln03 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 That's using IEEL 3.0.0.0 Cheers, |
| Comment by James A Simmons [ 23/Sep/16 ] |
|
We just hit it as well. If you need a dump we can provide it. |
| Comment by Jian Yu [ 12/Oct/16 ] |
|
Hi Alex, Could you please advise? Thank you. |
| Comment by parinay v kondekar (Inactive) [ 13/Oct/16 ] |
|
We have noticed similar crash Lustre: 47121:0:(client.c:2093:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1472152273/real 1472152273] req@ffff8813bd6e0cc0 x1543375481248976/t0(0) o3->lustre02-OST0006-osc-ffff8813c2ad6400@10.253.108.30@tcp:6/4 lens 488/432 e 0 to 1 dl 1472152289 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Lustre: lustre02-OST0006-osc-ffff8813c2ad6400: Connection to lustre02-OST0006 (at 10.253.108.30@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: 47110:0:(client.c:2093:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1472152290/real 1472152290] req@ffff8805b92893c0 x1543375481250220/t0(0) o8->lustre02-OST0006-osc-ffff8813c2ad6400@10.253.108.30@tcp:28/4 lens 520/544 e 0 to 1 dl 1472152310 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: lustre02-OST0006-osc-ffff8813c2ad6400: Connection restored to 10.253.108.31@tcp (at 10.253.108.31@tcp) LustreError: 62105:0:(niobuf.c:750:ptl_send_rpc()) @@@ xid: 1543375481425516, replied: 1543375481435211, list_empty:0 req@ffff8805f42e43c0 x1543375481425516/t0(0) o13->lustre02-OST0006-osc-ffff8813c2ad6400@10.253.108.31@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 LustreError: 62094:0:(niobuf.c:751:ptl_send_rpc()) LBUG Pid: 62094, comm: ssexec LustreError: 62099:0:(niobuf.c:751:ptl_send_rpc()) LBUG Call Trace: LustreError: 62088:0:(niobuf.c:751:ptl_send_rpc()) LBUG Pid: 62099, comm: ssexec Pid: 62088, comm: ssexec Call Trace: Call Trace: [<ffffffffa147f895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa147f895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa1480007>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa147f895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa1480007>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa1480007>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa17a9478>] ptl_send_rpc+0x3f8/0xe70 [ptlrpc] [<ffffffffa17a9478>] ptl_send_rpc+0x3f8/0xe70 [ptlrpc] [<ffffffffa17a9478>] ptl_send_rpc+0x3f8/0xe70 [ptlrpc] [<ffffffffa17e06a4>] ? sptlrpc_req_refresh_ctx+0x154/0x910 [ptlrpc] [<ffffffffa17e06a4>] ? sptlrpc_req_refresh_ctx+0x154/0x910 [ptlrpc] [<ffffffffa17e06a4>] ? sptlrpc_req_refresh_ctx+0x154/0x910 [ptlrpc] LustreError: 62086:0:(niobuf.c:751:ptl_send_rpc()) LBUG Pid: 62086, comm: ssexec Call Trace: [<ffffffffa17a313c>] ptlrpc_check_set+0x17cc/0x1d90 [ptlrpc] [<ffffffffa17a313c>] ptlrpc_check_set+0x17cc/0x1d90 [ptlrpc] [<ffffffffa147f895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa1480007>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa17a313c>] ptlrpc_check_set+0x17cc/0x1d90 [ptlrpc] [<ffffffffa17a39ea>] ptlrpc_set_wait+0x2ea/0x9c0 [ptlrpc] [<ffffffffa17a39ea>] ptlrpc_set_wait+0x2ea/0x9c0 [ptlrpc] [<ffffffffa17a9478>] ptl_send_rpc+0x3f8/0xe70 [ptlrpc] [<ffffffffa17a39ea>] ptlrpc_set_wait+0x2ea/0x9c0 [ptlrpc] [<ffffffffa1798c20>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc] [<ffffffffa1798c20>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc] [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 [<ffffffffa1798c20>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc] [<ffffffffa17e06a4>] ? sptlrpc_req_refresh_ctx+0x154/0x910 [ptlrpc] [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 [<ffffffffa1a997d4>] obd_statfs_rqset+0x144/0x560 [lustre] [<ffffffffa1a997d4>] obd_statfs_rqset+0x144/0x560 [lustre] [<ffffffffa1a9b51b>] ll_statfs_internal+0x2bb/0x7f0 [lustre] [<ffffffffa1a9b51b>] ll_statfs_internal+0x2bb/0x7f0 [lustre] [<ffffffffa17a313c>] ptlrpc_check_set+0x17cc/0x1d90 [ptlrpc] [<ffffffffa1a997d4>] obd_statfs_rqset+0x144/0x560 [lustre] [<ffffffffa1a9bae5>] ll_statfs+0x95/0x190 [lustre] [<ffffffffa1a9bae5>] ll_statfs+0x95/0x190 [lustre] [<ffffffff811bcdd4>] statfs_by_dentry+0x74/0xa0 [<ffffffff811bcdd4>] statfs_by_dentry+0x74/0xa0 [<ffffffff811bcf0b>] vfs_statfs+0x1b/0xb0 [<ffffffff811bcf0b>] vfs_statfs+0x1b/0xb0 [<ffffffffa1a9b51b>] ll_statfs_internal+0x2bb/0x7f0 [lustre] [<ffffffff811bd107>] user_statfs+0x47/0xb0 [<ffffffff811bd107>] user_statfs+0x47/0xb0 [<ffffffffa17a39ea>] ptlrpc_set_wait+0x2ea/0x9c0 [ptlrpc] [<ffffffff811bd20a>] sys_statfs+0x2a/0x50 [<ffffffff811bd20a>] sys_statfs+0x2a/0x50 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b The code 745 > if (list_empty(&request->rq_unreplied_list) || 746 > request->rq_xid <= imp->imp_known_replied_xid) { 747 > > DEBUG_REQ(D_ERROR, request, "xid: "LPU64", replied: "LPU64", " 748 > > > "list_empty:%d\n", request->rq_xid, 749 > > > imp->imp_known_replied_xid, 750 > > > list_empty(&request->rq_unreplied_list)); 751 > > LBUG(); 752 > } from lustre logs req@ffff881eaf9e90c0 x1543375481252856/t0(0) o13->lustre02-OST0006-osc-ffff8813c2ad6400@10.253.108.31@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 00000100:00020000:4.0F:1472152509.139389:0:62099:0:(niobuf.c:750:ptl_send_rpc()) @@@ xid: 1543375481269472, replied: 1543375481435211, list_empty:0 from crash crash> ptlrpc_request ffff881eaf9e90c0 | grep rq_xid
rq_xid = 1543375481252856,
crash>
crash> ptlrpc_request ffff881eaf9e90c0 | grep rq_import
rq_import = 0xffff881a34ef2000,
crash> obd_import 0xffff881a34ef2000 | grep imp_known_replied_xid
imp_known_replied_xid = 1543375481435211,
crash>
crash> ptlrpc_request ffff881eaf9e90c0 | grep -A 3 cr_unreplied_list
cr_unreplied_list = {
next = 0xffff880e1f320b48,
prev = 0xffff881f38437548
},
The issue was seen during compilebench runs from lustre/tests ( 1 out of 100 runs (87the run failed)) JFYI I re-ran the tests with |
| Comment by James A Simmons [ 13/Oct/16 ] |
|
No it does fix this. We are running 2.8.0 with this patch already merged and still saw this LBUG. |
| Comment by parinay v kondekar (Inactive) [ 13/Oct/16 ] |
|
James, To understand this clearly, Thanks. |
| Comment by Niu Yawei (Inactive) [ 14/Oct/16 ] |
|
All these failures happened on the request to old OST servers, so I think It's highly likely related to
|
| Comment by Peter Jones [ 03/Nov/17 ] |
|
LLNL have flagged this as important. Is the situation the same that this LBUG is only seen with 2.5.x servers and 2.8.x clients? And this happens pretty quickly in such a configuration? |
| Comment by Olaf Faaland [ 06/Nov/17 ] |
|
Recently, we have seen this with 2.5.x servers and 2.8.x clients. Our general file system tests produced it a couple of times/day, but I do not know the volume of file system tests that ran before it appears. They are a fairly small subset of our total test suite. |
| Comment by Alex Zhuravlev [ 09/Nov/17 ] |
|
unfortunately I can't access lustre-release-fe-llnl repo, could someone please make a tarball and attach please? |
| Comment by Peter Jones [ 09/Nov/17 ] |
|
Alex Please connect with Minh to get the necessary permissions Peter |
| Comment by Alex Zhuravlev [ 10/Nov/17 ] |
|
I've ported |
| Comment by Olaf Faaland [ 16/Nov/17 ] |
|
Alex, It looks like something happened in the autotest system and you need to kick off testing again. There appear to be no test results. |
| Comment by Peter Jones [ 18/Dec/17 ] |
|
Patch ported and landed to 2.8 FE branch |