[LU-7558] niobuf.c:721:ptl_send_rpc() LASSERT(AT_OFF || imp_state != LUSTRE_IMP_FULL || imp_msghdr_flags & MSGHDR_AT_SUPPORT ...) Created: 15/Dec/15 Updated: 08/Apr/19 Resolved: 16/Jan/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.9.0, Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Olaf Faaland | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
BG/Q I/O nodes |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
/bgsys/logs/BGQ.sn/R04-ID-J00.log (among many others) LustreError: 28558:0: (niobuf.c:721:ptl_send_rpc()) ASSERTION( (at_max == 0) || request->rq_import->imp_state != LUSTRE_IMP_FULL || (request->rq_import->imp_msghdr_flags & 0x1) || ! (request->rq_import->imp_connect_data.ocd_connect_flags & 0x1000000ULL) ) failed: Occurred on many tens of I/O nodes, then within the next 24 hours, occurred on many tens more. Continuing to occur. We have not seen this issue before. The patch that introduced this assert was in the patch stack for our tag 2.5.4-1chaos, rolled out in April. We do not know what triggered this now. c389652 There are no crash dumps for these nodes, nor much in the console logs. Because several conditions were ASSERTed in a single statement, which failed is unknown. |
| Comments |
| Comment by Olaf Faaland [ 15/Dec/15 ] |
|
Patch stack running on those nodes: * ae35897 (tag: 2.5.4-4chaos) LU-2232 debug: print debug for prolonged lock * 458c1f9 Revert "LU-2232 debug: lock handle in IO request may be stale" * f69e69e (tag: 2.5.4-3chaos) LU-6389 llite: restart short read/write for normal IO * 1e8f823 Revert "LU-6389 llite: restart short read/write for normal IO" * 3028245 (tag: 2.5.4-2chaos) LU-6152 osd-zfs: ZFS large block compat * fb6b94c Revert "LU-5053 ptlrpc: Add schedule point to ptlrpc_check_set()" * 0dc7041 Fix ldiskfs source autodetect for CentOS 6 * 42edddb LU-6536 llapi: lmm_stripe_count used unswabbed * 3d25e65 LU-2182 llapi: implementation of new llapi_layout API * fe5c3dd LLNL-0000 llapi: get OST count from proc * c1c8672 LU-4107 build: fix lustre_user.h to C++ compatible * f139e17 LU-5042 ldlm: delay filling resource's LVB upon replay * 35e5a78 (tag: 2.5.4-1chaos) LU-6389 llite: restart short read/write for normal IO * a25d1d1 LU-1032 build: Honor --disable-modules option in spec file * 01df569 LU-1032 build: Add Lustre DKMS spec file * 1f47ba6 LU-6038 osd-zfs: sa_spill_alloc()/sa_spill_free() compat * 16b79c6 LU-6038 osd-zfs: Avoid redefining KM_SLEEP * 36eca32 LU-5326 libcfs: remove umode_t typedef * 685c9ac LU-3353 ptlrpc: Suppress error message when imp_sec is freed * 9ea4c83 LU-5984 obd: fix lastid file name in compat code * c389652 LU-5528 ptlrpc: fix race between connect vs resend * bba812b LU-5579 ldlm: re-sent enqueue vs lock destroy race |
| Comment by Olaf Faaland [ 15/Dec/15 ] |
|
The console logs indicate a pair of OSS's (failover pair, in fact) had crashed and come back up. This error occured when they had timed out of recovery and evicted the clients. There's nothing else in the console log. The nodes are unresponsive after they hit this LBUG, so I'm not able to gather more information unless I can reproduce the situation while monitoring the node. Not clear right now how to do that without affecting the whole machine. Seems wrong to ASSERT anything to do with connection state when not holding a lock on the import. Also not helpful to ASSERT several conditions collectively instead of individually so the bad state is clearer. So I'm looking into changing to code to return failure instead. |
| Comment by Olaf Faaland [ 15/Dec/15 ] |
|
My statement about ASSERTing collectively is mistaken - OR not AND. But this still doesn't produce enough useful information, the patch will need to change that. I don't understand why they chose to ASSERT here instead of fail. There is discussion of this ASSERT in code review from the patch, but I don't see a rationale for crashing the node. |
| Comment by Peter Jones [ 15/Dec/15 ] |
|
Mike Could you please advise? Peter |
| Comment by Christopher Morrone [ 15/Dec/15 ] |
|
This bug is taking out a large number of the I/O Nodes on Sequoia, our largest production supercomputer, so I bumped the severity to 1. |
| Comment by Peter Jones [ 15/Dec/15 ] |
|
Chris Just to acknowledge this - I am discussing with engineering Peter |
| Comment by John Hammond [ 15/Dec/15 ] |
|
Is 2.5.4-15chaos available somewhere? I do not see it on https://github.com/chaos/lustre. |
| Comment by Peter Jones [ 15/Dec/15 ] |
|
John I can direct you to it Peter |
| Comment by Peter Jones [ 15/Dec/15 ] |
|
Olaf When did the last software update get applied to these nodes and which tag preceded it? Have any changes to the configuration been made in recent weeks? Peter |
| Comment by Christopher Morrone [ 15/Dec/15 ] |
|
The nodes are only running 2.5.4-4chaos, the newer tags haven't made it to the BG/Q systems. |
| Comment by Christopher Morrone [ 15/Dec/15 ] |
|
No software changes have happened on these BG/Q clients in months. No recent change on the servers either. |
| Comment by Olaf Faaland [ 15/Dec/15 ] |
|
Servers are at lustre-2.5.4-13chaos |
| Comment by Hongchao Zhang [ 16/Dec/15 ] |
|
The possible case (but I wan't sure about it!) could cause this problem is the race between setting the state of the obd_import in "ptlrpc_connect_import", the "imp_state" is set as "LUSTRE_IMP_CONNECTING", and if the check of the "imp_state" the race is, rc = ptlrpc_resend(imp); IMPORT_SET_STATE(imp, LUSTRE_IMP_FULL); thread1: the imp_state is set LUSTRE_IMP_FULL, and the LBUG is triggered. |
| Comment by Mikhail Pershin [ 16/Dec/15 ] |
|
this assertion was added by patch for |
| Comment by Olaf Faaland [ 16/Dec/15 ] |
|
What is special about LUSTRE_IMP_FULL? Why would the assertion check for that state, but none of the other import states like _EVICTED or _CLOSED? Also, there are many locations in the code within ptl_send_rpc() where something can cause the function to return an error. Do you know why the person who wrote the patch chose to ASSERT instead of return an error on import state? thanks |
| Comment by Mikhail Pershin [ 16/Dec/15 ] |
|
Olaf, it looks for me like assertion was added to prevent similar situations as |
| Comment by Olaf Faaland [ 16/Dec/15 ] |
|
Mikhail, yes, we could revert the patch. I need to talk it over with Chris and the BG team. It's trickier than it normally would be, because lots of people are out of the office next week and so it's difficult to handle anything new that comes up. Should we add any debug code? |
| Comment by Mikhail Pershin [ 16/Dec/15 ] |
|
Olaf, let me think a bit, I will provide patch with debug instead of assertion |
| Comment by Mikhail Pershin [ 16/Dec/15 ] |
|
I agree with Hongchao, this assertion is result of race he mentioned, and that race itself was introduced by |
| Comment by Mikhail Pershin [ 16/Dec/15 ] |
|
Olaf, I think it is better to revert this commit, even if we will remove that assertion, the import will still have wrong flags causing all requests to behave wrongly, so this situation is not better than before that commit. Meanwhile I will think how to solve this properly, I don't see quick solution so far. The possible solution I have in mind is to set new import bool value imp_connected which would indicate that import was reconnected (and imp flags are updated already) even if it is not in FULL state yet. So ptlrpc_connect_import() will not try to initiate new connect by checking this flag (now it checks import state) and will not clear flags for import which is connected but not FULL yet. I am not sure this is the only solution, it has to be discussed. |
| Comment by Olaf Faaland [ 16/Dec/15 ] |
|
Mikhail, OK, we are reverting this patch for now. |
| Comment by Christopher Morrone [ 25/Mar/16 ] |
|
This is still broken on master and b2_8. Since we hope to move to 2.8 in the near future, we would really like to see this fixed as soon as possible. |
| Comment by Gerrit Updater [ 04/Apr/16 ] |
|
Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/19312 |
| Comment by Mikhail Pershin [ 04/Apr/16 ] |
|
the patch implements approach I mentioned in comment above. It blocks new connect requests while current connect interpret is running, preventing the import connection flags re-setting. |
| Comment by Gerrit Updater [ 02/May/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19312/ |
| Comment by Joseph Gmitter (Inactive) [ 03/May/16 ] |
|
Landed to master for 2.9.0 |
| Comment by Alexey Lyashkov [ 20/Jul/16 ] |
|
Mikhail, i think it not a very good approach. It's easy to use a simple change like |
| Comment by Oleg Drokin [ 06/Aug/18 ] |
|
I am still hitting this regularly in my testing. [16176.517655] LustreError: 2151:0:(client.c:1179:ptlrpc_import_delay_req()) @@@ invalidate in flight req@ffff8802f4d5fc80 x1608074126206960/t0(0) o38->lustre-MDT0000-mdc-ffff8802acac9800@0@lo:12/10 lens 520/544 e 0 to 0 dl 0 ref 1 fl Rpc:N/0/ffffffff rc 0/-1 [16178.021121] LustreError: 2309:0:(niobuf.c:782:ptl_send_rpc()) ASSERTION( (at_max == 0) || imp->imp_state != LUSTRE_IMP_FULL || (imp->imp_msghdr_flags & MSGHDR_AT_SUPPORT) || !(imp->imp_connect_data.ocd_connect_flags & 0x1000000ULL) ) failed: [16178.032291] LustreError: 2309:0:(niobuf.c:782:ptl_send_rpc()) LBUG [16178.033617] CPU: 1 PID: 2309 Comm: rm Kdump: loaded Tainted: P OE ------------ 3.10.0-7.5-debug #1 [16178.036021] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [16178.037385] Call Trace: [16178.038659] [<ffffffff8176fc9a>] dump_stack+0x19/0x1b [16178.042848] [<ffffffffa01b37c2>] libcfs_call_trace+0x72/0x80 [libcfs] [16178.044440] [<ffffffffa01b384c>] lbug_with_loc+0x4c/0xb0 [libcfs] [16178.046617] [<ffffffffa05854e9>] ptl_send_rpc+0xb79/0xe80 [ptlrpc] [16178.047953] [<ffffffffa01b9f97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [16178.049215] [<ffffffffa0579d40>] ptlrpc_send_new_req+0x460/0xa70 [ptlrpc] [16178.050257] [<ffffffffa057e819>] ptlrpc_set_wait+0x289/0x790 [ptlrpc] [16178.051188] [<ffffffff810630a3>] ? kvm_clock_read+0x33/0x40 [16178.052007] [<ffffffff810630b9>] ? kvm_clock_get_cycles+0x9/0x10 [16178.053405] [<ffffffffa058abda>] ? lustre_msg_set_jobid+0x9a/0x110 [ptlrpc] [16178.054486] [<ffffffffa057ed9d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc] [16178.055636] [<ffffffffa08a3ca7>] mdc_reint+0x57/0x160 [mdc] [16178.056599] [<ffffffffa08a4df6>] mdc_unlink+0x206/0x450 [mdc] [16178.057700] [<ffffffffa0500125>] lmv_unlink+0x5a5/0x950 [lmv] [16178.059013] [<ffffffffa146c764>] ? ll_i2gids+0x24/0xb0 [lustre] [16178.060357] [<ffffffffa1471b31>] ll_unlink+0x171/0x5e0 [lustre] [16178.061621] [<ffffffff8121d326>] vfs_unlink+0x106/0x190 [16178.062807] [<ffffffff8121ff6e>] do_unlinkat+0x26e/0x2b0 [16178.063937] [<ffffffff8178387b>] ? system_call_after_swapgs+0xc8/0x160 [16178.065792] [<ffffffff8178386f>] ? system_call_after_swapgs+0xbc/0x160 [16178.068348] [<ffffffff8178387b>] ? system_call_after_swapgs+0xc8/0x160 [16178.069707] [<ffffffff8178386f>] ? system_call_after_swapgs+0xbc/0x160 [16178.070768] [<ffffffff8178387b>] ? system_call_after_swapgs+0xc8/0x160 [16178.071896] [<ffffffff8178386f>] ? system_call_after_swapgs+0xbc/0x160 [16178.075009] [<ffffffff81220eab>] SyS_unlinkat+0x1b/0x40 [16178.076048] [<ffffffff81783929>] system_call_fastpath+0x16/0x1b [16178.077068] [<ffffffff8178387b>] ? system_call_after_swapgs+0xc8/0x160 It fails in multiple different tests too so it's not any particular testcase |
| Comment by Gerrit Updater [ 26/Nov/18 ] |
|
Andriy Skulysh (c17819@cray.com) uploaded a new patch: https://review.whamcloud.com/33718 |
| Comment by Gerrit Updater [ 16/Jan/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33718/ |
| Comment by Joseph Gmitter (Inactive) [ 16/Jan/19 ] |
|
Landed for 2.13.0 |
| Comment by Gerrit Updater [ 23/Feb/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34290 |
| Comment by Gerrit Updater [ 25/Feb/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34293 |
| Comment by Gerrit Updater [ 02/Mar/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34290/ |
| Comment by Gerrit Updater [ 08/Apr/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34293/ |