[LU-9628] LBUG (niobuf.c:773:ptl_send_rpc()) ASSERTION( (at_max == 0) || imp->imp_state != LUSTRE_IMP_FULL || (imp->imp_msghdr_flags & 0x1) || !(imp->imp_connect_data.ocd_connect_flags & 0x1000000ULL) ) failed: Created: 09/Jun/17  Updated: 01/Aug/19  Resolved: 23/Feb/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Cliff White (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: soak
Environment:

Soak cluster


Attachments: Text File vmcore-dmesg.txt    
Issue Links:
Duplicate
duplicates LU-7558 niobuf.c:721:ptl_send_rpc() LASSERT(A... Resolved
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Soak client running soak test

Jun  9 12:08:56 soak-16 systemd-logind: Removed session 1396.
Jun  9 12:09:24 soak-16 kernel: Lustre: soaked-MDT0000-mdc-ffff880828cb2000: Connection restored to 192.168.1.108@o2ib10 (at 192.168.1.108@o2ib10)
Jun  9 12:09:24 soak-16 kernel: LustreError: 11-0: soaked-OST0003-osc-ffff880828cb2000: operation ldlm_enqueue to node 192.168.1.104@o2ib10 failed: rc = -19
Jun  9 12:09:24 soak-16 kernel: LustreError: 2947:0:(import.c:671:ptlrpc_connect_import()) already connecting
Jun  9 12:09:25 soak-16 kernel: LustreError: 11-0: soaked-OST0003-osc-ffff880828cb2000: operation ldlm_enqueue to node 192.168.1.104@o2ib10 failed: rc = -107
Jun  9 12:09:25 soak-16 kernel: LustreError: Skipped 12826 previous similar messages
Jun  9 12:09:25 soak-16 kernel: LustreError: 2947:0:(import.c:671:ptlrpc_connect_import()) already connecting
Jun  9 12:09:25 soak-16 kernel: LustreError: 2947:0:(import.c:671:ptlrpc_connect_import()) Skipped 13048 previous similar messages
Jun  9 12:09:26 soak-16 kernel: LustreError: 167-0: soaked-OST0003-osc-ffff880828cb2000: This client was evicted by soaked-OST0003; in progress operations using this service will fail.
Jun  9 12:09:26 soak-16 kernel: LustreError: 2960:0:(client.c:1189:ptlrpc_import_delay_req()) @@@ invalidate in flight  req@ffff8805e1cf3900 x1569656575292352/t0(0) o101->soaked-OST0003-osc-ffff880828cb2000@192.168.1.104@o2ib10:28/4 lens 328/400 e 0 to 1 dl 1497009748 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Jun  9 12:09:26 soak-16 kernel: LustreError: 2947:0:(client.c:1176:ptlrpc_import_delay_req()) @@@ invalidate in flight  req@ffff880624b19800 x1569656617848704/t0(0) o8->soaked-OST0003-osc-ffff880828cb2000@192.168.1.104@o2ib10:28/4 lens 520/544 e 0 to 0 dl 0 ref 1 fl Rpc:N/0/ffffffff rc 0/-1
Jun  9 12:09:56 soak-16 kernel: LustreError: 2960:0:(niobuf.c:773:ptl_send_rpc()) ASSERTION( (at_max == 0) || imp->imp_state != LUSTRE_IMP_FULL || (imp->imp_msghdr_flags & 0x1) || !(imp->imp_connect_data.ocd_connect_flags & 0x1000000ULL) ) failed:
Jun  9 12:09:56 soak-16 kernel: LustreError: 2960:0:(niobuf.c:773:ptl_send_rpc()) LBUG
[74364.267656] LustreError: 217188:0:(niobuf.c:773:ptl_send_rpc()) ASSERTION( (at_max == 0) || imp->imp_state != LUSTRE_IMP_FULL || (imp->imp_msghdr_flags & 0x1) || !(imp->imp_connect_data.ocd_connect_flags & 0x1000000ULL) ) failed:
[74364.267659] LustreError: 217188:0:(niobuf.c:773:ptl_send_rpc()) LBUG
[74364.267661] Pid: 217188, comm: df
[74364.267661]
Call Trace:
[74364.267685]  [<ffffffffa08287ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
[74364.267695]  [<ffffffffa082887c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[74364.267747]  [<ffffffffa0b61c4f>] ptl_send_rpc+0xb1f/0xe60 [ptlrpc]
[74364.267815]  [<ffffffffa0b95203>] ? sptlrpc_req_refresh_ctx+0x153/0x900 [ptlrpc]
[74364.267856]  [<ffffffffa0b570f0>] ptlrpc_send_new_req+0x460/0xa60 [ptlrpc]
[74364.267894]  [<ffffffffa0b5bcc1>] ptlrpc_set_wait+0x3d1/0x900 [ptlrpc]
[74364.267906]  [<ffffffffa0e1a45d>] ? osc_statfs_async+0xfd/0x1e0 [osc]
[74364.267919]  [<ffffffffa0cd5e67>] ? lov_statfs_async+0xe7/0x730 [lov]
[74364.267928]  [<ffffffff811dd065>] ? kmem_cache_alloc_node_trace+0x125/0x220
[74364.267955]  [<ffffffffa0d7800d>] ll_statfs_internal+0x35d/0xf30 [lustre]
[74364.267959]  [<ffffffff812094ac>] ? lookup_fast+0xcc/0x2e0 
[74364.267963]  [<ffffffff8120bd83>] ? path_lookupat+0x83/0x7a0
[74364.267966]  [<ffffffff8120be16>] ? path_lookupat+0x116/0x7a0
[74364.267979]  [<ffffffffa0ebb798>] ? _nfs4_proc_statfs+0xc8/0xf0 [nfsv4]
[74364.268023]  [<ffffffffa0968519>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[74364.268044]  [<ffffffffa0d78c64>] ll_statfs+0x84/0x180 [lustre]
[74364.268047]  [<ffffffff8120ed4d>] ? putname+0x3d/0x60
[74364.268052]  [<ffffffff812312b1>] statfs_by_dentry+0xa1/0x140
[74364.268054]  [<ffffffff8123136b>] vfs_statfs+0x1b/0xb0
[74364.268056]  [<ffffffff81231455>] user_statfs+0x55/0xa0
[74364.268059]  [<ffffffff812314c7>] SYSC_statfs+0x27/0x60
[74364.268062]  [<ffffffff812316ce>] SyS_statfs+0xe/0x10
[74364.268068]  [<ffffffff81696b09>] system_call_fastpath+0x16/0x1b
[74364.268069]
[74364.268070] Kernel panic - not syncing: LBUG

Crash dump is available on soak at /scratch/dumps/soak-16
vmcore-dmesg attached



 Comments   
Comment by Oleg Drokin [ 24/Aug/17 ]

I jut hit this assertion on my testbed, but the stacktrace is a bit different.

[39956.774374] Lustre: DEBUG MARKER: == recovery-small test 29a: error adding new clients doesn't cause LBUG (bug 22273) ================== 22:11:33 (1503540693)
[39957.040948] Lustre: Failing over lustre-MDT0000
[39957.630050] Lustre: server umount lustre-MDT0000 complete
[39961.483178] LustreError: 166-1: MGC192.168.123.166@tcp: Connection to MGS (at 192.168.123.166@tcp) was lost; in progress operations using this service will fail
[39961.495128] Lustre: Evicted from MGS (at 192.168.123.166@tcp) after server handle changed from 0xaf506984f6812737 to 0xaf506984f6812967
[39961.508960] LustreError: 29451:0:(niobuf.c:776:ptl_send_rpc()) ASSERTION( (at_max == 0) || imp->imp_state != LUSTRE_IMP_FULL || (imp->imp_msghdr_flags & 0x1) || !(imp->imp_connect_data.ocd_connect_flags & 0x1000000ULL) ) failed: 
[39961.511060] LustreError: 29451:0:(niobuf.c:776:ptl_send_rpc()) LBUG
[39961.511718] Pid: 29451, comm: ll_cfg_requeue
[39961.513004] 
Call Trace:
[39961.514239]  [<ffffffffa02187ce>] libcfs_call_trace+0x4e/0x60 [libcfs]
[39961.514937]  [<ffffffffa021885c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[39961.515774]  [<ffffffffa05f8a82>] ptl_send_rpc+0xb82/0xec0 [ptlrpc]
[39961.516456]  [<ffffffff81383cf9>] ? snprintf+0x49/0x70
[39961.517160]  [<ffffffffa0223cb7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[39961.517716]  [<ffffffffa05ed030>] ptlrpc_send_new_req+0x480/0xa90 [ptlrpc]
[39961.518300]  [<ffffffffa05f1d41>] ptlrpc_set_wait+0x3d1/0x900 [ptlrpc]
[39961.519663]  [<ffffffffa03a3255>] ? lustre_get_jobid+0x215/0x4d0 [obdclass]
[39961.520556]  [<ffffffffa05fe0b5>] ? lustre_msg_set_jobid+0x95/0x100 [ptlrpc]
[39961.521426]  [<ffffffffa05f22f7>] ptlrpc_queue_wait+0x87/0x230 [ptlrpc]
[39961.522688]  [<ffffffffa05ce426>] ldlm_cli_enqueue+0x686/0x810 [ptlrpc]
[39961.523845]  [<ffffffffa05c9200>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
[39961.524621]  [<ffffffffa0595a10>] ? mgc_blocking_ast+0x0/0x7e0 [mgc]
[39961.525315]  [<ffffffffa059160a>] mgc_enqueue.isra.7.constprop.17+0x19a/0x320 [mgc]
[39961.526531]  [<ffffffffa0595a10>] ? mgc_blocking_ast+0x0/0x7e0 [mgc]
[39961.530457]  [<ffffffffa05c9200>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
[39961.531168]  [<ffffffffa05973a3>] mgc_process_log+0x183/0x890 [mgc]
[39961.531866]  [<ffffffff810b7cc0>] ? default_wake_function+0x0/0x20
[39961.532519]  [<ffffffffa05999e0>] mgc_requeue_thread+0x2c0/0x870 [mgc]
[39961.533296]  [<ffffffff810b7cc0>] ? default_wake_function+0x0/0x20
[39961.534927]  [<ffffffffa0599720>] ? mgc_requeue_thread+0x0/0x870 [mgc]
[39961.535738]  [<ffffffff810a2eba>] kthread+0xea/0xf0
[39961.536395]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[39961.537054]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
[39961.537702]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
[39961.538352] 
[39961.539261] Kernel panic - not syncing: LBUG

this is on current lustre-master, crashdump is available

Comment by James Nunez (Inactive) [ 13/Aug/18 ]

replay-single test_121 crash. Same assertion, stack trace is a bit different; https://testing.whamcloud.com/test_sets/084a0e64-9d95-11e8-8ee3-52540065bddc. Core file can be found in /scratch/dumps/trevis-10vm1.trevis.whamcloud.com/10.9.4.107-2018-08-11-00:59:14 .

From kernel-crash log

[52958.205573] Lustre: DEBUG MARKER: == replay-single test 121: lock replay timed out and race ============================================ 00:55:47 (1533948947)
[52976.888014] LustreError: 4047:0:(import.c:673:ptlrpc_connect_import()) already connecting
[53155.279068] LustreError: 4047:0:(client.c:1179:ptlrpc_import_delay_req()) @@@ invalidate in flight  req@ffff8aa5771b3000 x1608454336137216/t0(0) o38->lustre-MDT0000-mdc-ffff8aa59f230000@10.9.4.110@tcp:12/10 lens 520/544 e 0 to 0 dl 0 ref 1 fl Rpc:N/0/ffffffff rc 0/-1
[53156.576261] LustreError: 18862:0:(niobuf.c:782:ptl_send_rpc()) ASSERTION( (at_max == 0) || imp->imp_state != LUSTRE_IMP_FULL || (imp->imp_msghdr_flags & MSGHDR_AT_SUPPORT) || !(imp->imp_connect_data.ocd_connect_flags & 0x1000000ULL) ) failed: 
[53156.578612] LustreError: 18862:0:(niobuf.c:782:ptl_send_rpc()) LBUG
[53156.579251] CPU: 0 PID: 18862 Comm: lfs Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.9.1.el7.x86_64 #1
[53156.580457] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[53156.581117] Call Trace:
[53156.581472]  [<ffffffffb690e84e>] dump_stack+0x19/0x1b
[53156.582040]  [<ffffffffc07657b2>] libcfs_call_trace+0x72/0x80 [libcfs]
[53156.582744]  [<ffffffffc076583c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[53156.583491]  [<ffffffffc0ab225b>] ptl_send_rpc+0xb6b/0xe70 [ptlrpc]
[53156.585442]  [<ffffffffc0aa7730>] ptlrpc_send_new_req+0x460/0xa70 [ptlrpc]
[53156.586206]  [<ffffffffc0aac211>] ptlrpc_set_wait+0x291/0x790 [ptlrpc]
[53156.588327]  [<ffffffffc0aac78d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
[53156.589081]  [<ffffffffc0a916f2>] ldlm_cli_enqueue+0x3d2/0x920 [ptlrpc]
[53156.592174]  [<ffffffffc0bed71a>] mdc_enqueue_base+0x30a/0x1c10 [mdc]
[53156.592825]  [<ffffffffc0bef77f>] mdc_intent_lock+0x12f/0x560 [mdc]
[53156.595867]  [<ffffffffc0a49bc9>] lmv_intent_lock+0x589/0x1980 [lmv]
[53156.599678]  [<ffffffffc0d46de9>] ll_inode_revalidate+0x169/0x800 [lustre]
[53156.600388]  [<ffffffffc0d474df>] ll_getattr+0x5f/0x700 [lustre]
[53156.602227]  [<ffffffffb6420e09>] vfs_getattr+0x49/0x80
[53156.602835]  [<ffffffffb6420f35>] vfs_fstatat+0x75/0xc0
[53156.603367]  [<ffffffffb64214f1>] SYSC_newlstat+0x31/0x60
[53156.611484]  [<ffffffffb642177e>] SyS_newlstat+0xe/0x10
[53156.612068]  [<ffffffffb6920795>] system_call_fastpath+0x1c/0x21
[53156.613337] Kernel panic - not syncing: LBUG
Generated at Sat Feb 10 02:27:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.