Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9628

LBUG (niobuf.c:773:ptl_send_rpc()) ASSERTION( (at_max == 0) || imp->imp_state != LUSTRE_IMP_FULL || (imp->imp_msghdr_flags & 0x1) || !(imp->imp_connect_data.ocd_connect_flags & 0x1000000ULL) ) failed:

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.10.0, Lustre 2.12.0
    • Soak cluster
    • 3
    • 9223372036854775807

    Description

      Soak client running soak test

      Jun  9 12:08:56 soak-16 systemd-logind: Removed session 1396.
      Jun  9 12:09:24 soak-16 kernel: Lustre: soaked-MDT0000-mdc-ffff880828cb2000: Connection restored to 192.168.1.108@o2ib10 (at 192.168.1.108@o2ib10)
      Jun  9 12:09:24 soak-16 kernel: LustreError: 11-0: soaked-OST0003-osc-ffff880828cb2000: operation ldlm_enqueue to node 192.168.1.104@o2ib10 failed: rc = -19
      Jun  9 12:09:24 soak-16 kernel: LustreError: 2947:0:(import.c:671:ptlrpc_connect_import()) already connecting
      Jun  9 12:09:25 soak-16 kernel: LustreError: 11-0: soaked-OST0003-osc-ffff880828cb2000: operation ldlm_enqueue to node 192.168.1.104@o2ib10 failed: rc = -107
      Jun  9 12:09:25 soak-16 kernel: LustreError: Skipped 12826 previous similar messages
      Jun  9 12:09:25 soak-16 kernel: LustreError: 2947:0:(import.c:671:ptlrpc_connect_import()) already connecting
      Jun  9 12:09:25 soak-16 kernel: LustreError: 2947:0:(import.c:671:ptlrpc_connect_import()) Skipped 13048 previous similar messages
      Jun  9 12:09:26 soak-16 kernel: LustreError: 167-0: soaked-OST0003-osc-ffff880828cb2000: This client was evicted by soaked-OST0003; in progress operations using this service will fail.
      Jun  9 12:09:26 soak-16 kernel: LustreError: 2960:0:(client.c:1189:ptlrpc_import_delay_req()) @@@ invalidate in flight  req@ffff8805e1cf3900 x1569656575292352/t0(0) o101->soaked-OST0003-osc-ffff880828cb2000@192.168.1.104@o2ib10:28/4 lens 328/400 e 0 to 1 dl 1497009748 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
      Jun  9 12:09:26 soak-16 kernel: LustreError: 2947:0:(client.c:1176:ptlrpc_import_delay_req()) @@@ invalidate in flight  req@ffff880624b19800 x1569656617848704/t0(0) o8->soaked-OST0003-osc-ffff880828cb2000@192.168.1.104@o2ib10:28/4 lens 520/544 e 0 to 0 dl 0 ref 1 fl Rpc:N/0/ffffffff rc 0/-1
      Jun  9 12:09:56 soak-16 kernel: LustreError: 2960:0:(niobuf.c:773:ptl_send_rpc()) ASSERTION( (at_max == 0) || imp->imp_state != LUSTRE_IMP_FULL || (imp->imp_msghdr_flags & 0x1) || !(imp->imp_connect_data.ocd_connect_flags & 0x1000000ULL) ) failed:
      Jun  9 12:09:56 soak-16 kernel: LustreError: 2960:0:(niobuf.c:773:ptl_send_rpc()) LBUG
      
      [74364.267656] LustreError: 217188:0:(niobuf.c:773:ptl_send_rpc()) ASSERTION( (at_max == 0) || imp->imp_state != LUSTRE_IMP_FULL || (imp->imp_msghdr_flags & 0x1) || !(imp->imp_connect_data.ocd_connect_flags & 0x1000000ULL) ) failed:
      [74364.267659] LustreError: 217188:0:(niobuf.c:773:ptl_send_rpc()) LBUG
      [74364.267661] Pid: 217188, comm: df
      [74364.267661]
      Call Trace:
      [74364.267685]  [<ffffffffa08287ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
      [74364.267695]  [<ffffffffa082887c>] lbug_with_loc+0x4c/0xb0 [libcfs]
      [74364.267747]  [<ffffffffa0b61c4f>] ptl_send_rpc+0xb1f/0xe60 [ptlrpc]
      [74364.267815]  [<ffffffffa0b95203>] ? sptlrpc_req_refresh_ctx+0x153/0x900 [ptlrpc]
      [74364.267856]  [<ffffffffa0b570f0>] ptlrpc_send_new_req+0x460/0xa60 [ptlrpc]
      [74364.267894]  [<ffffffffa0b5bcc1>] ptlrpc_set_wait+0x3d1/0x900 [ptlrpc]
      [74364.267906]  [<ffffffffa0e1a45d>] ? osc_statfs_async+0xfd/0x1e0 [osc]
      [74364.267919]  [<ffffffffa0cd5e67>] ? lov_statfs_async+0xe7/0x730 [lov]
      [74364.267928]  [<ffffffff811dd065>] ? kmem_cache_alloc_node_trace+0x125/0x220
      [74364.267955]  [<ffffffffa0d7800d>] ll_statfs_internal+0x35d/0xf30 [lustre]
      [74364.267959]  [<ffffffff812094ac>] ? lookup_fast+0xcc/0x2e0 
      [74364.267963]  [<ffffffff8120bd83>] ? path_lookupat+0x83/0x7a0
      [74364.267966]  [<ffffffff8120be16>] ? path_lookupat+0x116/0x7a0
      [74364.267979]  [<ffffffffa0ebb798>] ? _nfs4_proc_statfs+0xc8/0xf0 [nfsv4]
      [74364.268023]  [<ffffffffa0968519>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [74364.268044]  [<ffffffffa0d78c64>] ll_statfs+0x84/0x180 [lustre]
      [74364.268047]  [<ffffffff8120ed4d>] ? putname+0x3d/0x60
      [74364.268052]  [<ffffffff812312b1>] statfs_by_dentry+0xa1/0x140
      [74364.268054]  [<ffffffff8123136b>] vfs_statfs+0x1b/0xb0
      [74364.268056]  [<ffffffff81231455>] user_statfs+0x55/0xa0
      [74364.268059]  [<ffffffff812314c7>] SYSC_statfs+0x27/0x60
      [74364.268062]  [<ffffffff812316ce>] SyS_statfs+0xe/0x10
      [74364.268068]  [<ffffffff81696b09>] system_call_fastpath+0x16/0x1b
      [74364.268069]
      [74364.268070] Kernel panic - not syncing: LBUG
      

      Crash dump is available on soak at /scratch/dumps/soak-16
      vmcore-dmesg attached

      Attachments

        Issue Links

          Activity

            [LU-9628] LBUG (niobuf.c:773:ptl_send_rpc()) ASSERTION( (at_max == 0) || imp->imp_state != LUSTRE_IMP_FULL || (imp->imp_msghdr_flags & 0x1) || !(imp->imp_connect_data.ocd_connect_flags & 0x1000000ULL) ) failed:
            adilger Andreas Dilger made changes -
            Resolution New: Duplicate [ 3 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            adilger Andreas Dilger made changes -
            Link New: This issue duplicates LU-7558 [ LU-7558 ]
            mdiep Minh Diep made changes -
            Link New: This issue is related to EDU-107 [ EDU-107 ]
            jamesanunez James Nunez (Inactive) added a comment - - edited

            replay-single test_121 crash. Same assertion, stack trace is a bit different; https://testing.whamcloud.com/test_sets/084a0e64-9d95-11e8-8ee3-52540065bddc. Core file can be found in /scratch/dumps/trevis-10vm1.trevis.whamcloud.com/10.9.4.107-2018-08-11-00:59:14 .

            From kernel-crash log

            [52958.205573] Lustre: DEBUG MARKER: == replay-single test 121: lock replay timed out and race ============================================ 00:55:47 (1533948947)
            [52976.888014] LustreError: 4047:0:(import.c:673:ptlrpc_connect_import()) already connecting
            [53155.279068] LustreError: 4047:0:(client.c:1179:ptlrpc_import_delay_req()) @@@ invalidate in flight  req@ffff8aa5771b3000 x1608454336137216/t0(0) o38->lustre-MDT0000-mdc-ffff8aa59f230000@10.9.4.110@tcp:12/10 lens 520/544 e 0 to 0 dl 0 ref 1 fl Rpc:N/0/ffffffff rc 0/-1
            [53156.576261] LustreError: 18862:0:(niobuf.c:782:ptl_send_rpc()) ASSERTION( (at_max == 0) || imp->imp_state != LUSTRE_IMP_FULL || (imp->imp_msghdr_flags & MSGHDR_AT_SUPPORT) || !(imp->imp_connect_data.ocd_connect_flags & 0x1000000ULL) ) failed: 
            [53156.578612] LustreError: 18862:0:(niobuf.c:782:ptl_send_rpc()) LBUG
            [53156.579251] CPU: 0 PID: 18862 Comm: lfs Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.9.1.el7.x86_64 #1
            [53156.580457] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
            [53156.581117] Call Trace:
            [53156.581472]  [<ffffffffb690e84e>] dump_stack+0x19/0x1b
            [53156.582040]  [<ffffffffc07657b2>] libcfs_call_trace+0x72/0x80 [libcfs]
            [53156.582744]  [<ffffffffc076583c>] lbug_with_loc+0x4c/0xb0 [libcfs]
            [53156.583491]  [<ffffffffc0ab225b>] ptl_send_rpc+0xb6b/0xe70 [ptlrpc]
            [53156.585442]  [<ffffffffc0aa7730>] ptlrpc_send_new_req+0x460/0xa70 [ptlrpc]
            [53156.586206]  [<ffffffffc0aac211>] ptlrpc_set_wait+0x291/0x790 [ptlrpc]
            [53156.588327]  [<ffffffffc0aac78d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
            [53156.589081]  [<ffffffffc0a916f2>] ldlm_cli_enqueue+0x3d2/0x920 [ptlrpc]
            [53156.592174]  [<ffffffffc0bed71a>] mdc_enqueue_base+0x30a/0x1c10 [mdc]
            [53156.592825]  [<ffffffffc0bef77f>] mdc_intent_lock+0x12f/0x560 [mdc]
            [53156.595867]  [<ffffffffc0a49bc9>] lmv_intent_lock+0x589/0x1980 [lmv]
            [53156.599678]  [<ffffffffc0d46de9>] ll_inode_revalidate+0x169/0x800 [lustre]
            [53156.600388]  [<ffffffffc0d474df>] ll_getattr+0x5f/0x700 [lustre]
            [53156.602227]  [<ffffffffb6420e09>] vfs_getattr+0x49/0x80
            [53156.602835]  [<ffffffffb6420f35>] vfs_fstatat+0x75/0xc0
            [53156.603367]  [<ffffffffb64214f1>] SYSC_newlstat+0x31/0x60
            [53156.611484]  [<ffffffffb642177e>] SyS_newlstat+0xe/0x10
            [53156.612068]  [<ffffffffb6920795>] system_call_fastpath+0x1c/0x21
            [53156.613337] Kernel panic - not syncing: LBUG
            
            jamesanunez James Nunez (Inactive) added a comment - - edited replay-single test_121 crash. Same assertion, stack trace is a bit different; https://testing.whamcloud.com/test_sets/084a0e64-9d95-11e8-8ee3-52540065bddc . Core file can be found in /scratch/dumps/trevis-10vm1.trevis.whamcloud.com/10.9.4.107-2018-08-11-00:59:14 . From kernel-crash log [52958.205573] Lustre: DEBUG MARKER: == replay-single test 121: lock replay timed out and race ============================================ 00:55:47 (1533948947) [52976.888014] LustreError: 4047:0:(import.c:673:ptlrpc_connect_import()) already connecting [53155.279068] LustreError: 4047:0:(client.c:1179:ptlrpc_import_delay_req()) @@@ invalidate in flight req@ffff8aa5771b3000 x1608454336137216/t0(0) o38->lustre-MDT0000-mdc-ffff8aa59f230000@10.9.4.110@tcp:12/10 lens 520/544 e 0 to 0 dl 0 ref 1 fl Rpc:N/0/ffffffff rc 0/-1 [53156.576261] LustreError: 18862:0:(niobuf.c:782:ptl_send_rpc()) ASSERTION( (at_max == 0) || imp->imp_state != LUSTRE_IMP_FULL || (imp->imp_msghdr_flags & MSGHDR_AT_SUPPORT) || !(imp->imp_connect_data.ocd_connect_flags & 0x1000000ULL) ) failed: [53156.578612] LustreError: 18862:0:(niobuf.c:782:ptl_send_rpc()) LBUG [53156.579251] CPU: 0 PID: 18862 Comm: lfs Kdump: loaded Tainted: G OE ------------ 3.10.0-862.9.1.el7.x86_64 #1 [53156.580457] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [53156.581117] Call Trace: [53156.581472] [<ffffffffb690e84e>] dump_stack+0x19/0x1b [53156.582040] [<ffffffffc07657b2>] libcfs_call_trace+0x72/0x80 [libcfs] [53156.582744] [<ffffffffc076583c>] lbug_with_loc+0x4c/0xb0 [libcfs] [53156.583491] [<ffffffffc0ab225b>] ptl_send_rpc+0xb6b/0xe70 [ptlrpc] [53156.585442] [<ffffffffc0aa7730>] ptlrpc_send_new_req+0x460/0xa70 [ptlrpc] [53156.586206] [<ffffffffc0aac211>] ptlrpc_set_wait+0x291/0x790 [ptlrpc] [53156.588327] [<ffffffffc0aac78d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc] [53156.589081] [<ffffffffc0a916f2>] ldlm_cli_enqueue+0x3d2/0x920 [ptlrpc] [53156.592174] [<ffffffffc0bed71a>] mdc_enqueue_base+0x30a/0x1c10 [mdc] [53156.592825] [<ffffffffc0bef77f>] mdc_intent_lock+0x12f/0x560 [mdc] [53156.595867] [<ffffffffc0a49bc9>] lmv_intent_lock+0x589/0x1980 [lmv] [53156.599678] [<ffffffffc0d46de9>] ll_inode_revalidate+0x169/0x800 [lustre] [53156.600388] [<ffffffffc0d474df>] ll_getattr+0x5f/0x700 [lustre] [53156.602227] [<ffffffffb6420e09>] vfs_getattr+0x49/0x80 [53156.602835] [<ffffffffb6420f35>] vfs_fstatat+0x75/0xc0 [53156.603367] [<ffffffffb64214f1>] SYSC_newlstat+0x31/0x60 [53156.611484] [<ffffffffb642177e>] SyS_newlstat+0xe/0x10 [53156.612068] [<ffffffffb6920795>] system_call_fastpath+0x1c/0x21 [53156.613337] Kernel panic - not syncing: LBUG
            jamesanunez James Nunez (Inactive) made changes -
            Affects Version/s New: Lustre 2.12.0 [ 13495 ]
            green Oleg Drokin added a comment -

            I jut hit this assertion on my testbed, but the stacktrace is a bit different.

            [39956.774374] Lustre: DEBUG MARKER: == recovery-small test 29a: error adding new clients doesn't cause LBUG (bug 22273) ================== 22:11:33 (1503540693)
            [39957.040948] Lustre: Failing over lustre-MDT0000
            [39957.630050] Lustre: server umount lustre-MDT0000 complete
            [39961.483178] LustreError: 166-1: MGC192.168.123.166@tcp: Connection to MGS (at 192.168.123.166@tcp) was lost; in progress operations using this service will fail
            [39961.495128] Lustre: Evicted from MGS (at 192.168.123.166@tcp) after server handle changed from 0xaf506984f6812737 to 0xaf506984f6812967
            [39961.508960] LustreError: 29451:0:(niobuf.c:776:ptl_send_rpc()) ASSERTION( (at_max == 0) || imp->imp_state != LUSTRE_IMP_FULL || (imp->imp_msghdr_flags & 0x1) || !(imp->imp_connect_data.ocd_connect_flags & 0x1000000ULL) ) failed: 
            [39961.511060] LustreError: 29451:0:(niobuf.c:776:ptl_send_rpc()) LBUG
            [39961.511718] Pid: 29451, comm: ll_cfg_requeue
            [39961.513004] 
            Call Trace:
            [39961.514239]  [<ffffffffa02187ce>] libcfs_call_trace+0x4e/0x60 [libcfs]
            [39961.514937]  [<ffffffffa021885c>] lbug_with_loc+0x4c/0xb0 [libcfs]
            [39961.515774]  [<ffffffffa05f8a82>] ptl_send_rpc+0xb82/0xec0 [ptlrpc]
            [39961.516456]  [<ffffffff81383cf9>] ? snprintf+0x49/0x70
            [39961.517160]  [<ffffffffa0223cb7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
            [39961.517716]  [<ffffffffa05ed030>] ptlrpc_send_new_req+0x480/0xa90 [ptlrpc]
            [39961.518300]  [<ffffffffa05f1d41>] ptlrpc_set_wait+0x3d1/0x900 [ptlrpc]
            [39961.519663]  [<ffffffffa03a3255>] ? lustre_get_jobid+0x215/0x4d0 [obdclass]
            [39961.520556]  [<ffffffffa05fe0b5>] ? lustre_msg_set_jobid+0x95/0x100 [ptlrpc]
            [39961.521426]  [<ffffffffa05f22f7>] ptlrpc_queue_wait+0x87/0x230 [ptlrpc]
            [39961.522688]  [<ffffffffa05ce426>] ldlm_cli_enqueue+0x686/0x810 [ptlrpc]
            [39961.523845]  [<ffffffffa05c9200>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
            [39961.524621]  [<ffffffffa0595a10>] ? mgc_blocking_ast+0x0/0x7e0 [mgc]
            [39961.525315]  [<ffffffffa059160a>] mgc_enqueue.isra.7.constprop.17+0x19a/0x320 [mgc]
            [39961.526531]  [<ffffffffa0595a10>] ? mgc_blocking_ast+0x0/0x7e0 [mgc]
            [39961.530457]  [<ffffffffa05c9200>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
            [39961.531168]  [<ffffffffa05973a3>] mgc_process_log+0x183/0x890 [mgc]
            [39961.531866]  [<ffffffff810b7cc0>] ? default_wake_function+0x0/0x20
            [39961.532519]  [<ffffffffa05999e0>] mgc_requeue_thread+0x2c0/0x870 [mgc]
            [39961.533296]  [<ffffffff810b7cc0>] ? default_wake_function+0x0/0x20
            [39961.534927]  [<ffffffffa0599720>] ? mgc_requeue_thread+0x0/0x870 [mgc]
            [39961.535738]  [<ffffffff810a2eba>] kthread+0xea/0xf0
            [39961.536395]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
            [39961.537054]  [<ffffffff8170fb98>] ret_from_fork+0x58/0x90
            [39961.537702]  [<ffffffff810a2dd0>] ? kthread+0x0/0xf0
            [39961.538352] 
            [39961.539261] Kernel panic - not syncing: LBUG
            

            this is on current lustre-master, crashdump is available

            green Oleg Drokin added a comment - I jut hit this assertion on my testbed, but the stacktrace is a bit different. [39956.774374] Lustre: DEBUG MARKER: == recovery-small test 29a: error adding new clients doesn't cause LBUG (bug 22273) ================== 22:11:33 (1503540693) [39957.040948] Lustre: Failing over lustre-MDT0000 [39957.630050] Lustre: server umount lustre-MDT0000 complete [39961.483178] LustreError: 166-1: MGC192.168.123.166@tcp: Connection to MGS (at 192.168.123.166@tcp) was lost; in progress operations using this service will fail [39961.495128] Lustre: Evicted from MGS (at 192.168.123.166@tcp) after server handle changed from 0xaf506984f6812737 to 0xaf506984f6812967 [39961.508960] LustreError: 29451:0:(niobuf.c:776:ptl_send_rpc()) ASSERTION( (at_max == 0) || imp->imp_state != LUSTRE_IMP_FULL || (imp->imp_msghdr_flags & 0x1) || !(imp->imp_connect_data.ocd_connect_flags & 0x1000000ULL) ) failed: [39961.511060] LustreError: 29451:0:(niobuf.c:776:ptl_send_rpc()) LBUG [39961.511718] Pid: 29451, comm: ll_cfg_requeue [39961.513004] Call Trace: [39961.514239] [<ffffffffa02187ce>] libcfs_call_trace+0x4e/0x60 [libcfs] [39961.514937] [<ffffffffa021885c>] lbug_with_loc+0x4c/0xb0 [libcfs] [39961.515774] [<ffffffffa05f8a82>] ptl_send_rpc+0xb82/0xec0 [ptlrpc] [39961.516456] [<ffffffff81383cf9>] ? snprintf+0x49/0x70 [39961.517160] [<ffffffffa0223cb7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [39961.517716] [<ffffffffa05ed030>] ptlrpc_send_new_req+0x480/0xa90 [ptlrpc] [39961.518300] [<ffffffffa05f1d41>] ptlrpc_set_wait+0x3d1/0x900 [ptlrpc] [39961.519663] [<ffffffffa03a3255>] ? lustre_get_jobid+0x215/0x4d0 [obdclass] [39961.520556] [<ffffffffa05fe0b5>] ? lustre_msg_set_jobid+0x95/0x100 [ptlrpc] [39961.521426] [<ffffffffa05f22f7>] ptlrpc_queue_wait+0x87/0x230 [ptlrpc] [39961.522688] [<ffffffffa05ce426>] ldlm_cli_enqueue+0x686/0x810 [ptlrpc] [39961.523845] [<ffffffffa05c9200>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc] [39961.524621] [<ffffffffa0595a10>] ? mgc_blocking_ast+0x0/0x7e0 [mgc] [39961.525315] [<ffffffffa059160a>] mgc_enqueue.isra.7.constprop.17+0x19a/0x320 [mgc] [39961.526531] [<ffffffffa0595a10>] ? mgc_blocking_ast+0x0/0x7e0 [mgc] [39961.530457] [<ffffffffa05c9200>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc] [39961.531168] [<ffffffffa05973a3>] mgc_process_log+0x183/0x890 [mgc] [39961.531866] [<ffffffff810b7cc0>] ? default_wake_function+0x0/0x20 [39961.532519] [<ffffffffa05999e0>] mgc_requeue_thread+0x2c0/0x870 [mgc] [39961.533296] [<ffffffff810b7cc0>] ? default_wake_function+0x0/0x20 [39961.534927] [<ffffffffa0599720>] ? mgc_requeue_thread+0x0/0x870 [mgc] [39961.535738] [<ffffffff810a2eba>] kthread+0xea/0xf0 [39961.536395] [<ffffffff810a2dd0>] ? kthread+0x0/0xf0 [39961.537054] [<ffffffff8170fb98>] ret_from_fork+0x58/0x90 [39961.537702] [<ffffffff810a2dd0>] ? kthread+0x0/0xf0 [39961.538352] [39961.539261] Kernel panic - not syncing: LBUG this is on current lustre-master, crashdump is available
            cliffw Cliff White (Inactive) created issue -

            People

              wc-triage WC Triage
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: