Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19714

LNet: lpni_txcredits overwritten without handling lpni_txq

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Medium
    • Lustre 2.18.0
    • Lustre 2.17.0
    • 3
    • 9223372036854775807

    Description

      The following scenario results in an LBUG: 

      • About to add a net, for example tcp999
      • Before tcp999 is configured locally, try to send to a peer on @tcp999.
      • lnet_peerni_by_nid_locked() allocates a peer_ni with lpni->lpni_net == NULL and puts it on ln_remote_peer_ni_list (lpni_txcredits is 0 at start in this case)
      • A send attempt consumes peer-tx credit (lpni_txcredits-- goes negative, message is put on lpni_txq)
      • Later the net/LNI get added, triggering lnet_peer_net_added() (lpni_txcredits is set to the tunable value (e.g. 8), but lpni_txq is not drained)
      • Now txcredits = 8 and txq is non-empty
      • LBUG on the next attempt to take a peer-tx credit
      [16931.136942] LNetError: 898962:0:(lib-move.c:884:lnet_post_send_locked()) ASSERTION( (lp->lpni_txcredits < 0) == !list_empty(&lp->lpni_txq) ) failed: 
      [16931.139496] LNetError: 898962:0:(lib-move.c:884:lnet_post_send_locked()) LBUG
      [16931.140806] CPU: 1 PID: 898962 Comm: mdt00_001 Kdump: loaded Tainted: P           OE     -------- -  - 4.18.0-553.89.1.el8_lustre.x86_64 #1
      [16931.143021] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [16931.144061] Call Trace:
      [16931.144602]  dump_stack+0x41/0x60
      [16931.145299]  lbug_with_loc.cold.6+0x5/0x43 [libcfs]
      [16931.146262]  lnet_post_send_locked+0x6b9/0x780 [lnet]
      [16931.147389]  lnet_handle_send+0x2a2/0x650 [lnet]
      [16931.148287]  lnet_select_pathway+0x629/0x1a90 [lnet]
      [16931.149244]  ? cfs_trace_unlock_tcd+0x20/0x70 [libcfs]
      [16931.150201]  ? libcfs_debug_msg+0x907/0xc00 [libcfs]
      [16931.151126]  lnet_send+0x6d/0x1e0 [lnet]
      [16931.151898]  LNetPut+0x2f8/0x950 [lnet]
      [16931.152662]  ptl_send_buf+0x132/0x540 [ptlrpc]
      [16931.154072]  ? ptlrpc_ni_fini+0x60/0x60 [ptlrpc]
      [16931.155019]  ptlrpc_send_reply+0x2f5/0x8d0 [ptlrpc]
      [16931.155997]  target_send_reply+0x328/0x7a0 [ptlrpc]
      [16931.157017]  tgt_request_handle+0x454/0x1d20 [ptlrpc]
      [16931.158076]  ptlrpc_server_handle_request+0x2ca/0xd70 [ptlrpc]
      [16931.159259]  ? lprocfs_counter_add+0x117/0x180 [obdclass]
      [16931.160557]  ptlrpc_main+0xb98/0x1460 [ptlrpc]
      [16931.161483]  ? __schedule+0x2d9/0x870
      [16931.162196]  ? ptlrpc_wait_event+0x5b0/0x5b0 [ptlrpc]
      [16931.163218]  kthread+0x134/0x150
      [16931.163879]  ? set_kthread_struct+0x50/0x50
      [16931.164670]  ret_from_fork+0x35/0x40
      [16931.165397] Kernel panic - not syncing: LBUG 

       

      Attachments

        Issue Links

          Activity

            People

              ssmirnov Serguei Smirnov
              ssmirnov Serguei Smirnov
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: