Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-755

lustre had panic on network error (stale ZQ entry)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.1.0
    • None
    • RHEL6/Lustre 2.1.0
    • 3
    • 10310

    Description

      While iozone testing system had a panic
      @tcp has timed out for slow reply: [sent 1317906741] [real_sent 1317906741] [current 1317906748] [deadline 7s] [delay 0s] req@ffff88010df32400 x1381925181718579/t5108(5108) o-1->testfs-OST0000_UUID@192.168.123.12@tcp:6/4 lens 456/416 e 0 to 1 dl 1317906748 ref 3 fl Bulk:RX/ffffffff/ffffffff rc 0/-1
      [ 4790.534571] Lustre: testfs-OST0000-osc-ffff88010e1a9400: Connection to service testfs-OST0000 via nid 192.168.123.12@tcp was lost; in progress operations using this service will wait for recovery to complete.
      [ 4790.567408] Lustre: testfs-OST0000-osc-ffff88010e1a9400: Connection restored to service testfs-OST0000 using nid 192.168.123.12@tcp.
      [ 4844.512099] LustreError: 31786:0:(socklnd_cb.c:2518:ksocknal_check_peer_timeouts()) Total 1 stale ZC_REQs for peer 192.168.123.12@tcp detected; the oldest(ffff8800daf1d600) timed out 10 secs ago, resid: 0, wmem: 0
      [ 4844.521468] LustreError: 31786:0:(events.c:194:client_bulk_callback()) event type 0, status -5, desc ffff8800ada53200
      [ 4844.526766] LustreError: 31789:0:(client.c:1695:ptlrpc_check_set()) @@@ bulk transfer failed req@ffff88010df32400 x1381925181718581/t5108(5108) o-1->testfs-OST0000_UUID@192.168.123.12@tcp:6/4 lens 456/416 e 0 to 0 dl 1317906748 ref 2 fl Bulk:RS/ffffffff/ffffffff rc -11/-1
      [ 4844.536035] LustreError: 31789:0:(client.c:1696:ptlrpc_check_set()) LBUG

      that panic caused error in client bulk callback - which a wakeup request and unregister a bulk transfer, but not a mark request as failed.

      crash> struct ptlrpc_request ffff88010df32400
      struct ptlrpc_request {
      ...
      rq_intr = 0,
      rq_replied = 1,
      rq_err = 0,
      rq_timedout = 0,
      rq_resend = 1,
      rq_restart = 0,
      rq_replay = 0,
      rq_no_resend = 0,
      rq_waiting = 0,
      rq_receiving_reply = 0,
      rq_no_delay = 0,
      rq_net_err = 0,
      rq_wait_ctx = 0,
      rq_early = 0,
      rq_must_unlink = 0,
      rq_fake = 0,
      rq_memalloc = 0,
      rq_packed_final = 0,
      rq_hp = 0,
      rq_at_linked = 0,
      rq_reply_truncate = 0,
      rq_committed = 0,
      rq_invalid_rqset = 0,
      rq_phase = 3955285506,
      rq_next_phase = 3955285506,
      rq_refcount =

      { counter = 2 }

      ,
      ...
      crash> p *((struct ptlrpc_bulk_desc *)0xffff8800ada53200)
      $9 = {
      bd_success = 0,
      bd_network_rw = 0,
      bd_type = 0,
      bd_registered = 1,
      bd_lock = {
      raw_lock =

      { slock = 0 }

      },
      bd_import_generation = 0,
      bd_export = 0x0,
      bd_import = 0xffff88010ed33000,
      bd_portal = 8,
      bd_req = 0xffff88010df32400,

      so it's panic in same bulk desc as failed in client_bulk_callback

      Attachments

        Activity

          People

            wc-triage WC Triage
            shadow Alexey Lyashkov
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: