Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6416

Client evicted on lock cancel

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: Lustre 2.8.0
    • Labels:
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      Running iorssf (job 33751) on 3 client nodes, one of the clients got evicted. The IB fabric seems to be sane (no downlink, symbolerrors,...) for ports in question. No mds restart or failover operation was ongoing at the time the error occured.
      Client error

      Jan 22 20:12:35 lola-24 kernel: LustreError: 11-0: soaked-OST000f-osc-ffff880819252c00: Communicating with 192.1
      68.1.105@o2ib10, operation obd_ping failed with -107.
      Jan 22 20:12:35 lola-24 kernel: Lustre: soaked-OST000f-osc-ffff880819252c00: Connection to soaked-OST000f (at 19
      2.168.1.105@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
      Jan 22 20:12:35 lola-24 kernel: Lustre: Skipped 1 previous similar message
      Jan 22 20:12:35 lola-24 kernel: LustreError: 167-0: soaked-OST000f-osc-ffff880819252c00: This client was evicted
       by soaked-OST000f; in progress operations using this service will fail.
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(osc_lock.c:832:osc_ldlm_completion_ast()) lock@ffff880650
      352228[3 3 0 1 1 00000000] W(2):[8702328, 18446744073709551615]@[0x1000f0000:0x1514518:0x0] {
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(osc_lock.c:832:osc_ldlm_completion_ast())     lovsub@ffff
      8804b7635460: [0 ffff88067225d470 R(1):[8702328, 18446744073709551615]@[0x60002c323:0x96d3:0x0]] 
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(osc_lock.c:832:osc_ldlm_completion_ast())     osc@ffff880
      4fdc1dce8: ffff8803bf23e680 0x20040000000 0x5ec022e3dfcd440e 3 ffff88073a4476f8 size: 53278838784 mtime: 1421986
      119 atime: 1421985489 ctime: 1421986114 blocks: 102256017
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(osc_lock.c:832:osc_ldlm_completion_ast()) } lock@ffff8806
      50352228
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(osc_lock.c:832:osc_ldlm_completion_ast()) dlmlock returne
      d -5
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(ldlm_resource.c:809:ldlm_resource_complain()) soaked-OST0
      00f-osc-ffff880819252c00: namespace resource [0x1514518:0x0:0x0].0 (ffff8802cef43d00) refcount nonzero (1) after
       lock cleanup; forcing cleanup.
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(ldlm_resource.c:1448:ldlm_resource_dump()) --- Resource: 
      [0x1514518:0x0:0x0].0 (ffff8802cef43d00) refcount = 2
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(ldlm_resource.c:1451:ldlm_resource_dump()) Granted locks 
      (in reverse order):
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(ldlm_resource.c:1454:ldlm_resource_dump()) ### ### ns: so
      aked-OST000f-osc-ffff880819252c00 lock: ffff8803bf23e680/0x5ec022e3dfcd440e lrc: 3/0,0 mode: PW/PW res: [0x15145
      18:0x0:0x0].0 rrc: 2 type: EXT [35644735488->18446744073709551615] (req 35644735488->18446744073709551615) flags
      : 0x126400000000 nid: local remote: 0xe42bc2b426f6efd7 expref: -99 pid: 163262 timeout: 0 lvb_type: 1
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1422:cl_unuse_try()) result = -5, this is unlik
      ely!
      Jan 22 20:12:35 lola-24 kernel: Lustre: soaked-OST000f-osc-ffff880819252c00: Connection restored to soaked-OST00
      0f (at 192.168.1.105@o2ib10)
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1437:cl_unuse_locked()) lock@ffff880650352660[2
       0 0 1 0 00000000] R(1):[8702328, 18446744073709551615]@[0x60002c323:0x96d3:0x0] {
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1437:cl_unuse_locked())     vvp@ffff8804e2282c4
      0: 
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1437:cl_unuse_locked())     lov@ffff88067225d47
      0: 1
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1437:cl_unuse_locked())     0 0: ---
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1437:cl_unuse_locked()) 
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1437:cl_unuse_locked()) } lock@ffff880650352660
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1437:cl_unuse_locked()) unuse return -5
      Jan 22 20:12:35 lola-24 slurmd[lola-24][163257]: error: *** STEP 33751.0 KILLED AT 2015-01-22T20:12:35 WITH SIGNAL 9 ***
      Jan 22 20:12:35 lola-24 slurmd[lola-24][163257]: error: *** STEP 33751.0 KILLED AT 2015-01-22T20:12:35 WITH SIGNAL 9 ***
      Jan 22 20:12:35 lola-24 slurmd[lola-24][163257]: done with job
      

      OSS error:

      Jan 22 20:12:31 lola-5 kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.1.124@o2ib100  ns: filter-soaked-OST000f_UUID lock: ffff88073a776480/0xe42bc2b426f6ee2c lrc: 3/0,0 mode: PR/PR res: [0x1514518:0x0:0x0].0 rrc: 7 type: EXT [35519225856->35644735487] (req 35519225856->35550605311) flags: 0x60000000000020 nid: 192.168.1.124@o2ib100 remote: 0x5ec022e3dfcd4407 expref: 32 pid: 103043 timeout: 5797977767 lvb_type: 0
      Jan 22 20:14:10 lola-5 kernel: LustreError: 93964:0:(ost_handler.c:1775:ost_blocking_ast()) Error -2 syncing data on lock cancel
      Jan 22 20:14:12 lola-5 kernel: LNet: Service thread pid 102689 completed after 319.51s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      Jan 22 20:14:12 lola-5 kernel: LNet: Skipped 3 previous similar messages
      Jan 22 20:14:55 lola-5 kernel: LNet: Service thread pid 102685 completed after 362.60s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      Jan 22 20:14:55 lola-5 kernel: LNet: Skipped 1 previous similar message
      Jan 22 20:14:59 lola-5 kernel: LustreError: 93958:0:(ldlm_lib.c:2679:target_bulk_io()) @@@ bulk PUT failed: rc -107  req@ffff8807306bc800 x1490477310356324/t0(0) o3->76d86919-2338-4877-542c-9e7f34e71591@192.168.1.124@o2ib100:0/0 lens 488/432 e 8 to 0 dl 1421986372 ref 1 fl Interpret:/0/0 rc 0/0
      Jan 22 20:14:59 lola-5 kernel: Lustre: soaked-OST000f: Bulk IO read error with 76d86919-2338-4877-542c-9e7f34e71591 (at 192.168.1.124@o2ib100), client will retry: rc -107
      Jan 22 20:14:59 lola-5 kernel: Lustre: Skipped 1 previous similar message
      Jan 22 20:14:59 lola-5 kernel: Lustre: 102688:0:(service.c:2045:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (243:127s); client may timeout.  req@ffff880271b8ac00 x1490477310356328/t0(0) o3->76d86919-2338-4877-542c-9e7f34e71591@192.168.1.124@o2ib100:0/0 lens 488/432 e 8 to 0 dl 1421986372 ref 1 fl Complete:/0/ffffffff rc 0/-1
      Jan 22 20:14:59 lola-5 kernel: Lustre: 102688:0:(service.c:2045:ptlrpc_server_handle_request()) Skipped 1 previous similar message
      Jan 22 20:14:59 lola-5 kernel: LustreError: 93958:0:(ldlm_lib.c:2679:target_bulk_io()) Skipped 2 previous similar messages
      Jan 22 20:18:22 lola-5 kernel: LustreError: 94003:0:(ost_handler.c:1775:ost_blocking_ast()) Error -2 syncing data on lock cancel
      

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              liang Liang Zhen (Inactive)
              Reporter:
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: