Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      Running iorssf (job 33751) on 3 client nodes, one of the clients got evicted. The IB fabric seems to be sane (no downlink, symbolerrors,...) for ports in question. No mds restart or failover operation was ongoing at the time the error occured.
      Client error

      Jan 22 20:12:35 lola-24 kernel: LustreError: 11-0: soaked-OST000f-osc-ffff880819252c00: Communicating with 192.1
      68.1.105@o2ib10, operation obd_ping failed with -107.
      Jan 22 20:12:35 lola-24 kernel: Lustre: soaked-OST000f-osc-ffff880819252c00: Connection to soaked-OST000f (at 19
      2.168.1.105@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
      Jan 22 20:12:35 lola-24 kernel: Lustre: Skipped 1 previous similar message
      Jan 22 20:12:35 lola-24 kernel: LustreError: 167-0: soaked-OST000f-osc-ffff880819252c00: This client was evicted
       by soaked-OST000f; in progress operations using this service will fail.
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(osc_lock.c:832:osc_ldlm_completion_ast()) lock@ffff880650
      352228[3 3 0 1 1 00000000] W(2):[8702328, 18446744073709551615]@[0x1000f0000:0x1514518:0x0] {
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(osc_lock.c:832:osc_ldlm_completion_ast())     lovsub@ffff
      8804b7635460: [0 ffff88067225d470 R(1):[8702328, 18446744073709551615]@[0x60002c323:0x96d3:0x0]] 
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(osc_lock.c:832:osc_ldlm_completion_ast())     osc@ffff880
      4fdc1dce8: ffff8803bf23e680 0x20040000000 0x5ec022e3dfcd440e 3 ffff88073a4476f8 size: 53278838784 mtime: 1421986
      119 atime: 1421985489 ctime: 1421986114 blocks: 102256017
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(osc_lock.c:832:osc_ldlm_completion_ast()) } lock@ffff8806
      50352228
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(osc_lock.c:832:osc_ldlm_completion_ast()) dlmlock returne
      d -5
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(ldlm_resource.c:809:ldlm_resource_complain()) soaked-OST0
      00f-osc-ffff880819252c00: namespace resource [0x1514518:0x0:0x0].0 (ffff8802cef43d00) refcount nonzero (1) after
       lock cleanup; forcing cleanup.
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(ldlm_resource.c:1448:ldlm_resource_dump()) --- Resource: 
      [0x1514518:0x0:0x0].0 (ffff8802cef43d00) refcount = 2
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(ldlm_resource.c:1451:ldlm_resource_dump()) Granted locks 
      (in reverse order):
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163903:0:(ldlm_resource.c:1454:ldlm_resource_dump()) ### ### ns: so
      aked-OST000f-osc-ffff880819252c00 lock: ffff8803bf23e680/0x5ec022e3dfcd440e lrc: 3/0,0 mode: PW/PW res: [0x15145
      18:0x0:0x0].0 rrc: 2 type: EXT [35644735488->18446744073709551615] (req 35644735488->18446744073709551615) flags
      : 0x126400000000 nid: local remote: 0xe42bc2b426f6efd7 expref: -99 pid: 163262 timeout: 0 lvb_type: 1
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1422:cl_unuse_try()) result = -5, this is unlik
      ely!
      Jan 22 20:12:35 lola-24 kernel: Lustre: soaked-OST000f-osc-ffff880819252c00: Connection restored to soaked-OST00
      0f (at 192.168.1.105@o2ib10)
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1437:cl_unuse_locked()) lock@ffff880650352660[2
       0 0 1 0 00000000] R(1):[8702328, 18446744073709551615]@[0x60002c323:0x96d3:0x0] {
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1437:cl_unuse_locked())     vvp@ffff8804e2282c4
      0: 
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1437:cl_unuse_locked())     lov@ffff88067225d47
      0: 1
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1437:cl_unuse_locked())     0 0: ---
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1437:cl_unuse_locked()) 
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1437:cl_unuse_locked()) } lock@ffff880650352660
      Jan 22 20:12:35 lola-24 kernel: LustreError: 163262:0:(cl_lock.c:1437:cl_unuse_locked()) unuse return -5
      Jan 22 20:12:35 lola-24 slurmd[lola-24][163257]: error: *** STEP 33751.0 KILLED AT 2015-01-22T20:12:35 WITH SIGNAL 9 ***
      Jan 22 20:12:35 lola-24 slurmd[lola-24][163257]: error: *** STEP 33751.0 KILLED AT 2015-01-22T20:12:35 WITH SIGNAL 9 ***
      Jan 22 20:12:35 lola-24 slurmd[lola-24][163257]: done with job
      

      OSS error:

      Jan 22 20:12:31 lola-5 kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.1.124@o2ib100  ns: filter-soaked-OST000f_UUID lock: ffff88073a776480/0xe42bc2b426f6ee2c lrc: 3/0,0 mode: PR/PR res: [0x1514518:0x0:0x0].0 rrc: 7 type: EXT [35519225856->35644735487] (req 35519225856->35550605311) flags: 0x60000000000020 nid: 192.168.1.124@o2ib100 remote: 0x5ec022e3dfcd4407 expref: 32 pid: 103043 timeout: 5797977767 lvb_type: 0
      Jan 22 20:14:10 lola-5 kernel: LustreError: 93964:0:(ost_handler.c:1775:ost_blocking_ast()) Error -2 syncing data on lock cancel
      Jan 22 20:14:12 lola-5 kernel: LNet: Service thread pid 102689 completed after 319.51s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      Jan 22 20:14:12 lola-5 kernel: LNet: Skipped 3 previous similar messages
      Jan 22 20:14:55 lola-5 kernel: LNet: Service thread pid 102685 completed after 362.60s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      Jan 22 20:14:55 lola-5 kernel: LNet: Skipped 1 previous similar message
      Jan 22 20:14:59 lola-5 kernel: LustreError: 93958:0:(ldlm_lib.c:2679:target_bulk_io()) @@@ bulk PUT failed: rc -107  req@ffff8807306bc800 x1490477310356324/t0(0) o3->76d86919-2338-4877-542c-9e7f34e71591@192.168.1.124@o2ib100:0/0 lens 488/432 e 8 to 0 dl 1421986372 ref 1 fl Interpret:/0/0 rc 0/0
      Jan 22 20:14:59 lola-5 kernel: Lustre: soaked-OST000f: Bulk IO read error with 76d86919-2338-4877-542c-9e7f34e71591 (at 192.168.1.124@o2ib100), client will retry: rc -107
      Jan 22 20:14:59 lola-5 kernel: Lustre: Skipped 1 previous similar message
      Jan 22 20:14:59 lola-5 kernel: Lustre: 102688:0:(service.c:2045:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (243:127s); client may timeout.  req@ffff880271b8ac00 x1490477310356328/t0(0) o3->76d86919-2338-4877-542c-9e7f34e71591@192.168.1.124@o2ib100:0/0 lens 488/432 e 8 to 0 dl 1421986372 ref 1 fl Complete:/0/ffffffff rc 0/-1
      Jan 22 20:14:59 lola-5 kernel: Lustre: 102688:0:(service.c:2045:ptlrpc_server_handle_request()) Skipped 1 previous similar message
      Jan 22 20:14:59 lola-5 kernel: LustreError: 93958:0:(ldlm_lib.c:2679:target_bulk_io()) Skipped 2 previous similar messages
      Jan 22 20:18:22 lola-5 kernel: LustreError: 94003:0:(ost_handler.c:1775:ost_blocking_ast()) Error -2 syncing data on lock cancel
      

      Attachments

        Issue Links

          Activity

            [LU-6416] Client evicted on lock cancel
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14085/
            Subject: LU-6416 ldlm: no canceled lock on waiting list
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 657bbc4969be581aca66549ca0427ccec89ea5a2

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14085/ Subject: LU-6416 ldlm: no canceled lock on waiting list Project: fs/lustre-release Branch: master Current Patch Set: Commit: 657bbc4969be581aca66549ca0427ccec89ea5a2
            pjones Peter Jones added a comment - http://review.whamcloud.com/#/c/14085/
            green Oleg Drokin added a comment -

            Liang's analysis:
            I suspect it's a race case like this:
            If a lock was not granted straight away on server, but it's granted with LDLM_FL_AST_SENT set before ldlm_handle_enqueue0 sends out reply, client side will know she needs to cancel this lock.

            At the meanwhile, this lock can be added to a long granting list by another server thread.

            When lock cancel request arrives at server and server calls into

                  ldlm_lock_cancel()->
                      ldlm_cancel_callback()->
                          tgt_blocking_ast(...LDLM_CB_CANCELING)->
                              tgt_sync()
            

            The other server thread eventually get a chance to send completion AST for this lock with LDLM_FL_AST_SENT set, and add this lock to waiting list again.

            However, tgt_sync may take arbitrary time which is irrelevant to AT of lock revoke on client, server could evict client only because itself has slow IO.

            green Oleg Drokin added a comment - Liang's analysis: I suspect it's a race case like this: If a lock was not granted straight away on server, but it's granted with LDLM_FL_AST_SENT set before ldlm_handle_enqueue0 sends out reply, client side will know she needs to cancel this lock. At the meanwhile, this lock can be added to a long granting list by another server thread. When lock cancel request arrives at server and server calls into ldlm_lock_cancel()-> ldlm_cancel_callback()-> tgt_blocking_ast(...LDLM_CB_CANCELING)-> tgt_sync() The other server thread eventually get a chance to send completion AST for this lock with LDLM_FL_AST_SENT set, and add this lock to waiting list again. However, tgt_sync may take arbitrary time which is irrelevant to AT of lock revoke on client, server could evict client only because itself has slow IO.

            People

              liang Liang Zhen (Inactive)
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: