Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-859

OST evictions of multiple Lustre-Client nodes running the same job

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.0.0
    • None
    • linux-2.6.32-71.24.1
    • 3
    • 6522

    Description

      Hi,

      Our CEA customer is experiencing a bunch of client evictions from the same OST in a short
      periode of time, leading to job failures. This problem has now been identified to occur
      quiet frequently and to imply jobs failures, so it has been raised by customer as one of
      their main Lustre problem actually.

      The main error messages reported in the server log are the following:
      LustreError: 0:0:(ldlm_lockd.c:345:waiting_locks_callback()) ### lock callback timer expired after <x>s: evicting client at ...
      LustreError: 42960:0:(ldlm_request.c:85:ldlm_expired_completion_wait()) ### lock timed out ...
      LustreError: 42823:0:(ldlm_lib.c:2123:target_send_reply_msg()) @@@ processing error (-107) ...
      Lustre: Service thread pid 42960 was inactive for <xxx.yy>s. The thread might be hung, or it ...
      Dumping the stack trace for debugging purposes: Pid: 42960, comm: ll_ost_io_88 ... followed by the stack trace

      I don't know if this could help, but I found a description of a similar problem in bugzilla 23352.
      A patch (attachment 31479) was landed in 1.8.6 and in 2.0.59 (LU-13), but our customer is currently
      runing 2.0.0.1 which does not contain this patch.

      Below is an extract of the server log provided by on site support:

      #context:
       
      We quite frequently encounter the situation where :
      
                 _ bunch of Client evictions from the same OST in a short period of time. All evictions are against the same
                   resource, locked in "PW" mode and for which all impacted Clients are competing ...
                 _ all Clients are running the same job, and reconnected very quick after beeing evicted.
                 _ during the same period of time, the OSS also complains about one particular thread beeing hung/inactive
                   and always with the same messages/stack described after.
      
      #consequences: 
      
      Jobs loss/aborts.
      
      #details:
      
      Server log extract :
      ======================================================================
      ...
      LustreError: 0:0:(ldlm_lockd.c:345:waiting_locks_callback()) ### lock callback
      timer expired after 158s: evicting client at *JO.BOO.IT.PB*@o2ib10  ns:
      filter-ptmp-OST002f_UUID lock: ffff880e1d7d0b40/0xc2c2144dc3413119 lrc: 3/0,0
      mode: PW/PW res: *3261626*/0 rrc: 77 type: EXT [0->18446744073709551615] (req
      0->18446744073709551615) flags: 0x20 remote: 0xc792259022f8f7d5 expref: 32 pid:
      42832 timeout 4362673415
      Lustre: DEBUG MARKER: Thu Oct 27 10:35:01 2011
      ....
      LustreError: 19:0:(ldlm_lockd.c:345:waiting_locks_callback()) ### lock callback
      timer expired after 259s: evicting client at JO.BOO.IT.PW@o2ib10  ns:
      filter-ptmp-OST002f_UUID lock: ffff880a229896c0/0xc2c2144dc3413127 lrc: 3/0,0
      mode: PW/PW res: *3261626*/0 rrc: 75 type: EXT [0->18446744073709551615] (req
      0->18446744073709551615) flags: 0x20 remote: 0xbc822cf3d7daaf53 expref: 11 pid:
      42796 timeout 4362683505
      LustreError: *42960*:0:(ldlm_request.c:85:ldlm_expired_completion_wait()) ###
      lock timed out (enqueued at 1319704407, 200s ago); not entering recovery in
      server code, just going back to sleep ns: filter-ptmp-OST002f_UUID lock:
      ffff8810233b6000/0xc2c2144dc341793b lrc: 3/0,1 mode: --/PW res: *3261626*/0 rrc:
      74 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags:
      0x80004000 remote: 0x0 expref: -99 pid: 42960 timeout 0
      ...
      LustreError: 0:0:(ldlm_lockd.c:345:waiting_locks_callback()) ### lock callback
      timer expired after 360s: evicting client at *JO.BOO.IT.IF*@o2ib10  ns:
      filter-ptmp-OST002f_UUID lock: ffff880fb4c42d80/0xc2c2144dc341312e lrc: 3/0,0
      mode: PW/PW res: *3261626*/0 rrc: 74 type: EXT [0->18446744073709551615] (req
      0->18446744073709551615) flags: 0x20 remote: 0xe712bf3eabc648a1 expref: 31 pid:
      42819 timeout 4362693603
      LustreError: 42823:0:(ldlm_lib.c:2123:target_send_reply_msg()) @@@ processing
      error (-107)  req@ffff8807fb385000 x1381763496185989/t0(0) o400-><?>@<?>:0/0
      lens 192/0 e 0 to 0 dl 1319704723 ref 1 fl Interpret:H/ffffffff/ffffffff rc
      -107/-1
      
      LustreError: 43:0:(ldlm_lockd.c:345:waiting_locks_callback()) ### lock callback
      timer expired after 461s: evicting client at *JO.BOO.IT.IT*@o2ib10  ns:
      filter-ptmp-OST002f_UUID lock: ffff8808554dad80/0xc2c2144dc341313c lrc: 3/0,0
      mode: PW/PW res: *3261626*/0 rrc: 72 type: EXT [0->18446744073709551615] (req
      0->18446744073709551615) flags: 0x20 remote: 0x6d2273e9da542e3a expref: 14 pid:
      42785 timeout 4362703703
      Lustre: Service thread pid *42960* was inactive for 400.00s. The thread might be
      hung, or it might only be slow and will resume later. Dumping the stack trace
      for debugging purposes:
      Pid: 42960, comm: ll_ost_io_88
      
      Call Trace:
      [<ffffffff81467eba>] ? schedule_timeout+0x19a/0x2e0
      [<ffffffffa05a0144>] ? ldlm_expired_completion_wait+0x2c4/0x2e0 [ptlrpc]
      [<ffffffffa059fe80>] ? ldlm_expired_completion_wait+0x0/0x2e0 [ptlrpc]
      [<ffffffffa041d75e>] cfs_waitq_wait+0xe/0x10 [libcfs]
      [<ffffffffa05a3d75>] ldlm_completion_ast+0x575/0x740 [ptlrpc]
      [<ffffffffa0585938>] ? ldlm_lock_enqueue+0x178/0xa50 [ptlrpc]
      [<ffffffff8104da10>] ? default_wake_function+0x0/0x20
      [<ffffffffa05a348f>] ldlm_cli_enqueue_local+0x1cf/0x540 [ptlrpc]
      [<ffffffff81176fc0>] ? mntput_no_expire+0x30/0x110
      [<ffffffffa05a3800>] ? ldlm_completion_ast+0x0/0x740 [ptlrpc]
      [<ffffffffa05a22c0>] ? ldlm_blocking_ast+0x0/0x200 [ptlrpc]
      [<ffffffffa068c540>] filter_destroy+0x2d0/0x1590 [obdfilter]
      [<ffffffffa05a22c0>] ? ldlm_blocking_ast+0x0/0x200 [ptlrpc]
      [<ffffffffa05a3800>] ? ldlm_completion_ast+0x0/0x740 [ptlrpc]
      [<ffffffffa05ca994>] ? lustre_msg_add_version+0x94/0x110 [ptlrpc]
      [<ffffffffa05cac4b>] ? lustre_pack_reply_v2+0x23b/0x310 [ptlrpc]
      [<ffffffffa05f529b>] ? __req_capsule_get+0x14b/0x6b0 [ptlrpc]
      [<ffffffffa05cadf1>] ? lustre_pack_reply_flags+0xd1/0x1f0 [ptlrpc]
      [<ffffffffa0552df3>] ost_destroy+0x2b3/0x890 [ost]
      [<ffffffffa05c96e4>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
      [<ffffffffa055e4ba>] ost_handle+0x305a/0x4f90 [ost]
      [<ffffffff8125c510>] ? __bitmap_weight+0x50/0xb0
      [<ffffffff8124ea83>] ? cpumask_next_and+0x23/0x40
      [<ffffffff81052065>] ? find_busiest_group+0x245/0xa90
      [<ffffffffa05c96e4>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
      [<ffffffffa05d85f1>] ptlrpc_server_handle_request+0x421/0xef0 [ptlrpc]
      [<ffffffff810408fe>] ? activate_task+0x2e/0x40
      [<ffffffff8104d906>] ? try_to_wake_up+0x276/0x380
      [<ffffffff8104da22>] ? default_wake_function+0x12/0x20
      [<ffffffff810411a9>] ? __wake_up_common+0x59/0x90
      [<ffffffffa041d63e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      [<ffffffffa05d9992>] ptlrpc_main+0x8d2/0x1550 [ptlrpc]
      [<ffffffff8104da10>] ? default_wake_function+0x0/0x20
      [<ffffffff8100d1aa>] child_rip+0xa/0x20
      [<ffffffffa05d90c0>] ? ptlrpc_main+0x0/0x1550 [ptlrpc]
      [<ffffffff8100d1a0>] ? child_rip+0x0/0x20
      
      LustreError: dumping log to /tmp/lustre-log.1319704808.42960
      
      ...
      
      Lustre: *42960*:0:(service.c:1703:ptlrpc_server_handle_request()) @@@ Request
      x1381763495912720 took longer than estimated (1164:2161s); client may timeout. 
      req@ffff8807b76a8c00 x1381763495912720/t137044087(0)
      o6->25c5744e-aed5-6dac-8eff-070feefd72ea@:0/0 lens 512/400 e 5 to 0 dl
      1319705571 ref 1 fl Complete:/ffffffff/ffffffff rc 0/-1
      Lustre: Service thread pid 42960 completed after 3324.32s. This indicates the
      system was overloaded (too many service threads, or there were not enough
      hardware resources).
      Lustre: DEBUG MARKER: Thu Oct 27 11:30:01 2011 
      ...
      
      ======================================================================
      

      Attachments

        Activity

          People

            laisiyao Lai Siyao
            patrick.valentin Patrick Valentin (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: