Loading...

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.0.0
Labels:
None
Environment:
linux-2.6.32-71.24.1

Severity:
3
Rank (Obsolete):
6522

Hi,

Our CEA customer is experiencing a bunch of client evictions from the same OST in a short
periode of time, leading to job failures. This problem has now been identified to occur
quiet frequently and to imply jobs failures, so it has been raised by customer as one of
their main Lustre problem actually.

The main error messages reported in the server log are the following:
LustreError: 0:0:(ldlm_lockd.c:345:waiting_locks_callback()) ### lock callback timer expired after <x>s: evicting client at ...
LustreError: 42960:0:(ldlm_request.c:85:ldlm_expired_completion_wait()) ### lock timed out ...
LustreError: 42823:0:(ldlm_lib.c:2123:target_send_reply_msg()) @@@ processing error (-107) ...
Lustre: Service thread pid 42960 was inactive for <xxx.yy>s. The thread might be hung, or it ...
Dumping the stack trace for debugging purposes: Pid: 42960, comm: ll_ost_io_88 ... followed by the stack trace

I don't know if this could help, but I found a description of a similar problem in bugzilla 23352.
A patch (attachment 31479) was landed in 1.8.6 and in 2.0.59 (~~LU-13~~), but our customer is currently
runing 2.0.0.1 which does not contain this patch.

Below is an extract of the server log provided by on site support:

#context:
 
We quite frequently encounter the situation where :

           _ bunch of Client evictions from the same OST in a short period of time. All evictions are against the same
             resource, locked in "PW" mode and for which all impacted Clients are competing ...
           _ all Clients are running the same job, and reconnected very quick after beeing evicted.
           _ during the same period of time, the OSS also complains about one particular thread beeing hung/inactive
             and always with the same messages/stack described after.

#consequences: 

Jobs loss/aborts.

#details:

Server log extract :
======================================================================
...
LustreError: 0:0:(ldlm_lockd.c:345:waiting_locks_callback()) ### lock callback
timer expired after 158s: evicting client at *JO.BOO.IT.PB*@o2ib10  ns:
filter-ptmp-OST002f_UUID lock: ffff880e1d7d0b40/0xc2c2144dc3413119 lrc: 3/0,0
mode: PW/PW res: *3261626*/0 rrc: 77 type: EXT [0->18446744073709551615] (req
0->18446744073709551615) flags: 0x20 remote: 0xc792259022f8f7d5 expref: 32 pid:
42832 timeout 4362673415
Lustre: DEBUG MARKER: Thu Oct 27 10:35:01 2011
....
LustreError: 19:0:(ldlm_lockd.c:345:waiting_locks_callback()) ### lock callback
timer expired after 259s: evicting client at JO.BOO.IT.PW@o2ib10  ns:
filter-ptmp-OST002f_UUID lock: ffff880a229896c0/0xc2c2144dc3413127 lrc: 3/0,0
mode: PW/PW res: *3261626*/0 rrc: 75 type: EXT [0->18446744073709551615] (req
0->18446744073709551615) flags: 0x20 remote: 0xbc822cf3d7daaf53 expref: 11 pid:
42796 timeout 4362683505
LustreError: *42960*:0:(ldlm_request.c:85:ldlm_expired_completion_wait()) ###
lock timed out (enqueued at 1319704407, 200s ago); not entering recovery in
server code, just going back to sleep ns: filter-ptmp-OST002f_UUID lock:
ffff8810233b6000/0xc2c2144dc341793b lrc: 3/0,1 mode: --/PW res: *3261626*/0 rrc:
74 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags:
0x80004000 remote: 0x0 expref: -99 pid: 42960 timeout 0
...
LustreError: 0:0:(ldlm_lockd.c:345:waiting_locks_callback()) ### lock callback
timer expired after 360s: evicting client at *JO.BOO.IT.IF*@o2ib10  ns:
filter-ptmp-OST002f_UUID lock: ffff880fb4c42d80/0xc2c2144dc341312e lrc: 3/0,0
mode: PW/PW res: *3261626*/0 rrc: 74 type: EXT [0->18446744073709551615] (req
0->18446744073709551615) flags: 0x20 remote: 0xe712bf3eabc648a1 expref: 31 pid:
42819 timeout 4362693603
LustreError: 42823:0:(ldlm_lib.c:2123:target_send_reply_msg()) @@@ processing
error (-107)  req@ffff8807fb385000 x1381763496185989/t0(0) o400-><?>@<?>:0/0
lens 192/0 e 0 to 0 dl 1319704723 ref 1 fl Interpret:H/ffffffff/ffffffff rc
-107/-1

LustreError: 43:0:(ldlm_lockd.c:345:waiting_locks_callback()) ### lock callback
timer expired after 461s: evicting client at *JO.BOO.IT.IT*@o2ib10  ns:
filter-ptmp-OST002f_UUID lock: ffff8808554dad80/0xc2c2144dc341313c lrc: 3/0,0
mode: PW/PW res: *3261626*/0 rrc: 72 type: EXT [0->18446744073709551615] (req
0->18446744073709551615) flags: 0x20 remote: 0x6d2273e9da542e3a expref: 14 pid:
42785 timeout 4362703703
Lustre: Service thread pid *42960* was inactive for 400.00s. The thread might be
hung, or it might only be slow and will resume later. Dumping the stack trace
for debugging purposes:
Pid: 42960, comm: ll_ost_io_88

Call Trace:
[<ffffffff81467eba>] ? schedule_timeout+0x19a/0x2e0
[<ffffffffa05a0144>] ? ldlm_expired_completion_wait+0x2c4/0x2e0 [ptlrpc]
[<ffffffffa059fe80>] ? ldlm_expired_completion_wait+0x0/0x2e0 [ptlrpc]
[<ffffffffa041d75e>] cfs_waitq_wait+0xe/0x10 [libcfs]
[<ffffffffa05a3d75>] ldlm_completion_ast+0x575/0x740 [ptlrpc]
[<ffffffffa0585938>] ? ldlm_lock_enqueue+0x178/0xa50 [ptlrpc]
[<ffffffff8104da10>] ? default_wake_function+0x0/0x20
[<ffffffffa05a348f>] ldlm_cli_enqueue_local+0x1cf/0x540 [ptlrpc]
[<ffffffff81176fc0>] ? mntput_no_expire+0x30/0x110
[<ffffffffa05a3800>] ? ldlm_completion_ast+0x0/0x740 [ptlrpc]
[<ffffffffa05a22c0>] ? ldlm_blocking_ast+0x0/0x200 [ptlrpc]
[<ffffffffa068c540>] filter_destroy+0x2d0/0x1590 [obdfilter]
[<ffffffffa05a22c0>] ? ldlm_blocking_ast+0x0/0x200 [ptlrpc]
[<ffffffffa05a3800>] ? ldlm_completion_ast+0x0/0x740 [ptlrpc]
[<ffffffffa05ca994>] ? lustre_msg_add_version+0x94/0x110 [ptlrpc]
[<ffffffffa05cac4b>] ? lustre_pack_reply_v2+0x23b/0x310 [ptlrpc]
[<ffffffffa05f529b>] ? __req_capsule_get+0x14b/0x6b0 [ptlrpc]
[<ffffffffa05cadf1>] ? lustre_pack_reply_flags+0xd1/0x1f0 [ptlrpc]
[<ffffffffa0552df3>] ost_destroy+0x2b3/0x890 [ost]
[<ffffffffa05c96e4>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
[<ffffffffa055e4ba>] ost_handle+0x305a/0x4f90 [ost]
[<ffffffff8125c510>] ? __bitmap_weight+0x50/0xb0
[<ffffffff8124ea83>] ? cpumask_next_and+0x23/0x40
[<ffffffff81052065>] ? find_busiest_group+0x245/0xa90
[<ffffffffa05c96e4>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
[<ffffffffa05d85f1>] ptlrpc_server_handle_request+0x421/0xef0 [ptlrpc]
[<ffffffff810408fe>] ? activate_task+0x2e/0x40
[<ffffffff8104d906>] ? try_to_wake_up+0x276/0x380
[<ffffffff8104da22>] ? default_wake_function+0x12/0x20
[<ffffffff810411a9>] ? __wake_up_common+0x59/0x90
[<ffffffffa041d63e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[<ffffffffa05d9992>] ptlrpc_main+0x8d2/0x1550 [ptlrpc]
[<ffffffff8104da10>] ? default_wake_function+0x0/0x20
[<ffffffff8100d1aa>] child_rip+0xa/0x20
[<ffffffffa05d90c0>] ? ptlrpc_main+0x0/0x1550 [ptlrpc]
[<ffffffff8100d1a0>] ? child_rip+0x0/0x20

LustreError: dumping log to /tmp/lustre-log.1319704808.42960

...

Lustre: *42960*:0:(service.c:1703:ptlrpc_server_handle_request()) @@@ Request
x1381763495912720 took longer than estimated (1164:2161s); client may timeout. 
req@ffff8807b76a8c00 x1381763495912720/t137044087(0)
o6->25c5744e-aed5-6dac-8eff-070feefd72ea@:0/0 lens 512/400 e 5 to 0 dl
1319705571 ref 1 fl Complete:/ffffffff/ffffffff rc 0/-1
Lustre: Service thread pid 42960 completed after 3324.32s. This indicates the
system was overloaded (too many service threads, or there were not enough
hardware resources).
Lustre: DEBUG MARKER: Thu Oct 27 11:30:01 2011 
...

======================================================================

Details

Description

Attachments

Activity

People

Dates