[LU-5913] client stuck in ptlrpc_invalidate_import Created: 12/Nov/14  Updated: 16/Oct/15  Resolved: 16/Oct/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Zhenyu Xu
Resolution: Duplicate Votes: 0
Labels: None
Environment:

client and server version lustre2.4.3-7nas


Severity: 3
Rank (Obsolete): 16510

 Description   

?DUP of LU-10?
1. Client get a odb_ping failed
2. client is evicted
3. client can't reconnect and stuck in ptlrpc_invalidate_import

Nov 12 04:32:30 pfe22 kernel: [69643.816473] LustreError: 11-0: nbp9-OST0075-osc-ffff880a28404400: Communicating with 10.151.26.11@o2ib, operation obd_ping failed with -107.
Nov 12 04:32:30 pfe22 kernel: [69643.854158] Lustre: nbp9-OST0075-osc-ffff880a28404400: Connection to nbp9-OST0075 (at 10.151.26.11@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Nov 12 04:32:30 pfe22 kernel: [69643.922758] LustreError: 167-0: nbp9-OST0075-osc-ffff880a28404400: This client was evicted by nbp9-OST0075; in progress operations using this service will fail.
Nov 12 04:34:11 pfe22 kernel: [69743.757653] LustreError: 92481:0:(import.c:324:ptlrpc_invalidate_import()) nbp9-OST0075_UUID: rc = -110 waiting for callback (1 != 0)
Nov 12 04:34:11 pfe22 kernel: [69743.793518] LustreError: 92481:0:(import.c:350:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff8802204cec00 x1484496199485132/t0(0) o4->nbp9-OST0075-osc-ffff880a28404400@10.151.26.11@o2ib:6/4 lens 488/448 e 0 to 0 dl 1415726912 ref 2 fl Rpc:RE/0/ffffffff rc -5/-1
Nov 12 04:34:11 pfe22 kernel: [69743.866993] LustreError: 92481:0:(import.c:366:ptlrpc_invalidate_import()) nbp9-OST0075_UUID: RPCs in "Unregistering" phase found (0). Network is sluggish? Waiting them to error out.

We find ldlm_bl_ threads stuck in D state.

Stack traceback for pid 8080
0xffff8802569b2540     8080        2  0    1   D  0xffff8802569b2bb0  ldlm_bl_04
 [<ffffffff8146fb6b>] thread_return+0x0/0x295
 [<ffffffff81470c58>] __mutex_lock_slowpath+0xf8/0x150
 [<ffffffff814706ea>] mutex_lock+0x1a/0x40
 [<ffffffffa08bd28e>] cl_lock_mutex_get+0x6e/0xc0 [obdclass]
 [<ffffffffa0b8988e>] osc_dlm_blocking_ast0+0x5e/0x210 [osc]
 [<ffffffffa0b89a8c>] osc_ldlm_blocking_ast+0x4c/0x100 [osc]
 [<ffffffffa09e30a0>] ldlm_handle_bl_callback+0xc0/0x420 [ptlrpc]
 [<ffffffffa09e3609>] ldlm_bl_thread_main+0x209/0x430 [ptlrpc]
 [<ffffffff8147ade4>] kernel_thread_helper+0x4/0x10
[0]kdb> btp 8090
Stack traceback for pid 8090
0xffff88024c9a6640     8090        2  0    6   D  0xffff88024c9a6cb0  ldlm_bl_12
 [<ffffffff8146fb6b>] thread_return+0x0/0x295
 [<ffffffff81470c58>] __mutex_lock_slowpath+0xf8/0x150
 [<ffffffff814706ea>] mutex_lock+0x1a/0x40
 [<ffffffffa08bd28e>] cl_lock_mutex_get+0x6e/0xc0 [obdclass]
 [<ffffffffa0b8988e>] osc_dlm_blocking_ast0+0x5e/0x210 [osc]
 [<ffffffffa0b89a8c>] osc_ldlm_blocking_ast+0x4c/0x100 [osc]
 [<ffffffffa09e30a0>] ldlm_handle_bl_callback+0xc0/0x420 [ptlrpc]
 [<ffffffffa09e3609>] ldlm_bl_thread_main+0x209/0x430 [ptlrpc]
 [<ffffffff8147ade4>] kernel_thread_helper+0x4/0x10
[0]kdb> go


 Comments   
Comment by Bruno Faccini (Inactive) [ 13/Nov/14 ]

Hello Mahmoud,
Have you had the chance to at least collect the full stacks trace and the Lustre debug log ?? Or may be a Client's full crash-dump has been taken ??

Comment by Peter Jones [ 13/Nov/14 ]

Bobijam

Could you please advise?

Thanks

Peter

Comment by Mahmoud Hanafi [ 13/Nov/14 ]

I have uploaded lustre debug logs and stacktrace for all processes to the ftp site.

ftp.whamcloud.com/uploads/LU-5913/r447i2n10.stacktrace.gz
ftp.whamcloud.com/uploads/LU-5913/r447i2n10.lustredebug.gz

Comment by Jinshan Xiong (Inactive) [ 17/Nov/14 ]

I saw the statehead thread below was trying to enqueue a lock, then the enqueue process was stuck at writing back dirty pages. This is bad as it's doing this by holding parent lock, which may cause deadlock.

[1415909369.169600] ll_sa_96389     S ffff880734597460     0 96438      2 0x00000000
[1415909369.169600]  ffff880734597390 0000000000000046 ffff880734596010 0000000000011800
[1415909369.169600]  0000000000011800 0000000000011800 0000000000011800 ffff880734597fd8
[1415909369.169600]  ffff880734597fd8 0000000000011800 ffff880688818500 ffffffff81a11020
[1415909369.169600] Call Trace:
[1415909369.169600]  [<ffffffffa09a4b7b>] osc_extent_wait+0x5eb/0x680 [osc]
[1415909369.169600]  [<ffffffffa09a5146>] osc_cache_wait_range+0x536/0x820 [osc]
[1415909369.169600]  [<ffffffffa09a5f55>] osc_cache_writeback_range+0xb25/0x1140 [osc]
[1415909369.169600]  [<ffffffffa099025d>] osc_lock_flush+0x7d/0x260 [osc]
[1415909369.169600]  [<ffffffffa0990539>] osc_lock_cancel+0xf9/0x1e0 [osc]
[1415909369.169600]  [<ffffffffa06c3225>] cl_lock_cancel0+0x65/0x150 [obdclass]
[1415909369.169600]  [<ffffffffa06c3f7b>] cl_lock_cancel+0x14b/0x150 [obdclass]
[1415909369.169600]  [<ffffffffa099107d>] osc_lock_blocking+0x5d/0xf0 [osc]
[1415909369.169600]  [<ffffffffa0991929>] osc_dlm_blocking_ast0+0xf9/0x210 [osc]
[1415909369.169600]  [<ffffffffa0991a8c>] osc_ldlm_blocking_ast+0x4c/0x100 [osc]
[1415909369.169600]  [<ffffffffa07d4e0f>] ldlm_cancel_callback+0x5f/0x180 [ptlrpc]
[1415909369.169600]  [<ffffffffa07e320f>] ldlm_cli_cancel_local+0x7f/0x480 [ptlrpc]
[1415909369.169600]  [<ffffffffa07e65a2>] ldlm_cli_cancel_list_local+0xf2/0x290 [ptlrpc]
[1415909369.169600]  [<ffffffffa07e839f>] ldlm_prep_elc_req+0x3df/0x490 [ptlrpc]
[1415909369.169600]  [<ffffffffa07e846f>] ldlm_prep_enqueue_req+0x1f/0x30 [ptlrpc]
[1415909369.169600]  [<ffffffffa0977280>] osc_enqueue_base+0x150/0x660 [osc]
[1415909369.169600]  [<ffffffffa09914a0>] osc_lock_enqueue+0x310/0x4c0 [osc]
[1415909369.169600]  [<ffffffffa06c4722>] cl_enqueue_kick+0x62/0x150 [obdclass]
[1415909369.169600]  [<ffffffffa06c7f31>] cl_enqueue_try+0xc1/0x210 [obdclass]
[1415909369.169600]  [<ffffffffa0a27e2d>] lov_lock_enqueue_one+0x4d/0x170 [lov]
[1415909369.169600]  [<ffffffffa0a2b07b>] lov_lock_enqueue+0x16b/0x3e0 [lov]
[1415909369.169600]  [<ffffffffa06c4722>] cl_enqueue_kick+0x62/0x150 [obdclass]
[1415909369.169600]  [<ffffffffa06c7f31>] cl_enqueue_try+0xc1/0x210 [obdclass]
[1415909369.169600]  [<ffffffffa06c8817>] cl_enqueue_locked+0x77/0x1e0 [obdclass]
[1415909369.169600]  [<ffffffffa06c9509>] cl_lock_request+0x79/0x280 [obdclass]
[1415909369.169600]  [<ffffffffa0afcd2d>] cl_glimpse_lock+0x17d/0x490 [lustre]
[1415909369.169600]  [<ffffffffa0afd1ef>] cl_glimpse_size0+0x1af/0x1e0 [lustre]
[1415909369.169600]  [<ffffffffa0af5aae>] ll_agl_trigger+0x2ae/0x4d0 [lustre]
[1415909369.169600]  [<ffffffffa0afbeac>] ll_statahead_thread+0x40c/0xd10 [lustre]
[1415909369.169600]  [<ffffffff8147ae64>] kernel_thread_helper+0x4/0x10

I guess this is a reproduction of LU-4300. If we can get the log from OST side, we can verify if this.

Anyway, let's try patch at LU-4300, master commit: 44cdf59 and see if it can fix the problem.

Comment by Mahmoud Hanafi [ 18/Nov/14 ]

Which logs would you like from the OST side?

Comment by Zhenyu Xu [ 20/Nov/14 ]

Lustre debug log.

Comment by Mahmoud Hanafi [ 04/Dec/14 ]

LU-4300 looks has fixed the issue. Since applying the patch we have not hit the bug.

Comment by Mahmoud Hanafi [ 16/Oct/15 ]

Please close this

Comment by Peter Jones [ 16/Oct/15 ]

Thanks Mahmoud

Generated at Sat Feb 10 01:55:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.