[LU-5913] client stuck in ptlrpc_invalidate_import Created: 12/Nov/14 Updated: 16/Oct/15 Resolved: 16/Oct/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Mahmoud Hanafi | Assignee: | Zhenyu Xu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
client and server version lustre2.4.3-7nas |
||
| Severity: | 3 |
| Rank (Obsolete): | 16510 |
| Description |
|
?DUP of Nov 12 04:32:30 pfe22 kernel: [69643.816473] LustreError: 11-0: nbp9-OST0075-osc-ffff880a28404400: Communicating with 10.151.26.11@o2ib, operation obd_ping failed with -107. Nov 12 04:32:30 pfe22 kernel: [69643.854158] Lustre: nbp9-OST0075-osc-ffff880a28404400: Connection to nbp9-OST0075 (at 10.151.26.11@o2ib) was lost; in progress operations using this service will wait for recovery to complete Nov 12 04:32:30 pfe22 kernel: [69643.922758] LustreError: 167-0: nbp9-OST0075-osc-ffff880a28404400: This client was evicted by nbp9-OST0075; in progress operations using this service will fail. Nov 12 04:34:11 pfe22 kernel: [69743.757653] LustreError: 92481:0:(import.c:324:ptlrpc_invalidate_import()) nbp9-OST0075_UUID: rc = -110 waiting for callback (1 != 0) Nov 12 04:34:11 pfe22 kernel: [69743.793518] LustreError: 92481:0:(import.c:350:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff8802204cec00 x1484496199485132/t0(0) o4->nbp9-OST0075-osc-ffff880a28404400@10.151.26.11@o2ib:6/4 lens 488/448 e 0 to 0 dl 1415726912 ref 2 fl Rpc:RE/0/ffffffff rc -5/-1 Nov 12 04:34:11 pfe22 kernel: [69743.866993] LustreError: 92481:0:(import.c:366:ptlrpc_invalidate_import()) nbp9-OST0075_UUID: RPCs in "Unregistering" phase found (0). Network is sluggish? Waiting them to error out. We find ldlm_bl_ threads stuck in D state. Stack traceback for pid 8080 0xffff8802569b2540 8080 2 0 1 D 0xffff8802569b2bb0 ldlm_bl_04 [<ffffffff8146fb6b>] thread_return+0x0/0x295 [<ffffffff81470c58>] __mutex_lock_slowpath+0xf8/0x150 [<ffffffff814706ea>] mutex_lock+0x1a/0x40 [<ffffffffa08bd28e>] cl_lock_mutex_get+0x6e/0xc0 [obdclass] [<ffffffffa0b8988e>] osc_dlm_blocking_ast0+0x5e/0x210 [osc] [<ffffffffa0b89a8c>] osc_ldlm_blocking_ast+0x4c/0x100 [osc] [<ffffffffa09e30a0>] ldlm_handle_bl_callback+0xc0/0x420 [ptlrpc] [<ffffffffa09e3609>] ldlm_bl_thread_main+0x209/0x430 [ptlrpc] [<ffffffff8147ade4>] kernel_thread_helper+0x4/0x10 [0]kdb> btp 8090 Stack traceback for pid 8090 0xffff88024c9a6640 8090 2 0 6 D 0xffff88024c9a6cb0 ldlm_bl_12 [<ffffffff8146fb6b>] thread_return+0x0/0x295 [<ffffffff81470c58>] __mutex_lock_slowpath+0xf8/0x150 [<ffffffff814706ea>] mutex_lock+0x1a/0x40 [<ffffffffa08bd28e>] cl_lock_mutex_get+0x6e/0xc0 [obdclass] [<ffffffffa0b8988e>] osc_dlm_blocking_ast0+0x5e/0x210 [osc] [<ffffffffa0b89a8c>] osc_ldlm_blocking_ast+0x4c/0x100 [osc] [<ffffffffa09e30a0>] ldlm_handle_bl_callback+0xc0/0x420 [ptlrpc] [<ffffffffa09e3609>] ldlm_bl_thread_main+0x209/0x430 [ptlrpc] [<ffffffff8147ade4>] kernel_thread_helper+0x4/0x10 [0]kdb> go |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 13/Nov/14 ] |
|
Hello Mahmoud, |
| Comment by Peter Jones [ 13/Nov/14 ] |
|
Bobijam Could you please advise? Thanks Peter |
| Comment by Mahmoud Hanafi [ 13/Nov/14 ] |
|
I have uploaded lustre debug logs and stacktrace for all processes to the ftp site. ftp.whamcloud.com/uploads/ |
| Comment by Jinshan Xiong (Inactive) [ 17/Nov/14 ] |
|
I saw the statehead thread below was trying to enqueue a lock, then the enqueue process was stuck at writing back dirty pages. This is bad as it's doing this by holding parent lock, which may cause deadlock. [1415909369.169600] ll_sa_96389 S ffff880734597460 0 96438 2 0x00000000 [1415909369.169600] ffff880734597390 0000000000000046 ffff880734596010 0000000000011800 [1415909369.169600] 0000000000011800 0000000000011800 0000000000011800 ffff880734597fd8 [1415909369.169600] ffff880734597fd8 0000000000011800 ffff880688818500 ffffffff81a11020 [1415909369.169600] Call Trace: [1415909369.169600] [<ffffffffa09a4b7b>] osc_extent_wait+0x5eb/0x680 [osc] [1415909369.169600] [<ffffffffa09a5146>] osc_cache_wait_range+0x536/0x820 [osc] [1415909369.169600] [<ffffffffa09a5f55>] osc_cache_writeback_range+0xb25/0x1140 [osc] [1415909369.169600] [<ffffffffa099025d>] osc_lock_flush+0x7d/0x260 [osc] [1415909369.169600] [<ffffffffa0990539>] osc_lock_cancel+0xf9/0x1e0 [osc] [1415909369.169600] [<ffffffffa06c3225>] cl_lock_cancel0+0x65/0x150 [obdclass] [1415909369.169600] [<ffffffffa06c3f7b>] cl_lock_cancel+0x14b/0x150 [obdclass] [1415909369.169600] [<ffffffffa099107d>] osc_lock_blocking+0x5d/0xf0 [osc] [1415909369.169600] [<ffffffffa0991929>] osc_dlm_blocking_ast0+0xf9/0x210 [osc] [1415909369.169600] [<ffffffffa0991a8c>] osc_ldlm_blocking_ast+0x4c/0x100 [osc] [1415909369.169600] [<ffffffffa07d4e0f>] ldlm_cancel_callback+0x5f/0x180 [ptlrpc] [1415909369.169600] [<ffffffffa07e320f>] ldlm_cli_cancel_local+0x7f/0x480 [ptlrpc] [1415909369.169600] [<ffffffffa07e65a2>] ldlm_cli_cancel_list_local+0xf2/0x290 [ptlrpc] [1415909369.169600] [<ffffffffa07e839f>] ldlm_prep_elc_req+0x3df/0x490 [ptlrpc] [1415909369.169600] [<ffffffffa07e846f>] ldlm_prep_enqueue_req+0x1f/0x30 [ptlrpc] [1415909369.169600] [<ffffffffa0977280>] osc_enqueue_base+0x150/0x660 [osc] [1415909369.169600] [<ffffffffa09914a0>] osc_lock_enqueue+0x310/0x4c0 [osc] [1415909369.169600] [<ffffffffa06c4722>] cl_enqueue_kick+0x62/0x150 [obdclass] [1415909369.169600] [<ffffffffa06c7f31>] cl_enqueue_try+0xc1/0x210 [obdclass] [1415909369.169600] [<ffffffffa0a27e2d>] lov_lock_enqueue_one+0x4d/0x170 [lov] [1415909369.169600] [<ffffffffa0a2b07b>] lov_lock_enqueue+0x16b/0x3e0 [lov] [1415909369.169600] [<ffffffffa06c4722>] cl_enqueue_kick+0x62/0x150 [obdclass] [1415909369.169600] [<ffffffffa06c7f31>] cl_enqueue_try+0xc1/0x210 [obdclass] [1415909369.169600] [<ffffffffa06c8817>] cl_enqueue_locked+0x77/0x1e0 [obdclass] [1415909369.169600] [<ffffffffa06c9509>] cl_lock_request+0x79/0x280 [obdclass] [1415909369.169600] [<ffffffffa0afcd2d>] cl_glimpse_lock+0x17d/0x490 [lustre] [1415909369.169600] [<ffffffffa0afd1ef>] cl_glimpse_size0+0x1af/0x1e0 [lustre] [1415909369.169600] [<ffffffffa0af5aae>] ll_agl_trigger+0x2ae/0x4d0 [lustre] [1415909369.169600] [<ffffffffa0afbeac>] ll_statahead_thread+0x40c/0xd10 [lustre] [1415909369.169600] [<ffffffff8147ae64>] kernel_thread_helper+0x4/0x10 I guess this is a reproduction of Anyway, let's try patch at |
| Comment by Mahmoud Hanafi [ 18/Nov/14 ] |
|
Which logs would you like from the OST side? |
| Comment by Zhenyu Xu [ 20/Nov/14 ] |
|
Lustre debug log. |
| Comment by Mahmoud Hanafi [ 04/Dec/14 ] |
|
|
| Comment by Mahmoud Hanafi [ 16/Oct/15 ] |
|
Please close this |
| Comment by Peter Jones [ 16/Oct/15 ] |
|
Thanks Mahmoud |