[LU-2030] LDLM timeout resulting in client eviction and truncate system call to fail with EINTR Created: 25/Sep/12 Updated: 23/Jan/16 Resolved: 23/Jan/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Prakash Surya (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
Lustre Client: 2.1.2-4chaos |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 4159 |
| Description |
|
I see errors reported from the truncate system call when running SIMUL on 2 nodes with 16 processes each (32 processes total). The reproducer is: $ salloc -N2 -n32 $ srun -- /g/g0/surya1/dev/simul-1.14/simul -d /p/lcrater2/surya1/simul-test -N 635 Specifically, I see truncate return EINTR when running the test under strace: truncate("/p/lcrater2/surya1/simul-test/simul_truncate.0", 1024) = -1 EINTR (Interrupted system call)
The console logs from the clients and servers from a case where it hit this is below. Test: hype355@root:srun -- /g/g0/surya1/dev/simul-1.14/simul -d /p/lcrater2/surya1/simul-test -N 635 -f14 -l15 Simul is running with 32 process(es) 13:14:38: Set iteration 0 13:14:38: Running test #14(iter 0): creat, shared mode. 13:14:38: Running test #15(iter 0): truncate, shared mode. 13:18:00: Process 3(hype214): FAILED in simul_truncate, truncate failed: Interrupted system call Client console messages: # hype355 /root > tail -f /var/log/conman/console.hype21{3,4}
==> /var/log/conman/console.hype213 <==
2012-09-25 13:16:19 Lustre: lc2-OST000c-osc-ffff880833b85400: Connection to lc2-OST000c (at 10.1.1.49@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
2012-09-25 13:16:19 LustreError: 107596:0:(ldlm_request.c:1179:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway
2012-09-25 13:16:19 LustreError: 107596:0:(ldlm_request.c:1807:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
2012-09-25 13:16:19 LustreError: 167-0: This client was evicted by lc2-OST000c; in progress operations using this service will fail.
2012-09-25 13:16:19 LustreError: 107628:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lc2-OST000c-osc-ffff880833b85400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
2012-09-25 13:16:19 LustreError: 107628:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff880748182d40 (167630808/0/0/0) (rc: 1)
2012-09-25 13:16:19 Lustre: lc2-OST000c-osc-ffff880833b85400: Connection restored to lc2-OST000c (at 10.1.1.49@o2ib9)
==> /var/log/conman/console.hype214 <==
2012-09-25 13:18:00 Lustre: lc2-OST000c-osc-ffff880831ce8800: Connection to lc2-OST000c (at 10.1.1.49@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
2012-09-25 13:18:00 LustreError: 75280:0:(ldlm_request.c:1179:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway
2012-09-25 13:18:00 LustreError: 75280:0:(ldlm_request.c:1807:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
2012-09-25 13:18:00 LustreError: 167-0: This client was evicted by lc2-OST000c; in progress operations using this service will fail.
2012-09-25 13:18:00 Lustre: lc2-OST000c-osc-ffff880831ce8800: Connection restored to lc2-OST000c (at 10.1.1.49@o2ib9)
Server console messages: > $ tail /var/log/conman/console.zwicky49 > ==> /var/log/conman/console.zwicky49 <== > 2012-09-25 13:16:19 LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.121.14@o2ib2 ns: filter-lc2-OST000c_UUID lock: ffff88028aee6d80/0x8691e72961eb7429 lrc: 3/0,0 mode: PW/PW res: 167630808/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xf19c08e9d42c3468 expref: 4 pid: 4358 timeout 7757255590 > 2012-09-25 13:16:19 LustreError: 21861:0:(ldlm_lockd.c:2074:ldlm_cancel_handler()) ldlm_cancel from 192.168.121.14@o2ib2 arrived at 1348604178 with bad export cookie 9696785637627868352 > 2012-09-25 13:16:19 Lustre: 24547:0:(ldlm_lib.c:933:target_handle_connect()) lc2-OST000c: connection from 7f3b19c3-f206-6f48-4758-3716d5199ca1@192.168.121.14@o2ib2 t0 exp (null) cur 1348604179 last 0 > 2012-09-25 13:17:59 LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.121.15@o2ib2 ns: filter-lc2-OST000c_UUID lock: ffff8801d0235b40/0x8691e72961eb744c lrc: 3/0,0 mode: PW/PW res: 167630808/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xbd300c41b4abfe4a expref: 4 pid: 24547 timeout 7757356173 > 2012-09-25 13:18:00 LustreError: 21861:0:(ldlm_lockd.c:2074:ldlm_cancel_handler()) ldlm_cancel from 192.168.121.15@o2ib2 arrived at 1348604279 with bad export cookie 9696785637609291136 > 2012-09-25 13:18:00 Lustre: 25184:0:(ldlm_lib.c:933:target_handle_connect()) lc2-OST000c: connection from e0aa3722-987c-e30c-5c60-102f582dddae@192.168.121.15@o2ib2 t0 exp (null) cur 1348604280 last 0 The client version of lustre on the nodes: hype355@root:pdsh -w hype[213-214] 'rpm -qa | grep lustre | sort' | dshbak -c ---------------- hype[213-214] ---------------- lustre-2.1.2-4chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64 lustre-modules-2.1.2-4chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64 lustre-tools-llnl-1.4-2.ch5.x86_64 The server version of lustre: $ pdsh -w zwicky[48-49] 'rpm -qa | grep lustre | sort' | dshbak -c ---------------- zwicky[48-49] ---------------- lustre-2.1.2-3chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64 lustre-modules-2.1.2-3chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64 lustre-tools-llnl-1.4-2.ch5.x86_64 |
| Comments |
| Comment by Peter Jones [ 25/Sep/12 ] |
|
Hongchao Could you please look into this one? Thanks Peter |
| Comment by Hongchao Zhang [ 26/Sep/12 ] |
|
Is the debug log available? there was an extent lock held by client much longer time than server permitted and was evicted, from the debug log, |
| Comment by Prakash Surya (Inactive) [ 26/Sep/12 ] |
|
Here is a Lustre Log file with '+dlmtrace' from the server which had the lock expire. |
| Comment by Prakash Surya (Inactive) [ 26/Sep/12 ] |
|
Here are logs from the server and the client with '+dlmtrace'. The client also dumped dumped this to it's console: 2012-09-26 10:08:14 LustreError: 11-0: lc2-OST000c-osc-ffff880831ce8800: Communicating with 10.1.1.49@o2ib9, operation ost_punch failed with -107.
2012-09-26 10:08:14 Lustre: lc2-OST000c-osc-ffff880831ce8800: Connection to lc2-OST000c (at 10.1.1.49@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
2012-09-26 10:08:14 LustreError: 167-0: This client was evicted by lc2-OST000c; in progress operations using this service will fail.
2012-09-26 10:08:14 LustreError: 79024:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lock@ffff88043156ddd8[2 3 0 1 1 00000000] W(2):[0, 18446744073709551615]@[0x1000c0000:0x9fdd7d9:0x0] {
2012-09-26 10:08:14 LustreError: 79024:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lovsub@ffff88043156e4e0: [0 ffff880773391748 W(2):[0, 18446744073709551615]@[0x20097c3a8:0x16a97:0x0]]
2012-09-26 10:08:14 LustreError: 79024:0:(osc_lock.c:816:osc_ldlm_completion_ast()) osc@ffff88043156fb50: ffff880199befb40 00120002 0xbd300c41b4ac0bfd 3 ffff8803dd5e9068 size: 1024 mtime: 1348679193 atime: 1348679193 ctime: 1348679193 blocks: 8
2012-09-26 10:08:14 LustreError: 79024:0:(osc_lock.c:816:osc_ldlm_completion_ast()) } lock@ffff88043156ddd8
2012-09-26 10:08:14 LustreError: 79024:0:(osc_lock.c:816:osc_ldlm_completion_ast()) dlmlock returned -5
2012-09-26 10:08:14 LustreError: 79024:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lc2-OST000c-osc-ffff880831ce8800 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) lock@ffff88082a0540b8[2 4 0 2 0 00000000] W(2):[0, 18446744073709551615]@[0x20097c3a8:0x16a97:0x0] {
2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) vvp@ffff880773390420:
2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) lov@ffff880773391748: 2
2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) 0 0: ---
2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) 1 0: ---
2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try())
2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff88082a0540b8
2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) unuse return -5
2012-09-26 10:08:14 LustreError: 79024:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff880820256ac0 (167630809/0/0/0) (rc: 0)
2012-09-26 10:08:14 Lustre: lc2-OST000c-osc-ffff880831ce8800: Connection restored to lc2-OST000c (at 10.1.1.49@o2ib9)
|
| Comment by Prakash Surya (Inactive) [ 26/Sep/12 ] |
|
Looking at the server logs, the RX errors from the LNet routers is a bit troubling. The clients and the servers have to go through a set of routers, 10.1.1.53@o2ib9 being the NID of one of the routers: 00010000:00010000:4.0:1348679193.857904:0:20439:0:(ldlm_lockd.c:1069:ldlm_handle_enqueue0()) ### server-side enqueue handler START 00010000:00010000:4.0:1348679193.857909:0:20439:0:(ldlm_lockd.c:1154:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: filter-lc2-OST000c_UUID lock: ffff880248b78240/0x8691e72961eb80f5 lrc: 2/0,0 mode: --/PW res: 167630809/0 rrc: 2 type: EXT [0->0] (req 0->0) flags: 0x0 remote: 0xf19c08e9d42c4364 expref: -99 pid: 20439 timeout 0 00010000:00010000:4.0:1348679193.857921:0:20439:0:(ldlm_lock.c:626:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: filter-lc2-OST000c_UUID lock: ffff880248b78480/0x8691e72961eb80ee lrc: 2/0,0 mode: PW/PW res: 167630809/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 remote: 0xbd300c41b4ac0bfd expref: 14 pid: 20439 timeout 0 00010000:00010000:4.0:1348679193.857925:0:20439:0:(ldlm_extent.c:295:ldlm_check_contention()) contended locks = 1 00010000:00010000:4.0:1348679193.857926:0:20439:0:(ldlm_extent.c:295:ldlm_check_contention()) contended locks = 1 00010000:00010000:4.0:1348679193.857931:0:20439:0:(ldlm_lockd.c:809:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: filter-lc2-OST000c_UUID lock: ffff880248b78480/0x8691e72961eb80ee lrc: 3/0,0 mode: PW/PW res: 167630809/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xbd300c41b4ac0bfd expref: 14 pid: 20439 timeout 0 00010000:00010000:4.0:1348679193.857936:0:20439:0:(ldlm_lockd.c:468:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 100, AT: on) ns: filter-lc2-OST000c_UUID lock: ffff880248b78480/0x8691e72961eb80ee lrc: 4/0,0 mode: PW/PW res: 167630809/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xbd300c41b4ac0bfd expref: 14 pid: 20439 timeout 7832370900 00010000:00010000:4.0:1348679193.858049:0:20439:0:(ldlm_lockd.c:1288:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: filter-lc2-OST000c_UUID lock: ffff880248b78240/0x8691e72961eb80f5 lrc: 3/0,0 mode: --/PW res: 167630809/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 remote: 0xf19c08e9d42c4364 expref: 18 pid: 20439 timeout 0 00010000:00010000:4.0:1348679193.858054:0:20439:0:(ldlm_lockd.c:1320:ldlm_handle_enqueue0()) ### server-side enqueue handler END (lock ffff880248b78240, rc 0) 00000800:00000100:0.0:1348679234.951973:0:2588:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00000800:00000100:1.0:1348679234.951992:0:2591:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00000800:00000100:2.0:1348679234.951999:0:2592:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00000800:00000100:4.0:1348679234.952008:0:2594:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00000800:00000100:0.0:1348679234.952008:0:2590:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00000800:00000100:1.0:1348679234.952019:0:2588:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00000800:00000100:2.0:1348679234.952024:0:2587:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00000800:00000100:1.0:1348679234.952034:0:2588:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00000800:00000100:1.0:1348679234.952043:0:2588:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00000800:00000100:1.0:1348679234.952052:0:2588:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00000800:00000100:2.0:1348679234.952057:0:2587:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00010000:00020000:4.1F:1348679293.957439:0:0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.121.15@o2ib2 ns: filter-lc2-OST000c_UUID lock: ffff880248b78480/0x8691e72961eb80ee lrc: 3/0,0 mode: PW/PW res: 167630809/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xbd300c41b4ac0bfd expref: 4 pid: 20439 timeou |
| Comment by Prakash Surya (Inactive) [ 26/Sep/12 ] |
|
Also, in case it's useful, I reproduced the issue with both '+dlmtrace' and '+rpctrace'. So: |
| Comment by Hongchao Zhang [ 15/Oct/12 ] |
|
in the and it has acquired the ldlm lock "ffff8806636c1b40/0xf19c08e9d42c4459" at OST0013 and try to get the ldlm lock |
| Comment by Hongchao Zhang [ 18/Oct/12 ] |
|
status update: the possible patch is under creation & test |
| Comment by Prakash Surya (Inactive) [ 18/Oct/12 ] |
|
Good to hear. Thanks! |
| Comment by Hongchao Zhang [ 24/Oct/12 ] |
|
the patch is tracked at http://review.whamcloud.com/#change,4382 |
| Comment by Hongchao Zhang [ 24/Oct/12 ] |
|
Hi Prakash, |
| Comment by Prakash Surya (Inactive) [ 25/Oct/12 ] |
|
Yes, the issue was reproducible. I'll try to allocate some time to get on the machine and gather the logs from all 4 OSS nodes for this filesystem. Any debug flags of special interest? |
| Comment by Hongchao Zhang [ 01/Nov/12 ] |
|
Hi, the same as before, +rpctrace and +dlmtrace, thanks a lot! |
| Comment by Hongchao Zhang [ 22/Jan/16 ] |
|
Hi Prakash, |
| Comment by Christopher Morrone [ 22/Jan/16 ] |
|
Prakash hasn't worked on Lustre in some time. This is probably another "Won't Fix" given its age. |
| Comment by John Fuchs-Chesney (Inactive) [ 23/Jan/16 ] |
|
Thanks Chris. |