Details
-
Bug
-
Resolution: Won't Fix
-
Minor
-
None
-
Lustre 2.1.2
-
Lustre Client: 2.1.2-4chaos
Lustre Server: 2.1.2-3chaos
-
3
-
4159
Description
I see errors reported from the truncate system call when running SIMUL on 2 nodes with 16 processes each (32 processes total).
The reproducer is:
$ salloc -N2 -n32 $ srun -- /g/g0/surya1/dev/simul-1.14/simul -d /p/lcrater2/surya1/simul-test -N 635
Specifically, I see truncate return EINTR when running the test under strace:
truncate("/p/lcrater2/surya1/simul-test/simul_truncate.0", 1024) = -1 EINTR (Interrupted system call)
The console logs from the clients and servers from a case where it hit this is below.
Test:
hype355@root:srun -- /g/g0/surya1/dev/simul-1.14/simul -d /p/lcrater2/surya1/simul-test -N 635 -f14 -l15 Simul is running with 32 process(es) 13:14:38: Set iteration 0 13:14:38: Running test #14(iter 0): creat, shared mode. 13:14:38: Running test #15(iter 0): truncate, shared mode. 13:18:00: Process 3(hype214): FAILED in simul_truncate, truncate failed: Interrupted system call
Client console messages:
# hype355 /root > tail -f /var/log/conman/console.hype21{3,4} ==> /var/log/conman/console.hype213 <== 2012-09-25 13:16:19 Lustre: lc2-OST000c-osc-ffff880833b85400: Connection to lc2-OST000c (at 10.1.1.49@o2ib9) was lost; in progress operations using this service will wait for recovery to complete 2012-09-25 13:16:19 LustreError: 107596:0:(ldlm_request.c:1179:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway 2012-09-25 13:16:19 LustreError: 107596:0:(ldlm_request.c:1807:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107 2012-09-25 13:16:19 LustreError: 167-0: This client was evicted by lc2-OST000c; in progress operations using this service will fail. 2012-09-25 13:16:19 LustreError: 107628:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lc2-OST000c-osc-ffff880833b85400 resource refcount nonzero (1) after lock cleanup; forcing cleanup. 2012-09-25 13:16:19 LustreError: 107628:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff880748182d40 (167630808/0/0/0) (rc: 1) 2012-09-25 13:16:19 Lustre: lc2-OST000c-osc-ffff880833b85400: Connection restored to lc2-OST000c (at 10.1.1.49@o2ib9) ==> /var/log/conman/console.hype214 <== 2012-09-25 13:18:00 Lustre: lc2-OST000c-osc-ffff880831ce8800: Connection to lc2-OST000c (at 10.1.1.49@o2ib9) was lost; in progress operations using this service will wait for recovery to complete 2012-09-25 13:18:00 LustreError: 75280:0:(ldlm_request.c:1179:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway 2012-09-25 13:18:00 LustreError: 75280:0:(ldlm_request.c:1807:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107 2012-09-25 13:18:00 LustreError: 167-0: This client was evicted by lc2-OST000c; in progress operations using this service will fail. 2012-09-25 13:18:00 Lustre: lc2-OST000c-osc-ffff880831ce8800: Connection restored to lc2-OST000c (at 10.1.1.49@o2ib9)
Server console messages:
> $ tail /var/log/conman/console.zwicky49 > ==> /var/log/conman/console.zwicky49 <== > 2012-09-25 13:16:19 LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.121.14@o2ib2 ns: filter-lc2-OST000c_UUID lock: ffff88028aee6d80/0x8691e72961eb7429 lrc: 3/0,0 mode: PW/PW res: 167630808/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xf19c08e9d42c3468 expref: 4 pid: 4358 timeout 7757255590 > 2012-09-25 13:16:19 LustreError: 21861:0:(ldlm_lockd.c:2074:ldlm_cancel_handler()) ldlm_cancel from 192.168.121.14@o2ib2 arrived at 1348604178 with bad export cookie 9696785637627868352 > 2012-09-25 13:16:19 Lustre: 24547:0:(ldlm_lib.c:933:target_handle_connect()) lc2-OST000c: connection from 7f3b19c3-f206-6f48-4758-3716d5199ca1@192.168.121.14@o2ib2 t0 exp (null) cur 1348604179 last 0 > 2012-09-25 13:17:59 LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.121.15@o2ib2 ns: filter-lc2-OST000c_UUID lock: ffff8801d0235b40/0x8691e72961eb744c lrc: 3/0,0 mode: PW/PW res: 167630808/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xbd300c41b4abfe4a expref: 4 pid: 24547 timeout 7757356173 > 2012-09-25 13:18:00 LustreError: 21861:0:(ldlm_lockd.c:2074:ldlm_cancel_handler()) ldlm_cancel from 192.168.121.15@o2ib2 arrived at 1348604279 with bad export cookie 9696785637609291136 > 2012-09-25 13:18:00 Lustre: 25184:0:(ldlm_lib.c:933:target_handle_connect()) lc2-OST000c: connection from e0aa3722-987c-e30c-5c60-102f582dddae@192.168.121.15@o2ib2 t0 exp (null) cur 1348604280 last 0
The client version of lustre on the nodes:
hype355@root:pdsh -w hype[213-214] 'rpm -qa | grep lustre | sort' | dshbak -c ---------------- hype[213-214] ---------------- lustre-2.1.2-4chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64 lustre-modules-2.1.2-4chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64 lustre-tools-llnl-1.4-2.ch5.x86_64
The server version of lustre:
$ pdsh -w zwicky[48-49] 'rpm -qa | grep lustre | sort' | dshbak -c ---------------- zwicky[48-49] ---------------- lustre-2.1.2-3chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64 lustre-modules-2.1.2-3chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64 lustre-tools-llnl-1.4-2.ch5.x86_64
Thanks Chris.