[LU-2030] LDLM timeout resulting in client eviction and truncate system call to fail with EINTR Created: 25/Sep/12  Updated: 23/Jan/16  Resolved: 23/Jan/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.2
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Prakash Surya (Inactive) Assignee: Hongchao Zhang
Resolution: Won't Fix Votes: 0
Labels: llnl
Environment:

Lustre Client: 2.1.2-4chaos
Lustre Server: 2.1.2-3chaos


Attachments: File LU-2030-llogs-1.tar.gz     File LU-2030-llogs.tar.gz     File zwicky49-dlmtrace.llog    
Severity: 3
Rank (Obsolete): 4159

 Description   

I see errors reported from the truncate system call when running SIMUL on 2 nodes with 16 processes each (32 processes total).

The reproducer is:

$ salloc -N2 -n32
$ srun -- /g/g0/surya1/dev/simul-1.14/simul -d /p/lcrater2/surya1/simul-test -N 635

Specifically, I see truncate return EINTR when running the test under strace:

truncate("/p/lcrater2/surya1/simul-test/simul_truncate.0", 1024) = -1 EINTR (Interrupted system call)

The console logs from the clients and servers from a case where it hit this is below.

Test:

hype355@root:srun -- /g/g0/surya1/dev/simul-1.14/simul -d /p/lcrater2/surya1/simul-test -N 635 -f14 -l15
Simul is running with 32 process(es)
13:14:38: Set iteration 0
13:14:38: Running test #14(iter 0): creat, shared mode.
13:14:38: Running test #15(iter 0): truncate, shared mode.
13:18:00: Process 3(hype214): FAILED in simul_truncate, truncate failed: Interrupted system call

Client console messages:

# hype355 /root > tail -f /var/log/conman/console.hype21{3,4}
==> /var/log/conman/console.hype213 <==
2012-09-25 13:16:19 Lustre: lc2-OST000c-osc-ffff880833b85400: Connection to lc2-OST000c (at 10.1.1.49@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
2012-09-25 13:16:19 LustreError: 107596:0:(ldlm_request.c:1179:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway
2012-09-25 13:16:19 LustreError: 107596:0:(ldlm_request.c:1807:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
2012-09-25 13:16:19 LustreError: 167-0: This client was evicted by lc2-OST000c; in progress operations using this service will fail.
2012-09-25 13:16:19 LustreError: 107628:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lc2-OST000c-osc-ffff880833b85400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
2012-09-25 13:16:19 LustreError: 107628:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff880748182d40 (167630808/0/0/0) (rc: 1)
2012-09-25 13:16:19 Lustre: lc2-OST000c-osc-ffff880833b85400: Connection restored to lc2-OST000c (at 10.1.1.49@o2ib9)

==> /var/log/conman/console.hype214 <==
2012-09-25 13:18:00 Lustre: lc2-OST000c-osc-ffff880831ce8800: Connection to lc2-OST000c (at 10.1.1.49@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
2012-09-25 13:18:00 LustreError: 75280:0:(ldlm_request.c:1179:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway
2012-09-25 13:18:00 LustreError: 75280:0:(ldlm_request.c:1807:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
2012-09-25 13:18:00 LustreError: 167-0: This client was evicted by lc2-OST000c; in progress operations using this service will fail.
2012-09-25 13:18:00 Lustre: lc2-OST000c-osc-ffff880831ce8800: Connection restored to lc2-OST000c (at 10.1.1.49@o2ib9)

Server console messages:

> $ tail /var/log/conman/console.zwicky49
> ==> /var/log/conman/console.zwicky49 <==
> 2012-09-25 13:16:19 LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.121.14@o2ib2  ns: filter-lc2-OST000c_UUID lock: ffff88028aee6d80/0x8691e72961eb7429 lrc: 3/0,0 mode: PW/PW res: 167630808/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xf19c08e9d42c3468 expref: 4 pid: 4358 timeout 7757255590
> 2012-09-25 13:16:19 LustreError: 21861:0:(ldlm_lockd.c:2074:ldlm_cancel_handler()) ldlm_cancel from 192.168.121.14@o2ib2 arrived at 1348604178 with bad export cookie 9696785637627868352
> 2012-09-25 13:16:19 Lustre: 24547:0:(ldlm_lib.c:933:target_handle_connect()) lc2-OST000c: connection from 7f3b19c3-f206-6f48-4758-3716d5199ca1@192.168.121.14@o2ib2 t0 exp (null) cur 1348604179 last 0
> 2012-09-25 13:17:59 LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.121.15@o2ib2  ns: filter-lc2-OST000c_UUID lock: ffff8801d0235b40/0x8691e72961eb744c lrc: 3/0,0 mode: PW/PW res: 167630808/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xbd300c41b4abfe4a expref: 4 pid: 24547 timeout 7757356173
> 2012-09-25 13:18:00 LustreError: 21861:0:(ldlm_lockd.c:2074:ldlm_cancel_handler()) ldlm_cancel from 192.168.121.15@o2ib2 arrived at 1348604279 with bad export cookie 9696785637609291136
> 2012-09-25 13:18:00 Lustre: 25184:0:(ldlm_lib.c:933:target_handle_connect()) lc2-OST000c: connection from e0aa3722-987c-e30c-5c60-102f582dddae@192.168.121.15@o2ib2 t0 exp (null) cur 1348604280 last 0

The client version of lustre on the nodes:

hype355@root:pdsh -w hype[213-214] 'rpm -qa | grep lustre | sort' | dshbak -c
----------------
hype[213-214]
----------------
lustre-2.1.2-4chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64
lustre-modules-2.1.2-4chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64
lustre-tools-llnl-1.4-2.ch5.x86_64

The server version of lustre:

$ pdsh -w zwicky[48-49] 'rpm -qa | grep lustre | sort' | dshbak -c
----------------
zwicky[48-49]
----------------
lustre-2.1.2-3chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64
lustre-modules-2.1.2-3chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64
lustre-tools-llnl-1.4-2.ch5.x86_64


 Comments   
Comment by Peter Jones [ 25/Sep/12 ]

Hongchao

Could you please look into this one?

Thanks

Peter

Comment by Hongchao Zhang [ 26/Sep/12 ]

Is the debug log available? there was an extent lock held by client much longer time than server permitted and was evicted, from the debug log,
we could find which lock was the culprit and help to trace where is the real problem.

Comment by Prakash Surya (Inactive) [ 26/Sep/12 ]

Here is a Lustre Log file with '+dlmtrace' from the server which had the lock expire.

Comment by Prakash Surya (Inactive) [ 26/Sep/12 ]

Here are logs from the server and the client with '+dlmtrace'.

The client also dumped dumped this to it's console:

2012-09-26 10:08:14 LustreError: 11-0: lc2-OST000c-osc-ffff880831ce8800: Communicating with 10.1.1.49@o2ib9, operation ost_punch failed with -107.
2012-09-26 10:08:14 Lustre: lc2-OST000c-osc-ffff880831ce8800: Connection to lc2-OST000c (at 10.1.1.49@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
2012-09-26 10:08:14 LustreError: 167-0: This client was evicted by lc2-OST000c; in progress operations using this service will fail.
2012-09-26 10:08:14 LustreError: 79024:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lock@ffff88043156ddd8[2 3 0 1 1 00000000] W(2):[0, 18446744073709551615]@[0x1000c0000:0x9fdd7d9:0x0] {
2012-09-26 10:08:14 LustreError: 79024:0:(osc_lock.c:816:osc_ldlm_completion_ast())     lovsub@ffff88043156e4e0: [0 ffff880773391748 W(2):[0, 18446744073709551615]@[0x20097c3a8:0x16a97:0x0]] 
2012-09-26 10:08:14 LustreError: 79024:0:(osc_lock.c:816:osc_ldlm_completion_ast())     osc@ffff88043156fb50: ffff880199befb40 00120002 0xbd300c41b4ac0bfd 3 ffff8803dd5e9068 size: 1024 mtime: 1348679193 atime: 1348679193 ctime: 1348679193 blocks: 8
2012-09-26 10:08:14 LustreError: 79024:0:(osc_lock.c:816:osc_ldlm_completion_ast()) } lock@ffff88043156ddd8
2012-09-26 10:08:14 LustreError: 79024:0:(osc_lock.c:816:osc_ldlm_completion_ast()) dlmlock returned -5
2012-09-26 10:08:14 LustreError: 79024:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lc2-OST000c-osc-ffff880831ce8800 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) lock@ffff88082a0540b8[2 4 0 2 0 00000000] W(2):[0, 18446744073709551615]@[0x20097c3a8:0x16a97:0x0] {
2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try())     vvp@ffff880773390420: 
2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try())     lov@ffff880773391748: 2
2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try())     0 0: ---
2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try())     1 0: ---
2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) 
2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff88082a0540b8
2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) unuse return -5
2012-09-26 10:08:14 LustreError: 79024:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff880820256ac0 (167630809/0/0/0) (rc: 0)
2012-09-26 10:08:14 Lustre: lc2-OST000c-osc-ffff880831ce8800: Connection restored to lc2-OST000c (at 10.1.1.49@o2ib9)
Comment by Prakash Surya (Inactive) [ 26/Sep/12 ]

Looking at the server logs, the RX errors from the LNet routers is a bit troubling. The clients and the servers have to go through a set of routers, 10.1.1.53@o2ib9 being the NID of one of the routers:

00010000:00010000:4.0:1348679193.857904:0:20439:0:(ldlm_lockd.c:1069:ldlm_handle_enqueue0()) ### server-side enqueue handler START
00010000:00010000:4.0:1348679193.857909:0:20439:0:(ldlm_lockd.c:1154:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: filter-lc2-OST000c_UUID lock: ffff880248b78240/0x8691e72961eb80f5 lrc: 2/0,0 mode: --/PW res: 167630809/0 rrc: 2 type: EXT [0->0] (req 0->0) flags: 0x0 remote: 0xf19c08e9d42c4364 expref: -99 pid: 20439 timeout 0
00010000:00010000:4.0:1348679193.857921:0:20439:0:(ldlm_lock.c:626:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: filter-lc2-OST000c_UUID lock: ffff880248b78480/0x8691e72961eb80ee lrc: 2/0,0 mode: PW/PW res: 167630809/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 remote: 0xbd300c41b4ac0bfd expref: 14 pid: 20439 timeout 0
00010000:00010000:4.0:1348679193.857925:0:20439:0:(ldlm_extent.c:295:ldlm_check_contention()) contended locks = 1
00010000:00010000:4.0:1348679193.857926:0:20439:0:(ldlm_extent.c:295:ldlm_check_contention()) contended locks = 1
00010000:00010000:4.0:1348679193.857931:0:20439:0:(ldlm_lockd.c:809:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: filter-lc2-OST000c_UUID lock: ffff880248b78480/0x8691e72961eb80ee lrc: 3/0,0 mode: PW/PW res: 167630809/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xbd300c41b4ac0bfd expref: 14 pid: 20439 timeout 0
00010000:00010000:4.0:1348679193.857936:0:20439:0:(ldlm_lockd.c:468:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 100, AT: on) ns: filter-lc2-OST000c_UUID lock: ffff880248b78480/0x8691e72961eb80ee lrc: 4/0,0 mode: PW/PW res: 167630809/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xbd300c41b4ac0bfd expref: 14 pid: 20439 timeout 7832370900
00010000:00010000:4.0:1348679193.858049:0:20439:0:(ldlm_lockd.c:1288:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: filter-lc2-OST000c_UUID lock: ffff880248b78240/0x8691e72961eb80f5 lrc: 3/0,0 mode: --/PW res: 167630809/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 remote: 0xf19c08e9d42c4364 expref: 18 pid: 20439 timeout 0
00010000:00010000:4.0:1348679193.858054:0:20439:0:(ldlm_lockd.c:1320:ldlm_handle_enqueue0()) ### server-side enqueue handler END (lock ffff880248b78240, rc 0)
00000800:00000100:0.0:1348679234.951973:0:2588:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
00000800:00000100:1.0:1348679234.951992:0:2591:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
00000800:00000100:2.0:1348679234.951999:0:2592:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
00000800:00000100:4.0:1348679234.952008:0:2594:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
00000800:00000100:0.0:1348679234.952008:0:2590:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
00000800:00000100:1.0:1348679234.952019:0:2588:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
00000800:00000100:2.0:1348679234.952024:0:2587:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
00000800:00000100:1.0:1348679234.952034:0:2588:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
00000800:00000100:1.0:1348679234.952043:0:2588:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
00000800:00000100:1.0:1348679234.952052:0:2588:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
00000800:00000100:2.0:1348679234.952057:0:2587:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
00010000:00020000:4.1F:1348679293.957439:0:0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.121.15@o2ib2  ns: filter-lc2-OST000c_UUID lock: ffff880248b78480/0x8691e72961eb80ee lrc: 3/0,0 mode: PW/PW res: 167630809/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xbd300c41b4ac0bfd expref: 4 pid: 20439 timeou
Comment by Prakash Surya (Inactive) [ 26/Sep/12 ]

Also, in case it's useful, I reproduced the issue with both '+dlmtrace' and '+rpctrace'. So:

  • LU-2030-llogs.tar.gz : Has '+dlmtrace'
  • LU-2030-llogs-1.tar.gz : Has '+dlmtrace' and '+rpctrace'
Comment by Hongchao Zhang [ 15/Oct/12 ]

in the LU-2030-llogs-1 (+dlmtrace, +rpctrace), the "simul_truncate.0" (PID=112500) failed,
00000002:00010000:9.0:1348694770.130724:0:112500:0:(mdc_locks.c:918:mdc_intent_lock()) (name: simul_truncate.0,[0x20097c3a8:0x16a98:0x0]) in obj [0x20095bc20:0x114c:0x0], intent: getattr flags 00

and it has acquired the ldlm lock "ffff8806636c1b40/0xf19c08e9d42c4459" at OST0013 and try to get the ldlm lock
"ffff8807ff4d3d80/0xf19c08e9d42c4460" at OST0010, which has spent more than 100s to get the lock! the the eviction occur

Comment by Hongchao Zhang [ 18/Oct/12 ]

status update: the possible patch is under creation & test

Comment by Prakash Surya (Inactive) [ 18/Oct/12 ]

Good to hear. Thanks!

Comment by Hongchao Zhang [ 24/Oct/12 ]

the patch is tracked at http://review.whamcloud.com/#change,4382

Comment by Hongchao Zhang [ 24/Oct/12 ]

Hi Prakash,
Is the issue reproducible? there is still one more issue in the logs, why OST0010 took so much time to enqueue this glimpse lock?
is the log of the node containing lc2-OST0010 available?

Comment by Prakash Surya (Inactive) [ 25/Oct/12 ]

Yes, the issue was reproducible. I'll try to allocate some time to get on the machine and gather the logs from all 4 OSS nodes for this filesystem. Any debug flags of special interest?

Comment by Hongchao Zhang [ 01/Nov/12 ]

Hi, the same as before, +rpctrace and +dlmtrace, thanks a lot!

Comment by Hongchao Zhang [ 22/Jan/16 ]

Hi Prakash,
Do you need any more works on this ticket? Or can we closed it?
Thanks

Comment by Christopher Morrone [ 22/Jan/16 ]

Prakash hasn't worked on Lustre in some time. This is probably another "Won't Fix" given its age.

Comment by John Fuchs-Chesney (Inactive) [ 23/Jan/16 ]

Thanks Chris.

Generated at Sat Feb 10 01:21:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.