[LU-4381] clio deadlock from truncate Created: 12/Dec/13  Updated: 05/Sep/14  Resolved: 09/Apr/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.2

Type: Bug Priority: Major
Reporter: John Hammond Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: clio, mn4

Issue Links:
Duplicate
duplicates LU-4495 client evicted on parallel append wri... Closed
Severity: 3
Rank (Obsolete): 12008

 Description   

If I run the following:

# export OSTCOUNT=6
# export MOUNT_2=y
# ./lustre/tests/llmount.sh
# lfs setstripe -c 6 /mnt/lustre/f0
# 
# (while true; do echo Hi > /mnt/lustre/f0; done) &
# (while true; do echo Bye > /mnt/lustre2/f0; done) &

Then within a second of starting, both child tasks get stuck in cl_lock_state_wait()

[<ffffffffa045cb75>] cl_lock_state_wait+0x1b5/0x320 [obdclass]
[<ffffffffa045d35b>] cl_enqueue_locked+0x15b/0x1f0 [obdclass]
[<ffffffffa045debe>] cl_lock_request+0x7e/0x270 [obdclass]
[<ffffffffa0462e4c>] cl_io_lock+0x3cc/0x560 [obdclass]
[<ffffffffa0463082>] cl_io_loop+0xa2/0x1b0 [obdclass]
[<ffffffffa0dcabe8>] cl_setattr_ost+0x218/0x2f0 [lustre]
[<ffffffffa0d96145>] ll_setattr_raw+0xa45/0x10c0 [lustre]
[<ffffffffa0d9681d>] ll_setattr+0x5d/0xf0 [lustre]
[<ffffffff811a0048>] notify_change+0x168/0x340
[<ffffffff81180ad4>] do_truncate+0x64/0xa0
[<ffffffff811949e1>] do_filp_open+0x851/0xdc0
[<ffffffff8117f849>] do_sys_open+0x69/0x140
[<ffffffff8117f960>] sys_open+0x20/0x30
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

They stay stuck there until one client gets evicted by an OST:

LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 0@lo  ns: filter-lustre-OST0002_UUID lock: ffff880217559100/0xb06606e6f58bd625 lrc: 3/0,0 mode: PW/PW res: [0x2:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 0@lo remote: 0xb06606e6f58bd61e expref: 4 pid: 14479 timeout: 4300627190 lvb_type: 0
LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 0@lo  ns: filter-lustre-OST0004_UUID lock: ffff88019996f9c0/0xb06606e6f58bd58b lrc: 3/0,0 mode: PW/PW res: [0x2:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 0@lo remote: 0xb06606e6f58bd584 expref: 4 pid: 13781 timeout: 4300627191 lvb_type: 0
LustreError: 11-0: lustre-OST0002-osc-ffff8801a13eb800: Communicating with 0@lo, operation obd_ping failed with -107.
Lustre: lustre-OST0004-osc-ffff88019e033800: Connection to lustre-OST0004 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 167-0: lustre-OST0004-osc-ffff88019e033800: This client was evicted by lustre-OST0004; in progress operations using this service will fail.
LustreError: 16413:0:(ldlm_resource.c:815:ldlm_resource_complain()) lustre-OST0004-osc-ffff88019e033800: namespace resource [0x2:0x0:0x0].0 (ffff8801a86f6980) refcount nonzero (1) after lock cleanup; forcing cleanup.
LustreError: 16413:0:(ldlm_resource.c:1454:ldlm_resource_dump()) --- Resource: [0x2:0x0:0x0].0 (ffff8801a86f6980) refcount = 2
Lustre: lustre-OST0004-osc-ffff88019e033800: Connection restored to lustre-OST0004 (at 0@lo)


 Comments   
Comment by John Hammond [ 13/Dec/13 ]

I should have mentioned that I don't see this hang when the file is created with stripe count 1.

Comment by Andreas Dilger [ 18/Dec/13 ]

Jinshan, Bobijam, would this problem be fixed by the cl_lock removal?

Comment by Jinshan Xiong (Inactive) [ 19/Dec/13 ]

very likely. I'll keep in mind to run this test case with new cl_lock implementation

Comment by Patrick Farrell (Inactive) [ 08/Apr/14 ]

With this patch:

http://review.whamcloud.com/#/c/9152/

landed, can this bug be closed?

Also, LU-4495 is a duplicate of this bug. I originally thought there was a difference between these two, but it seems to be exactly the same.

Comment by Jinshan Xiong (Inactive) [ 09/Apr/14 ]

indeed. Closed

Comment by Bob Glossman (Inactive) [ 17/Apr/14 ]

backport to b2_5
http://review.whamcloud.com/9994

Comment by Bruno Faccini (Inactive) [ 05/Sep/14 ]

backport to b2_4 is at http://review.whamcloud.com/11632.

Generated at Sat Feb 10 01:42:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.