[LU-6112] osc_ldlm_completion_ast() wrong error path handler Created: 12/Jan/15  Updated: 17/Jun/15  Resolved: 17/Jun/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.1, Lustre 2.5.2, Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Alexander Boyko Assignee: WC Triage
Resolution: Won't Fix Votes: 0
Labels: patch

Severity: 3
Rank (Obsolete): 17020

 Description   

A system about to enter acceptance is experiencing unexpected client evictions following the power down of an unrelated client. During cleanup from the eviction, the client reports:

> 2014-11-02T01:36:06.015880-05:00 c1-0c0s2n3 LustreError: 11-0: snx11023-OST0006-osc-ffff880850703800: Communicating with 10.149.4.6@o2ib, operation obd_ping failed with -107.
> 2014-11-02T01:36:06.015921-05:00 c1-0c0s2n3 Lustre: snx11023-OST0006-osc-ffff880850703800: Connection to snx11023-OST0006 (at 10.149.4.6@o2ib) was lost; in progress operations using this service will wait for recovery to complete
> 2014-11-02T01:36:06.103816-05:00 c1-0c0s2n3 LustreError: 167-0: snx11023-OST0006-osc-ffff880850703800: This client was evicted by snx11023-OST0006; in progress operations using this service will fail.
> 2014-11-02T01:36:06.133063-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:806:ldlm_resource_complain()) snx11023-OST0006-osc-ffff880850703800: namespace resource [0xb428979:0x0:0x0].0 (ffff8805c02f0a80) refcount nonzero (1) after lock cleanup; forcing cleanup.
> 2014-11-02T01:36:06.160248-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:1445:ldlm_resource_dump()) --- Resource: [0xb428979:0x0:0x0].0 (ffff8805c02f0a80) refcount = 2
> 2014-11-02T01:36:06.160266-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:1448:ldlm_resource_dump()) Granted locks (in reverse order):
> 2014-11-02T01:36:06.210835-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:1451:ldlm_resource_dump()) ### ### ns: snx11023-OST0006-osc-ffff880850703800 lock: ffff8806e69fd2c0/0x3e4e9540570d857e lrc: 2/0,1 mode: PW/PW res: [0xb428979:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x506400020000 nid: local remote: 0xc6a482c115c30a13 expref: -99 pid: 23818 timeout: 0 lvb_type: 1
> 2014-11-02T01:36:06.250360-05:00 c1-0c0s2n3 Lustre: snx11023-OST0006-osc-ffff880850703800: Connection restored to snx11023-OST0006 (at 10.149.4.6@o2ib)
}> 2014-11-02T01:36:06.015880-05:00 c1-0c0s2n3 LustreError: 11-0: snx11023-OST0006-osc-ffff880850703800: Communicating with 10.149.4.6@o2ib, operation obd_ping failed with -107.
> 2014-11-02T01:36:06.015921-05:00 c1-0c0s2n3 Lustre: snx11023-OST0006-osc-ffff880850703800: Connection to snx11023-OST0006 (at 10.149.4.6@o2ib) was lost; in progress operations using this service will wait for recovery to complete
> 2014-11-02T01:36:06.103816-05:00 c1-0c0s2n3 LustreError: 167-0: snx11023-OST0006-osc-ffff880850703800: This client was evicted by snx11023-OST0006; in progress operations using this service will fail.
> 2014-11-02T01:36:06.133063-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:806:ldlm_resource_complain()) snx11023-OST0006-osc-ffff880850703800: namespace resource [0xb428979:0x0:0x0].0 (ffff8805c02f0a80) refcount nonzero (1) after lock cleanup; forcing cleanup.
> 2014-11-02T01:36:06.160248-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:1445:ldlm_resource_dump()) --- Resource: [0xb428979:0x0:0x0].0 (ffff8805c02f0a80) refcount = 2
> 2014-11-02T01:36:06.160266-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:1448:ldlm_resource_dump()) Granted locks (in reverse order):
> 2014-11-02T01:36:06.210835-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:1451:ldlm_resource_dump()) ### ### ns: snx11023-OST0006-osc-ffff880850703800 lock: ffff8806e69fd2c0/0x3e4e9540570d857e lrc: 2/0,1 mode: PW/PW res: [0xb428979:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x506400020000 nid: local remote: 0xc6a482c115c30a13 expref: -99 pid: 23818 timeout: 0 lvb_type: 1
> 2014-11-02T01:36:06.250360-05:00 c1-0c0s2n3 Lustre: snx11023-OST0006-osc-ffff880850703800: Connection restored to snx11023-OST0006 (at 10.149.4.6@o2ib)

The server eviction is:

> Nov  2 01:35:50 snx11023n005 kernel: LustreError: 0:0:(ldlm_lockd.c:363:waiting_locks_callback()) ### lock callback timer expired after 180s: evicting client at 203@gni1  ns: filter-snx11023-OST0006_UUID lock: ffff8802f3f5c480/0xc6a482c115c30a13 lrc: 3/0,0 mode: PW/PW res: 188909945/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x4000000000020 remote: 0x3e4e9540570d857e expref: 4 pid: 78080 timeout 4526536428

From different logs

00000020:00000001:5.0:1418783943.976679:0:8277:0:(cl_lock.c:1242:cl_enqueue_try()) Process leaving (rc=1 : 1 : 1)
00000020:00000001:5.0:1418783943.976680:0:8277:0:(cl_lock.c:962:cl_lock_state_wait()) Process entered
00000020:00010000:5.0:1418783943.976680:0:8277:0:(cl_lock.c:151:cl_lock_trace0()) state wait lock: ffff8807de9e1738@(3 ffff880833626800 1 1 0 1 1 0)(ffff880ff77259c8/0/1) at cl_lock_state_wait():968
00000020:00000001:5.0:1418783943.976682:0:8277:0:(cl_lock.c:151:cl_lock_trace0()) put mutex: ffff8807de9e1738@(3 ffff880833626800 1 1 0 1 1 0)(ffff880ff77259c8/0/1) at cl_lock_mutex_put():755
00000020:00000001:5.0:1418783943.983353:0:8277:0:(cl_lock.c:151:cl_lock_trace0()) got mutex: ffff8807de9e1738@(3 ffff880833626800 1 1 0 1 1 0)(ffff880ff77259c8/0/1) at cl_lock_mutex_tail():663
00000020:00000001:5.0:1418783943.983355:0:8277:0:(cl_lock.c:999:cl_lock_state_wait()) Process leaving (rc=18446744073709551104 : -512 : fffffffffffffe00)

While Lustre is waiting lock grant, process got signal. cl_lock_state_wait() failed with ERESTARTSYS, and this cause detaches osc_lock. And after that result is - one waiters, one eviction, and refcount nonzero.



 Comments   
Comment by Gerrit Updater [ 12/Jan/15 ]

Alexander Boyko (alexander.boyko@seagate.com) uploaded a new patch: http://review.whamcloud.com/13346
Subject: LU-6112 ldlm: wakeup ldlm waiters when error happened
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 46b01a78d5c532d4cf53ae40d36e9898f6ff3806

Comment by Alexander Boyko [ 12/Jan/15 ]

patch http://review.whamcloud.com/13346

Comment by Cory Spitz [ 16/Jan/15 ]

We don't need a master patch for this bug then, right?

Comment by Alexander Boyko [ 02/Mar/15 ]

yes

Comment by Ann Koehler (Inactive) [ 17/Jun/15 ]

This bug can be closed since the patch is not needed in master. The patch does resolve the issue in 2.5.

Comment by Peter Jones [ 17/Jun/15 ]

Thanks Anne - it's good to reduced clutter where we can

Generated at Sat Feb 10 01:57:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.