[LU-6112] osc_ldlm_completion_ast() wrong error path handler Created: 12/Jan/15 Updated: 17/Jun/15 Resolved: 17/Jun/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.1, Lustre 2.5.2, Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexander Boyko | Assignee: | WC Triage |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | patch | ||
| Severity: | 3 |
| Rank (Obsolete): | 17020 |
| Description |
|
A system about to enter acceptance is experiencing unexpected client evictions following the power down of an unrelated client. During cleanup from the eviction, the client reports: > 2014-11-02T01:36:06.015880-05:00 c1-0c0s2n3 LustreError: 11-0: snx11023-OST0006-osc-ffff880850703800: Communicating with 10.149.4.6@o2ib, operation obd_ping failed with -107. > 2014-11-02T01:36:06.015921-05:00 c1-0c0s2n3 Lustre: snx11023-OST0006-osc-ffff880850703800: Connection to snx11023-OST0006 (at 10.149.4.6@o2ib) was lost; in progress operations using this service will wait for recovery to complete > 2014-11-02T01:36:06.103816-05:00 c1-0c0s2n3 LustreError: 167-0: snx11023-OST0006-osc-ffff880850703800: This client was evicted by snx11023-OST0006; in progress operations using this service will fail. > 2014-11-02T01:36:06.133063-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:806:ldlm_resource_complain()) snx11023-OST0006-osc-ffff880850703800: namespace resource [0xb428979:0x0:0x0].0 (ffff8805c02f0a80) refcount nonzero (1) after lock cleanup; forcing cleanup. > 2014-11-02T01:36:06.160248-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:1445:ldlm_resource_dump()) --- Resource: [0xb428979:0x0:0x0].0 (ffff8805c02f0a80) refcount = 2 > 2014-11-02T01:36:06.160266-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:1448:ldlm_resource_dump()) Granted locks (in reverse order): > 2014-11-02T01:36:06.210835-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:1451:ldlm_resource_dump()) ### ### ns: snx11023-OST0006-osc-ffff880850703800 lock: ffff8806e69fd2c0/0x3e4e9540570d857e lrc: 2/0,1 mode: PW/PW res: [0xb428979:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x506400020000 nid: local remote: 0xc6a482c115c30a13 expref: -99 pid: 23818 timeout: 0 lvb_type: 1 > 2014-11-02T01:36:06.250360-05:00 c1-0c0s2n3 Lustre: snx11023-OST0006-osc-ffff880850703800: Connection restored to snx11023-OST0006 (at 10.149.4.6@o2ib) }> 2014-11-02T01:36:06.015880-05:00 c1-0c0s2n3 LustreError: 11-0: snx11023-OST0006-osc-ffff880850703800: Communicating with 10.149.4.6@o2ib, operation obd_ping failed with -107. > 2014-11-02T01:36:06.015921-05:00 c1-0c0s2n3 Lustre: snx11023-OST0006-osc-ffff880850703800: Connection to snx11023-OST0006 (at 10.149.4.6@o2ib) was lost; in progress operations using this service will wait for recovery to complete > 2014-11-02T01:36:06.103816-05:00 c1-0c0s2n3 LustreError: 167-0: snx11023-OST0006-osc-ffff880850703800: This client was evicted by snx11023-OST0006; in progress operations using this service will fail. > 2014-11-02T01:36:06.133063-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:806:ldlm_resource_complain()) snx11023-OST0006-osc-ffff880850703800: namespace resource [0xb428979:0x0:0x0].0 (ffff8805c02f0a80) refcount nonzero (1) after lock cleanup; forcing cleanup. > 2014-11-02T01:36:06.160248-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:1445:ldlm_resource_dump()) --- Resource: [0xb428979:0x0:0x0].0 (ffff8805c02f0a80) refcount = 2 > 2014-11-02T01:36:06.160266-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:1448:ldlm_resource_dump()) Granted locks (in reverse order): > 2014-11-02T01:36:06.210835-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:1451:ldlm_resource_dump()) ### ### ns: snx11023-OST0006-osc-ffff880850703800 lock: ffff8806e69fd2c0/0x3e4e9540570d857e lrc: 2/0,1 mode: PW/PW res: [0xb428979:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x506400020000 nid: local remote: 0xc6a482c115c30a13 expref: -99 pid: 23818 timeout: 0 lvb_type: 1 > 2014-11-02T01:36:06.250360-05:00 c1-0c0s2n3 Lustre: snx11023-OST0006-osc-ffff880850703800: Connection restored to snx11023-OST0006 (at 10.149.4.6@o2ib) The server eviction is: > Nov 2 01:35:50 snx11023n005 kernel: LustreError: 0:0:(ldlm_lockd.c:363:waiting_locks_callback()) ### lock callback timer expired after 180s: evicting client at 203@gni1 ns: filter-snx11023-OST0006_UUID lock: ffff8802f3f5c480/0xc6a482c115c30a13 lrc: 3/0,0 mode: PW/PW res: 188909945/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x4000000000020 remote: 0x3e4e9540570d857e expref: 4 pid: 78080 timeout 4526536428 From different logs 00000020:00000001:5.0:1418783943.976679:0:8277:0:(cl_lock.c:1242:cl_enqueue_try()) Process leaving (rc=1 : 1 : 1) 00000020:00000001:5.0:1418783943.976680:0:8277:0:(cl_lock.c:962:cl_lock_state_wait()) Process entered 00000020:00010000:5.0:1418783943.976680:0:8277:0:(cl_lock.c:151:cl_lock_trace0()) state wait lock: ffff8807de9e1738@(3 ffff880833626800 1 1 0 1 1 0)(ffff880ff77259c8/0/1) at cl_lock_state_wait():968 00000020:00000001:5.0:1418783943.976682:0:8277:0:(cl_lock.c:151:cl_lock_trace0()) put mutex: ffff8807de9e1738@(3 ffff880833626800 1 1 0 1 1 0)(ffff880ff77259c8/0/1) at cl_lock_mutex_put():755 00000020:00000001:5.0:1418783943.983353:0:8277:0:(cl_lock.c:151:cl_lock_trace0()) got mutex: ffff8807de9e1738@(3 ffff880833626800 1 1 0 1 1 0)(ffff880ff77259c8/0/1) at cl_lock_mutex_tail():663 00000020:00000001:5.0:1418783943.983355:0:8277:0:(cl_lock.c:999:cl_lock_state_wait()) Process leaving (rc=18446744073709551104 : -512 : fffffffffffffe00) While Lustre is waiting lock grant, process got signal. cl_lock_state_wait() failed with ERESTARTSYS, and this cause detaches osc_lock. And after that result is - one waiters, one eviction, and refcount nonzero. |
| Comments |
| Comment by Gerrit Updater [ 12/Jan/15 ] |
|
Alexander Boyko (alexander.boyko@seagate.com) uploaded a new patch: http://review.whamcloud.com/13346 |
| Comment by Alexander Boyko [ 12/Jan/15 ] |
| Comment by Cory Spitz [ 16/Jan/15 ] |
|
We don't need a master patch for this bug then, right? |
| Comment by Alexander Boyko [ 02/Mar/15 ] |
|
yes |
| Comment by Ann Koehler (Inactive) [ 17/Jun/15 ] |
|
This bug can be closed since the patch is not needed in master. The patch does resolve the issue in 2.5. |
| Comment by Peter Jones [ 17/Jun/15 ] |
|
Thanks Anne - it's good to reduced clutter where we can |