Details
-
Bug
-
Resolution: Won't Fix
-
Major
-
None
-
Lustre 2.5.1, Lustre 2.5.2, Lustre 2.5.3
-
3
-
17020
Description
A system about to enter acceptance is experiencing unexpected client evictions following the power down of an unrelated client. During cleanup from the eviction, the client reports:
> 2014-11-02T01:36:06.015880-05:00 c1-0c0s2n3 LustreError: 11-0: snx11023-OST0006-osc-ffff880850703800: Communicating with 10.149.4.6@o2ib, operation obd_ping failed with -107. > 2014-11-02T01:36:06.015921-05:00 c1-0c0s2n3 Lustre: snx11023-OST0006-osc-ffff880850703800: Connection to snx11023-OST0006 (at 10.149.4.6@o2ib) was lost; in progress operations using this service will wait for recovery to complete > 2014-11-02T01:36:06.103816-05:00 c1-0c0s2n3 LustreError: 167-0: snx11023-OST0006-osc-ffff880850703800: This client was evicted by snx11023-OST0006; in progress operations using this service will fail. > 2014-11-02T01:36:06.133063-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:806:ldlm_resource_complain()) snx11023-OST0006-osc-ffff880850703800: namespace resource [0xb428979:0x0:0x0].0 (ffff8805c02f0a80) refcount nonzero (1) after lock cleanup; forcing cleanup. > 2014-11-02T01:36:06.160248-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:1445:ldlm_resource_dump()) --- Resource: [0xb428979:0x0:0x0].0 (ffff8805c02f0a80) refcount = 2 > 2014-11-02T01:36:06.160266-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:1448:ldlm_resource_dump()) Granted locks (in reverse order): > 2014-11-02T01:36:06.210835-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:1451:ldlm_resource_dump()) ### ### ns: snx11023-OST0006-osc-ffff880850703800 lock: ffff8806e69fd2c0/0x3e4e9540570d857e lrc: 2/0,1 mode: PW/PW res: [0xb428979:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x506400020000 nid: local remote: 0xc6a482c115c30a13 expref: -99 pid: 23818 timeout: 0 lvb_type: 1 > 2014-11-02T01:36:06.250360-05:00 c1-0c0s2n3 Lustre: snx11023-OST0006-osc-ffff880850703800: Connection restored to snx11023-OST0006 (at 10.149.4.6@o2ib) }> 2014-11-02T01:36:06.015880-05:00 c1-0c0s2n3 LustreError: 11-0: snx11023-OST0006-osc-ffff880850703800: Communicating with 10.149.4.6@o2ib, operation obd_ping failed with -107. > 2014-11-02T01:36:06.015921-05:00 c1-0c0s2n3 Lustre: snx11023-OST0006-osc-ffff880850703800: Connection to snx11023-OST0006 (at 10.149.4.6@o2ib) was lost; in progress operations using this service will wait for recovery to complete > 2014-11-02T01:36:06.103816-05:00 c1-0c0s2n3 LustreError: 167-0: snx11023-OST0006-osc-ffff880850703800: This client was evicted by snx11023-OST0006; in progress operations using this service will fail. > 2014-11-02T01:36:06.133063-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:806:ldlm_resource_complain()) snx11023-OST0006-osc-ffff880850703800: namespace resource [0xb428979:0x0:0x0].0 (ffff8805c02f0a80) refcount nonzero (1) after lock cleanup; forcing cleanup. > 2014-11-02T01:36:06.160248-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:1445:ldlm_resource_dump()) --- Resource: [0xb428979:0x0:0x0].0 (ffff8805c02f0a80) refcount = 2 > 2014-11-02T01:36:06.160266-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:1448:ldlm_resource_dump()) Granted locks (in reverse order): > 2014-11-02T01:36:06.210835-05:00 c1-0c0s2n3 LustreError: 23831:0:(ldlm_resource.c:1451:ldlm_resource_dump()) ### ### ns: snx11023-OST0006-osc-ffff880850703800 lock: ffff8806e69fd2c0/0x3e4e9540570d857e lrc: 2/0,1 mode: PW/PW res: [0xb428979:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x506400020000 nid: local remote: 0xc6a482c115c30a13 expref: -99 pid: 23818 timeout: 0 lvb_type: 1 > 2014-11-02T01:36:06.250360-05:00 c1-0c0s2n3 Lustre: snx11023-OST0006-osc-ffff880850703800: Connection restored to snx11023-OST0006 (at 10.149.4.6@o2ib)
The server eviction is:
> Nov 2 01:35:50 snx11023n005 kernel: LustreError: 0:0:(ldlm_lockd.c:363:waiting_locks_callback()) ### lock callback timer expired after 180s: evicting client at 203@gni1 ns: filter-snx11023-OST0006_UUID lock: ffff8802f3f5c480/0xc6a482c115c30a13 lrc: 3/0,0 mode: PW/PW res: 188909945/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x4000000000020 remote: 0x3e4e9540570d857e expref: 4 pid: 78080 timeout 4526536428
From different logs
00000020:00000001:5.0:1418783943.976679:0:8277:0:(cl_lock.c:1242:cl_enqueue_try()) Process leaving (rc=1 : 1 : 1) 00000020:00000001:5.0:1418783943.976680:0:8277:0:(cl_lock.c:962:cl_lock_state_wait()) Process entered 00000020:00010000:5.0:1418783943.976680:0:8277:0:(cl_lock.c:151:cl_lock_trace0()) state wait lock: ffff8807de9e1738@(3 ffff880833626800 1 1 0 1 1 0)(ffff880ff77259c8/0/1) at cl_lock_state_wait():968 00000020:00000001:5.0:1418783943.976682:0:8277:0:(cl_lock.c:151:cl_lock_trace0()) put mutex: ffff8807de9e1738@(3 ffff880833626800 1 1 0 1 1 0)(ffff880ff77259c8/0/1) at cl_lock_mutex_put():755 00000020:00000001:5.0:1418783943.983353:0:8277:0:(cl_lock.c:151:cl_lock_trace0()) got mutex: ffff8807de9e1738@(3 ffff880833626800 1 1 0 1 1 0)(ffff880ff77259c8/0/1) at cl_lock_mutex_tail():663 00000020:00000001:5.0:1418783943.983355:0:8277:0:(cl_lock.c:999:cl_lock_state_wait()) Process leaving (rc=18446744073709551104 : -512 : fffffffffffffe00)
While Lustre is waiting lock grant, process got signal. cl_lock_state_wait() failed with ERESTARTSYS, and this cause detaches osc_lock. And after that result is - one waiters, one eviction, and refcount nonzero.