[LU-2271] recovery-small test 10 does not properly reconnect Created: 03/Nov/12 Updated: 20/Nov/12 Resolved: 20/Nov/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Oleg Drokin | Assignee: | Nathaniel Clark |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | NFBlocker | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 5430 | ||||||||
| Description |
|
It appears that recovery-small test 10 could cause eviction of this client from not only MDS, but also all OSTs. [15398.819135] Lustre: DEBUG MARKER: == recovery-small test 10: finish request on server after client eviction (bug 1521) == 00:21:04 (1351916464) [15398.902152] Lustre: *** cfs_fail_loc=305, val=0*** [15398.904516] Lustre: *** cfs_fail_loc=305, val=0*** [15399.574089] Lustre: *** cfs_fail_loc=305, val=0*** [15399.575254] Lustre: Skipped 2 previous similar messages [15406.572138] Lustre: 21155:0:(client.c:1912:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1351916465/real 1351916465] req@ffff8801acc1cbf0 x1417586946343301/t0(0) o104->lustre-OST0000@0@lo:15/16 lens 296/224 e 0 to 1 dl 1351916472 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [15406.572324] LustreError: 138-a: lustre-OST0000: A client on nid 0@lo was evicted due to a lock blocking callback time out: rc -107 [15406.578774] Lustre: 21155:0:(client.c:1912:ptlrpc_expire_one_request()) Skipped 6 previous similar messages [15406.580254] LustreError: 21155:0:(ldlm_lockd.c:684:ldlm_handle_ast_error()) ### client (nid 0@lo) returned 0 from blocking AST ns: filter-ffff88011e018000 lock: ffff880203b29db8/0xb21738563fa1aa1f lrc: 1/0,0 mode: --/PW res: 4/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->2097151) flags: 0x200100a0 nid: 0@lo remote: 0xb21738563fa1aa18 expref: 2 pid: 21155 timeout 4298779689 [15409.904300] LustreError: 138-a: lustre-MDT0000: A client on nid 0@lo was evicted due to a lock blocking callback time out: rc -107 [15409.910598] LustreError: Skipped 1 previous similar message [15410.061266] LustreError: 21616:0:(mdt_handler.c:3031:mdt_recovery()) operation 101 on unconnected MDS from 12345-0@lo [15410.063011] LustreError: 11-0: an error occurred while communicating with 0@lo. The ldlm_enqueue operation failed with -107 [15410.066075] LustreError: Skipped 1 previous similar message [15410.067186] Lustre: lustre-MDT0000-mdc-ffff8801c5fefbf0: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete [15410.069999] Lustre: Skipped 4 previous similar messages [15410.073860] LustreError: 167-0: lustre-MDT0000-mdc-ffff8801c5fefbf0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. [15410.077472] LustreError: 22383:0:(mdc_locks.c:773:mdc_enqueue()) ldlm_cli_enqueue: -5 [15410.078167] Lustre: lustre-MDT0000-mdc-ffff8801c5fefbf0: Connection restored to lustre-MDT0000 (at 0@lo) [15410.078169] Lustre: Skipped 4 previous similar messages [15410.088663] LustreError: 167-0: lustre-OST0001-osc-ffff8801c5fefbf0: This client was evicted by lustre-OST0001; in progress operations using this service will fail. |
| Comments |
| Comment by Oleg Drokin [ 03/Nov/12 ] |
|
patch in http://review.whamcloud.com/4453 |
| Comment by Oleg Drokin [ 06/Nov/12 ] |
|
note the patch does not fully work since eviction can happen later than the check |
| Comment by Nathaniel Clark [ 20/Nov/12 ] |
|
I ran into this too and opened bug |
| Comment by Nathaniel Clark [ 20/Nov/12 ] |
|
Duplicate of |