[LU-2271] recovery-small test 10 does not properly reconnect Created: 03/Nov/12  Updated: 20/Nov/12  Resolved: 20/Nov/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Nathaniel Clark
Resolution: Duplicate Votes: 0
Labels: NFBlocker

Issue Links:
Duplicate
duplicates LU-2356 Test failure recovery-small/11,12,13 Resolved
Severity: 3
Rank (Obsolete): 5430

 Description   

It appears that recovery-small test 10 could cause eviction of this client from not only MDS, but also all OSTs.
As the result subsequent test will also fail when it tries to touch one of the not connected OSTs.
(often manifested in test 11 failing ,but if you skip test 11, test 12 will fail, or test 13 if you skip tests 11 and 12).

[15398.819135] Lustre: DEBUG MARKER: == recovery-small test 10: finish request on server after client eviction (bug 1521) == 00:21:04 (1351916464)
[15398.902152] Lustre: *** cfs_fail_loc=305, val=0***
[15398.904516] Lustre: *** cfs_fail_loc=305, val=0***
[15399.574089] Lustre: *** cfs_fail_loc=305, val=0***
[15399.575254] Lustre: Skipped 2 previous similar messages
[15406.572138] Lustre: 21155:0:(client.c:1912:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1351916465/real 1351916465]  req@ffff8801acc1cbf0 x1417586946343301/t0(0) o104->lustre-OST0000@0@lo:15/16 lens 296/224 e 0 to 1 dl 1351916472 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[15406.572324] LustreError: 138-a: lustre-OST0000: A client on nid 0@lo was evicted due to a lock blocking callback time out: rc -107
[15406.578774] Lustre: 21155:0:(client.c:1912:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
[15406.580254] LustreError: 21155:0:(ldlm_lockd.c:684:ldlm_handle_ast_error()) ### client (nid 0@lo) returned 0 from blocking AST ns: filter-ffff88011e018000 lock: ffff880203b29db8/0xb21738563fa1aa1f lrc: 1/0,0 mode: --/PW res: 4/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->2097151) flags: 0x200100a0 nid: 0@lo remote: 0xb21738563fa1aa18 expref: 2 pid: 21155 timeout 4298779689
[15409.904300] LustreError: 138-a: lustre-MDT0000: A client on nid 0@lo was evicted due to a lock blocking callback time out: rc -107
[15409.910598] LustreError: Skipped 1 previous similar message
[15410.061266] LustreError: 21616:0:(mdt_handler.c:3031:mdt_recovery()) operation 101 on unconnected MDS from 12345-0@lo
[15410.063011] LustreError: 11-0: an error occurred while communicating with 0@lo. The ldlm_enqueue operation failed with -107
[15410.066075] LustreError: Skipped 1 previous similar message
[15410.067186] Lustre: lustre-MDT0000-mdc-ffff8801c5fefbf0: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
[15410.069999] Lustre: Skipped 4 previous similar messages
[15410.073860] LustreError: 167-0: lustre-MDT0000-mdc-ffff8801c5fefbf0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
[15410.077472] LustreError: 22383:0:(mdc_locks.c:773:mdc_enqueue()) ldlm_cli_enqueue: -5
[15410.078167] Lustre: lustre-MDT0000-mdc-ffff8801c5fefbf0: Connection restored to lustre-MDT0000 (at 0@lo)
[15410.078169] Lustre: Skipped 4 previous similar messages
[15410.088663] LustreError: 167-0: lustre-OST0001-osc-ffff8801c5fefbf0: This client was evicted by lustre-OST0001; in progress operations using this service will fail.


 Comments   
Comment by Oleg Drokin [ 03/Nov/12 ]

patch in http://review.whamcloud.com/4453

Comment by Oleg Drokin [ 06/Nov/12 ]

note the patch does not fully work since eviction can happen later than the check

Comment by Nathaniel Clark [ 20/Nov/12 ]

I ran into this too and opened bug LU-2356 and have a patch for tests 10 through 12

Comment by Nathaniel Clark [ 20/Nov/12 ]

Duplicate of LU-2356

Generated at Sat Feb 10 01:23:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.