[LU-7097] conf-sanity test_84 (check recovery_time_hard) fails on DNE setup Created: 03/Sep/15  Updated: 13/Oct/21  Resolved: 13/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Sergey Cheremencev Assignee: Hongchao Zhang
Resolution: Incomplete Votes: 0
Labels: None

Issue Links:
Related
is related to LU-7222 conf-sanity test_84: invalid llog tai... Resolved
is related to LU-7428 conf-sanity test_84, replay-dual 0a: ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

conf-sanity fails on DNE setup.
Simple reproducer:
MDSCOUNT=2 ONLY=84 sh ./conf-sanity.sh



 Comments   
Comment by Gerrit Updater [ 03/Sep/15 ]

Sergey Cheremencev (sergey_cheremencev@xyratex.com) uploaded a new patch: http://review.whamcloud.com/16217
Subject: LU-7097 tests: add DNE support to conf-sanity_84
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 836e0145ed22bba1b2107523c74f322f630b9588

Comment by James Nunez (Inactive) [ 03/Sep/15 ]

Sergey - What failure message do you get when this test fails? Could you upload any relevant logs?

Comment by Sergey Cheremencev [ 03/Sep/15 ]

Test failed due to timeout. There is not DNE support in test.

Lustre: DEBUG MARKER: == conf-sanity test 84: check recovery_time_hard == 12:21:27 (1417263687)
LustreError: 11-0: lustre-MDT0000-mdc-ffff88005beb5000: Communicating with 192.168.112.5@tcp, operation mds_connect failed with -11.
LustreError: 11-0: lustre-MDT0001-mdc-ffff88005beb5000: Communicating with 192.168.112.5@tcp, operation mds_connect failed with -19.
LustreError: Skipped 1 previous similar message
Lustre: Mounted lustre-client
LustreError: 11-0: lustre-MDT0001-mdc-ffff88005aa65400: Communicating with 192.168.112.5@tcp, operation mds_connect failed with -19.
LustreError: Skipped 1 previous similar message
LustreError: 11-0: lustre-MDT0001-mdc-ffff88005beb5000: Communicating with 192.168.112.5@tcp, operation mds_connect failed with -19.
LustreError: Skipped 1 previous similar message
LustreError: 11-0: lustre-MDT0001-mdc-ffff88005beb5000: Communicating with 192.168.112.5@tcp, operation mds_connect failed with -19.
LustreError: Skipped 1 previous similar message
LustreError: 11-0: lustre-MDT0001-mdc-ffff88005beb5000: Communicating with 192.168.112.5@tcp, operation mds_connect failed with -19.
LustreError: Skipped 3 previous similar messages
LustreError: 11-0: lustre-MDT0001-mdc-ffff88005beb5000: Communicating with 192.168.112.5@tcp, operation mds_connect failed with -19.
LustreError: Skipped 7 previous similar messages
LustreError: 11-0: lustre-MDT0001-mdc-ffff88005beb5000: Communicating with 192.168.112.5@tcp, operation mds_connect failed with -19.
LustreError: Skipped 13 previous similar messages
LustreError: 11-0: lustre-MDT0001-mdc-ffff88005beb5000: Communicating with 192.168.112.5@tcp, operation mds_connect failed with -19.
LustreError: Skipped 25 previous similar messages
LustreError: 11-0: lustre-MDT0001-mdc-ffff88005beb5000: Communicating with 192.168.112.5@tcp, operation mds_connect failed with -19.
LustreError: Skipped 51 previous similar messages
Lustre: lustre-MDT0000-mdc-ffff88005aa65400: Connection to lustre-MDT0000 (at 192.168.112.5@tcp) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 167-0: lustre-MDT0000-mdc-ffff88005aa65400: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
LustreError: 5312:0:(ldlm_resource.c:781:ldlm_resource_complain()) lustre-MDT0000-mdc-ffff88005aa65400: namespace resource [0x240000401:0x1:0x0].0 (ffff88003fd94e40) refcount nonzero (1) after lock cleanup; forcing cleanup.
LustreError: 5312:0:(ldlm_resource.c:1421:ldlm_resource_dump()) --- Resource: [0x240000401:0x1:0x0].0 (ffff88003fd94e40) refcount = 2
LustreError: 5312:0:(ldlm_resource.c:1424:ldlm_resource_dump()) Granted locks (in reverse order):
LustreError: 5312:0:(ldlm_resource.c:1427:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-mdc-ffff88005aa65400 lock: ffff88005c258b00/0xf5fcc28fef022372 lrc: 3/0,0 mode: PR/PR res: [0x240000401:0x1:0x0].0 bits 0x1b rrc: 2 type: IBT flags: 0x52f400000000 nid: local remote: 0x73481ecc0160356c expref: -99 pid: 4521 timeout: 0 lvb_type: 0
Lustre: lustre-MDT0000-mdc-ffff88005aa65400: Connection restored to lustre-MDT0000 (at 192.168.112.5@tcp)
Comment by Sergey Cheremencev [ 27/Jan/16 ]

It seems issue is already fixed by "LU-7222 tests: add Mulitple MDTs to test_84".
But on the other hand my patch introduces mdsfailover_HOST support for test_84.
Also it starts all possible clients and waits that just one of them will be evicted.

Comment by Andreas Dilger [ 13/Oct/21 ]

Can reopen ticket if new patch is submitted.

Generated at Sat Feb 10 02:05:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.