[LU-17229] replay-dual test_33: import is not in REPLAY_WAIT state Created: 26/Oct/23  Updated: 04/Feb/24  Resolved: 04/Feb/24

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Etienne Aujames
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16408 replay-dual test_33: unable to mount ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Sergey Cheremencev <scherementsev@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/5937b834-7be4-4f39-b44f-9be432e94b5f

test_33 failed with the following error:

lctl dl | grep ' ST ' || true
error: read_param: '/sys/fs/lustre/mdc/lustre-MDT0000-mdc-ffff9a37051e0000/ping': Transport endpoint is not connected
error: read_param: '/sys/fs/lustre/mdc/lustre-MDT0000-mdc-ffff9a372060e800/ping': Transport endpoint is not connected
...
 rpc test_33: @@@@@@ FAIL: can't put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into REPLAY_WAIT state after 1475 sec, have FULL 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6686:error()
  = /usr/lib64/lustre/tests/test-framework.sh:8086:_wait_import_state()
  = /usr/lib64/lustre/tests/test-framework.sh:8108:wait_import_state()
  = /usr/lib64/lustre/tests/test-framework.sh:8118:wait_import_state_mount()
  = rpc.sh:20:main()
CMD: onyx-82vm10,onyx-82vm1.onyx.whamcloud.com,onyx-82vm2,onyx-82vm5,onyx-82vm9 /usr/sbin/lctl dk > /autotest/autotest-2/2023-10-19/lustre-reviews_review-dne-part-2_99525_4_c51e2a6f-6696-46b6-b871-c03b2779b9df//rpc.test_33.debug_log.\$(hostname -s).1697711480.log;
		dmesg > /autotest/autotest-2/2023-10-19/lustre-reviews_review-dne-part-2_99525_4_c51e2a6f-6696-46b6-b871-c03b2779b9df//rpc.test_33.dmesg.\$(hostname -s).1697711480.log
onyx-82vm1.onyx.whamcloud.com: Dumping lctl log to /autotest/autotest-2/2023-10-19/lustre-reviews_review-dne-part-2_99525_4_c51e2a6f-6696-46b6-b871-c03b2779b9df//rpc.test_33.*.1697711480.log
 replay-dual test_33: @@@@@@ FAIL: import is not in REPLAY_WAIT state 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6686:error()
  = /usr/lib64/lustre/tests/test-framework.sh:8352:wait_clients_import_state()
  = /usr/lib64/lustre/tests/replay-dual.sh:1303:test_33()
  = /usr/lib64/lustre/tests/test-framework.sh:7026:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:7082:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:6912:run_test()
  = /usr/lib64/lustre/tests/replay-dual.sh:1323:main()

Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/99525 - 4.18.0-477.21.1.el8_8.x86_64
servers: https://build.whamcloud.com/job/lustre-reviews/99525 - 4.18.0-477.21.1.el8_lustre.x86_64

<<Please provide additional information about the failure here>>

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
replay-dual test_33 - import is not in REPLAY_WAIT state



 Comments   
Comment by Andreas Dilger [ 27/Oct/23 ]

Hi Etienne, could you please look at this failure. It was first hit with your patch https://review.whamcloud.com/50434 "LU-16408 tests: fix replay-dual test 33" 3x before it landed, and has been hit regularly since it landed on 2023-09-23.

https://testing.whamcloud.com/search?status%5B%5D=FAIL&test_set_script_id=7b616032-3db2-11e0-80c0-52540025f9af&sub_test_script_id=e37b0f8e-e1e8-49a7-ac2e-9820110a84d0&start_date=2023-09-01&end_date=2023-10-26&source=sub_tests#redirect

Comment by Nikitas Angelinas [ 03/Nov/23 ]

+1 on master: https://testing.whamcloud.com/test_sets/a4c57330-8b26-48df-ab88-29d50600dd6c

Comment by Andreas Dilger [ 27/Nov/23 ]

Still being hit regularly, also in replay-single test_135

Comment by Gerrit Updater [ 28/Nov/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53267
Subject: LU-17229 tests: rely on IR for replay-dual 33
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: abe96582e2bf0b73d1738d895800e929db85cf03

Comment by Etienne Aujames [ 28/Nov/23 ]

I have tried something, but I am blind here. I cannot reproduce this on test failure on my VMs.
The issue here seems to happen when there is no IR enabled. So instead of trying to make it work with no-IR, I keep the MGS up and fail another MDT.

Comment by Aurelien Degremont [ 08/Jan/24 ]

+2 on master:

 

Comment by Arshad Hussain [ 25/Jan/24 ]

+1 on Master

https://testing.whamcloud.com/sub_tests/697b09ea-deeb-4e84-a108-c15cb8808e99

Comment by Alex Zhuravlev [ 28/Jan/24 ]

hitting this one quite often:

Failure Rate: 17.33% of most recent 75 runs, 25 skipped (all branches
Comment by Gerrit Updater [ 04/Feb/24 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53267/
Subject: LU-17229 tests: rely on IR for replay-dual 33
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8fdef2f2f658c9bdd7568db31473fdd97da8a14e

Comment by Peter Jones [ 04/Feb/24 ]

Merged for 2.16

Generated at Sat Feb 10 03:33:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.