[LU-10729] replay-dual test_23d: FAIL: Remote creation failed 1 : mkdir: cannot create directory': File exists Created: 27/Feb/18  Updated: 29/Nov/23  Resolved: 11/Apr/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.10.7, Lustre 2.12.1, Lustre 2.14.0, Lustre 2.12.6
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Critical
Reporter: Elena Gryaznova Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Attachments: Zip Archive 5a946d4ff72e620d571a71ff.zip    
Issue Links:
Related
is related to LU-6006 replay-dual test_22a: Remote creation... Resolved
is related to LU-17102 replay-single test_80c: remote creati... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

test fails on failover config: hosts != failoverhosts

== replay-dual test 23d: c1 rmdir d1, M0 drop update reply and fail M0/M1, c2 mkdir d1 =============== 19:53:39 (1519674819)
fail_loc=0x1701
fail_loc=0
Failing mds1 on fre909
Stopping /mnt/lustre-mds1 (opts:) on fre909
pdsh@fre913: fre909: ssh exited with exit code 1
Failing mds2 on fre910
Stopping /mnt/lustre-mds2 (opts:) on fre910
pdsh@fre913: fre910: ssh exited with exit code 1
reboot facets: mds1
Failover mds1 to fre910
...
fre915: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID      1168028        6688     1062924   1% /mnt/lustre[MDT:0]
lustre-MDT0001_UUID      1168028        6272     1063340   1% /mnt/lustre[MDT:1]
lustre-OST0000_UUID      4607356       34096     4306876   1% /mnt/lustre[OST:0]
lustre-OST0001_UUID      4607356       34096     4306876   1% /mnt/lustre[OST:1]
lustre-OST0002_UUID      4607356       34100     4306872   1% /mnt/lustre[OST:2]
lustre-OST0003_UUID      4607356       34100     4306872   1% /mnt/lustre[OST:3]

filesystem_summary:     18429424      136392    17227496   1% /mnt/lustre

fre914: mkdir: cannot create directory '/mnt/lustre2/d23d.replay-dual/remote_dir': File exists
pdsh@fre913: fre914: ssh exited with exit code 1
 replay-dual test_23d: @@@@@@ FAIL: Remote creation failed 1 


 Comments   
Comment by Minh Diep [ 23/Mar/18 ]

+1 on b2_10

https://testing.hpdd.intel.com/test_sets/9a12c54c-2eab-11e8-b3c6-52540065bddc

Comment by James Nunez (Inactive) [ 06/Aug/18 ]

When this test fails, we see that several test suites that run after replay-dual fail because they can't clean up this sub test directory; the tests don't even start to run or the test suite fails on clean up after all tests are run.

From the end of the replay-vbr test_log at https://testing.whamcloud.com/test_sets/95dc7668-98ef-11e8-a9f7-52540065bddc, we see

rm: cannot remove '/mnt/lustre/d23d.replay-dual': Directory not empty
  Trace dump:
  = /usr/lib64/lustre/tests/replay-vbr.sh:31:main()
replay-vbr: FAIL: test-framework exiting on error

I'm gong to attribute these failures to this ticket.

Comment by Alex Zhuravlev [ 19/Nov/21 ]

the llog cancel for a previous distributed transaction race with the update from rmdir ../remote_dir and if that llog write wins then rmdir's update is on hold (until the reply for llog write).
I think the easiest fix is just to wait a bit before rmdir to let that cancel write to get processed.

Comment by Gerrit Updater [ 19/Nov/21 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45623
Subject: LU-10729 tests: replay-dual/23d to wait
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f06970ed6d6e839b82b9623c1ae007164e3d5d50

Comment by Gerrit Updater [ 11/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/45623/
Subject: LU-10729 tests: replay-dual/23d to wait
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 63a19f6f666b9d18fede66ce8bcd2d799b5e0fa7

Comment by Peter Jones [ 11/Apr/23 ]

Landed for 2.16

Comment by Andreas Dilger [ 09/Sep/23 ]

I also see this fail intermittently on replay-dual test_22d.

Comment by Gerrit Updater [ 12/Sep/23 ]

"Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52343
Subject: LU-10729 tests: replay-dual/22d to wait
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bfa32d30813269f74c721eef5aca8930f40230e8

Comment by Gerrit Updater [ 29/Nov/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52343/
Subject: LU-10729 tests: replay-dual/22d to wait
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c0973b9fd64036adb3991615e76a97c6aa0b384e

Generated at Sat Feb 10 02:37:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.