[LU-16733] recovery-small: cannot remove '/mnt/lustre/d110h.recovery-small' Created: 12/Apr/23  Updated: 20/Jun/23  Resolved: 20/Jun/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Feng Lei
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16737 recovery-small: cleanup error after a... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/d38d511e-7fdf-4ab8-bca4-a3f9d540464f

 

The test session reports "No sub tests failed in this test set."

Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/93288 - 4.18.0-348.7.1.el8_5.x86_64
servers: https://build.whamcloud.com/job/lustre-reviews/93288 - 4.18.0-348.23.1.el8_lustre.x86_64

Have seen this failure in a few different patches, unable to clean up at the end:

== recovery-small test complete, duration 7270 sec ======= 11:52:10 (1679917930)
rm: cannot remove '/mnt/lustre/d110h.recovery-small': Input/output error
 recovery-small : @@@@@@ FAIL: remove sub-test dirs failed 

I also saw it with d110i.recovery-small



 Comments   
Comment by Andreas Dilger [ 14/Apr/23 ]

flei can you please check if there is some patch that landed recently that is causing this to be hit (or hit more frequently)?

It looks like the first (recent) hit was 2023-03-27 (ver 2.15.54.114) but on a patch that hasn't landed yet. There was also a single hit on 2023-01-19 (ver 2.15.53.56 full testing, so no patch), but it complained about d110j. The problem has definitely been hit much more recently since 2023-04-04. This Maloo search shows all of the failures tagged with LU-16733, since it isn't otherwise possible to search for "no failure", at least until patch https://review.whamcloud.com/49582 lands.

The patches landed after 2023-03-26 and before 2023-03-30 are:

# git log --after 2023-03-25 --before 2023-03-30 --oneline
7c52cbf65218 LU-16515 tests: disable sanity test_118c/118d
a7222127c7a6 LU-16642 tests: improve sanity-sec test_61
8f40a3d7110d LU-16639 misc: cleanup concole messages
e998d21caf99 LU-16589 tests: add sanity/31l to test ln command
17bbf5bdd6f9 LU-930 docs: fix whatis output
36cbba150bce LU-16632 tests: more margin of error for sanity/56xh
91a3726f313d LU-16633 obdclass: fix rpc slot leakage
12c34651994b LU-14291 batch: don't include lustre_update.h for client only builds
d5b26443a3d3 LU-16615 utils: add messages in l_getidentity
b30f825232cb LU-16601 kernel: update SLES15 SP4 [5.14.21-150400.24.46.1]
8f004bc53b1a LU-16599 obdclass: job_stats can parse escaped jobid string
fc7a0d6013b4 LU-14668 lnet: add 'lock_prim_nid" lnet module parameter
f5293fb66e79 LU-16598 osp: cleanup comment in osp_sync.c
5e24b374f7bd LU-16595 test: save one second in wait_destroy_complete()
da230373bd14 LU-16563 lnet: use discovered ni status to set initial health
0366422cfd1e LU-16221 kernel: update RHEL 9.1 [5.14.0-162.18.1.el9_1]
2d40d96b4ec8 LU-15053 tests: reset quota if ENABLE_QUOTA=1
7e893c70955d LU-16382 build: udev files in /usr/lib
b33808d3aebb LU-16338 readahead: clip readahead with kms
ccee6b92ec4d LU-13107 utils: remove duplicate lctl erase/fork_lcfg
2471d35c0e0e LU-16217 iokit: Add lst.sh wrapper and lst-survey
bdbc7f9f42b9 LU-12805 tests: disable replay-single/36
73ee638813a8 LU-16604 kfilnd: kfilnd_peer ref leak on send
6fab1fe4a5c5 LU-9680 lnet: handle multi-rail setups
0ecb2a167c56 LU-11912 ofd: reduce LUSTRE_DATA_SEQ_MAX_WIDTH
c97d4cdf4dc7 LU-16629 osd: refill the existing env

I think there are a few approaches that could be used to debug this:

  • check MDS, OSS, client, test logs around test_110h/i/j to see if something unusual is happening vs. non-failing runs. This might be difficult since there will already be errors due to the test itself
  • review debug logs from the test failure to see why the directory could not be removed
  • submit "bisect" patches at different points in the above patch list with Test-Parameters: lines to run recovery-small enough times to be confident whether the bug is hit or not. It failed 11/256 runs in the past week, so it would need to run twice as many as average failure rate, about 46x, to be confident in the results. Each session takes about 2h to finish, so they should be run in parallel (one line of "Test-Parameters: testlist=recovery-small mdscount=2 mtscount=4" per session). Since this will consume 46 test nodes per patch, better to do this one patch at a time, maybe more over the weekend.
Comment by Andreas Dilger [ 25/Apr/23 ]

"Feng Lei <flei@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50683
Subject: LU-16733 tests: wait for recovery done
Project: fs/lustre-release
Branch: master
Current Patch Set: 2
Commit: 6b5c19493cbc8a186035f687d618003d06da0ef2

Comment by Gerrit Updater [ 20/Jun/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50683/
Subject: LU-16733 tests: wait for recovery done
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1512b6572e78442760e0caff50957061b2ca6617

Comment by Peter Jones [ 20/Jun/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:29:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.