[LU-9827] sanityn fail “remove sub-test dirs failed d80b.sanityn/migrate_dir” Created: 03/Aug/17  Updated: 24/Aug/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0, Lustre 2.10.6
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James Nunez (Inactive) Assignee: Lai Siyao
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10553 d23b.replay-dual: Directory not empty... Open
is related to LU-10789 lustre-rsync-test: @@@@@@ FAIL: remov... Open
is related to LU-10690 sanity-hsm: remove sub-test dirs failed Resolved
is related to LU-9927 sanityn fails on clean up with 'FAIL:... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The sanityn test suite fails, but no subtests fail. If you look at the end of the suite_stdout log, we see that one of the directories that test_80b created cannot be deleted

03:30:00:== sanityn test complete, duration 3688 sec ========================================================== 03:29:56 (1501730996)
03:30:11:rm: cannot remove '/mnt/lustre/d80b.sanityn/migrate_dir/link_file': Stale file handle
03:30:11: sanityn : @@@@@@ FAIL: remove sub-test dirs failed 
03:30:11:  Trace dump:
03:30:11:  = /usr/lib64/lustre/tests/test-framework.sh:4980:error()
03:30:11:  = /usr/lib64/lustre/tests/test-framework.sh:4499:check_and_cleanup_lustre()
03:30:11:  = /usr/lib64/lustre/tests/sanityn.sh:4012:main()

test 80b does not fail, but we see the following in the test_log:

16:02:42:== sanityn test 80b: Accessing directory during migration ============================================ 16:02:39 (1501516959)
16:02:42:start migration thread 11920
16:02:42:accessing the migrating directory for 5 minutes...
16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:02:53:...10 seconds
16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
…
16:03:41:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:03:41:...60 seconds
16:03:41:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:03:45:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:03:45:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:03:46:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:03:46:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:03:46:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:03:46:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
16:03:46:diff: /mnt/lustre2/d80b.sanityn/migrate_dir/file1: No such file or directory
16:03:46:access file1 fails
16:03:46:Resetting fail_loc on all nodes.../usr/lib64/lustre/tests/test-framework.sh: line 3146: 11920 Killed                  ( while true; do
16:03:46:    mdt_idx=$((RANDOM % MDSCOUNT)); $LFS migrate -m $mdt_idx $migrate_dir1 &>/dev/null || rc=$?; [ $rc -ne 0 -o $rc -ne 16 ] || break;
16:03:46:done )
16:03:46:CMD: trevis-3vm1.trevis.hpdd.intel.com,trevis-3vm2,trevis-3vm3,trevis-3vm4,trevis-3vm8 lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
16:03:46:done.

We’ve seen this test suite failure with a Lustre setup with DNE:
https://testing.hpdd.intel.com/test_sets/e98fd2a4-7826-11e7-9a7b-5254006e85c2
https://testing.hpdd.intel.com/test_sets/e5ffaca8-763d-11e7-bbe0-5254006e85c2
https://testing.hpdd.intel.com/test_sets/42bd9ada-7289-11e7-bb95-5254006e85c2



 Comments   
Comment by Peter Jones [ 04/Aug/17 ]

Lai

Could you please advise on this one?

Thanks

Peter

Comment by Bob Glossman (Inactive) [ 15/Aug/17 ]

another on b2_10:
https://testing.hpdd.intel.com/test_sets/51f590c6-816f-11e7-bbc6-5254006e85c2

Comment by Jian Yu [ 01/Oct/17 ]

More failure instance on master:
https://testing.hpdd.intel.com/test_sets/a409c466-a6c4-11e7-b786-5254006e85c2

Comment by Bob Glossman (Inactive) [ 13/Oct/17 ]

more on master:
https://testing.hpdd.intel.com/test_sets/887cb8aa-afbd-11e7-8d8d-5254006e85c2
https://testing.hpdd.intel.com/test_sets/eb4b2f2c-afe3-11e7-a26c-5254006e85c2

Comment by Mikhail Pershin [ 15/Nov/17 ]

in master

https://testing.hpdd.intel.com/test_sessions/1b74ef7e-dd18-4ec3-8e97-1cd65318eecc

Comment by Jian Yu [ 30/Dec/17 ]

+1 on master branch:
https://testing.hpdd.intel.com/test_sets/153c9566-ec96-11e7-8c43-52540065bddc

Comment by Minh Diep [ 19/Jan/18 ]

+1 on b2_10

https://testing.hpdd.intel.com/test_sets/ac5d6fec-fc31-11e7-a6ad-52540065bddc

Comment by Jian Yu [ 28/Jan/18 ]

+1 on master branch:
https://testing.hpdd.intel.com/test_sessions/80f7c1cf-216e-4836-b786-d875a3850a12

Comment by Bob Glossman (Inactive) [ 13/Feb/18 ]

another on master:
https://testing.hpdd.intel.com/test_sessions/915d9dad-45ed-4a0e-8390-2670ae2df9ed

Comment by James Nunez (Inactive) [ 28/Feb/18 ]

We also see sanity-hsm fail due to not being able to clean up the test directory from sanity 80b because of "Directory not empty".

== sanity-hsm test complete, duration 1607 sec ======================================================= 03:57:13 (1518695833)
rm: cannot remove '/mnt/lustre/d80b.sanityn/migrate_dir': Directory not empty
 sanity-hsm : @@@@@@ FAIL: remove sub-test dirs failed 

Logs at
https://testing.hpdd.intel.com/test_sets/295276d4-1102-11e8-a10a-52540065bddc
https://testing.hpdd.intel.com/test_sets/4d54710e-12d3-11e8-a6ad-52540065bddc

Comment by Bob Glossman (Inactive) [ 05/Mar/18 ]

another on master:
https://testing.hpdd.intel.com/test_sets/8d4ec074-20ad-11e8-a4b1-52540065bddc

Comment by Lai Siyao [ 06/Mar/18 ]

This is related with dir migration, so this may be gone after LU-4684.

Comment by nasf (Inactive) [ 06/Mar/18 ]

+1 on b2_10:
https://testing.hpdd.intel.com/test_logs/ea214066-20bb-11e8-a4b1-52540065bddc/show_text
https://testing.hpdd.intel.com/test_sessions/e8a606ee-e124-4434-897a-32c5d70eea1b

Comment by Bob Glossman (Inactive) [ 12/Mar/18 ]

another on master:

https://testing.hpdd.intel.com/test_sets/0d8187b0-2642-11e8-b74b-52540065bddc

 

Comment by Andreas Dilger [ 17/Apr/18 ]

+1 on master: https://testing.hpdd.intel.com/test_sets/ee3a3fb8-4023-11e8-95c0-52540065bddc

Comment by Bruno Faccini (Inactive) [ 23/May/18 ]

+1 on master : https://testing.hpdd.intel.com/test_sets/c089db5c-5df9-11e8-abc3-52540065bddc

Comment by Nathaniel Clark [ 20/Jul/18 ]

master review-dne-part-1

https://testing.whamcloud.com/test_sets/83c65a84-8bcf-11e8-b0aa-52540065bddc

Comment by Andreas Dilger [ 30/Jul/18 ]

+1 on master review-dne-part-1:

https://testing.whamcloud.com/test_sets/e48cc2f2-93b0-11e8-8ee3-52540065bddc

Comment by Mikhail Pershin [ 17/Aug/18 ]

+1 on master:

https://testing.whamcloud.com/test_sets/6e2aa034-9fbd-11e8-b0aa-52540065bddc

it seems that is failed quite often

Comment by Andreas Dilger [ 17/Dec/18 ]

+1 on master https://testing.whamcloud.com/test_sets/bc4ce80e-ff8e-11e8-a97c-52540065bddc

Generated at Sat Feb 10 02:29:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.