[LU-12747] sanity: test 811 fail with "MDD orphan cleanup thread not quit" Created: 11/Sep/19  Updated: 13/Jan/21  Resolved: 14/Feb/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-11418 hung threads on MDT and MDT won't umount Resolved
is related to LU-14330 Interop: recovery-small test 143 fail... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Lai Siyao <lai.siyao@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/83dd0b3a-d3ea-11e9-9fc9-52540065bddc

onyx-33vm4: == rpc test complete, duration -o sec ================================================================ 16:37:21 (1568133441)
onyx-33vm4: onyx-33vm4.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
CMD: onyx-33vm4 e2label /dev/mapper/mds1_flakey 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
CMD: onyx-33vm4 e2label /dev/mapper/mds1_flakey 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
CMD: onyx-33vm4 e2label /dev/mapper/mds1_flakey 2>/dev/null
Started lustre-MDT0000
CMD: onyx-33vm4 pgrep orph_.*-MDD
 sanity test_811: @@@@@@ FAIL: MDD orphan cleanup thread not quit 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6115:error()
  = /usr/lib64/lustre/tests/sanity.sh:21633:test_811()
  = /usr/lib64/lustre/tests/test-framework.sh:6417:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:6456:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:6302:run_test()
  = /usr/lib64/lustre/tests/sanity.sh:21635:main()


 Comments   
Comment by Andreas Dilger [ 31/Jan/20 ]

+1 on master https://testing.whamcloud.com/test_sets/64398a3e-4243-11ea-b083-52540065bddc

Comment by Andreas Dilger [ 01/Feb/20 ]

This seems to fail intermittently, but could be made more robust.

Comment by Andreas Dilger [ 01/Feb/20 ]
[ 8541.797642] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
[ 8542.067572] Lustre: DEBUG MARKER: pgrep orph_.*-MDD
[ 8542.196104] Lustre: lustre-MDT0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted.
[ 8542.253891] LustreError: 27822:0:(osd_handler.c:278:osd_idc_find_or_init()) can't lookup: rc = -2
[ 8542.255510] Lustre: 27822:0:(mdd_orphans.c:340:mdd_orphan_destroy()) lustre-MDD0000: orphan 0x200006991:0xd:0x0 [0x200006991:0xd:0x0] doesn't exist
[ 8542.706917] Lustre: DEBUG MARKER: sanity test_811: @@@@@@ FAIL: MDD orphan cleanup thread not quit

The pgrep is run shortly before mdd_orphan_destroy() is finished, a slightly longer wait would fix this.

Comment by Gerrit Updater [ 01/Feb/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37395
Subject: LU-12747 tests: wait properly for orhpan thread stop
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c5073903984d040bc5006a75e49d13dc0d7e54a1

Comment by Gerrit Updater [ 14/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37395/
Subject: LU-12747 tests: wait properly for orhpan thread stop
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e5346a494fcb54b7f9fbc7ed4fb93003a8489231

Comment by Peter Jones [ 14/Feb/20 ]

Landed for 2.14

Generated at Sat Feb 10 02:55:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.