[LU-12865] sanity test 160f fails with ‘mds1: User cl6 not registered’ Created: 15/Oct/19  Updated: 15/Jan/24  Resolved: 18/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.3, Lustre 2.12.4, Lustre 2.12.5
Fix Version/s: Lustre 2.14.0, Lustre 2.12.6

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10680 MDT becoming unresponsive in 2.10.3 Resolved
is related to LU-12871 enable changelog garbage collection b... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test_160f fails with ‘mds1: User cl6 not registered’. So far this year, there have been 52 sanity test 160f failures with this error; 36 of those failures are for ARM clients.

Looking at the suite_log for a recent failure, https://testing.whamcloud.com/test_sets/8bedad40-ebd5-11e9-b62b-52540065bddc, we see that user cl6 is registered and that we are able to manipulate the changelog register prior to the error

== sanity test 160f: changelog garbage collect (timestamped users) =================================== 20:27:47 (1570566467)
CMD: trevis-49vm2 /usr/sbin/lctl get_param mdd.lustre-MDT0000.changelog_mask -n
CMD: trevis-49vm2 /usr/sbin/lctl set_param mdd.lustre-MDT0000.changelog_mask=+hsm
mdd.lustre-MDT0000.changelog_mask=+hsm
CMD: trevis-49vm2 /usr/sbin/lctl --device lustre-MDT0000 changelog_register -n
CMD: trevis-49vm3 /usr/sbin/lctl get_param mdd.lustre-MDT0001.changelog_mask -n
CMD: trevis-49vm3 /usr/sbin/lctl set_param mdd.lustre-MDT0001.changelog_mask=+hsm
mdd.lustre-MDT0001.changelog_mask=+hsm
CMD: trevis-49vm3 /usr/sbin/lctl --device lustre-MDT0001 changelog_register -n
CMD: trevis-49vm2 /usr/sbin/lctl get_param mdd.lustre-MDT0002.changelog_mask -n
CMD: trevis-49vm2 /usr/sbin/lctl set_param mdd.lustre-MDT0002.changelog_mask=+hsm
mdd.lustre-MDT0002.changelog_mask=+hsm
CMD: trevis-49vm2 /usr/sbin/lctl --device lustre-MDT0002 changelog_register -n
CMD: trevis-49vm3 /usr/sbin/lctl get_param mdd.lustre-MDT0003.changelog_mask -n
CMD: trevis-49vm3 /usr/sbin/lctl set_param mdd.lustre-MDT0003.changelog_mask=+hsm
mdd.lustre-MDT0003.changelog_mask=+hsm
CMD: trevis-49vm3 /usr/sbin/lctl --device lustre-MDT0003 changelog_register -n
Registered 4 changelog users: 'cl6 cl6 cl6 cl6'
…
mds1: verifying user cl6 clear:  19 + 2 == 21
CMD: trevis-49vm2 /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.changelog_users
CMD: trevis-49vm3 /usr/sbin/lctl get_param -n mdd.lustre-MDT0001.changelog_users
CMD: trevis-49vm3 /usr/sbin/lctl get_param -n mdd.lustre-MDT0001.changelog_users
lustre-MDT0001: clear the changelog for cl6 to record #10
CMD: trevis-49vm3 /usr/sbin/lctl get_param -n mdd.lustre-MDT0001.changelog_users
mds2: verifying user cl6 clear:  8 + 2 == 10
CMD: trevis-49vm3 /usr/sbin/lctl get_param -n mdd.lustre-MDT0001.changelog_users
CMD: trevis-49vm2 /usr/sbin/lctl get_param -n mdd.lustre-MDT0002.changelog_users
CMD: trevis-49vm2 /usr/sbin/lctl get_param -n mdd.lustre-MDT0002.changelog_users
lustre-MDT0002: clear the changelog for cl6 to record #2
CMD: trevis-49vm2 /usr/sbin/lctl get_param -n mdd.lustre-MDT0002.changelog_users
mds3: verifying user cl6 clear:  0 + 2 == 2
CMD: trevis-49vm2 /usr/sbin/lctl get_param -n mdd.lustre-MDT0002.changelog_users
CMD: trevis-49vm3 /usr/sbin/lctl get_param -n mdd.lustre-MDT0003.changelog_users
CMD: trevis-49vm3 /usr/sbin/lctl get_param -n mdd.lustre-MDT0003.changelog_users
lustre-MDT0003: clear the changelog for cl6 to record #2
CMD: trevis-49vm3 /usr/sbin/lctl get_param -n mdd.lustre-MDT0003.changelog_users
mds4: verifying user cl6 clear:  0 + 2 == 2
CMD: trevis-49vm3 /usr/sbin/lctl get_param -n mdd.lustre-MDT0003.changelog_users
total: 8 create in 0.02 seconds: 453.39 ops/second
CMD: trevis-49vm2 ps -e -o comm= | grep chlg_gc_thread
pdsh@trevis-79vm17: trevis-49vm2: ssh exited with exit code 1
CMD: trevis-49vm2 ps -e -o comm= | grep chlg_gc_thread
pdsh@trevis-79vm17: trevis-49vm2: ssh exited with exit code 1
CMD: trevis-49vm3 ps -e -o comm= | grep chlg_gc_thread
pdsh@trevis-79vm17: trevis-49vm3: ssh exited with exit code 1
CMD: trevis-49vm3 ps -e -o comm= | grep chlg_gc_thread
pdsh@trevis-79vm17: trevis-49vm3: ssh exited with exit code 1
CMD: trevis-49vm2 /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.changelog_users
 sanity test_160f: @@@@@@ FAIL: mds1: User cl6 not registered 

There is no indication of a problem in any of the console logs.

Logs for other failures are at
https://testing.whamcloud.com/test_sets/ecc72250-ccfd-11e9-a25b-52540065bddc
https://testing.whamcloud.com/test_sets/8da05bea-cff3-11e9-9fc9-52540065bddc



 Comments   
Comment by Andreas Dilger [ 17/Oct/19 ]

It looks like this test failure might be a problem in the test script itself? Looking at the MDT0000 debug log shows that the cl6 user is deregistered by the GC thread because it is idle for too long:

1570566291.903442:0:27506:0:(mdd_trans.c:192:mdd_chlg_garbage_collect()) lustre-MDD0000: Force deregister of ChangeLog user cl7 idle since more than 35s
1570566291.903509:0:27506:0:(mdd_trans.c:192:mdd_chlg_garbage_collect()) lustre-MDD0000: Force deregister of ChangeLog user cl6 idle since more than 11s

This is because the test sets changelog_max_idle_time=10, but sleeps for 9s and then does a number of operations that could be slow. In rare cases the test ran too long and the MDS evicted the "good" along with the bad one.

Patch to fix the test forthcoming.

Comment by Gerrit Updater [ 17/Oct/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36468
Subject: LU-12865 tests: fix sanity 160f to be more robust
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7e2a15299f1180d8b7026f46886c85dbf4dacd7b

Comment by Gerrit Updater [ 18/Jan/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36468/
Subject: LU-12865 tests: fix sanity 160f to be more robust
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4b0f0164c6ed761897409186376e9edc989323c9

Comment by Gerrit Updater [ 04/Jun/20 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38833
Subject: LU-12865 tests: fix sanity 160f to be more robust
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 011b8e3eac372032dd8f09576239e7a05d496251

Comment by Gerrit Updater [ 11/Jul/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38833/
Subject: LU-12865 tests: fix sanity 160f to be more robust
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: fdc1320ea1a817182291ec67394a903b2e0a4911

Generated at Sat Feb 10 02:56:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.