[LU-12865] sanity test 160f fails with ‘mds1: User cl6 not registered’ Created: 15/Oct/19 Updated: 15/Jan/24 Resolved: 18/Jan/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.12.3, Lustre 2.12.4, Lustre 2.12.5 |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.6 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
sanity test_160f fails with ‘mds1: User cl6 not registered’. So far this year, there have been 52 sanity test 160f failures with this error; 36 of those failures are for ARM clients. Looking at the suite_log for a recent failure, https://testing.whamcloud.com/test_sets/8bedad40-ebd5-11e9-b62b-52540065bddc, we see that user cl6 is registered and that we are able to manipulate the changelog register prior to the error == sanity test 160f: changelog garbage collect (timestamped users) =================================== 20:27:47 (1570566467) CMD: trevis-49vm2 /usr/sbin/lctl get_param mdd.lustre-MDT0000.changelog_mask -n CMD: trevis-49vm2 /usr/sbin/lctl set_param mdd.lustre-MDT0000.changelog_mask=+hsm mdd.lustre-MDT0000.changelog_mask=+hsm CMD: trevis-49vm2 /usr/sbin/lctl --device lustre-MDT0000 changelog_register -n CMD: trevis-49vm3 /usr/sbin/lctl get_param mdd.lustre-MDT0001.changelog_mask -n CMD: trevis-49vm3 /usr/sbin/lctl set_param mdd.lustre-MDT0001.changelog_mask=+hsm mdd.lustre-MDT0001.changelog_mask=+hsm CMD: trevis-49vm3 /usr/sbin/lctl --device lustre-MDT0001 changelog_register -n CMD: trevis-49vm2 /usr/sbin/lctl get_param mdd.lustre-MDT0002.changelog_mask -n CMD: trevis-49vm2 /usr/sbin/lctl set_param mdd.lustre-MDT0002.changelog_mask=+hsm mdd.lustre-MDT0002.changelog_mask=+hsm CMD: trevis-49vm2 /usr/sbin/lctl --device lustre-MDT0002 changelog_register -n CMD: trevis-49vm3 /usr/sbin/lctl get_param mdd.lustre-MDT0003.changelog_mask -n CMD: trevis-49vm3 /usr/sbin/lctl set_param mdd.lustre-MDT0003.changelog_mask=+hsm mdd.lustre-MDT0003.changelog_mask=+hsm CMD: trevis-49vm3 /usr/sbin/lctl --device lustre-MDT0003 changelog_register -n Registered 4 changelog users: 'cl6 cl6 cl6 cl6' … mds1: verifying user cl6 clear: 19 + 2 == 21 CMD: trevis-49vm2 /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.changelog_users CMD: trevis-49vm3 /usr/sbin/lctl get_param -n mdd.lustre-MDT0001.changelog_users CMD: trevis-49vm3 /usr/sbin/lctl get_param -n mdd.lustre-MDT0001.changelog_users lustre-MDT0001: clear the changelog for cl6 to record #10 CMD: trevis-49vm3 /usr/sbin/lctl get_param -n mdd.lustre-MDT0001.changelog_users mds2: verifying user cl6 clear: 8 + 2 == 10 CMD: trevis-49vm3 /usr/sbin/lctl get_param -n mdd.lustre-MDT0001.changelog_users CMD: trevis-49vm2 /usr/sbin/lctl get_param -n mdd.lustre-MDT0002.changelog_users CMD: trevis-49vm2 /usr/sbin/lctl get_param -n mdd.lustre-MDT0002.changelog_users lustre-MDT0002: clear the changelog for cl6 to record #2 CMD: trevis-49vm2 /usr/sbin/lctl get_param -n mdd.lustre-MDT0002.changelog_users mds3: verifying user cl6 clear: 0 + 2 == 2 CMD: trevis-49vm2 /usr/sbin/lctl get_param -n mdd.lustre-MDT0002.changelog_users CMD: trevis-49vm3 /usr/sbin/lctl get_param -n mdd.lustre-MDT0003.changelog_users CMD: trevis-49vm3 /usr/sbin/lctl get_param -n mdd.lustre-MDT0003.changelog_users lustre-MDT0003: clear the changelog for cl6 to record #2 CMD: trevis-49vm3 /usr/sbin/lctl get_param -n mdd.lustre-MDT0003.changelog_users mds4: verifying user cl6 clear: 0 + 2 == 2 CMD: trevis-49vm3 /usr/sbin/lctl get_param -n mdd.lustre-MDT0003.changelog_users total: 8 create in 0.02 seconds: 453.39 ops/second CMD: trevis-49vm2 ps -e -o comm= | grep chlg_gc_thread pdsh@trevis-79vm17: trevis-49vm2: ssh exited with exit code 1 CMD: trevis-49vm2 ps -e -o comm= | grep chlg_gc_thread pdsh@trevis-79vm17: trevis-49vm2: ssh exited with exit code 1 CMD: trevis-49vm3 ps -e -o comm= | grep chlg_gc_thread pdsh@trevis-79vm17: trevis-49vm3: ssh exited with exit code 1 CMD: trevis-49vm3 ps -e -o comm= | grep chlg_gc_thread pdsh@trevis-79vm17: trevis-49vm3: ssh exited with exit code 1 CMD: trevis-49vm2 /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.changelog_users sanity test_160f: @@@@@@ FAIL: mds1: User cl6 not registered There is no indication of a problem in any of the console logs. Logs for other failures are at |
| Comments |
| Comment by Andreas Dilger [ 17/Oct/19 ] |
|
It looks like this test failure might be a problem in the test script itself? Looking at the MDT0000 debug log shows that the cl6 user is deregistered by the GC thread because it is idle for too long: 1570566291.903442:0:27506:0:(mdd_trans.c:192:mdd_chlg_garbage_collect()) lustre-MDD0000: Force deregister of ChangeLog user cl7 idle since more than 35s 1570566291.903509:0:27506:0:(mdd_trans.c:192:mdd_chlg_garbage_collect()) lustre-MDD0000: Force deregister of ChangeLog user cl6 idle since more than 11s This is because the test sets changelog_max_idle_time=10, but sleeps for 9s and then does a number of operations that could be slow. In rare cases the test ran too long and the MDS evicted the "good" along with the bad one. Patch to fix the test forthcoming. |
| Comment by Gerrit Updater [ 17/Oct/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36468 |
| Comment by Gerrit Updater [ 18/Jan/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36468/ |
| Comment by Gerrit Updater [ 04/Jun/20 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38833 |
| Comment by Gerrit Updater [ 11/Jul/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38833/ |