[LU-11161] sanity test_160g fails for DNE with ''mds2: User cl9 still registered'' Created: 19/Jul/18 Updated: 27/Jul/22 Resolved: 30/Jan/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.1 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | James Nunez (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | DNE | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
test_160g failed with the following error: 'mds2: User cl9 still registered' This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/d2eee862-8ad3-11e8-9e83-52540065bddc This issue was created by maloo for James Nunez <james.a.nunez@intel.com> <<Please provide additional information about the failure here>> The patch for Looking at the MDS console logs for the test session mentioned above, we see the following. On the first MDS with MDT0 and MDT2 we can clearly see that the changelog user cl9 is deregistered [ 6260.597934] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0002.changelog_users [ 6260.922657] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0002.changelog_users [ 6264.392672] Lustre: 8849:0:(mdd_trans.c:187:mdd_chlg_garbage_collect()) lustre-MDD0000: Force deregister of ChangeLog user cl9 idle with more than 4 unprocessed records [ 6264.600644] Lustre: DEBUG MARKER: ps -e -o comm= | grep chlg_gc_thread [ 6264.928866] Lustre: DEBUG MARKER: ps -e -o comm= | grep chlg_gc_thread [ 6265.932195] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.changelog_users [ 6266.259457] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.changelog_users [ 6266.586909] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.changelog_users On the other MDS with MDT1 and MDT3, we don't see the same user deregistered. In fact we don't see any users deregistered [ 6261.587345] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0003.changelog_users [ 6261.913956] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0003.changelog_users [ 6262.237009] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0003.changelog_users [ 6265.271455] Lustre: DEBUG MARKER: ps -e -o comm= | grep chlg_gc_thread [ 6265.602819] Lustre: DEBUG MARKER: ps -e -o comm= | grep chlg_gc_thread [ 6266.927027] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0001.changelog_users [ 6267.246822] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0001.changelog_users [ 6267.685450] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity test_160g: @@@@@@ FAIL: mds2: User cl9 still registered VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Gerrit Updater [ 19/Jul/18 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32844 |
| Comment by Gerrit Updater [ 20/Jul/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32844/ |
| Comment by Andreas Dilger [ 24/Jul/18 ] |
|
I think this is a problem with the test script - it is not causing the user to be evicted on the other MDTs when the test expects this. What should be happening is that there are some logs created on all of the MDTs (create some files or whatever generates a change log entry), the gc timeout is reduced to some short interval, one user consumes the pending log records, the test sleeps longer than the interval, some new logs are created, and then the idle user that did not consume the log records is evicted. This is happening correctly on one MDT, but not on the others. It may be that there are no new records created, or we didn't wait long enough on those MDTs. |
| Comment by Gerrit Updater [ 08/Jan/19 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33994 |
| Comment by James Nunez (Inactive) [ 10/Jan/19 ] |
|
I think the issue with this test is that we are not exceeding the changelog_max_idle_indexes for each MDT, we were just meeting the threshold. In my testing, if I reduce the changelog_max_idle_indexes by 1 or write one more file per MDT, the tests passes in a DNE environment. I'll update patch https://review.whamcloud.com/33994 to reflect this. |
| Comment by Gerrit Updater [ 30/Jan/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33994/ |
| Comment by James Nunez (Inactive) [ 30/Jan/19 ] |
|
Landed for 2.13.0 |
| Comment by Gerrit Updater [ 25/Feb/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34297 |
| Comment by Gerrit Updater [ 19/Mar/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34297/ |