[LU-16639] job_stat_exit() should not have any items Created: 13/Mar/23 Updated: 10/Nov/23 Resolved: 29/Mar/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.16.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
It looks like a patch has somehow resulted in a leak of job_stats on the server that result in a message being printed at unmount time: [ 1095.059674] Lustre: Failing over lustre-MDT0000 [ 1095.090215] LustreError: 31568:0:(lprocfs_jobstats.c:137:job_stat_exit()) should not have any items [ 1095.093184] LustreError: 31568:0:(lprocfs_jobstats.c:137:job_stat_exit()) Skipped 20 previous similar messages [ 1095.146954] Lustre: server umount lustre-MDT0000 complete This started around 2022-03-03 but I haven't isolated it to a specific patch yet. |
| Comments |
| Comment by Andreas Dilger [ 13/Mar/23 ] |
|
According to this Kibana search, it looks like the first reports of this job_stat_exit() message are actually back on 2022-03-03. That was when 2.15.0 was released, and most patches have parent commit v2_15_0-RC2-2-g94f4e1f517 " |
| Comment by Gerrit Updater [ 13/Mar/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50283 |
| Comment by Andreas Dilger [ 13/Mar/23 ] |
|
I've pushed the above patch which improves the job_stats_exit() message to print the jn_name field. In theory, the lprocfs_job_cleanup() function should have freed all of the job stats before the cfs_hash_putref() I'm not sure if it is related, but there was an autotest patch landed just before the problem appeared that could potentially be related: commit dbf96c013a17af99b9968f7c79e49f949c33e881
Author: Charlie Olmstead <charlie@whamcloud.com>
CommitDate: Tue Mar 1 17:58:17 2022 +0000
ATM-2317 - Monitor test framework process
Added logic to monitor the status of the test framework by the presence of
<shared_log_dir/run_test.sh.pid. The presence of that file signals to AT
that the test framework is running. This allows AT to monitor the status of
the framework without logging into the test controller every X seconds. The
pid file is deleted when TestStep.runtestscript exits. because of the newly
created wrapper. The wrapper has 2 functions:
- Execute TestStep.runtestscript in a process group so it and all
sub-processes can be killed at once by AT
- Remove <shared_log_dir>/run_test.sh.pid file when TestStep.runtestscript
exits
In addition to the monitoring I've ported over the dynamic execute/rexec
logic from loadjenkinsbuild. It allows callers to dynamically build
execution commands by changing the method name called: e.g. rexec_no_retry,
rexec_no_retry_data, rexec_status_no_retry!, etc.
Also:
- Removed the unnecessary reboot_without_install? check in the provisioner.
This is a relic of old-school provisioning
- Improved selinux/firewall enabling/disabling logic
- Paritioner: removed logic that runs 'ls /dev' if pvcreate fails. I looked
through all of the logs and it has not happened nor have we had any
sessions that have failed with 'pvcreate failed, unable to continue'
since I can remember. The rest of the changes to this file are shortening
of long lines.
Change-Id: I5f960bf5210e90a37aa66cdac7a3d260632a5279
Reviewed-on: https://review.whamcloud.com/46528
|
| Comment by Alex Zhuravlev [ 14/Mar/23 ] |
|
I did bisection: COMMIT TESTED PASSED FAILED COMMIT DESCRIPTION 9cf4dddd52 1 0 1 BAD LU-14831 osd-ldiskfs: uninited osd_inode_id is used 3a9176d5bb 1 0 1 BAD LU-14174 lfs: llapi_mirror_find() signed return 3a83078628 1 0 1 BAD LU-14409 ldiskfs: Add support for SUSE 5.3.18-24.46.1 c83304607a 8 0 8 BAD LU-12391 tests: mdsrate tests improvements ea2cd3af7b 8 0 8 BAD LU-11407 obdclass: add start time to stats files 9a5bace55a 10 10 0 GOOD LU-15115 ptlrpc: recalc timer on EINPROGRESS reply 3038917f12 10 10 0 GOOD LU-2084 lnet: don't retry allocating router buffers |
| Comment by Andreas Dilger [ 14/Mar/23 ] |
|
Thanks Alex. It looks like I messed up the final job stats cleanup in lprocfs_job_cleanup() in that patch. |
| Comment by Gerrit Updater [ 28/Mar/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50283/ |
| Comment by Peter Jones [ 29/Mar/23 ] |
|
Landed for 2.16 |