[LU-16639] job_stat_exit() should not have any items Created: 13/Mar/23  Updated: 10/Nov/23  Resolved: 29/Mar/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Andreas Dilger Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16205 fid2path for encrypted files Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

It looks like a patch has somehow resulted in a leak of job_stats on the server that result in a message being printed at unmount time:

[ 1095.059674] Lustre: Failing over lustre-MDT0000
[ 1095.090215] LustreError: 31568:0:(lprocfs_jobstats.c:137:job_stat_exit()) should not have any items
[ 1095.093184] LustreError: 31568:0:(lprocfs_jobstats.c:137:job_stat_exit()) Skipped 20 previous similar messages
[ 1095.146954] Lustre: server umount lustre-MDT0000 complete

This started around 2022-03-03 but I haven't isolated it to a specific patch yet.



 Comments   
Comment by Andreas Dilger [ 13/Mar/23 ]

According to this Kibana search, it looks like the first reports of this job_stat_exit() message are actually back on 2022-03-03.

That was when 2.15.0 was released, and most patches have parent commit v2_15_0-RC2-2-g94f4e1f517 "LU-15512 lnet: Stop discovery on deleted peer NI", which was the only patch that had been landed in a couple of weeks. However, there are a few patches in an LNet series https://review.whamcloud.com/#/c/fs/lustre-release/+/46653/5 that were run on 2022-03-03 that hit the problem, BUT were based on an earlier parent commit, so this might indicate that the issue is at least partly related to the test environment and not the patches themselves.

Comment by Gerrit Updater [ 13/Mar/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50283
Subject: LU-16639 misc: improve console error messages
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c50c6467958f4287fbcc3abec84e345b9cbf4c24

Comment by Andreas Dilger [ 13/Mar/23 ]

I've pushed the above patch which improves the job_stats_exit() message to print the jn_name field.

In theory, the lprocfs_job_cleanup() function should have freed all of the job stats before the cfs_hash_putref()>cfs_hash_destroy()>job_stat_exit() path cleans up the remaining stats. It looks like the stats are freed by calling cfs_hash_put_locked()->job_free().

I'm not sure if it is related, but there was an autotest patch landed just before the problem appeared that could potentially be related:

commit dbf96c013a17af99b9968f7c79e49f949c33e881
Author:     Charlie Olmstead <charlie@whamcloud.com>
CommitDate: Tue Mar 1 17:58:17 2022 +0000

    ATM-2317 - Monitor test framework process
    
    Added logic to monitor the status of the test framework by the presence of
    <shared_log_dir/run_test.sh.pid. The presence of that file signals to AT
    that the test framework is running. This allows AT to monitor the status of
    the framework without logging into the test controller every X seconds. The
    pid file is deleted when TestStep.runtestscript exits. because of the newly
    created wrapper. The wrapper has 2 functions:
    - Execute TestStep.runtestscript in a process group so it and all
      sub-processes can be killed at once by AT
    - Remove <shared_log_dir>/run_test.sh.pid file when TestStep.runtestscript
      exits
    
    In addition to the monitoring I've ported over the dynamic execute/rexec
    logic from loadjenkinsbuild. It allows callers to dynamically build
    execution commands by changing the method name called: e.g. rexec_no_retry,
    rexec_no_retry_data, rexec_status_no_retry!, etc.
    
    Also:
    - Removed the unnecessary reboot_without_install? check in the provisioner.
      This is a relic of old-school provisioning
    - Improved selinux/firewall enabling/disabling logic
    - Paritioner: removed logic that runs 'ls /dev' if pvcreate fails. I looked
      through all of the logs and it has not happened nor have we had any
      sessions that have failed with 'pvcreate failed, unable to continue'
      since I can remember. The rest of the changes to this file are shortening
      of long lines.
    
    Change-Id: I5f960bf5210e90a37aa66cdac7a3d260632a5279
    Reviewed-on: https://review.whamcloud.com/46528
Comment by Alex Zhuravlev [ 14/Mar/23 ]

I did bisection:

COMMIT          TESTED  PASSED  FAILED          COMMIT DESCRIPTION
9cf4dddd52      1       0       1       BAD     LU-14831 osd-ldiskfs: uninited osd_inode_id is used
3a9176d5bb      1       0       1       BAD     LU-14174 lfs: llapi_mirror_find() signed return
3a83078628      1       0       1       BAD     LU-14409 ldiskfs: Add support for SUSE 5.3.18-24.46.1
c83304607a      8       0       8       BAD     LU-12391 tests: mdsrate tests improvements
ea2cd3af7b      8       0       8       BAD     LU-11407 obdclass: add start time to stats files
9a5bace55a      10      10      0       GOOD    LU-15115 ptlrpc: recalc timer on EINPROGRESS reply
3038917f12      10      10      0       GOOD    LU-2084 lnet: don't retry allocating router buffers
Comment by Andreas Dilger [ 14/Mar/23 ]

Thanks Alex. It looks like I messed up the final job stats cleanup in lprocfs_job_cleanup() in that patch.

Comment by Gerrit Updater [ 28/Mar/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50283/
Subject: LU-16639 misc: cleanup concole messages
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8f40a3d7110da1af8e310a4b7f40b86f13080938

Comment by Peter Jones [ 29/Mar/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:28:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.