Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16639

job_stat_exit() should not have any items

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0, Lustre 2.15.6
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      It looks like a patch has somehow resulted in a leak of job_stats on the server that result in a message being printed at unmount time:

      [ 1095.059674] Lustre: Failing over lustre-MDT0000
      [ 1095.090215] LustreError: 31568:0:(lprocfs_jobstats.c:137:job_stat_exit()) should not have any items
      [ 1095.093184] LustreError: 31568:0:(lprocfs_jobstats.c:137:job_stat_exit()) Skipped 20 previous similar messages
      [ 1095.146954] Lustre: server umount lustre-MDT0000 complete
      

      This started around 2022-03-03 but I haven't isolated it to a specific patch yet.

      Attachments

        Issue Links

          Activity

            [LU-16639] job_stat_exit() should not have any items

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56727/
            Subject: LU-16639 misc: cleanup concole messages
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 3b63e032cfd3b98adae97be4d262845eaa7d2bb9

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56727/ Subject: LU-16639 misc: cleanup concole messages Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 3b63e032cfd3b98adae97be4d262845eaa7d2bb9

            "Eric Carbonneau <carbonneau1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56727
            Subject: LU-16639 misc: cleanup concole messages
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: a3c7f67367003176f4627d918b9d6f81ef977fea

            gerrit Gerrit Updater added a comment - "Eric Carbonneau <carbonneau1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56727 Subject: LU-16639 misc: cleanup concole messages Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: a3c7f67367003176f4627d918b9d6f81ef977fea
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50283/
            Subject: LU-16639 misc: cleanup concole messages
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 8f40a3d7110da1af8e310a4b7f40b86f13080938

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50283/ Subject: LU-16639 misc: cleanup concole messages Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8f40a3d7110da1af8e310a4b7f40b86f13080938

            Thanks Alex. It looks like I messed up the final job stats cleanup in lprocfs_job_cleanup() in that patch.

            adilger Andreas Dilger added a comment - Thanks Alex. It looks like I messed up the final job stats cleanup in lprocfs_job_cleanup() in that patch.

            I did bisection:

            COMMIT          TESTED  PASSED  FAILED          COMMIT DESCRIPTION
            9cf4dddd52      1       0       1       BAD     LU-14831 osd-ldiskfs: uninited osd_inode_id is used
            3a9176d5bb      1       0       1       BAD     LU-14174 lfs: llapi_mirror_find() signed return
            3a83078628      1       0       1       BAD     LU-14409 ldiskfs: Add support for SUSE 5.3.18-24.46.1
            c83304607a      8       0       8       BAD     LU-12391 tests: mdsrate tests improvements
            ea2cd3af7b      8       0       8       BAD     LU-11407 obdclass: add start time to stats files
            9a5bace55a      10      10      0       GOOD    LU-15115 ptlrpc: recalc timer on EINPROGRESS reply
            3038917f12      10      10      0       GOOD    LU-2084 lnet: don't retry allocating router buffers
            
            bzzz Alex Zhuravlev added a comment - I did bisection: COMMIT TESTED PASSED FAILED COMMIT DESCRIPTION 9cf4dddd52 1 0 1 BAD LU-14831 osd-ldiskfs: uninited osd_inode_id is used 3a9176d5bb 1 0 1 BAD LU-14174 lfs: llapi_mirror_find() signed return 3a83078628 1 0 1 BAD LU-14409 ldiskfs: Add support for SUSE 5.3.18-24.46.1 c83304607a 8 0 8 BAD LU-12391 tests: mdsrate tests improvements ea2cd3af7b 8 0 8 BAD LU-11407 obdclass: add start time to stats files 9a5bace55a 10 10 0 GOOD LU-15115 ptlrpc: recalc timer on EINPROGRESS reply 3038917f12 10 10 0 GOOD LU-2084 lnet: don't retry allocating router buffers
            adilger Andreas Dilger added a comment - - edited

            I've pushed the above patch which improves the job_stats_exit() message to print the jn_name field.

            In theory, the lprocfs_job_cleanup() function should have freed all of the job stats before the cfs_hash_putref()>cfs_hash_destroy()>job_stat_exit() path cleans up the remaining stats. It looks like the stats are freed by calling cfs_hash_put_locked()->job_free().

            I'm not sure if it is related, but there was an autotest patch landed just before the problem appeared that could potentially be related:

            commit dbf96c013a17af99b9968f7c79e49f949c33e881
            Author:     Charlie Olmstead <charlie@whamcloud.com>
            CommitDate: Tue Mar 1 17:58:17 2022 +0000
            
                ATM-2317 - Monitor test framework process
                
                Added logic to monitor the status of the test framework by the presence of
                <shared_log_dir/run_test.sh.pid. The presence of that file signals to AT
                that the test framework is running. This allows AT to monitor the status of
                the framework without logging into the test controller every X seconds. The
                pid file is deleted when TestStep.runtestscript exits. because of the newly
                created wrapper. The wrapper has 2 functions:
                - Execute TestStep.runtestscript in a process group so it and all
                  sub-processes can be killed at once by AT
                - Remove <shared_log_dir>/run_test.sh.pid file when TestStep.runtestscript
                  exits
                
                In addition to the monitoring I've ported over the dynamic execute/rexec
                logic from loadjenkinsbuild. It allows callers to dynamically build
                execution commands by changing the method name called: e.g. rexec_no_retry,
                rexec_no_retry_data, rexec_status_no_retry!, etc.
                
                Also:
                - Removed the unnecessary reboot_without_install? check in the provisioner.
                  This is a relic of old-school provisioning
                - Improved selinux/firewall enabling/disabling logic
                - Paritioner: removed logic that runs 'ls /dev' if pvcreate fails. I looked
                  through all of the logs and it has not happened nor have we had any
                  sessions that have failed with 'pvcreate failed, unable to continue'
                  since I can remember. The rest of the changes to this file are shortening
                  of long lines.
                
                Change-Id: I5f960bf5210e90a37aa66cdac7a3d260632a5279
                Reviewed-on: https://review.whamcloud.com/46528
            
            adilger Andreas Dilger added a comment - - edited I've pushed the above patch which improves the job_stats_exit() message to print the jn_name field. In theory, the lprocfs_job_cleanup() function should have freed all of the job stats before the cfs_hash_putref() >cfs_hash_destroy() >job_stat_exit() path cleans up the remaining stats. It looks like the stats are freed by calling cfs_hash_put_locked()->job_free() . I'm not sure if it is related, but there was an autotest patch landed just before the problem appeared that could potentially be related: commit dbf96c013a17af99b9968f7c79e49f949c33e881 Author: Charlie Olmstead <charlie@whamcloud.com> CommitDate: Tue Mar 1 17:58:17 2022 +0000 ATM-2317 - Monitor test framework process Added logic to monitor the status of the test framework by the presence of <shared_log_dir/run_test.sh.pid. The presence of that file signals to AT that the test framework is running. This allows AT to monitor the status of the framework without logging into the test controller every X seconds. The pid file is deleted when TestStep.runtestscript exits. because of the newly created wrapper. The wrapper has 2 functions: - Execute TestStep.runtestscript in a process group so it and all sub-processes can be killed at once by AT - Remove <shared_log_dir>/run_test.sh.pid file when TestStep.runtestscript exits In addition to the monitoring I've ported over the dynamic execute/rexec logic from loadjenkinsbuild. It allows callers to dynamically build execution commands by changing the method name called: e.g. rexec_no_retry, rexec_no_retry_data, rexec_status_no_retry!, etc. Also: - Removed the unnecessary reboot_without_install? check in the provisioner. This is a relic of old-school provisioning - Improved selinux/firewall enabling/disabling logic - Paritioner: removed logic that runs 'ls /dev' if pvcreate fails. I looked through all of the logs and it has not happened nor have we had any sessions that have failed with 'pvcreate failed, unable to continue' since I can remember. The rest of the changes to this file are shortening of long lines. Change-Id: I5f960bf5210e90a37aa66cdac7a3d260632a5279 Reviewed-on: https://review.whamcloud.com/46528

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50283
            Subject: LU-16639 misc: improve console error messages
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c50c6467958f4287fbcc3abec84e345b9cbf4c24

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50283 Subject: LU-16639 misc: improve console error messages Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c50c6467958f4287fbcc3abec84e345b9cbf4c24

            According to this Kibana search, it looks like the first reports of this job_stat_exit() message are actually back on 2022-03-03.

            That was when 2.15.0 was released, and most patches have parent commit v2_15_0-RC2-2-g94f4e1f517 "LU-15512 lnet: Stop discovery on deleted peer NI", which was the only patch that had been landed in a couple of weeks. However, there are a few patches in an LNet series https://review.whamcloud.com/#/c/fs/lustre-release/+/46653/5 that were run on 2022-03-03 that hit the problem, BUT were based on an earlier parent commit, so this might indicate that the issue is at least partly related to the test environment and not the patches themselves.

            adilger Andreas Dilger added a comment - According to this Kibana search , it looks like the first reports of this job_stat_exit() message are actually back on 2022-03-03. That was when 2.15.0 was released, and most patches have parent commit v2_15_0-RC2-2-g94f4e1f517 " LU-15512 lnet: Stop discovery on deleted peer NI ", which was the only patch that had been landed in a couple of weeks. However, there are a few patches in an LNet series https://review.whamcloud.com/#/c/fs/lustre-release/+/46653/5 that were run on 2022-03-03 that hit the problem, BUT were based on an earlier parent commit, so this might indicate that the issue is at least partly related to the test environment and not the patches themselves.

            People

              adilger Andreas Dilger
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: