Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.17.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      When jobs reach very large numbers > 100k

      During walking the list of jobs (ojs_list) and restarting is frequent due to reallocation and the pos requested for restart may not be the last_job.

      Save every n'th pos (here it is 512) so the max walk on restart is 512.

      Attachments

        Issue Links

          Activity

            [LU-18351] Job stats scaling

            just hit this locally:

            [ 5405.394587] Lustre: DEBUG MARKER: == sanity test 205g: stress test for job_stats procfile == 06:16:13 (1740550573)
            [ 5496.225192] LustreError: 310746:0:(lprocfs_jobstats.c:116:job_putref()) ASSERTION( kref_read(&job->js_refcount) > 0 ) failed: 
            [ 5496.225504] LustreError: 310746:0:(lprocfs_jobstats.c:116:job_putref()) LBUG
            [ 5496.225576] CPU: 1 PID: 310746 Comm: lctl Tainted: G        W  O     --------- -  - 4.18.0 #12
            [ 5496.225657] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
            [ 5496.226072] Call Trace:
            [ 5496.226115]  dump_stack+0x6e/0xa0
            [ 5496.226158]  lbug_with_loc.cold.4+0x5/0x4e [libcfs]
            [ 5496.226221]  job_putref+0xa6/0xe0 [obdclass]
            [ 5496.226317]  lprocfs_jobstats_seq_show+0x2d1/0x520 [obdclass]
            [ 5496.226421]  seq_read+0x14e/0x3e0
            [ 5496.226460]  proc_reg_read+0x31/0x50
            [ 5496.226497]  vfs_read+0xa1/0x150
            [ 5496.226533]  ksys_read+0x3d/0xa0
            [ 5496.226568]  do_syscall_64+0x4b/0x1b0
            [ 5496.226605]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
            [ 5496.226651] RIP: 0033:0x7f456bdc69b2
            [ 5496.226692] Code: 96 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b6 0f 1f 80 00 00 00 00 f3 0f 1e fa 8b 05 96 da 20 00 85 c0 75 12 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 41 54 49 89 d4 55 48 89
            [ 5496.226842] RSP: 002b:00007ffcfa51c918 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
            [ 5496.226909] RAX: ffffffffffffffda RBX: 00000000006ae5c0 RCX: 00007f456bdc69b2
            [ 5496.226975] RDX: 0000000000001000 RSI: 00000000006ae5c0 RDI: 0000000000000003
            [ 5496.227041] RBP: 0000000000001000 R08: 0000000000000000 R09: 00007f456c1c3220
            [ 5496.227108] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000003
            [ 5496.227178] R13: 00007f456c1f7d40 R14: 00007ffcfa51cab0 R15: 0000000000000e31
            
            bzzz Alex Zhuravlev added a comment - just hit this locally: [ 5405.394587] Lustre: DEBUG MARKER: == sanity test 205g: stress test for job_stats procfile == 06:16:13 (1740550573) [ 5496.225192] LustreError: 310746:0:(lprocfs_jobstats.c:116:job_putref()) ASSERTION( kref_read(&job->js_refcount) > 0 ) failed: [ 5496.225504] LustreError: 310746:0:(lprocfs_jobstats.c:116:job_putref()) LBUG [ 5496.225576] CPU: 1 PID: 310746 Comm: lctl Tainted: G W O --------- - - 4.18.0 #12 [ 5496.225657] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014 [ 5496.226072] Call Trace: [ 5496.226115] dump_stack+0x6e/0xa0 [ 5496.226158] lbug_with_loc.cold.4+0x5/0x4e [libcfs] [ 5496.226221] job_putref+0xa6/0xe0 [obdclass] [ 5496.226317] lprocfs_jobstats_seq_show+0x2d1/0x520 [obdclass] [ 5496.226421] seq_read+0x14e/0x3e0 [ 5496.226460] proc_reg_read+0x31/0x50 [ 5496.226497] vfs_read+0xa1/0x150 [ 5496.226533] ksys_read+0x3d/0xa0 [ 5496.226568] do_syscall_64+0x4b/0x1b0 [ 5496.226605] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 5496.226651] RIP: 0033:0x7f456bdc69b2 [ 5496.226692] Code: 96 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b6 0f 1f 80 00 00 00 00 f3 0f 1e fa 8b 05 96 da 20 00 85 c0 75 12 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 41 54 49 89 d4 55 48 89 [ 5496.226842] RSP: 002b:00007ffcfa51c918 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 5496.226909] RAX: ffffffffffffffda RBX: 00000000006ae5c0 RCX: 00007f456bdc69b2 [ 5496.226975] RDX: 0000000000001000 RSI: 00000000006ae5c0 RDI: 0000000000000003 [ 5496.227041] RBP: 0000000000001000 R08: 0000000000000000 R09: 00007f456c1c3220 [ 5496.227108] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000003 [ 5496.227178] R13: 00007f456c1f7d40 R14: 00007ffcfa51cab0 R15: 0000000000000e31

            just hit this on clean master:

            [ 5302.557213] Lustre: DEBUG MARKER: == sanity test 205g: stress test for job_stats procfile == 00:21:32 (1735777292)
            [ 5393.581798] LustreError: 303135:0:(lprocfs_jobstats.c:133:job_putref()) ASSERTION( kref_read(&job->js_refcount) > 0 ) failed: 
            [ 5393.581997] LustreError: 303135:0:(lprocfs_jobstats.c:133:job_putref()) LBUG
            [ 5393.582044] CPU: 1 PID: 303135 Comm: lctl Tainted: G        W  O     --------- -  - 4.18.0 #11
            [ 5393.582084] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
            [ 5393.582124] Call Trace:
            [ 5393.582161]  dump_stack+0x6e/0xa0
            [ 5393.582189]  lbug_with_loc.cold.4+0x5/0x63 [libcfs]
            [ 5393.582221]  job_putref+0xa6/0xe0 [obdclass]
            [ 5393.582297]  lprocfs_jobstats_seq_show+0x2d1/0x520 [obdclass]
            [ 5393.582374]  seq_read+0x2c8/0x3e0
            [ 5393.582398]  proc_reg_read+0x31/0x50
            [ 5393.582421]  vfs_read+0xa1/0x150
            [ 5393.582441]  ksys_read+0x3d/0xa0
            [ 5393.582462]  do_syscall_64+0x4b/0x1b0
            [ 5393.582483]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
            [ 5393.582511] RIP: 0033:0x7fe718b459b2
            
            bzzz Alex Zhuravlev added a comment - just hit this on clean master: [ 5302.557213] Lustre: DEBUG MARKER: == sanity test 205g: stress test for job_stats procfile == 00:21:32 (1735777292) [ 5393.581798] LustreError: 303135:0:(lprocfs_jobstats.c:133:job_putref()) ASSERTION( kref_read(&job->js_refcount) > 0 ) failed: [ 5393.581997] LustreError: 303135:0:(lprocfs_jobstats.c:133:job_putref()) LBUG [ 5393.582044] CPU: 1 PID: 303135 Comm: lctl Tainted: G W O --------- - - 4.18.0 #11 [ 5393.582084] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014 [ 5393.582124] Call Trace: [ 5393.582161] dump_stack+0x6e/0xa0 [ 5393.582189] lbug_with_loc.cold.4+0x5/0x63 [libcfs] [ 5393.582221] job_putref+0xa6/0xe0 [obdclass] [ 5393.582297] lprocfs_jobstats_seq_show+0x2d1/0x520 [obdclass] [ 5393.582374] seq_read+0x2c8/0x3e0 [ 5393.582398] proc_reg_read+0x31/0x50 [ 5393.582421] vfs_read+0xa1/0x150 [ 5393.582441] ksys_read+0x3d/0xa0 [ 5393.582462] do_syscall_64+0x4b/0x1b0 [ 5393.582483] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 5393.582511] RIP: 0033:0x7fe718b459b2
            pjones Peter Jones added a comment -

            Merged for 2.17

            pjones Peter Jones added a comment - Merged for 2.17

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56607/
            Subject: LU-18351 obdclass: jobstat scaling
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: cad59b9b72fc72182574f4ce782e12d1e98f84fd

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56607/ Subject: LU-18351 obdclass: jobstat scaling Project: fs/lustre-release Branch: master Current Patch Set: Commit: cad59b9b72fc72182574f4ce782e12d1e98f84fd
            gerrit Gerrit Updater added a comment - - edited

            Abandoned:
            "Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57042
            Subject: LU-18351 obdclass: debug jobstat scaling race
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 69e3dbe6e19809ffc2383f0690ca077953cc8bde

            gerrit Gerrit Updater added a comment - - edited Abandoned: "Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57042 Subject: LU-18351 obdclass: debug jobstat scaling race Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 69e3dbe6e19809ffc2383f0690ca077953cc8bde

            Since the current work replaces cfs_hash job stats implementation.

            simmonsja James A Simmons added a comment - Since the current work replaces cfs_hash job stats implementation.

            "Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56607
            Subject: LU-18351 obdclass: jobstat scaling
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5a3137c566b1561518553e674b0ec7f26fe850f8

            gerrit Gerrit Updater added a comment - "Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56607 Subject: LU-18351 obdclass: jobstat scaling Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5a3137c566b1561518553e674b0ec7f26fe850f8

            People

              stancheff Shaun Tancheff
              stancheff Shaun Tancheff
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: