Details
-
Improvement
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
When jobs reach very large numbers > 100k
During walking the list of jobs (ojs_list) and restarting is frequent due to reallocation and the pos requested for restart may not be the last_job.
Save every n'th pos (here it is 512) so the max walk on restart is 512.
Attachments
Issue Links
- is related to
-
LU-18610 Avoid double putting job ref
-
- Open
-
-
LU-8130 Migrate from libcfs hash to rhashtable
-
- Open
-
- mentioned in
-
Page No Confluence page found with the given URL.
-
Page No Confluence page found with the given URL.
-
Page No Confluence page found with the given URL.
-
Page Loading...
just hit this locally:
[ 5405.394587] Lustre: DEBUG MARKER: == sanity test 205g: stress test for job_stats procfile == 06:16:13 (1740550573) [ 5496.225192] LustreError: 310746:0:(lprocfs_jobstats.c:116:job_putref()) ASSERTION( kref_read(&job->js_refcount) > 0 ) failed: [ 5496.225504] LustreError: 310746:0:(lprocfs_jobstats.c:116:job_putref()) LBUG [ 5496.225576] CPU: 1 PID: 310746 Comm: lctl Tainted: G W O --------- - - 4.18.0 #12 [ 5496.225657] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014 [ 5496.226072] Call Trace: [ 5496.226115] dump_stack+0x6e/0xa0 [ 5496.226158] lbug_with_loc.cold.4+0x5/0x4e [libcfs] [ 5496.226221] job_putref+0xa6/0xe0 [obdclass] [ 5496.226317] lprocfs_jobstats_seq_show+0x2d1/0x520 [obdclass] [ 5496.226421] seq_read+0x14e/0x3e0 [ 5496.226460] proc_reg_read+0x31/0x50 [ 5496.226497] vfs_read+0xa1/0x150 [ 5496.226533] ksys_read+0x3d/0xa0 [ 5496.226568] do_syscall_64+0x4b/0x1b0 [ 5496.226605] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 5496.226651] RIP: 0033:0x7f456bdc69b2 [ 5496.226692] Code: 96 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b6 0f 1f 80 00 00 00 00 f3 0f 1e fa 8b 05 96 da 20 00 85 c0 75 12 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 41 54 49 89 d4 55 48 89 [ 5496.226842] RSP: 002b:00007ffcfa51c918 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 5496.226909] RAX: ffffffffffffffda RBX: 00000000006ae5c0 RCX: 00007f456bdc69b2 [ 5496.226975] RDX: 0000000000001000 RSI: 00000000006ae5c0 RDI: 0000000000000003 [ 5496.227041] RBP: 0000000000001000 R08: 0000000000000000 R09: 00007f456c1c3220 [ 5496.227108] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000003 [ 5496.227178] R13: 00007f456c1f7d40 R14: 00007ffcfa51cab0 R15: 0000000000000e31