[LU-10680] MDT becoming unresponsive in 2.10.3 Created: 17/Feb/18  Updated: 23/Jan/24  Resolved: 19/Jun/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.3
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

3.10.0-693.2.2.el7_lustre.pl1.x86_64


Attachments: Text File foreach_bt-oak-md1-s2-2018-02-16-11-02-21.txt    
Issue Links:
Related
is related to LU-7340 ChangeLogs catalog full condition sho... Resolved
is related to LU-10734 sanity test_160g: User cl8 still foun... Resolved
is related to LU-12865 sanity test 160f fails with ‘mds1: Us... Resolved
is related to LU-12871 enable changelog garbage collection b... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

2.10 LTS continues to causing us some trouble, and this time it's the MDT that has been quite unstable lately. A typical problem looks to start with this: 

[52968.460453] Lustre: 5897:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[53955.181780] Lustre: 8034:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[53955.182270] Lustre: 8034:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 2 previous similar messages
[54269.697741] Lustre: 5911:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[54362.202025] Lustre: 7920:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[54362.202511] Lustre: 7920:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 1 previous similar message
[54808.564380] perf: interrupt took too long (2579 > 2500), lowering kernel.perf_event_max_sample_rate to 77000
[56283.046677] Lustre: 7958:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[56283.047150] Lustre: 7958:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 14 previous similar messages
[56867.725405] Lustre: 5397:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[60578.888437] Lustre: 8014:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[60579.796894] Lustre: 8078:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[60579.797374] Lustre: 8078:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 2 previous similar messages
[62145.590529] LNet: Service thread pid 157540 was inactive for 200.33s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[62145.591250] Pid: 157540, comm: mdt_rdpg01_050
[62145.591486] 
               Call Trace:
[62145.591949]  [<ffffffff816a94e9>] schedule+0x29/0x70
[62145.592200]  [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0
[62145.592444]  [<ffffffff810c9ef9>] ? select_task_rq_fair+0x549/0x700
[62145.592708]  [<ffffffff816a989d>] wait_for_completion+0xfd/0x140
[62145.592971]  [<ffffffff810c4810>] ? default_wake_function+0x0/0x20
[62145.593223]  [<ffffffff810b07ea>] kthread_create_on_node+0xaa/0x140
[62145.593472]  [<ffffffffc12f0ac0>] ? mdd_chlg_garbage_collect+0x0/0x640 [mdd]
[62145.593760]  [<ffffffffc091346a>] ? llog_add+0x7a/0x1a0 [obdclass]
[62145.594025]  [<ffffffff810ea8ca>] ? __getnstimeofday64+0x3a/0xd0
[62145.594272]  [<ffffffffc12f7ac4>] mdd_changelog_store+0x194/0x540 [mdd]
[62145.594537]  [<ffffffffc130532d>] mdd_changelog_data_store_by_fid+0xed/0x1a0 [mdd]
[62145.595020]  [<ffffffffc130644b>] mdd_changelog_data_store+0x15b/0x210 [mdd]
[62145.595267]  [<ffffffffc130779b>] mdd_close+0x25b/0xcf0 [mdd]
[62145.595540]  [<ffffffffc11c9073>] mdt_mfd_close+0x353/0x610 [mdt]
[62145.595790]  [<ffffffffc11ce651>] mdt_close_internal+0x121/0x220 [mdt]
[62145.596051]  [<ffffffffc11ce970>] mdt_close+0x220/0x780 [mdt]
[62145.596352]  [<ffffffffc0bebda5>] tgt_request_handle+0x925/0x1370 [ptlrpc]
[62145.596633]  [<ffffffffc0b94b16>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
[62145.597138]  [<ffffffffc0b91148>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[62145.597385]  [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
[62145.597651]  [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
[62145.597927]  [<ffffffffc0b98252>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[62145.598176]  [<ffffffff81029557>] ? __switch_to+0xd7/0x510
[62145.598420]  [<ffffffff816a8f00>] ? __schedule+0x2f0/0x8b0
[62145.598695]  [<ffffffffc0b977c0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
[62145.598955]  [<ffffffff810b098f>] kthread+0xcf/0xe0
[62145.604690]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
[62145.604932]  [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
[62145.605175]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0

... leading to the end of connectivity, but the server doesn't really crash. So I took a crash dump when the MDS was unresponsive, it is available at the following link:

 

vmcore-oak-md1-s2-2018-02-16-11:02:21.gz:

[https://stanford.box.com/s/n20wj34xhkmmltiav427l3btdz9qjwvn

]

vmlinux-3.10.0-693.2.2.el7_lustre.pl1.x86_64.gz:

https://stanford.box.com/s/r6u6xjzmsgys2kzcq26562bqo6p45xcp

also attaching foreach bt and memory usage

Thanks!

Stephane



 Comments   
Comment by Peter Jones [ 17/Feb/18 ]

Thanks Stephane

Comment by Bruno Faccini (Inactive) [ 18/Feb/18 ]

Stephane,
I wonder if according to the stack trace, in the infos you have provided, a thead is in the process to be started in order to do garbage collection (introduced by my patch for LU-7340) of space consumed by changelogs due to a user/consumer (cl1?) being idle since quite a long time ?

But for some unknown reason kthreadd is not starting it, or has still not reported it did... When on the other hand, the mdd_changelog_clear() error msgs seem to indicate changelogs garbage collection may have already started!

Thus, I would like to analyze the crash-dump available, but to do so I also need the lustre-debuginfo RPM from your Lustre version, can you also make it available ?

Comment by Stephane Thiell [ 18/Feb/18 ]

Hey Bruno,

Thanks for checking. I noticed that function `mdd_chlg_garbage_collect()` in the "inactivate thread" stack trace too. But this is weird because it is very unlikely cl1 was actually idle, it's a robinhood consumer and if I check the robinhood log just a bit before the issue / crash dump, it seems to be running:

 2018/02/16 10:29:18 [7022/2] STATS | ==================== Dumping stats at 2018/02/16 10:29:18 =====================
2018/02/16 10:29:18 [7022/2] STATS | ======== General statistics =========
2018/02/16 10:29:18 [7022/2] STATS | Daemon start time: 2018/01/29 11:23:38
2018/02/16 10:29:18 [7022/2] STATS | Started modules: scan,log_reader,policy_run(modeguard)
2018/02/16 10:29:18 [7022/2] STATS | ======== FS scan statistics =========
2018/02/16 10:29:18 [7022/2] STATS | last scan  = 2018/01/30 00:24:56
2018/02/16 10:29:18 [7022/2] STATS | duration    = 13h 01min 18s (46878 s)
2018/02/16 10:29:18 [7022/2] STATS | status      = complete
2018/02/16 10:29:18 [7022/2] STATS | current scan interval = 22.0d
2018/02/16 10:29:18 [7022/2] STATS | ChangeLog reader #0:
2018/02/16 10:29:18 [7022/2] STATS |    fs_name    =   oak
2018/02/16 10:29:18 [7022/2] STATS |    mdt_name   =   MDT0000
2018/02/16 10:29:18 [7022/2] STATS |    reader_id  =   cl1
2018/02/16 10:29:18 [7022/2] STATS |    records read        = 176082077
2018/02/16 10:29:18 [7022/2] STATS |    interesting records = 119680156
2018/02/16 10:29:18 [7022/2] STATS |    suppressed records  = 56401921
2018/02/16 10:29:18 [7022/2] STATS |    records pending     = 999
2018/02/16 10:29:18 [7022/2] STATS |    last received            = 2018/02/16 10:29:17
2018/02/16 10:29:18 [7022/2] STATS |    last read record time    = 2018/02/16 10:29:17.671089
2018/02/16 10:29:18 [7022/2] STATS |    last read record id      = 3459537671
2018/02/16 10:29:18 [7022/2] STATS |    last pushed record id    = 3459535680
2018/02/16 10:29:18 [7022/2] STATS |    last committed record id = 3459535680
2018/02/16 10:29:18 [7022/2] STATS |    last cleared record id   = 3459535680
2018/02/16 10:29:18 [7022/2] STATS |    read speed               = 76977.50 record/sec (513.18 incl. idle time)
2018/02/16 10:29:18 [7022/2] STATS |    processing speed ratio   = 1.00
2018/02/16 10:29:18 [7022/2] STATS |    ChangeLog stats:
2018/02/16 10:29:18 [7022/2] STATS |    MARK: 18414, CREAT: 16458408, MKDIR: 3318418, HLINK: 35763559, SLINK: 120840, MKNOD: 726
2018/02/16 10:29:18 [7022/2] STATS |    UNLNK: 43112348, RMDIR: 3143745, RENME: 6424858, RNMTO: 0, OPEN: 0, CLOSE: 32387229
2018/02/16 10:29:18 [7022/2] STATS |    LYOUT: 118087, TRUNC: 3063071, SATTR: 29921326, XATTR: 0, HSM: 0, MTIME: 695102, CTIME: 1535946
2018/02/16 10:29:18 [7022/2] STATS |    ATIME: 0, MIGRT: 0


I also just checked that we only have one changelog consumer, cl1:

[root@oak-md1-s2 ~]# cat /proc/fs/lustre/mdd/oak-MDT0000/changelog_users 
current index: 3475865674
ID    index (idle seconds)
cl1   3475865473 (1)
[root@oak-md1-s2 ~]#

You should be able to download the debuginfo RPMs at the following links:

Thanks!

Comment by Bruno Faccini (Inactive) [ 18/Feb/18 ]

Well after doing some analysis on the crash-dump I feel puzzled by what I seem to have discovered.
It looks like the assembly generated code of mdd_changelog_store() routine, including my patch for LU-7340, is wrong :

Dump of assembler code for function mdd_changelog_store:
931     int mdd_changelog_store(const struct lu_env *env, struct mdd_device *mdd,
   0xffffffffc12f7972 <+66>:    mov    0x28(%rsi),%r14

932                             struct llog_changelog_rec *rec, struct thandle *th)
933     {
   0xffffffffc12f7930 <+0>:     nopl   0x0(%rax,%rax,1)
   0xffffffffc12f7935 <+5>:     push   %rbp
   0xffffffffc12f7936 <+6>:     mov    %rsp,%rbp
   0xffffffffc12f7939 <+9>:     push   %r15
   0xffffffffc12f793b <+11>:    push   %r14
   0xffffffffc12f793d <+13>:    push   %r13
   0xffffffffc12f793f <+15>:    mov    %rdx,%r13
   0xffffffffc12f7942 <+18>:    push   %r12
   0xffffffffc12f7944 <+20>:    push   %rbx
   0xffffffffc12f7945 <+21>:    mov    %rsi,%rbx
   0xffffffffc12f794f <+31>:    sub    $0x30,%rsp
   0xffffffffc12f7957 <+39>:    mov    %rcx,-0x58(%rbp)
   0xffffffffc12f795b <+43>:    mov    %gs:0x28,%rax
   0xffffffffc12f7964 <+52>:    mov    %rax,-0x30(%rbp)
   0xffffffffc12f7968 <+56>:    xor    %eax,%eax
   0xffffffffc12f796a <+58>:    mov    %rdi,-0x48(%rbp)

934             struct obd_device       *obd = mdd2obd_dev(mdd);
935             struct llog_ctxt        *ctxt;
936             struct thandle          *llog_th;
937             int                      rc;
938             bool                     run_gc_task;
939     
940             rec->cr_hdr.lrh_len = llog_data_len(sizeof(*rec) +
   0xffffffffc12f79b2 <+130>:   mov    %eax,0x0(%r13)
........................
964     
965             /* time to recover some space ?? */
966             spin_lock(&mdd->mdd_cl.mc_lock);
967             if (unlikely(mdd->mdd_changelog_gc && (ktime_get_real_seconds() -
   0xffffffffc12f7a8f <+351>:   mov    0xf8(%rbx),%eax
   0xffffffffc12f7a95 <+357>:   test   %eax,%eax
   0xffffffffc12f7a97 <+359>:   jne    0xffffffffc12f7dc0 <mdd_changelog_store+1168>
   0xffffffffc12f7dc0 <+1168>:  callq  0xffffffffc07fec90 <ktime_get_real_seconds>
   0xffffffffc12f7dc5 <+1173>:  mov    0x108(%rbx),%edx
   0xffffffffc12f7dcb <+1179>:  sub    0xf0(%rbx),%rax
   0xffffffffc12f7dd2 <+1186>:  cmp    %rdx,%rax
   0xffffffffc12f7dd5 <+1189>:  jle    0xffffffffc12f7a9d <mdd_changelog_store+365>
   0xffffffffc12f7ddb <+1195>:  cmpq   $0x0,0xe8(%rbx)
   0xffffffffc12f7de3 <+1203>:  jne    0xffffffffc12f7a9d <mdd_changelog_store+365>
   0xffffffffc12f7de9 <+1209>:  mov    0x30(%r15),%rdi
   0xffffffffc12f7ded <+1213>:  callq  0xffffffffc0919a20 <llog_cat_free_space>
   0xffffffffc12f7df2 <+1218>:  cmp    0x10c(%rbx),%eax
   0xffffffffc12f7df8 <+1224>:  ja     0xffffffffc12f7a9d <mdd_changelog_store+365>

968                 mdd->mdd_cl.mc_gc_time > mdd->mdd_changelog_min_gc_interval) &&
969                 mdd->mdd_cl.mc_gc_task == NULL &&
970                 llog_cat_free_space(ctxt->loc_handle) <=
971                                     mdd->mdd_changelog_min_free_cat_entries)) {
972                     CWARN("%s: low on changelog_catalog free entries, starting "
   0xffffffffc12f7dfe <+1230>:  lea    0x40(%r14),%rdx
   0xffffffffc12f7e02 <+1234>:  mov    $0xffffffffc1316ec8,%rsi
   0xffffffffc12f7e09 <+1241>:  mov    $0xffffffffc132ba80,%rdi
   0xffffffffc12f7e10 <+1248>:  xor    %eax,%eax
   0xffffffffc12f7e12 <+1250>:  movl   $0x4,0x33c74(%rip)        # 0xffffffffc132ba90 <msgdata.69570+16>
   0xffffffffc12f7e1c <+1260>:  movq   $0xffffffffc1316b28,0x33c59(%rip)        # 0xffffffffc132ba80 <msgdata.69570>
   0xffffffffc12f7e27 <+1271>:  movq   $0xffffffffc1312b70,0x33c56(%rip)        # 0xffffffffc132ba88 <msgdata.69570+8>
   0xffffffffc12f7e32 <+1282>:  movl   $0x3cd,0x33c58(%rip)        # 0xffffffffc132ba94 <msgdata.69570+20>
   0xffffffffc12f7e3c <+1292>:  movq   $0xffffffffc132ba70,0x33c59(%rip)        # 0xffffffffc132baa0 <msgdata.69570+32>
   0xffffffffc12f7e47 <+1303>:  movl   $0x400,0x33c47(%rip)        # 0xffffffffc132ba98 <msgdata.69570+24>
   0xffffffffc12f7e51 <+1313>:  callq  0xffffffffc0809b70 <libcfs_debug_msg>

973                           "ChangeLog garbage collection thread\n", obd->obd_name);
974     
975                     /* indicate further kthread run will occur outside right after
976                      * critical section
977                      */
978                     mdd->mdd_cl.mc_gc_task = (struct task_struct *)(-1);
   0xffffffffc12f7e56 <+1318>:  movq   $0xffffffffffffffff,0xe8(%rbx)
   0xffffffffc12f7e61 <+1329>:  jmpq   0xffffffffc12f7a9d <mdd_changelog_store+365>
   0xffffffffc12f7e66:  nopw   %cs:0x0(%rax,%rax,1)

979                     run_gc_task = true;
980             }
981             spin_unlock(&mdd->mdd_cl.mc_lock);
982             if (run_gc_task) {
983                     struct task_struct *gc_task;
984     
985                     gc_task = kthread_run(mdd_chlg_garbage_collect, mdd,
   0xffffffffc12f7aa7 <+375>:   xor    %eax,%eax
   0xffffffffc12f7aa9 <+377>:   mov    $0xffffffffc131502e,%rcx
   0xffffffffc12f7ab0 <+384>:   mov    $0xffffffff,%edx
   0xffffffffc12f7ab5 <+389>:   mov    %rbx,%rsi
   0xffffffffc12f7ab8 <+392>:   mov    $0xffffffffc12f0ac0,%rdi
   0xffffffffc12f7abf <+399>:   callq  0xffffffff810b0740 <kthread_create_on_node>
   0xffffffffc12f7ac4 <+404>:   cmp    $0xfffffffffffff000,%rax
   0xffffffffc12f7aca <+410>:   mov    %rax,%r12
   0xffffffffc12f7acd <+413>:   ja     0xffffffffc12f7d55 <mdd_changelog_store+1061>
   0xffffffffc12f7ad3 <+419>:   mov    %rax,%rdi
   0xffffffffc12f7ad6 <+422>:   callq  0xffffffff810c4750 <wake_up_process>
............................

where it appears that the compiler has completely avoided run_gc_task variable setting/testing.
This may have been caused y the fact run_gc_task has been mistakenly forgotten to be initialized, but then it is unclear why there has been no warning/error to be generated at build/compile time (even if we use -Wall option of gcc, but seems that -Wuninitialized specific check may be, silently?, omitted because no optimization is requested as per some forums threads are reporting?) and also why the generated assembly code do not strictly correspond to the source code.
The side effect of all of this is that an attempt to create a kthread occurs during each ChangeLog record creation (instead of only when garbage-collection conditions have been triggered). And the consequences seems to be aggravated due to the fact now, kthreads are not started in the context of the requestor but requests to do so are queued for "kthreadd" handling which can lead to much trouble if "kthreadd" suffers some issue causing its operations to stop/slow-down.

And this seems to be somewhat the case in your crash-dump :

crash> bt 2
PID: 2      TASK: ffff880169448fd0  CPU: 6   COMMAND: "kthreadd"
 #0 [ffff88016946b500] __schedule at ffffffff816a8f65
 #1 [ffff88016946b568] schedule at ffffffff816a94e9
 #2 [ffff88016946b578] wait_transaction_locked at ffffffffc0177085 [jbd2]
 #3 [ffff88016946b5d0] add_transaction_credits at ffffffffc0177368 [jbd2]
 #4 [ffff88016946b630] start_this_handle at ffffffffc01775e1 [jbd2]
 #5 [ffff88016946b6c8] jbd2__journal_start at ffffffffc0177a93 [jbd2]
 #6 [ffff88016946b710] __ldiskfs_journal_start_sb at ffffffffc0f55069 [ldiskfs]
 #7 [ffff88016946b750] ldiskfs_acquire_dquot at ffffffffc0f8b753 [ldiskfs]
 #8 [ffff88016946b770] dqget at ffffffff81267814
 #9 [ffff88016946b7d0] dquot_get_dqblk at ffffffff812685f4
#10 [ffff88016946b7f0] osd_acct_index_lookup at ffffffffc10a58df [osd_ldiskfs]
#11 [ffff88016946b828] lquota_disk_read at ffffffffc1014214 [lquota]
#12 [ffff88016946b858] qsd_refresh_usage at ffffffffc101bbfa [lquota]
#13 [ffff88016946b890] qsd_op_adjust at ffffffffc102a881 [lquota]
#14 [ffff88016946b8d0] osd_object_delete at ffffffffc106ec50 [osd_ldiskfs]
#15 [ffff88016946b910] lu_object_free at ffffffffc095ce3d [obdclass]
#16 [ffff88016946b968] lu_site_purge_objects at ffffffffc095da9e [obdclass]
#17 [ffff88016946ba10] lu_cache_shrink at ffffffffc095e8e9 [obdclass]
#18 [ffff88016946ba60] shrink_slab at ffffffff81195413
#19 [ffff88016946bb00] zone_reclaim at ffffffff81198091
#20 [ffff88016946bba8] get_page_from_freelist at ffffffff8118c264
#21 [ffff88016946bcb8] __alloc_pages_nodemask at ffffffff8118caf6
#22 [ffff88016946bd68] copy_process at ffffffff8108510d
#23 [ffff88016946bdf8] do_fork at ffffffff81086a51
#24 [ffff88016946be70] kernel_thread at ffffffff81086d06
#25 [ffff88016946be80] kthreadd at ffffffff810b1341
#26 [ffff88016946bf50] ret_from_fork at ffffffff816b4f58
crash> 

and even if it seems to move forward as its last activity timestamp is much recent (-1120s) than the one of thread 157540 still stuck with the same stack.

So 1st thing I have to do asap is to push a fix to effectivelly initialize run_gc_task variable/boolean to false, and secondly, I will continue crash-dump analysis in order to understand the reason of "kthreadd" pseudo-hang/slow-operations.

Comment by Gerrit Updater [ 18/Feb/18 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/31347
Subject: LU-10680 mdd: fix run_gc_task uninitialized
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: df5102382a42a1d366fe38d9da47a1c20ac40dd8

Comment by Stephane Thiell [ 19/Feb/18 ]

Bruno,

Wow, what an amazing troubleshooting! Strange that the uninitialized variable wasn't detected by -Wall nor QA/Maloo indeed... I just built new 2.10.3 RPMs with your patch added that I will deploy ASAP.

Thanks much!

Comment by Bruno Faccini (Inactive) [ 20/Feb/18 ]

Hmm, I think that I have finally understood what has caused the kthreadd thread to hang and then the whole MDT operations to also become blocked/slow.

We have triggered the following situation/deadlock, where a lot of MDT service threads (mdt[0-9], mdt_rdpg) were stuck in uninterruptible state and waiting for kthreadd to spawn with the following kinds of stacks :

PID: 13741  TASK: ffff88203a31dee0  CPU: 23  COMMAND: "mdt_rdpg01_031"
 #0 [ffff880fe83efb20] __schedule at ffffffff816a8f65
 #1 [ffff880fe83efb88] schedule at ffffffff816a94e9
 #2 [ffff880fe83efb98] schedule_timeout at ffffffff816a6ff9
 #3 [ffff880fe83efc40] wait_for_completion at ffffffff816a989d
 #4 [ffff880fe83efca0] kthread_create_on_node at ffffffff810b07ea
 #5 [ffff880fe83efd58] ptlrpc_start_thread at ffffffffc0b962d5 [ptlrpc]
 #6 [ffff880fe83efde8] ptlrpc_main at ffffffffc0b9803c [ptlrpc]
 #7 [ffff880fe83efec8] kthread at ffffffff810b098f
 #8 [ffff880fe83eff50] ret_from_fork at ffffffff816b4f58
.......................
PID: 157540  TASK: ffff88103a13bf40  CPU: 11  COMMAND: "mdt_rdpg01_050"
 #0 [ffff881cb4b578d0] __schedule at ffffffff816a8f65
 #1 [ffff881cb4b57938] schedule at ffffffff816a94e9
 #2 [ffff881cb4b57948] schedule_timeout at ffffffff816a6ff9
 #3 [ffff881cb4b579f0] wait_for_completion at ffffffff816a989d
 #4 [ffff881cb4b57a50] kthread_create_on_node at ffffffff810b07ea
 #5 [ffff881cb4b57b08] mdd_changelog_store at ffffffffc12f7ac4 [mdd]
 #6 [ffff881cb4b57b70] mdd_changelog_data_store_by_fid at ffffffffc130532d [mdd]
 #7 [ffff881cb4b57bc8] mdd_changelog_data_store at ffffffffc130644b [mdd]
 #8 [ffff881cb4b57be0] mdd_close at ffffffffc130779b [mdd]
 #9 [ffff881cb4b57c40] mdt_mfd_close at ffffffffc11c9073 [mdt]
#10 [ffff881cb4b57c90] mdt_close_internal at ffffffffc11ce651 [mdt]
#11 [ffff881cb4b57cc0] mdt_close at ffffffffc11ce970 [mdt]
#12 [ffff881cb4b57d00] tgt_request_handle at ffffffffc0bebda5 [ptlrpc]
#13 [ffff881cb4b57d48] ptlrpc_server_handle_request at ffffffffc0b94b16 [ptlrpc]
#14 [ffff881cb4b57de8] ptlrpc_main at ffffffffc0b98252 [ptlrpc]
#15 [ffff881cb4b57ec8] kthread at ffffffff810b098f
#16 [ffff881cb4b57f50] ret_from_fork at ffffffff816b4f58
.....................
PID: 193852  TASK: ffff88202234bf40  CPU: 3   COMMAND: "mdt01_098"
 #0 [ffff881aa61fb5e0] __schedule at ffffffff816a8f65
 #1 [ffff881aa61fb648] schedule at ffffffff816a94e9
 #2 [ffff881aa61fb658] schedule_timeout at ffffffff816a6ff9
 #3 [ffff881aa61fb700] wait_for_completion at ffffffff816a989d
 #4 [ffff881aa61fb760] kthread_create_on_node at ffffffff810b07ea
 #5 [ffff881aa61fb818] mdd_changelog_store at ffffffffc12f7ac4 [mdd]
 #6 [ffff881aa61fb880] mdd_changelog_ns_store at ffffffffc12f8096 [mdd]
 #7 [ffff881aa61fb918] mdd_create at ffffffffc12fb40a [mdd]
 #8 [ffff881aa61fba10] mdt_reint_open at ffffffffc11cc9bc [mdt]
 #9 [ffff881aa61fbb00] mdt_reint_rec at ffffffffc11c18a0 [mdt]
#10 [ffff881aa61fbb28] mdt_reint_internal at ffffffffc11a330b [mdt]
#11 [ffff881aa61fbb60] mdt_intent_reint at ffffffffc11a3832 [mdt]
#12 [ffff881aa61fbba0] mdt_intent_policy at ffffffffc11ae59e [mdt]
#13 [ffff881aa61fbbf8] ldlm_lock_enqueue at ffffffffc0b39277 [ptlrpc]
#14 [ffff881aa61fbc50] ldlm_handle_enqueue0 at ffffffffc0b62903 [ptlrpc]
#15 [ffff881aa61fbce0] tgt_enqueue at ffffffffc0be7ea2 [ptlrpc]
#16 [ffff881aa61fbd00] tgt_request_handle at ffffffffc0bebda5 [ptlrpc]
#17 [ffff881aa61fbd48] ptlrpc_server_handle_request at ffffffffc0b94b16 [ptlrpc]
#18 [ffff881aa61fbde8] ptlrpc_main at ffffffffc0b98252 [ptlrpc]
#19 [ffff881aa61fbec8] kthread at ffffffff810b098f
#20 [ffff881aa61fbf50] ret_from_fork at ffffffff816b4f58
...................

either trying to spawn new service threads, or due to previous described problem of having one changelog garbage collection thread being spawned for each new changelog record.

But as I have already indicated, kthreadd is stuck with the following stack :

PID: 2      TASK: ffff880169448fd0  CPU: 6   COMMAND: "kthreadd"
 #0 [ffff88016946b500] __schedule at ffffffff816a8f65
 #1 [ffff88016946b568] schedule at ffffffff816a94e9
 #2 [ffff88016946b578] wait_transaction_locked at ffffffffc0177085 [jbd2]
 #3 [ffff88016946b5d0] add_transaction_credits at ffffffffc0177368 [jbd2]
 #4 [ffff88016946b630] start_this_handle at ffffffffc01775e1 [jbd2]
 #5 [ffff88016946b6c8] jbd2__journal_start at ffffffffc0177a93 [jbd2]
 #6 [ffff88016946b710] __ldiskfs_journal_start_sb at ffffffffc0f55069 [ldiskfs]
 #7 [ffff88016946b750] ldiskfs_acquire_dquot at ffffffffc0f8b753 [ldiskfs]
 #8 [ffff88016946b770] dqget at ffffffff81267814
 #9 [ffff88016946b7d0] dquot_get_dqblk at ffffffff812685f4
#10 [ffff88016946b7f0] osd_acct_index_lookup at ffffffffc10a58df [osd_ldiskfs]
#11 [ffff88016946b828] lquota_disk_read at ffffffffc1014214 [lquota]
#12 [ffff88016946b858] qsd_refresh_usage at ffffffffc101bbfa [lquota]
#13 [ffff88016946b890] qsd_op_adjust at ffffffffc102a881 [lquota]
#14 [ffff88016946b8d0] osd_object_delete at ffffffffc106ec50 [osd_ldiskfs]
#15 [ffff88016946b910] lu_object_free at ffffffffc095ce3d [obdclass]
#16 [ffff88016946b968] lu_site_purge_objects at ffffffffc095da9e [obdclass]
#17 [ffff88016946ba10] lu_cache_shrink at ffffffffc095e8e9 [obdclass]
#18 [ffff88016946ba60] shrink_slab at ffffffff81195413
#19 [ffff88016946bb00] zone_reclaim at ffffffff81198091
#20 [ffff88016946bba8] get_page_from_freelist at ffffffff8118c264
#21 [ffff88016946bcb8] __alloc_pages_nodemask at ffffffff8118caf6
#22 [ffff88016946bd68] copy_process at ffffffff8108510d
#23 [ffff88016946bdf8] do_fork at ffffffff81086a51
#24 [ffff88016946be70] kernel_thread at ffffffff81086d06
#25 [ffff88016946be80] kthreadd at ffffffff810b1341
#26 [ffff88016946bf50] ret_from_fork at ffffffff816b4f58

as upon trying to launch a new thread, it has unfortunately triggered memory reclaim condition with the consequence of shrinkers to be called, including the one for the lu_cache in order to drain unreferenced objects, and doing so it tries to update quotas in quota slave file, but is being blocked waiting to grant journal credits because the current/running transaction has still not committed. And this is because the transaction is awaiting for a lot of its attached/started handles to complete (jbd2_journal_stop!), but this will never occur as they are almost all from the previous threads waiting for kthreadd. Hence the deadlock.

I think that to fix and prevent such unexpected scenario to re-occur, I need to find a way to start changelog garbage-collection thread outside of the interval/section-of-code where a journal transaction is being filled.

I will push a new/2nd patch soon to implement this.

Comment by Bruno Faccini (Inactive) [ 21/Feb/18 ]

Thinking more about it, it seems to me that khreadd should be excluded to run the memory reclaim mechanism in its context, in this case it would be better to do nothing (and leave next thread that will try to allocate some memory to trigger the same conditions/threshold and start to reclaim in its context) or to simply/only wake-up kswapd to do the job asynchronously ?

But anyway, we need to protect against such possible cases, and only request for changelog garbage-collection thread creation, outside of any current journal transaction/handle.

Comment by Stephane Thiell [ 21/Feb/18 ]

Thanks Bruno! Hope you'll find a way to fix this race condition.

Comment by Gerrit Updater [ 22/Feb/18 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/31376
Subject: LU-10680 mdd: create gc thread when no current transaction
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2985625ebbb404c01f88ada84427d5b4386fbfa4

Comment by Gerrit Updater [ 27/Feb/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31347/
Subject: LU-10680 mdd: fix run_gc_task uninitialized
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 25cdbbb475b8a42e3a1f594898d2ac1406b466f6

Comment by Bruno Faccini (Inactive) [ 02/Mar/18 ]

Hello Stephane,
Just one more comment I have in mind and to explain why you have been able to trigger the 2nd problem/deadlock.
As part of LU-10709 I understand that you are running with vm.zone_reclaim_mode=0 also on MDS, so, according to mm code in this area, this seem to be definitely the reason why you have triggered memory-reclaim, with shrinkers run, during khreadd create of new kthread, instead to grab new memory/pages from preferred/current node+zone free-list.

Hopefully this specific case should be handled and fixed by my 2nd fix for this ticket.

Comment by Stephane Thiell [ 02/Mar/18 ]

Hey Bruno,

Thanks for the update! And yes, this happened with vm.zone_reclaim_mode=0 on MDS.

Comment by Gerrit Updater [ 06/Mar/18 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/31552
Subject: LU-10680 mdd: disable changelog garbage collection by default
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 62e4519de3dcb588b88311950e8c6b7fefcca6a8

Comment by Gerrit Updater [ 08/Mar/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31552/
Subject: LU-10680 mdd: disable changelog garbage collection by default
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 57719d0c8833b8a8a8b6516d3bd4f2ad2b226285

Comment by Andreas Dilger [ 08/Mar/18 ]

Moving over to 2.12 now that the GC feature is no longer enabled by default, and the failing tests have been disabled. The remaining patch "LU-10680 mdd: create gc thread when no current transaction" for this ticket can land when it is ready, but no longer blocks the 2.11 release.

Comment by Gerrit Updater [ 14/Jun/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31376/
Subject: LU-10680 mdd: create gc thread when no current transaction
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 31fef6845e8be39ae8dfd242714a84001925b7ad

Comment by Peter Jones [ 19/Jun/18 ]

Landed for 2.12

Comment by Andreas Dilger [ 23/Jan/24 ]

"Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53773
Subject: LU-10680 tests: fix interop for sanity test_160h
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 32f7dd91985cd43e6ad2ed9854f9bed114a0e640

Generated at Sat Feb 10 02:37:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.