[LU-12178] MDS deadlock with 2.12.0 (quotas?) - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3
Affects Version/s: Lustre 2.12.0
Labels:
None
Environment:
CentOS 7.6 - 3.10.0-957.1.3.el7_lustre.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We just got some kind of deadlock on fir-md1-s2 that serves MDT0001 and MDT0003.

I took a crash dump because the MDS was not usable and filesystem was hanging
Attaching foreach bt in bt.all and vmcore-dmesg.txt

Also, this is from the crash:

crash> foreach bt >bt.all
crash> kmem -i
                 PAGES        TOTAL      PERCENTAGE
    TOTAL MEM  65891316     251.4 GB         ----
         FREE  13879541      52.9 GB   21% of TOTAL MEM
         USED  52011775     198.4 GB   78% of TOTAL MEM
       SHARED  32692923     124.7 GB   49% of TOTAL MEM
      BUFFERS  32164243     122.7 GB   48% of TOTAL MEM
       CACHED   781150         3 GB    1% of TOTAL MEM
         SLAB  13801776      52.6 GB   20% of TOTAL MEM

   TOTAL HUGE        0            0         ----
    HUGE FREE        0            0    0% of TOTAL HUGE

   TOTAL SWAP  1048575         4 GB         ----
    SWAP USED        0            0    0% of TOTAL SWAP
    SWAP FREE  1048575         4 GB  100% of TOTAL SWAP

 COMMIT LIMIT  33994233     129.7 GB         ----
    COMMITTED   228689     893.3 MB    0% of TOTAL LIMIT
crash> ps | grep ">"
>     0      0   0  ffffffffb4218480  RU   0.0       0      0  [swapper/0]
>     0      0   1  ffff8ec129410000  RU   0.0       0      0  [swapper/1]
>     0      0   2  ffff8ed129f10000  RU   0.0       0      0  [swapper/2]
>     0      0   3  ffff8ee129ebe180  RU   0.0       0      0  [swapper/3]
>     0      0   4  ffff8eb1a9ba6180  RU   0.0       0      0  [swapper/4]
>     0      0   5  ffff8ec129416180  RU   0.0       0      0  [swapper/5]
>     0      0   6  ffff8ed129f16180  RU   0.0       0      0  [swapper/6]
>     0      0   7  ffff8ee129eba080  RU   0.0       0      0  [swapper/7]
>     0      0   8  ffff8eb1a9ba30c0  RU   0.0       0      0  [swapper/8]
>     0      0   9  ffff8ec129411040  RU   0.0       0      0  [swapper/9]
>     0      0  10  ffff8ed129f11040  RU   0.0       0      0  [swapper/10]
>     0      0  11  ffff8ee129ebd140  RU   0.0       0      0  [swapper/11]
>     0      0  12  ffff8eb1a9ba5140  RU   0.0       0      0  [swapper/12]
>     0      0  13  ffff8ec129415140  RU   0.0       0      0  [swapper/13]
>     0      0  14  ffff8ed129f15140  RU   0.0       0      0  [swapper/14]
>     0      0  15  ffff8ee129ebb0c0  RU   0.0       0      0  [swapper/15]
>     0      0  16  ffff8eb1a9ba4100  RU   0.0       0      0  [swapper/16]
>     0      0  17  ffff8ec129412080  RU   0.0       0      0  [swapper/17]
>     0      0  19  ffff8ee129ebc100  RU   0.0       0      0  [swapper/19]
>     0      0  20  ffff8eb1a9408000  RU   0.0       0      0  [swapper/20]
>     0      0  21  ffff8ec129414100  RU   0.0       0      0  [swapper/21]
>     0      0  22  ffff8ed129f14100  RU   0.0       0      0  [swapper/22]
>     0      0  23  ffff8ee129f38000  RU   0.0       0      0  [swapper/23]
>     0      0  24  ffff8eb1a940e180  RU   0.0       0      0  [swapper/24]
>     0      0  25  ffff8ec1294130c0  RU   0.0       0      0  [swapper/25]
>     0      0  26  ffff8ed129f130c0  RU   0.0       0      0  [swapper/26]
>     0      0  27  ffff8ee129f3e180  RU   0.0       0      0  [swapper/27]
>     0      0  28  ffff8eb1a9409040  RU   0.0       0      0  [swapper/28]
>     0      0  29  ffff8ec129430000  RU   0.0       0      0  [swapper/29]
>     0      0  30  ffff8ed129f50000  RU   0.0       0      0  [swapper/30]
>     0      0  31  ffff8ee129f39040  RU   0.0       0      0  [swapper/31]
>     0      0  32  ffff8eb1a940d140  RU   0.0       0      0  [swapper/32]
>     0      0  33  ffff8ec129436180  RU   0.0       0      0  [swapper/33]
>     0      0  34  ffff8ed129f56180  RU   0.0       0      0  [swapper/34]
>     0      0  35  ffff8ee129f3d140  RU   0.0       0      0  [swapper/35]
>     0      0  36  ffff8eb1a940a080  RU   0.0       0      0  [swapper/36]
>     0      0  37  ffff8ec129431040  RU   0.0       0      0  [swapper/37]
>     0      0  38  ffff8ed129f51040  RU   0.0       0      0  [swapper/38]
>     0      0  39  ffff8ee129f3a080  RU   0.0       0      0  [swapper/39]
>     0      0  40  ffff8eb1a940c100  RU   0.0       0      0  [swapper/40]
>     0      0  41  ffff8ec129435140  RU   0.0       0      0  [swapper/41]
>     0      0  42  ffff8ed129f55140  RU   0.0       0      0  [swapper/42]
>     0      0  43  ffff8ee129f3c100  RU   0.0       0      0  [swapper/43]
>     0      0  44  ffff8eb1a940b0c0  RU   0.0       0      0  [swapper/44]
>     0      0  45  ffff8ec129432080  RU   0.0       0      0  [swapper/45]
>     0      0  46  ffff8ed129f52080  RU   0.0       0      0  [swapper/46]
>     0      0  47  ffff8ee129f3b0c0  RU   0.0       0      0  [swapper/47]
> 109549  109543  18  ffff8ee05406c100  RU   0.0  115440   2132  bash

I noticed a lot of threads blocked on quota commands.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

bt.all
1.68 MB
11/Apr/19 4:40 AM
fir-md1-s2-20190424-ldiskfs-event.log
118 kB
26/Apr/19 2:59 PM
vmcore-dmesg.txt
1.01 MB
11/Apr/19 4:40 AM

Activity

[LU-12178] MDS deadlock with 2.12.0 (quotas?)

Alex Zhuravlev added a comment - 22/Apr/19 10:27 AM

so the issue is that quita acquision is supposed to allocate on-disk space for the given id and the existing infrastructure can specify that space is allocated already and there is no need to start a transaction.
I think it's safe to just remove quota adjustement from osd_object_delete() path, but that wouldn't be optmial in some cases as we want to maintain local vs global quota grant balanced.
it doesn't need to be done in the context of process releasing memory, rather it's an extra pressure for VM subsystem as quota rebalance may need additional memory (locks, RPC, etc).
I'd say it would be nice to schedule rebalance and do that in a separate context periodically.
not sure what exact structure to use. the simplest one would be some sort of ID list, probably batched on per-cpu basis to save a bit of memory on list_heads, though if millions of objects sharing few IDs are being removed then we'd waste quite amount of memory).
a smarter structure like a radix-tree-based bitmap would do better, at cost of development.

Alex Zhuravlev added a comment - 22/Apr/19 10:27 AM so the issue is that quita acquision is supposed to allocate on-disk space for the given id and the existing infrastructure can specify that space is allocated already and there is no need to start a transaction. I think it's safe to just remove quota adjustement from osd_object_delete() path, but that wouldn't be optmial in some cases as we want to maintain local vs global quota grant balanced. it doesn't need to be done in the context of process releasing memory, rather it's an extra pressure for VM subsystem as quota rebalance may need additional memory (locks, RPC, etc). I'd say it would be nice to schedule rebalance and do that in a separate context periodically. not sure what exact structure to use. the simplest one would be some sort of ID list, probably batched on per-cpu basis to save a bit of memory on list_heads, though if millions of objects sharing few IDs are being removed then we'd waste quite amount of memory). a smarter structure like a radix-tree-based bitmap would do better, at cost of development.

Alex Zhuravlev added a comment - 18/Apr/19 4:54 PM

sorry for late response.. I tend to think patching ldiskfs would fix just a specific case while there might be more similar cases.
this is not required to release quota to the master right away, this can be done later in a more appropriate context (or per master's request). I think it's similar to ~~LU-12018~~. will make a patch quickly.

Alex Zhuravlev added a comment - 18/Apr/19 4:54 PM sorry for late response.. I tend to think patching ldiskfs would fix just a specific case while there might be more similar cases. this is not required to release quota to the master right away, this can be done later in a more appropriate context (or per master's request). I think it's similar to LU-12018 . will make a patch quickly.

Stephane Thiell added a comment - 17/Apr/19 10:15 PM

bzzz what do you think? I fear this issue can happen again... thanks!

Stephane Thiell added a comment - 17/Apr/19 10:15 PM bzzz what do you think? I fear this issue can happen again... thanks!

Patrick Farrell (Inactive) added a comment - 11/Apr/19 2:59 PM

Alex,

I can look at adding a new ldiskfs patch for this, unless you think we should try to unpick the deadlock itself rather than just go to GFP_NOFS there? EXT4 is able to use GFP_KERNEL, but since we're taking additional locks at the OSD layer, we either have to use GFP_NOFS or figure out how to change our locking...

Patrick Farrell (Inactive) added a comment - 11/Apr/19 2:59 PM Alex, I can look at adding a new ldiskfs patch for this, unless you think we should try to unpick the deadlock itself rather than just go to GFP_NOFS there? EXT4 is able to use GFP_KERNEL, but since we're taking additional locks at the OSD layer, we either have to use GFP_NOFS or figure out how to change our locking...

Patrick Farrell (Inactive) added a comment - 11/Apr/19 2:37 PM

Ah, yes, we do need GFP_NOFS in ldiskfs_htree_create_dir_info as well - It's called from readdir, which is called from osd_ldiskfs_it_fill, with the semaphore held.

Patrick Farrell (Inactive) added a comment - 11/Apr/19 2:37 PM Ah, yes, we do need GFP_NOFS in ldiskfs_htree_create_dir_info as well - It's called from readdir, which is called from osd_ldiskfs_it_fill, with the semaphore held.

Patrick Farrell (Inactive) added a comment - 11/Apr/19 2:35 PM - edited

Alex,

So the issue is that we're going in to reclaim, and doesn't that just mean we need to use GFP_NOFS here in ldiskfs_htree_store_dirent:

        len = sizeof(struct fname) + dirent->name_len + extra_data + 1;
        new_fn = kzalloc(len, GFP_KERNEL);
        if (!new_fn)
                return -ENOMEM;
        new_fn->hash = hash;

The semaphore is oo_ext_idx_sem is taken in osd_ldiskfs_it_fill, and it is also needed in truncate.

So we're holding that sempahore, we go in to the FS for memory freeing on allocation, and when we spool over in to the file system during memory freeing, we deadlock because of it.

I haven't worked through everything, but there's at least one thread stuck in truncate:

[2794948.051149]  [<ffffffffb3d67c49>] schedule+0x29/0x70
[2794948.056299]  [<ffffffffc0406085>] wait_transaction_locked+0x85/0xd0 [jbd2]
[2794948.063366]  [<ffffffffb36c2d00>] ? wake_up_atomic_t+0x30/0x30
[2794948.069377]  [<ffffffffc0406368>] add_transaction_credits+0x268/0x2f0 [jbd2]
[2794948.076612]  [<ffffffffc13cb2e0>] ? osd_trunc_lock+0xe0/0x210 [osd_ldiskfs]
[2794948.083756]  [<ffffffffc04065e1>] start_this_handle+0x1a1/0x430 [jbd2]
[2794948.090474]  [<ffffffffc1615121>] ? lod_sub_declare_write+0xe1/0x2a0 [lod]
[2794948.097541]  [<ffffffffb381c0b2>] ? kmem_cache_alloc+0x1c2/0x1f0
[2794948.103732]  [<ffffffffc0406a93>] jbd2__journal_start+0xf3/0x1f0 [jbd2]
[2794948.110531]  [<ffffffffc13ab22e>] ? osd_trans_start+0x20e/0x4e0 [osd_ldiskfs]
[2794948.117871]  [<ffffffffc12b82c9>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
[2794948.125446]  [<ffffffffc13ab22e>] osd_trans_start+0x20e/0x4e0 [osd_ldiskfs]
[2794948.132616]  [<ffffffffc1043612>] top_trans_start+0x702/0x940 [ptlrpc]
[2794948.139346]  [<ffffffffc15d6494>] lod_trans_start+0x34/0x40 [lod]
[2794948.145640]  [<ffffffffc1689bca>] mdd_trans_start+0x1a/0x20 [mdd]
[2794948.151914]  [<ffffffffc167de9f>] mdd_attr_set+0x56f/0xca0 [mdd]
[2794948.158120]  [<ffffffffc150d2e0>] mdt_mfd_close+0x750/0x850 [mdt]
[2794948.164398]  [<ffffffffc15127d1>] mdt_close_internal+0x121/0x220 [mdt]
[2794948.171117]  [<ffffffffc1512af0>] mdt_close+0x220/0x780 [mdt]
[2794948.177090]  [<ffffffffc103035a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[2794948.184172]  [<ffffffffc1009a51>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[2794948.191918]  [<ffffffffc0aefbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[2794948.199189]  [<ffffffffc0fd492b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[2794948.207047]  [<ffffffffc0fd17b5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[2794948.214015]  [<ffffffffb36d67c2>] ? default_wake_function+0x12/0x20
[2794948.220462]  [<ffffffffb36cba9b>] ? __wake_up_common+0x5b/0x90
[2794948.226506]  [<ffffffffc0fd825c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[2794948.232998]  [<ffffffffc0fd7760>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[2794948.240570]  [<ffffffffb36c1c31>] kthread+0xd1/0xe0
[2794948.245634]  [<ffffffffb36c1b60>] ? insert_kthread_work+0x40/0x40
[2794948.251920]  [<ffffffffb3d74c24>] ret_from_fork_nospec_begin+0xe/0x21
[2794948.258536]  [<ffffffffb36c1b60>] ? insert_kthread_work+0x40/0x40

So I'm pretty sure the solution is just GFP_NOFS in that call. Also checking to see if we need it in:
ldiskfs_htree_create_dir_info

Which is in the same file and also uses just GFP_KERNEL...

Patrick Farrell (Inactive) added a comment - 11/Apr/19 2:35 PM - edited Alex, So the issue is that we're going in to reclaim, and doesn't that just mean we need to use GFP_NOFS here in ldiskfs_htree_store_dirent: len = sizeof(struct fname) + dirent->name_len + extra_data + 1; new_fn = kzalloc(len, GFP_KERNEL); if (!new_fn) return -ENOMEM; new_fn->hash = hash; The semaphore is oo_ext_idx_sem is taken in osd_ldiskfs_it_fill, and it is also needed in truncate. So we're holding that sempahore, we go in to the FS for memory freeing on allocation, and when we spool over in to the file system during memory freeing, we deadlock because of it. I haven't worked through everything, but there's at least one thread stuck in truncate: [2794948.051149] [<ffffffffb3d67c49>] schedule+0x29/0x70 [2794948.056299] [<ffffffffc0406085>] wait_transaction_locked+0x85/0xd0 [jbd2] [2794948.063366] [<ffffffffb36c2d00>] ? wake_up_atomic_t+0x30/0x30 [2794948.069377] [<ffffffffc0406368>] add_transaction_credits+0x268/0x2f0 [jbd2] [2794948.076612] [<ffffffffc13cb2e0>] ? osd_trunc_lock+0xe0/0x210 [osd_ldiskfs] [2794948.083756] [<ffffffffc04065e1>] start_this_handle+0x1a1/0x430 [jbd2] [2794948.090474] [<ffffffffc1615121>] ? lod_sub_declare_write+0xe1/0x2a0 [lod] [2794948.097541] [<ffffffffb381c0b2>] ? kmem_cache_alloc+0x1c2/0x1f0 [2794948.103732] [<ffffffffc0406a93>] jbd2__journal_start+0xf3/0x1f0 [jbd2] [2794948.110531] [<ffffffffc13ab22e>] ? osd_trans_start+0x20e/0x4e0 [osd_ldiskfs] [2794948.117871] [<ffffffffc12b82c9>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] [2794948.125446] [<ffffffffc13ab22e>] osd_trans_start+0x20e/0x4e0 [osd_ldiskfs] [2794948.132616] [<ffffffffc1043612>] top_trans_start+0x702/0x940 [ptlrpc] [2794948.139346] [<ffffffffc15d6494>] lod_trans_start+0x34/0x40 [lod] [2794948.145640] [<ffffffffc1689bca>] mdd_trans_start+0x1a/0x20 [mdd] [2794948.151914] [<ffffffffc167de9f>] mdd_attr_set+0x56f/0xca0 [mdd] [2794948.158120] [<ffffffffc150d2e0>] mdt_mfd_close+0x750/0x850 [mdt] [2794948.164398] [<ffffffffc15127d1>] mdt_close_internal+0x121/0x220 [mdt] [2794948.171117] [<ffffffffc1512af0>] mdt_close+0x220/0x780 [mdt] [2794948.177090] [<ffffffffc103035a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [2794948.184172] [<ffffffffc1009a51>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [2794948.191918] [<ffffffffc0aefbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [2794948.199189] [<ffffffffc0fd492b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [2794948.207047] [<ffffffffc0fd17b5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [2794948.214015] [<ffffffffb36d67c2>] ? default_wake_function+0x12/0x20 [2794948.220462] [<ffffffffb36cba9b>] ? __wake_up_common+0x5b/0x90 [2794948.226506] [<ffffffffc0fd825c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [2794948.232998] [<ffffffffc0fd7760>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [2794948.240570] [<ffffffffb36c1c31>] kthread+0xd1/0xe0 [2794948.245634] [<ffffffffb36c1b60>] ? insert_kthread_work+0x40/0x40 [2794948.251920] [<ffffffffb3d74c24>] ret_from_fork_nospec_begin+0xe/0x21 [2794948.258536] [<ffffffffb36c1b60>] ? insert_kthread_work+0x40/0x40 So I'm pretty sure the solution is just GFP_NOFS in that call. Also checking to see if we need it in: ldiskfs_htree_create_dir_info Which is in the same file and also uses just GFP_KERNEL...

Peter Jones added a comment - 11/Apr/19 2:22 PM

Alex

Could you please advise

Thanks

Peter

Peter Jones added a comment - 11/Apr/19 2:22 PM Alex Could you please advise Thanks Peter

People

Assignee:: Alex Zhuravlev

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 11/Apr/19 4:41 AM

Updated:: 08/Aug/19 1:04 PM

Resolved:: 30/Apr/19 12:50 PM