[LU-12178] MDS deadlock with 2.12.0 (quotas?) - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3
Affects Version/s: Lustre 2.12.0
Labels:
None
Environment:
CentOS 7.6 - 3.10.0-957.1.3.el7_lustre.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We just got some kind of deadlock on fir-md1-s2 that serves MDT0001 and MDT0003.

I took a crash dump because the MDS was not usable and filesystem was hanging
Attaching foreach bt in bt.all and vmcore-dmesg.txt

Also, this is from the crash:

crash> foreach bt >bt.all
crash> kmem -i
                 PAGES        TOTAL      PERCENTAGE
    TOTAL MEM  65891316     251.4 GB         ----
         FREE  13879541      52.9 GB   21% of TOTAL MEM
         USED  52011775     198.4 GB   78% of TOTAL MEM
       SHARED  32692923     124.7 GB   49% of TOTAL MEM
      BUFFERS  32164243     122.7 GB   48% of TOTAL MEM
       CACHED   781150         3 GB    1% of TOTAL MEM
         SLAB  13801776      52.6 GB   20% of TOTAL MEM

   TOTAL HUGE        0            0         ----
    HUGE FREE        0            0    0% of TOTAL HUGE

   TOTAL SWAP  1048575         4 GB         ----
    SWAP USED        0            0    0% of TOTAL SWAP
    SWAP FREE  1048575         4 GB  100% of TOTAL SWAP

 COMMIT LIMIT  33994233     129.7 GB         ----
    COMMITTED   228689     893.3 MB    0% of TOTAL LIMIT
crash> ps | grep ">"
>     0      0   0  ffffffffb4218480  RU   0.0       0      0  [swapper/0]
>     0      0   1  ffff8ec129410000  RU   0.0       0      0  [swapper/1]
>     0      0   2  ffff8ed129f10000  RU   0.0       0      0  [swapper/2]
>     0      0   3  ffff8ee129ebe180  RU   0.0       0      0  [swapper/3]
>     0      0   4  ffff8eb1a9ba6180  RU   0.0       0      0  [swapper/4]
>     0      0   5  ffff8ec129416180  RU   0.0       0      0  [swapper/5]
>     0      0   6  ffff8ed129f16180  RU   0.0       0      0  [swapper/6]
>     0      0   7  ffff8ee129eba080  RU   0.0       0      0  [swapper/7]
>     0      0   8  ffff8eb1a9ba30c0  RU   0.0       0      0  [swapper/8]
>     0      0   9  ffff8ec129411040  RU   0.0       0      0  [swapper/9]
>     0      0  10  ffff8ed129f11040  RU   0.0       0      0  [swapper/10]
>     0      0  11  ffff8ee129ebd140  RU   0.0       0      0  [swapper/11]
>     0      0  12  ffff8eb1a9ba5140  RU   0.0       0      0  [swapper/12]
>     0      0  13  ffff8ec129415140  RU   0.0       0      0  [swapper/13]
>     0      0  14  ffff8ed129f15140  RU   0.0       0      0  [swapper/14]
>     0      0  15  ffff8ee129ebb0c0  RU   0.0       0      0  [swapper/15]
>     0      0  16  ffff8eb1a9ba4100  RU   0.0       0      0  [swapper/16]
>     0      0  17  ffff8ec129412080  RU   0.0       0      0  [swapper/17]
>     0      0  19  ffff8ee129ebc100  RU   0.0       0      0  [swapper/19]
>     0      0  20  ffff8eb1a9408000  RU   0.0       0      0  [swapper/20]
>     0      0  21  ffff8ec129414100  RU   0.0       0      0  [swapper/21]
>     0      0  22  ffff8ed129f14100  RU   0.0       0      0  [swapper/22]
>     0      0  23  ffff8ee129f38000  RU   0.0       0      0  [swapper/23]
>     0      0  24  ffff8eb1a940e180  RU   0.0       0      0  [swapper/24]
>     0      0  25  ffff8ec1294130c0  RU   0.0       0      0  [swapper/25]
>     0      0  26  ffff8ed129f130c0  RU   0.0       0      0  [swapper/26]
>     0      0  27  ffff8ee129f3e180  RU   0.0       0      0  [swapper/27]
>     0      0  28  ffff8eb1a9409040  RU   0.0       0      0  [swapper/28]
>     0      0  29  ffff8ec129430000  RU   0.0       0      0  [swapper/29]
>     0      0  30  ffff8ed129f50000  RU   0.0       0      0  [swapper/30]
>     0      0  31  ffff8ee129f39040  RU   0.0       0      0  [swapper/31]
>     0      0  32  ffff8eb1a940d140  RU   0.0       0      0  [swapper/32]
>     0      0  33  ffff8ec129436180  RU   0.0       0      0  [swapper/33]
>     0      0  34  ffff8ed129f56180  RU   0.0       0      0  [swapper/34]
>     0      0  35  ffff8ee129f3d140  RU   0.0       0      0  [swapper/35]
>     0      0  36  ffff8eb1a940a080  RU   0.0       0      0  [swapper/36]
>     0      0  37  ffff8ec129431040  RU   0.0       0      0  [swapper/37]
>     0      0  38  ffff8ed129f51040  RU   0.0       0      0  [swapper/38]
>     0      0  39  ffff8ee129f3a080  RU   0.0       0      0  [swapper/39]
>     0      0  40  ffff8eb1a940c100  RU   0.0       0      0  [swapper/40]
>     0      0  41  ffff8ec129435140  RU   0.0       0      0  [swapper/41]
>     0      0  42  ffff8ed129f55140  RU   0.0       0      0  [swapper/42]
>     0      0  43  ffff8ee129f3c100  RU   0.0       0      0  [swapper/43]
>     0      0  44  ffff8eb1a940b0c0  RU   0.0       0      0  [swapper/44]
>     0      0  45  ffff8ec129432080  RU   0.0       0      0  [swapper/45]
>     0      0  46  ffff8ed129f52080  RU   0.0       0      0  [swapper/46]
>     0      0  47  ffff8ee129f3b0c0  RU   0.0       0      0  [swapper/47]
> 109549  109543  18  ffff8ee05406c100  RU   0.0  115440   2132  bash

I noticed a lot of threads blocked on quota commands.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

bt.all
1.68 MB
11/Apr/19 4:40 AM
fir-md1-s2-20190424-ldiskfs-event.log
118 kB
26/Apr/19 2:59 PM
vmcore-dmesg.txt
1.01 MB
11/Apr/19 4:40 AM

Activity

[LU-12178] MDS deadlock with 2.12.0 (quotas?)

Gerrit Updater added a comment - 08/Jun/19 2:37 AM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34926/
Subject: ~~LU-12178~~ osd: do not rebalance quota under memory pressure
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 8c0b1c9af812140bde14180a318ace834d077d4b

Gerrit Updater added a comment - 08/Jun/19 2:37 AM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34926/ Subject: LU-12178 osd: do not rebalance quota under memory pressure Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 8c0b1c9af812140bde14180a318ace834d077d4b

Gerrit Updater added a comment - 21/May/19 7:22 PM

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34926
Subject: ~~LU-12178~~ osd: do not rebalance quota under memory pressure
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 380447f63b8a9e6a232e1ea81b1e68c39bc28cf2

Gerrit Updater added a comment - 21/May/19 7:22 PM Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34926 Subject: LU-12178 osd: do not rebalance quota under memory pressure Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 380447f63b8a9e6a232e1ea81b1e68c39bc28cf2

Peter Jones added a comment - 30/Apr/19 12:50 PM

Landed for 2.13

Peter Jones added a comment - 30/Apr/19 12:50 PM Landed for 2.13

Gerrit Updater added a comment - 30/Apr/19 3:35 AM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34741/
Subject: ~~LU-12178~~ osd: do not rebalance quota under memory pressure
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c5e5b7cd872eb2fa0028cef8b1a5e5c51b085b44

Gerrit Updater added a comment - 30/Apr/19 3:35 AM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34741/ Subject: LU-12178 osd: do not rebalance quota under memory pressure Project: fs/lustre-release Branch: master Current Patch Set: Commit: c5e5b7cd872eb2fa0028cef8b1a5e5c51b085b44

Stephane Thiell added a comment - 26/Apr/19 3:02 PM

We do have 4x18TB MDTs for DoM so just in case you want to see formatting options, please see below (we do have the extent flag):

[root@fir-md1-s1 ~]# dumpe2fs -h /dev/mapper/md1-rbod1-mdt0 
dumpe2fs 1.44.3.wc1 (23-July-2018)
Filesystem volume name:   fir-MDT0000
Last mounted on:          /
Filesystem UUID:          d929671c-a108-4120-86aa-783d4601057a
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery extent 64bit mmp flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              288005760
Block count:              4681213440
Reserved block count:     234060672
Free blocks:              4219981031
Free inodes:              250082482
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         2016
Inode blocks per group:   504
Flex block group size:    16
Filesystem created:       Thu Jan 24 14:00:46 2019
Last mount time:          Fri Apr 26 06:56:02 2019
Last write time:          Fri Apr 26 06:56:02 2019
Mount count:              57
Maximum mount count:      -1
Last checked:             Thu Jan 24 14:00:46 2019
Check interval:           0 (<none>)
Lifetime writes:          23 TB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          1024
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      d9ae92da-e0cd-43f5-a26b-e6a4e9c64832
Journal backup:           inode blocks
MMP block number:         10335
MMP update interval:      5
User quota inode:         3
Group quota inode:        4
Journal features:         journal_incompat_revoke journal_64bit
Journal size:             4096M
Journal length:           1048576
Journal sequence:         0x010f2bf0
Journal start:            1
MMP_block:
    mmp_magic: 0x4d4d50
    mmp_check_interval: 10
    mmp_sequence: 0x00030d
    mmp_update_date: Fri Apr 26 08:01:02 2019
    mmp_update_time: 1556290862
    mmp_node_name: fir-md1-s1
    mmp_device_name: dm-4

Stephane Thiell added a comment - 26/Apr/19 3:02 PM We do have 4x18TB MDTs for DoM so just in case you want to see formatting options, please see below (we do have the extent flag): [root@fir-md1-s1 ~]# dumpe2fs -h /dev/mapper/md1-rbod1-mdt0 dumpe2fs 1.44.3.wc1 (23-July-2018) Filesystem volume name: fir-MDT0000 Last mounted on: / Filesystem UUID: d929671c-a108-4120-86aa-783d4601057a Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr dir_index filetype needs_recovery extent 64bit mmp flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 288005760 Block count: 4681213440 Reserved block count: 234060672 Free blocks: 4219981031 Free inodes: 250082482 First block: 0 Block size: 4096 Fragment size: 4096 Group descriptor size: 64 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 2016 Inode blocks per group: 504 Flex block group size: 16 Filesystem created: Thu Jan 24 14:00:46 2019 Last mount time: Fri Apr 26 06:56:02 2019 Last write time: Fri Apr 26 06:56:02 2019 Mount count: 57 Maximum mount count: -1 Last checked: Thu Jan 24 14:00:46 2019 Check interval: 0 (<none>) Lifetime writes: 23 TB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 1024 Required extra isize: 32 Desired extra isize: 32 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: d9ae92da-e0cd-43f5-a26b-e6a4e9c64832 Journal backup: inode blocks MMP block number: 10335 MMP update interval: 5 User quota inode: 3 Group quota inode: 4 Journal features: journal_incompat_revoke journal_64bit Journal size: 4096M Journal length: 1048576 Journal sequence: 0x010f2bf0 Journal start: 1 MMP_block: mmp_magic: 0x4d4d50 mmp_check_interval: 10 mmp_sequence: 0x00030d mmp_update_date: Fri Apr 26 08:01:02 2019 mmp_update_time: 1556290862 mmp_node_name: fir-md1-s1 mmp_device_name: dm-4

Stephane Thiell added a comment - 26/Apr/19 3:00 PM

We haven't applied the patch yet and the problem has not happened again, but while checking the server logs of I noticed a ldiskfs-related event which looks like ldiskfs blocked in list_sort. The server did recover and we had no report of a slowdown but... just in case, I attached fir-md1-s2-20190424-ldiskfs-event.log

Stephane Thiell added a comment - 26/Apr/19 3:00 PM We haven't applied the patch yet and the problem has not happened again, but while checking the server logs of I noticed a ldiskfs-related event which looks like ldiskfs blocked in list_sort. The server did recover and we had no report of a slowdown but... just in case, I attached fir-md1-s2-20190424-ldiskfs-event.log

Stephane Thiell added a comment - 24/Apr/19 7:41 PM

Thanks bzzz, that sounds great! We'll likely wait until the patch has landed into master unless the issue happens again before that.

Stephane Thiell added a comment - 24/Apr/19 7:41 PM Thanks bzzz , that sounds great! We'll likely wait until the patch has landed into master unless the issue happens again before that.

Gerrit Updater added a comment - 23/Apr/19 3:18 PM

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34741
Subject: ~~LU-12178~~ osd: do not rebalance quota under memory pressure
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 895a4dfb30ca4b449a35aff61bc23954ad43d643

Gerrit Updater added a comment - 23/Apr/19 3:18 PM Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34741 Subject: LU-12178 osd: do not rebalance quota under memory pressure Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 895a4dfb30ca4b449a35aff61bc23954ad43d643

Alex Zhuravlev added a comment - 22/Apr/19 10:27 AM

so the issue is that quita acquision is supposed to allocate on-disk space for the given id and the existing infrastructure can specify that space is allocated already and there is no need to start a transaction.
I think it's safe to just remove quota adjustement from osd_object_delete() path, but that wouldn't be optmial in some cases as we want to maintain local vs global quota grant balanced.
it doesn't need to be done in the context of process releasing memory, rather it's an extra pressure for VM subsystem as quota rebalance may need additional memory (locks, RPC, etc).
I'd say it would be nice to schedule rebalance and do that in a separate context periodically.
not sure what exact structure to use. the simplest one would be some sort of ID list, probably batched on per-cpu basis to save a bit of memory on list_heads, though if millions of objects sharing few IDs are being removed then we'd waste quite amount of memory).
a smarter structure like a radix-tree-based bitmap would do better, at cost of development.

Alex Zhuravlev added a comment - 22/Apr/19 10:27 AM so the issue is that quita acquision is supposed to allocate on-disk space for the given id and the existing infrastructure can specify that space is allocated already and there is no need to start a transaction. I think it's safe to just remove quota adjustement from osd_object_delete() path, but that wouldn't be optmial in some cases as we want to maintain local vs global quota grant balanced. it doesn't need to be done in the context of process releasing memory, rather it's an extra pressure for VM subsystem as quota rebalance may need additional memory (locks, RPC, etc). I'd say it would be nice to schedule rebalance and do that in a separate context periodically. not sure what exact structure to use. the simplest one would be some sort of ID list, probably batched on per-cpu basis to save a bit of memory on list_heads, though if millions of objects sharing few IDs are being removed then we'd waste quite amount of memory). a smarter structure like a radix-tree-based bitmap would do better, at cost of development.

Alex Zhuravlev added a comment - 18/Apr/19 4:54 PM

sorry for late response.. I tend to think patching ldiskfs would fix just a specific case while there might be more similar cases.
this is not required to release quota to the master right away, this can be done later in a more appropriate context (or per master's request). I think it's similar to ~~LU-12018~~. will make a patch quickly.

Alex Zhuravlev added a comment - 18/Apr/19 4:54 PM sorry for late response.. I tend to think patching ldiskfs would fix just a specific case while there might be more similar cases. this is not required to release quota to the master right away, this can be done later in a more appropriate context (or per master's request). I think it's similar to LU-12018 . will make a patch quickly.

People

Assignee:: Alex Zhuravlev

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 11/Apr/19 4:41 AM

Updated:: 08/Aug/19 1:04 PM

Resolved:: 30/Apr/19 12:50 PM