[LU-12178] MDS deadlock with 2.12.0 (quotas?) Created: 11/Apr/19 Updated: 08/Aug/19 Resolved: 30/Apr/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.3 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Stephane Thiell | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.6 - 3.10.0-957.1.3.el7_lustre.x86_64 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We just got some kind of deadlock on fir-md1-s2 that serves MDT0001 and MDT0003. I took a crash dump because the MDS was not usable and filesystem was hanging Also, this is from the crash: crash> foreach bt >bt.all
crash> kmem -i
PAGES TOTAL PERCENTAGE
TOTAL MEM 65891316 251.4 GB ----
FREE 13879541 52.9 GB 21% of TOTAL MEM
USED 52011775 198.4 GB 78% of TOTAL MEM
SHARED 32692923 124.7 GB 49% of TOTAL MEM
BUFFERS 32164243 122.7 GB 48% of TOTAL MEM
CACHED 781150 3 GB 1% of TOTAL MEM
SLAB 13801776 52.6 GB 20% of TOTAL MEM
TOTAL HUGE 0 0 ----
HUGE FREE 0 0 0% of TOTAL HUGE
TOTAL SWAP 1048575 4 GB ----
SWAP USED 0 0 0% of TOTAL SWAP
SWAP FREE 1048575 4 GB 100% of TOTAL SWAP
COMMIT LIMIT 33994233 129.7 GB ----
COMMITTED 228689 893.3 MB 0% of TOTAL LIMIT
crash> ps | grep ">"
> 0 0 0 ffffffffb4218480 RU 0.0 0 0 [swapper/0]
> 0 0 1 ffff8ec129410000 RU 0.0 0 0 [swapper/1]
> 0 0 2 ffff8ed129f10000 RU 0.0 0 0 [swapper/2]
> 0 0 3 ffff8ee129ebe180 RU 0.0 0 0 [swapper/3]
> 0 0 4 ffff8eb1a9ba6180 RU 0.0 0 0 [swapper/4]
> 0 0 5 ffff8ec129416180 RU 0.0 0 0 [swapper/5]
> 0 0 6 ffff8ed129f16180 RU 0.0 0 0 [swapper/6]
> 0 0 7 ffff8ee129eba080 RU 0.0 0 0 [swapper/7]
> 0 0 8 ffff8eb1a9ba30c0 RU 0.0 0 0 [swapper/8]
> 0 0 9 ffff8ec129411040 RU 0.0 0 0 [swapper/9]
> 0 0 10 ffff8ed129f11040 RU 0.0 0 0 [swapper/10]
> 0 0 11 ffff8ee129ebd140 RU 0.0 0 0 [swapper/11]
> 0 0 12 ffff8eb1a9ba5140 RU 0.0 0 0 [swapper/12]
> 0 0 13 ffff8ec129415140 RU 0.0 0 0 [swapper/13]
> 0 0 14 ffff8ed129f15140 RU 0.0 0 0 [swapper/14]
> 0 0 15 ffff8ee129ebb0c0 RU 0.0 0 0 [swapper/15]
> 0 0 16 ffff8eb1a9ba4100 RU 0.0 0 0 [swapper/16]
> 0 0 17 ffff8ec129412080 RU 0.0 0 0 [swapper/17]
> 0 0 19 ffff8ee129ebc100 RU 0.0 0 0 [swapper/19]
> 0 0 20 ffff8eb1a9408000 RU 0.0 0 0 [swapper/20]
> 0 0 21 ffff8ec129414100 RU 0.0 0 0 [swapper/21]
> 0 0 22 ffff8ed129f14100 RU 0.0 0 0 [swapper/22]
> 0 0 23 ffff8ee129f38000 RU 0.0 0 0 [swapper/23]
> 0 0 24 ffff8eb1a940e180 RU 0.0 0 0 [swapper/24]
> 0 0 25 ffff8ec1294130c0 RU 0.0 0 0 [swapper/25]
> 0 0 26 ffff8ed129f130c0 RU 0.0 0 0 [swapper/26]
> 0 0 27 ffff8ee129f3e180 RU 0.0 0 0 [swapper/27]
> 0 0 28 ffff8eb1a9409040 RU 0.0 0 0 [swapper/28]
> 0 0 29 ffff8ec129430000 RU 0.0 0 0 [swapper/29]
> 0 0 30 ffff8ed129f50000 RU 0.0 0 0 [swapper/30]
> 0 0 31 ffff8ee129f39040 RU 0.0 0 0 [swapper/31]
> 0 0 32 ffff8eb1a940d140 RU 0.0 0 0 [swapper/32]
> 0 0 33 ffff8ec129436180 RU 0.0 0 0 [swapper/33]
> 0 0 34 ffff8ed129f56180 RU 0.0 0 0 [swapper/34]
> 0 0 35 ffff8ee129f3d140 RU 0.0 0 0 [swapper/35]
> 0 0 36 ffff8eb1a940a080 RU 0.0 0 0 [swapper/36]
> 0 0 37 ffff8ec129431040 RU 0.0 0 0 [swapper/37]
> 0 0 38 ffff8ed129f51040 RU 0.0 0 0 [swapper/38]
> 0 0 39 ffff8ee129f3a080 RU 0.0 0 0 [swapper/39]
> 0 0 40 ffff8eb1a940c100 RU 0.0 0 0 [swapper/40]
> 0 0 41 ffff8ec129435140 RU 0.0 0 0 [swapper/41]
> 0 0 42 ffff8ed129f55140 RU 0.0 0 0 [swapper/42]
> 0 0 43 ffff8ee129f3c100 RU 0.0 0 0 [swapper/43]
> 0 0 44 ffff8eb1a940b0c0 RU 0.0 0 0 [swapper/44]
> 0 0 45 ffff8ec129432080 RU 0.0 0 0 [swapper/45]
> 0 0 46 ffff8ed129f52080 RU 0.0 0 0 [swapper/46]
> 0 0 47 ffff8ee129f3b0c0 RU 0.0 0 0 [swapper/47]
> 109549 109543 18 ffff8ee05406c100 RU 0.0 115440 2132 bash
I noticed a lot of threads blocked on quota commands. |
| Comments |
| Comment by Peter Jones [ 11/Apr/19 ] |
|
Alex Could you please advise Thanks Peter |
| Comment by Patrick Farrell (Inactive) [ 11/Apr/19 ] |
|
Alex, So the issue is that we're going in to reclaim, and doesn't that just mean we need to use GFP_NOFS here in ldiskfs_htree_store_dirent: len = sizeof(struct fname) + dirent->name_len + extra_data + 1;
new_fn = kzalloc(len, GFP_KERNEL);
if (!new_fn)
return -ENOMEM;
new_fn->hash = hash;
The semaphore is oo_ext_idx_sem is taken in osd_ldiskfs_it_fill, and it is also needed in truncate. So we're holding that sempahore, we go in to the FS for memory freeing on allocation, and when we spool over in to the file system during memory freeing, we deadlock because of it. I haven't worked through everything, but there's at least one thread stuck in truncate: [2794948.051149] [<ffffffffb3d67c49>] schedule+0x29/0x70 [2794948.056299] [<ffffffffc0406085>] wait_transaction_locked+0x85/0xd0 [jbd2] [2794948.063366] [<ffffffffb36c2d00>] ? wake_up_atomic_t+0x30/0x30 [2794948.069377] [<ffffffffc0406368>] add_transaction_credits+0x268/0x2f0 [jbd2] [2794948.076612] [<ffffffffc13cb2e0>] ? osd_trunc_lock+0xe0/0x210 [osd_ldiskfs] [2794948.083756] [<ffffffffc04065e1>] start_this_handle+0x1a1/0x430 [jbd2] [2794948.090474] [<ffffffffc1615121>] ? lod_sub_declare_write+0xe1/0x2a0 [lod] [2794948.097541] [<ffffffffb381c0b2>] ? kmem_cache_alloc+0x1c2/0x1f0 [2794948.103732] [<ffffffffc0406a93>] jbd2__journal_start+0xf3/0x1f0 [jbd2] [2794948.110531] [<ffffffffc13ab22e>] ? osd_trans_start+0x20e/0x4e0 [osd_ldiskfs] [2794948.117871] [<ffffffffc12b82c9>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] [2794948.125446] [<ffffffffc13ab22e>] osd_trans_start+0x20e/0x4e0 [osd_ldiskfs] [2794948.132616] [<ffffffffc1043612>] top_trans_start+0x702/0x940 [ptlrpc] [2794948.139346] [<ffffffffc15d6494>] lod_trans_start+0x34/0x40 [lod] [2794948.145640] [<ffffffffc1689bca>] mdd_trans_start+0x1a/0x20 [mdd] [2794948.151914] [<ffffffffc167de9f>] mdd_attr_set+0x56f/0xca0 [mdd] [2794948.158120] [<ffffffffc150d2e0>] mdt_mfd_close+0x750/0x850 [mdt] [2794948.164398] [<ffffffffc15127d1>] mdt_close_internal+0x121/0x220 [mdt] [2794948.171117] [<ffffffffc1512af0>] mdt_close+0x220/0x780 [mdt] [2794948.177090] [<ffffffffc103035a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [2794948.184172] [<ffffffffc1009a51>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [2794948.191918] [<ffffffffc0aefbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [2794948.199189] [<ffffffffc0fd492b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [2794948.207047] [<ffffffffc0fd17b5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [2794948.214015] [<ffffffffb36d67c2>] ? default_wake_function+0x12/0x20 [2794948.220462] [<ffffffffb36cba9b>] ? __wake_up_common+0x5b/0x90 [2794948.226506] [<ffffffffc0fd825c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [2794948.232998] [<ffffffffc0fd7760>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [2794948.240570] [<ffffffffb36c1c31>] kthread+0xd1/0xe0 [2794948.245634] [<ffffffffb36c1b60>] ? insert_kthread_work+0x40/0x40 [2794948.251920] [<ffffffffb3d74c24>] ret_from_fork_nospec_begin+0xe/0x21 [2794948.258536] [<ffffffffb36c1b60>] ? insert_kthread_work+0x40/0x40 So I'm pretty sure the solution is just GFP_NOFS in that call. Also checking to see if we need it in: Which is in the same file and also uses just GFP_KERNEL... |
| Comment by Patrick Farrell (Inactive) [ 11/Apr/19 ] |
|
Ah, yes, we do need GFP_NOFS in ldiskfs_htree_create_dir_info as well - It's called from readdir, which is called from osd_ldiskfs_it_fill, with the semaphore held. |
| Comment by Patrick Farrell (Inactive) [ 11/Apr/19 ] |
|
Alex, I can look at adding a new ldiskfs patch for this, unless you think we should try to unpick the deadlock itself rather than just go to GFP_NOFS there? EXT4 is able to use GFP_KERNEL, but since we're taking additional locks at the OSD layer, we either have to use GFP_NOFS or figure out how to change our locking... |
| Comment by Stephane Thiell [ 17/Apr/19 ] |
|
bzzz what do you think? I fear this issue can happen again... thanks! |
| Comment by Alex Zhuravlev [ 18/Apr/19 ] |
|
sorry for late response.. I tend to think patching ldiskfs would fix just a specific case while there might be more similar cases. |
| Comment by Alex Zhuravlev [ 22/Apr/19 ] |
|
so the issue is that quita acquision is supposed to allocate on-disk space for the given id and the existing infrastructure can specify that space is allocated already and there is no need to start a transaction. |
| Comment by Gerrit Updater [ 23/Apr/19 ] |
|
Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34741 |
| Comment by Stephane Thiell [ 24/Apr/19 ] |
|
Thanks bzzz, that sounds great! We'll likely wait until the patch has landed into master unless the issue happens again before that. |
| Comment by Stephane Thiell [ 26/Apr/19 ] |
|
We haven't applied the patch yet and the problem has not happened again, but while checking the server logs of I noticed a ldiskfs-related event which looks like ldiskfs blocked in list_sort. The server did recover and we had no report of a slowdown but... just in case, I attached fir-md1-s2-20190424-ldiskfs-event.log |
| Comment by Stephane Thiell [ 26/Apr/19 ] |
|
We do have 4x18TB MDTs for DoM so just in case you want to see formatting options, please see below (we do have the extent flag):
[root@fir-md1-s1 ~]# dumpe2fs -h /dev/mapper/md1-rbod1-mdt0
dumpe2fs 1.44.3.wc1 (23-July-2018)
Filesystem volume name: fir-MDT0000
Last mounted on: /
Filesystem UUID: d929671c-a108-4120-86aa-783d4601057a
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr dir_index filetype needs_recovery extent 64bit mmp flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 288005760
Block count: 4681213440
Reserved block count: 234060672
Free blocks: 4219981031
Free inodes: 250082482
First block: 0
Block size: 4096
Fragment size: 4096
Group descriptor size: 64
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 2016
Inode blocks per group: 504
Flex block group size: 16
Filesystem created: Thu Jan 24 14:00:46 2019
Last mount time: Fri Apr 26 06:56:02 2019
Last write time: Fri Apr 26 06:56:02 2019
Mount count: 57
Maximum mount count: -1
Last checked: Thu Jan 24 14:00:46 2019
Check interval: 0 (<none>)
Lifetime writes: 23 TB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 1024
Required extra isize: 32
Desired extra isize: 32
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: d9ae92da-e0cd-43f5-a26b-e6a4e9c64832
Journal backup: inode blocks
MMP block number: 10335
MMP update interval: 5
User quota inode: 3
Group quota inode: 4
Journal features: journal_incompat_revoke journal_64bit
Journal size: 4096M
Journal length: 1048576
Journal sequence: 0x010f2bf0
Journal start: 1
MMP_block:
mmp_magic: 0x4d4d50
mmp_check_interval: 10
mmp_sequence: 0x00030d
mmp_update_date: Fri Apr 26 08:01:02 2019
mmp_update_time: 1556290862
mmp_node_name: fir-md1-s1
mmp_device_name: dm-4
|
| Comment by Gerrit Updater [ 30/Apr/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34741/ |
| Comment by Peter Jones [ 30/Apr/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 21/May/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34926 |
| Comment by Gerrit Updater [ 08/Jun/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34926/ |