[LU-12935] MDT deadlock on 2.12.3 with DoM; is it missing async_discard feature? - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.12.4
Affects Version/s: Lustre 2.12.3
Labels:
None
Environment:
CentOS 7.6

Severity:
1
Rank (Obsolete):
9223372036854775807

Description

Currently we cannot make MDT0 work again on Fir (2.12.3) due to these backtraces and lock timeout:

Nov 04 18:30:17 fir-md1-s1 kernel: Pid: 32408, comm: mdt01_024 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019
Nov 04 18:30:17 fir-md1-s1 kernel: Call Trace:
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc10ccac0>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc10cd5e1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc15d850b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc15d8b90>] mdt_object_lock_internal+0x70/0x360 [mdt]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc15d8ea0>] mdt_object_lock+0x20/0x30 [mdt]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc1617c4b>] mdt_brw_enqueue+0x44b/0x760 [mdt]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc15c64bf>] mdt_intent_brw+0x1f/0x30 [mdt]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc15debb5>] mdt_intent_policy+0x435/0xd80 [mdt]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc10b3d46>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc10dc336>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc1164a12>] tgt_enqueue+0x62/0x210 [ptlrpc]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc116936a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc111024b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc1113bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffbe8c2e81>] kthread+0xd1/0xe0
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffbef77c24>] ret_from_fork_nospec_begin+0xe/0x21
Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Nov 04 18:30:17 fir-md1-s1 kernel: LNet: Service thread pid 32415 was inactive for 201.19s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
Nov 04 18:30:17 fir-md1-s1 kernel: LNet: Skipped 1 previous similar message
Nov 04 18:31:56 fir-md1-s1 kernel: LustreError: 32601:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572920815, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffffa10e0b407bc0/0x675682d854098c0 lrc: 3/0,1 mode: --/PW res: [0x200034eb7:0x1:0x0].0x0 bits 0x40/0x0 rrc: 912 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 32601 timeout: 0 lvb_type: 0
Nov 04 18:31:56 fir-md1-s1 kernel: LustreError: 32601:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 224 previous similar messages
Nov 04 18:34:25 fir-md1-s1 kernel: LustreError: 32404:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572920965, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffffa12da9254ec0/0x675682d8540d5dd lrc: 3/0,1 mode: --/PW res: [0x200034eb7:0x1:0x0].0x0 bits 0x40/0x0 rrc: 913 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 32404 timeout: 0 lvb_type: 0
Nov 04 18:34:25 fir-md1-s1 kernel: LustreError: 32404:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 161 previous similar messages

This looks similar to LU-11358.

I also just noticed another thing, on our new 2.12.3 clients, I can't find the async_discard import flag:

[root@sh-117-13 ~]# lctl get_param mdc.fir-*.import | grep connect_flags
    connect_flags: [ write_grant, server_lock, version, acl, xattr, create_on_write, truncate_lock, inode_bit_locks, getattr_by_fid, no_oh_for_devices, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, pools, large_ea, full20, layout_lock, 64bithash, jobstats, umask, einprogress, grant_param, lvb_type, short_io, flock_deadlock, disp_stripe, open_by_fid, lfsck, multi_mod_rpcs, dir_stripe, subtree, bulk_mbits, second_flags, file_secctx, dir_migrate, unknown, flr, lock_convert, archive_id_array, selinux_policy, lsom, unknown2_0x4000 ]

Is it unknown2_0x4000 ? Or is it missing??

Please confirm that the async_discard patch is missing from 2.12.3? If it's the case, we'll need to perform a full downgrade or emergency patching of the cluster and Lustre servers.

Attaching logs from MDS fir-md1-s1 which serves MDT0 and that we cannot make operational again at the moment. We even tried with abort_recovery with no luck.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

crash-sysrq-fir-md1-s1-foreach-bt.log
1.35 MB
05/Nov/19 4:13 AM
fir-md1-s1_20191105.log
758 kB
06/Nov/19 5:57 AM
fir-md1-s1_lfsck-results.log
3 kB
05/Nov/19 4:42 AM
fir-md1-s1-dk.log.gz
3.30 MB
05/Nov/19 3:32 AM
fir-md1-s1-dk2.log.gz
395 kB
05/Nov/19 5:11 AM
fir-md1-s1-MDT0.log
433 kB
05/Nov/19 2:42 AM
fir-rbh01.dk.log
651 kB
05/Nov/19 4:23 AM

Issue Links

is related to

LU-11421 DoM: manual migration OST-MDT, MDT-MDT

Resolved

LU-10664 DoM: make DoM lock enqueue non-blocking

Resolved

Activity

[LU-12935] MDT deadlock on 2.12.3 with DoM; is it missing async_discard feature?

Andreas Dilger added a comment - 26/Nov/19 11:59 PM

Stephane, it would make sense to get an strace (or equivalent Lustre "{{lctl set_param debug=+vfstrace +dlmtrace) from these jobs to see just how many times they write to the same file.

Mike, since it is possible to migrate DoM components to OSTs (either with full-file copy in 2.12 or via FLR mirror in 2.13 patch https://review.whamcloud.com/35359 "LU-11421 dom: manual OST-to-DOM migration via mirroring"), have you thought about automatically migrating files with high write lock contention from DoM to a regular OST object? Since the amount of data to be moved is very small (under 150KB in this case), the migration should be very fast, and it would allow extent locks to be used on the file.

That said, I have no idea how hard this would be, and only makes sense if there are multiple writers repeatedly contending on the same DoM file component (which I suspect is rare in most cases). Even here, it may be that if the clients are only writing to the same file a handful of times that the extra migration step would make the performance worse rather than better. If they write to the same file hundreds of times then it might be worthwhile to implement.

Even in IO-500 ior-hard-write the chunk size is 47008 bytes, so at most 2-3 ranks would be contending on a 64KB or 128KB DoM component, and we never had problems with this in our testing.

Andreas Dilger added a comment - 26/Nov/19 11:59 PM Stephane, it would make sense to get an strace (or equivalent Lustre "{{lctl set_param debug=+vfstrace +dlmtrace) from these jobs to see just how many times they write to the same file. Mike, since it is possible to migrate DoM components to OSTs (either with full-file copy in 2.12 or via FLR mirror in 2.13 patch https://review.whamcloud.com/35359 " LU-11421 dom: manual OST-to-DOM migration via mirroring "), have you thought about automatically migrating files with high write lock contention from DoM to a regular OST object? Since the amount of data to be moved is very small (under 150KB in this case), the migration should be very fast, and it would allow extent locks to be used on the file. That said, I have no idea how hard this would be, and only makes sense if there are multiple writers repeatedly contending on the same DoM file component (which I suspect is rare in most cases). Even here, it may be that if the clients are only writing to the same file a handful of times that the extra migration step would make the performance worse rather than better. If they write to the same file hundreds of times then it might be worthwhile to implement. Even in IO-500 ior-hard-write the chunk size is 47008 bytes, so at most 2-3 ranks would be contending on a 64KB or 128KB DoM component, and we never had problems with this in our testing.

Stephane Thiell added a comment - 08/Nov/19 7:52 PM

To avoid further issues for now, we have removed the default DoM stripping from all directories on this filesystem (only kept a PFL stripping). New files won't use DoM anymore. We'll see if that helps.

Stephane Thiell added a comment - 08/Nov/19 7:52 PM To avoid further issues for now, we have removed the default DoM stripping from all directories on this filesystem (only kept a PFL stripping). New files won't use DoM anymore. We'll see if that helps.

Stephane Thiell added a comment - 06/Nov/19 6:06 PM - edited

Thanks Mike. We're currently discussing about changing our default stripping to avoid further issues like these and perhaps only use DoM on specific cases.

FYI, yesterday, after I killed the suspected jobs, I still had to do a stop/start of MDT0 to resume filesystem operations (it didn't recover by itself).

Stephane Thiell added a comment - 06/Nov/19 6:06 PM - edited Thanks Mike. We're currently discussing about changing our default stripping to avoid further issues like these and perhaps only use DoM on specific cases. FYI, yesterday, after I killed the suspected jobs, I still had to do a stop/start of MDT0 to resume filesystem operations (it didn't recover by itself).

Mikhail Pershin added a comment - 06/Nov/19 6:44 AM

Stephane, there is one ticket for DoM improvements which can improve such access patterns, ~~LU-10664~~, but it has no patch at the moment.

Mikhail Pershin added a comment - 06/Nov/19 6:44 AM Stephane, there is one ticket for DoM improvements which can improve such access patterns, LU-10664 , but it has no patch at the moment.

Stephane Thiell added a comment - 06/Nov/19 6:29 AM - edited

Ah yes, I see, thanks. And I just found 250+ files like these that may be used by these aero-f jobs:

-rw-r--r-- 1 jbho cfarhat  145736 Nov  5 21:27 /scratch/users/jbho/aerof_simulations/maewing/DEFAULT.D2W115
-rw-r--r-- 1 jbho cfarhat  149248 Nov  5 21:27 /scratch/users/jbho/aerof_simulations/maewing/DEFAULT.D2W116

So they are small files but use the full 128KB DoM region but they are all on a single MDT, and accessed by 4 x 240 tasks (potentially).

I'll make changes so that these won't be using DoM anymore.

Stephane Thiell added a comment - 06/Nov/19 6:29 AM - edited Ah yes, I see, thanks. And I just found 250+ files like these that may be used by these aero-f jobs: -rw-r--r-- 1 jbho cfarhat 145736 Nov 5 21:27 /scratch/users/jbho/aerof_simulations/maewing/DEFAULT.D2W115 -rw-r--r-- 1 jbho cfarhat 149248 Nov 5 21:27 /scratch/users/jbho/aerof_simulations/maewing/DEFAULT.D2W116 So they are small files but use the full 128KB DoM region but they are all on a single MDT, and accessed by 4 x 240 tasks (potentially). I'll make changes so that these won't be using DoM anymore.

Mikhail Pershin added a comment - 06/Nov/19 6:14 AM

Stephane, yes, I was checking exactly that sysrq file but found no good candidates. As for DoM file efficiency, it is not about its size but how many processes are accessing it at the same time. I think with file DEFAULT.PKG many processes are trying to write to the file beginning so DOM region becomes bottkeneck and OST stripes are not so useful in that case, each process need access to DOM region first each time?

Mikhail Pershin added a comment - 06/Nov/19 6:14 AM Stephane, yes, I was checking exactly that sysrq file but found no good candidates. As for DoM file efficiency, it is not about its size but how many processes are accessing it at the same time. I think with file DEFAULT.PKG many processes are trying to write to the file beginning so DOM region becomes bottkeneck and OST stripes are not so useful in that case, each process need access to DOM region first each time?

Stephane Thiell added a comment - 06/Nov/19 5:57 AM

A similar situation happened again after the user relaunched his jobs, but the traces on the MDS are a bit different I think, at least the first ones. Looks like the filesystem is blocked again.

One of these traces:

Nov 05 20:58:18 fir-md1-s1 kernel: NMI watchdog: BUG: soft lockup - CPU#38 stuck for 22s! [mdt_io02_034:41734]
...
Nov 05 20:58:18 fir-md1-s1 kernel: CPU: 38 PID: 41734 Comm: mdt_io02_034 Kdump: loaded Tainted: G           OEL ------------   3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1
Nov 05 20:58:18 fir-md1-s1 kernel: Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.10.6 08/15/2019
Nov 05 20:58:18 fir-md1-s1 kernel: task: ffffa11e2a1e4100 ti: ffffa13cd3a70000 task.ti: ffffa13cd3a70000
Nov 05 20:58:18 fir-md1-s1 kernel: RIP: 0010:[<ffffffffbe913536>]  [<ffffffffbe913536>] native_queued_spin_lock_slowpath+0x126/0x200
Nov 05 20:58:18 fir-md1-s1 kernel: RSP: 0018:ffffa13cd3a73800  EFLAGS: 00000246
Nov 05 20:58:18 fir-md1-s1 kernel: RAX: 0000000000000000 RBX: ffffa130a9500be0 RCX: 0000000001310000
Nov 05 20:58:18 fir-md1-s1 kernel: RDX: ffffa12e3f8db780 RSI: 0000000001710101 RDI: ffffa13e3710f480
Nov 05 20:58:18 fir-md1-s1 kernel: RBP: ffffa13cd3a73800 R08: ffffa12e3f85b780 R09: 0000000000000000
Nov 05 20:58:18 fir-md1-s1 kernel: R10: ffffa12e3f85f140 R11: ffffda91d59da200 R12: 0000000000000000
Nov 05 20:58:18 fir-md1-s1 kernel: R13: ffffa13cd3a737a0 R14: ffffa130a9500948 R15: 0000000000000000
Nov 05 20:58:18 fir-md1-s1 kernel: FS:  00007f38ccb1d700(0000) GS:ffffa12e3f840000(0000) knlGS:0000000000000000
Nov 05 20:58:18 fir-md1-s1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 05 20:58:18 fir-md1-s1 kernel: CR2: 000000000124f178 CR3: 000000364fa10000 CR4: 00000000003407e0
Nov 05 20:58:18 fir-md1-s1 kernel: Call Trace:
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffbef5f2cb>] queued_spin_lock_slowpath+0xb/0xf
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffbef6d7a0>] _raw_spin_lock+0x20/0x30
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffc13e2c07>] ldiskfs_es_lru_add+0x57/0x90 [ldiskfs]
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffc13ad6a5>] ldiskfs_ext_map_blocks+0x7b5/0xf60 [ldiskfs]
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffbe902372>] ? ktime_get_ts64+0x52/0xf0
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffbe903612>] ? ktime_get+0x52/0xe0
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffc0bab14b>] ? kiblnd_post_tx_locked+0x7bb/0xa50 [ko2iblnd]
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffc13e9728>] ldiskfs_map_blocks+0x98/0x700 [ldiskfs]
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffc0b40203>] ? cfs_hash_bd_lookup_intent+0x63/0x170 [libcfs]
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffbe902372>] ? ktime_get_ts64+0x52/0xf0
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffc14bab63>] osd_ldiskfs_map_inode_pages+0x143/0x420 [osd_ldiskfs]
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffc14bc996>] osd_write_prep+0x2b6/0x360 [osd_ldiskfs]
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffc1614c3b>] mdt_obd_preprw+0x65b/0x10a0 [mdt]
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffc116d1bc>] tgt_brw_write+0xc7c/0x1cf0 [ptlrpc]
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffbe8e59c8>] ? load_balance+0x178/0x9a0
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffbe8e143c>] ? update_curr+0x14c/0x1e0
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffbe8dca58>] ? __enqueue_entity+0x78/0x80
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffbe8e367f>] ? enqueue_entity+0x2ef/0xbe0
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffc1159a7d>] ? tgt_lookup_reply+0x2d/0x190 [ptlrpc]
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffc116936a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffc1144da1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffc0b34bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffc111024b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffc110b805>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffbe8cfeb4>] ? __wake_up+0x44/0x50
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffc1113bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffc1113080>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffbe8c2e81>] kthread+0xd1/0xe0
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffbe8c2db0>] ? insert_kthread_work+0x40/0x40
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffbef77c24>] ret_from_fork_nospec_begin+0xe/0x21
Nov 05 20:58:18 fir-md1-s1 kernel:  [<ffffffffbe8c2db0>] ? insert_kthread_work+0x40/0x40

Attaching MDS logs as fir-md1-s1_20191105.log

Stephane Thiell added a comment - 06/Nov/19 5:57 AM A similar situation happened again after the user relaunched his jobs, but the traces on the MDS are a bit different I think, at least the first ones. Looks like the filesystem is blocked again. One of these traces: Nov 05 20:58:18 fir-md1-s1 kernel: NMI watchdog: BUG: soft lockup - CPU#38 stuck for 22s! [mdt_io02_034:41734] ... Nov 05 20:58:18 fir-md1-s1 kernel: CPU: 38 PID: 41734 Comm: mdt_io02_034 Kdump: loaded Tainted: G OEL ------------ 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 Nov 05 20:58:18 fir-md1-s1 kernel: Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.10.6 08/15/2019 Nov 05 20:58:18 fir-md1-s1 kernel: task: ffffa11e2a1e4100 ti: ffffa13cd3a70000 task.ti: ffffa13cd3a70000 Nov 05 20:58:18 fir-md1-s1 kernel: RIP: 0010:[<ffffffffbe913536>] [<ffffffffbe913536>] native_queued_spin_lock_slowpath+0x126/0x200 Nov 05 20:58:18 fir-md1-s1 kernel: RSP: 0018:ffffa13cd3a73800 EFLAGS: 00000246 Nov 05 20:58:18 fir-md1-s1 kernel: RAX: 0000000000000000 RBX: ffffa130a9500be0 RCX: 0000000001310000 Nov 05 20:58:18 fir-md1-s1 kernel: RDX: ffffa12e3f8db780 RSI: 0000000001710101 RDI: ffffa13e3710f480 Nov 05 20:58:18 fir-md1-s1 kernel: RBP: ffffa13cd3a73800 R08: ffffa12e3f85b780 R09: 0000000000000000 Nov 05 20:58:18 fir-md1-s1 kernel: R10: ffffa12e3f85f140 R11: ffffda91d59da200 R12: 0000000000000000 Nov 05 20:58:18 fir-md1-s1 kernel: R13: ffffa13cd3a737a0 R14: ffffa130a9500948 R15: 0000000000000000 Nov 05 20:58:18 fir-md1-s1 kernel: FS: 00007f38ccb1d700(0000) GS:ffffa12e3f840000(0000) knlGS:0000000000000000 Nov 05 20:58:18 fir-md1-s1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 05 20:58:18 fir-md1-s1 kernel: CR2: 000000000124f178 CR3: 000000364fa10000 CR4: 00000000003407e0 Nov 05 20:58:18 fir-md1-s1 kernel: Call Trace: Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffbef5f2cb>] queued_spin_lock_slowpath+0xb/0xf Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffbef6d7a0>] _raw_spin_lock+0x20/0x30 Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffc13e2c07>] ldiskfs_es_lru_add+0x57/0x90 [ldiskfs] Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffc13ad6a5>] ldiskfs_ext_map_blocks+0x7b5/0xf60 [ldiskfs] Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffbe902372>] ? ktime_get_ts64+0x52/0xf0 Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffbe903612>] ? ktime_get+0x52/0xe0 Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffc0bab14b>] ? kiblnd_post_tx_locked+0x7bb/0xa50 [ko2iblnd] Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffc13e9728>] ldiskfs_map_blocks+0x98/0x700 [ldiskfs] Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffc0b40203>] ? cfs_hash_bd_lookup_intent+0x63/0x170 [libcfs] Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffbe902372>] ? ktime_get_ts64+0x52/0xf0 Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffc14bab63>] osd_ldiskfs_map_inode_pages+0x143/0x420 [osd_ldiskfs] Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffc14bc996>] osd_write_prep+0x2b6/0x360 [osd_ldiskfs] Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffc1614c3b>] mdt_obd_preprw+0x65b/0x10a0 [mdt] Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffc116d1bc>] tgt_brw_write+0xc7c/0x1cf0 [ptlrpc] Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffbe8e59c8>] ? load_balance+0x178/0x9a0 Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffbe8e143c>] ? update_curr+0x14c/0x1e0 Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffbe8dca58>] ? __enqueue_entity+0x78/0x80 Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffbe8e367f>] ? enqueue_entity+0x2ef/0xbe0 Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffc1159a7d>] ? tgt_lookup_reply+0x2d/0x190 [ptlrpc] Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffc116936a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffc1144da1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffc0b34bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffc111024b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffc110b805>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffbe8cfeb4>] ? __wake_up+0x44/0x50 Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffc1113bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffc1113080>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffbe8c2e81>] kthread+0xd1/0xe0 Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffbe8c2db0>] ? insert_kthread_work+0x40/0x40 Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffbef77c24>] ret_from_fork_nospec_begin+0xe/0x21 Nov 05 20:58:18 fir-md1-s1 kernel: [<ffffffffbe8c2db0>] ? insert_kthread_work+0x40/0x40 Attaching MDS logs as fir-md1-s1_20191105.log

Stephane Thiell added a comment - 05/Nov/19 9:59 PM

Hi Mike,

Thanks for looking at this!

You say the whole DoM region, but in our case, the DoM size is only 128 KB. It's less than the size of an OST stripe (4MB), so I'm not sure to understand why DoM wouldn't be as efficient. For larger DoM sizes, I understand.

And we do have arrays of SSDs for the MDT storage. So disk I/Os for DoM should not be a problem.

Our main problem here is that it blocked MDT0 and thus the filesystem/namespace. I was hoping you could find a blocking thread maybe from crash-sysrq-fir-md1-s1-foreach-bt.log . Do you want the associated crash dump? Let me know, I can upload it to your FTP.

Stephane Thiell added a comment - 05/Nov/19 9:59 PM Hi Mike, Thanks for looking at this! You say the whole DoM region, but in our case, the DoM size is only 128 KB. It's less than the size of an OST stripe (4MB), so I'm not sure to understand why DoM wouldn't be as efficient. For larger DoM sizes, I understand. And we do have arrays of SSDs for the MDT storage. So disk I/Os for DoM should not be a problem. Our main problem here is that it blocked MDT0 and thus the filesystem/namespace. I was hoping you could find a blocking thread maybe from crash-sysrq-fir-md1-s1-foreach-bt.log . Do you want the associated crash dump? Let me know, I can upload it to your FTP.

Mikhail Pershin added a comment - 05/Nov/19 9:42 PM

Stephane, similar stack trace was seen several times in couple bugs like LU-11358 you've mentioned but in that cases there was also other thread which was blocker, but I don't see any other thread in logs which would block all these mdt_brw_intent() threads.
In fact DoM is vulnerable to such access patterns like we have here - multiple writers to single file, because DoM lock covers whole DoM region but not stripes, so only one writer at the time is allowed. That can become bottleneck and may cause lock timeouts on high load. So general recommendation in that case - use ordinary file with OST stripes instead.
Meanwhile that shouldn't cause such server hung as you've experienced, so I will investigate that more.

Mikhail Pershin added a comment - 05/Nov/19 9:42 PM Stephane, similar stack trace was seen several times in couple bugs like LU-11358 you've mentioned but in that cases there was also other thread which was blocker, but I don't see any other thread in logs which would block all these mdt_brw_intent() threads. In fact DoM is vulnerable to such access patterns like we have here - multiple writers to single file, because DoM lock covers whole DoM region but not stripes, so only one writer at the time is allowed. That can become bottleneck and may cause lock timeouts on high load. So general recommendation in that case - use ordinary file with OST stripes instead. Meanwhile that shouldn't cause such server hung as you've experienced, so I will investigate that more.

Stephane Thiell added a comment - 05/Nov/19 5:18 AM - edited

The file in question, DEFAULT.PKG, is actually used by the AERO-F software to restart a simulation (https://frg.bitbucket.io/aero-f/index.html#Restart)

Stephane Thiell added a comment - 05/Nov/19 5:18 AM - edited The file in question, DEFAULT.PKG, is actually used by the AERO-F software to restart a simulation ( https://frg.bitbucket.io/aero-f/index.html#Restart )

Stephane Thiell added a comment - 05/Nov/19 5:12 AM

Updated MDS dk logs attached as fir-md1-s1-dk2.log.gz and filesystem is still running.

Stephane Thiell added a comment - 05/Nov/19 5:12 AM Updated MDS dk logs attached as fir-md1-s1-dk2.log.gz and filesystem is still running.

People

Assignee:: Mikhail Pershin

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 05/Nov/19 2:43 AM

Updated:: 24/Aug/20 5:45 PM

Resolved:: 12/Dec/19 11:35 PM