Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12935

MDT deadlock on 2.12.3 with DoM; is it missing async_discard feature?

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.12.4
    • Lustre 2.12.3
    • None
    • CentOS 7.6
    • 1
    • 9223372036854775807

    Description

      Currently we cannot make MDT0 work again on Fir (2.12.3) due to these backtraces and lock timeout:

      Nov 04 18:30:17 fir-md1-s1 kernel: Pid: 32408, comm: mdt01_024 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019
      Nov 04 18:30:17 fir-md1-s1 kernel: Call Trace:
      Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc10ccac0>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
      Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc10cd5e1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
      Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc15d850b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc15d8b90>] mdt_object_lock_internal+0x70/0x360 [mdt]
      Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc15d8ea0>] mdt_object_lock+0x20/0x30 [mdt]
      Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc1617c4b>] mdt_brw_enqueue+0x44b/0x760 [mdt]
      Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc15c64bf>] mdt_intent_brw+0x1f/0x30 [mdt]
      Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc15debb5>] mdt_intent_policy+0x435/0xd80 [mdt]
      Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc10b3d46>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
      Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc10dc336>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
      Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc1164a12>] tgt_enqueue+0x62/0x210 [ptlrpc]
      Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc116936a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc111024b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffc1113bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffbe8c2e81>] kthread+0xd1/0xe0
      Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffbef77c24>] ret_from_fork_nospec_begin+0xe/0x21
      Nov 04 18:30:17 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
      Nov 04 18:30:17 fir-md1-s1 kernel: LNet: Service thread pid 32415 was inactive for 201.19s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      Nov 04 18:30:17 fir-md1-s1 kernel: LNet: Skipped 1 previous similar message
      Nov 04 18:31:56 fir-md1-s1 kernel: LustreError: 32601:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572920815, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffffa10e0b407bc0/0x675682d854098c0 lrc: 3/0,1 mode: --/PW res: [0x200034eb7:0x1:0x0].0x0 bits 0x40/0x0 rrc: 912 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 32601 timeout: 0 lvb_type: 0
      Nov 04 18:31:56 fir-md1-s1 kernel: LustreError: 32601:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 224 previous similar messages
      Nov 04 18:34:25 fir-md1-s1 kernel: LustreError: 32404:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1572920965, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffffa12da9254ec0/0x675682d8540d5dd lrc: 3/0,1 mode: --/PW res: [0x200034eb7:0x1:0x0].0x0 bits 0x40/0x0 rrc: 913 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 32404 timeout: 0 lvb_type: 0
      Nov 04 18:34:25 fir-md1-s1 kernel: LustreError: 32404:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 161 previous similar messages
      

      This looks similar to LU-11358.

      I also just noticed another thing, on our new 2.12.3 clients, I can't find the async_discard import flag:

      [root@sh-117-13 ~]# lctl get_param mdc.fir-*.import | grep connect_flags
          connect_flags: [ write_grant, server_lock, version, acl, xattr, create_on_write, truncate_lock, inode_bit_locks, getattr_by_fid, no_oh_for_devices, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, pools, large_ea, full20, layout_lock, 64bithash, jobstats, umask, einprogress, grant_param, lvb_type, short_io, flock_deadlock, disp_stripe, open_by_fid, lfsck, multi_mod_rpcs, dir_stripe, subtree, bulk_mbits, second_flags, file_secctx, dir_migrate, unknown, flr, lock_convert, archive_id_array, selinux_policy, lsom, unknown2_0x4000 ]
      

      Is it unknown2_0x4000 ? Or is it missing??

      Please confirm that the async_discard patch is missing from 2.12.3? If it's the case, we'll need to perform a full downgrade or emergency patching of the cluster and Lustre servers.

      Attaching logs from MDS fir-md1-s1 which serves MDT0 and that we cannot make operational again at the moment. We even tried with abort_recovery with no luck.

      Attachments

        1. crash-sysrq-fir-md1-s1-foreach-bt.log
          1.35 MB
        2. fir-md1-s1_20191105.log
          758 kB
        3. fir-md1-s1_lfsck-results.log
          3 kB
        4. fir-md1-s1-dk.log.gz
          3.30 MB
        5. fir-md1-s1-dk2.log.gz
          395 kB
        6. fir-md1-s1-MDT0.log
          433 kB
        7. fir-rbh01.dk.log
          651 kB

        Issue Links

          Activity

            [LU-12935] MDT deadlock on 2.12.3 with DoM; is it missing async_discard feature?
            pjones Peter Jones added a comment -

            Landed for 2.12.4 - does not affect master

            pjones Peter Jones added a comment - Landed for 2.12.4 - does not affect master

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36881/
            Subject: LU-12935 obdclass: fix import connect flag printing
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: b66efa7d500f94d755f318d43804f0f3eb883835

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36881/ Subject: LU-12935 obdclass: fix import connect flag printing Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: b66efa7d500f94d755f318d43804f0f3eb883835

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36881
            Subject: LU-12935 obdclass: fix import connect flag printing
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: b2afbb33175af02608d9d4127370dc874542148b

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36881 Subject: LU-12935 obdclass: fix import connect flag printing Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: b2afbb33175af02608d9d4127370dc874542148b

            Stephane, it would make sense to get an strace (or equivalent Lustre "{{lctl set_param debug=+vfstrace +dlmtrace) from these jobs to see just how many times they write to the same file.

            Mike, since it is possible to migrate DoM components to OSTs (either with full-file copy in 2.12 or via FLR mirror in 2.13 patch https://review.whamcloud.com/35359 "LU-11421 dom: manual OST-to-DOM migration via mirroring"), have you thought about automatically migrating files with high write lock contention from DoM to a regular OST object? Since the amount of data to be moved is very small (under 150KB in this case), the migration should be very fast, and it would allow extent locks to be used on the file.

            That said, I have no idea how hard this would be, and only makes sense if there are multiple writers repeatedly contending on the same DoM file component (which I suspect is rare in most cases). Even here, it may be that if the clients are only writing to the same file a handful of times that the extra migration step would make the performance worse rather than better. If they write to the same file hundreds of times then it might be worthwhile to implement.

            Even in IO-500 ior-hard-write the chunk size is 47008 bytes, so at most 2-3 ranks would be contending on a 64KB or 128KB DoM component, and we never had problems with this in our testing.

            adilger Andreas Dilger added a comment - Stephane, it would make sense to get an strace (or equivalent Lustre "{{lctl set_param debug=+vfstrace +dlmtrace) from these jobs to see just how many times they write to the same file. Mike, since it is possible to migrate DoM components to OSTs (either with full-file copy in 2.12 or via FLR mirror in 2.13 patch https://review.whamcloud.com/35359 " LU-11421 dom: manual OST-to-DOM migration via mirroring "), have you thought about automatically migrating files with high write lock contention from DoM to a regular OST object? Since the amount of data to be moved is very small (under 150KB in this case), the migration should be very fast, and it would allow extent locks to be used on the file. That said, I have no idea how hard this would be, and only makes sense if there are multiple writers repeatedly contending on the same DoM file component (which I suspect is rare in most cases). Even here, it may be that if the clients are only writing to the same file a handful of times that the extra migration step would make the performance worse rather than better. If they write to the same file hundreds of times then it might be worthwhile to implement. Even in IO-500 ior-hard-write the chunk size is 47008 bytes, so at most 2-3 ranks would be contending on a 64KB or 128KB DoM component, and we never had problems with this in our testing.

            To avoid further issues for now, we have removed the default DoM stripping from all directories on this filesystem (only kept a PFL stripping). New files won't use DoM anymore. We'll see if that helps.

            sthiell Stephane Thiell added a comment - To avoid further issues for now, we have removed the default DoM stripping from all directories on this filesystem (only kept a PFL stripping). New files won't use DoM anymore. We'll see if that helps.
            sthiell Stephane Thiell added a comment - - edited

            Thanks Mike. We're currently discussing about changing our default stripping to avoid further issues like these and perhaps only use DoM on specific cases.

            FYI, yesterday, after I killed the suspected jobs, I still had to do a stop/start of MDT0 to resume filesystem operations (it didn't recover by itself).

            sthiell Stephane Thiell added a comment - - edited Thanks Mike. We're currently discussing about changing our default stripping to avoid further issues like these and perhaps only use DoM on specific cases. FYI, yesterday, after I killed the suspected jobs, I still had to do a stop/start of MDT0 to resume filesystem operations (it didn't recover by itself).

            Stephane, there is one ticket for DoM improvements which can improve such access patterns, LU-10664, but it has no patch at the moment.

            tappro Mikhail Pershin added a comment - Stephane, there is one ticket for DoM improvements which can improve such access patterns, LU-10664 , but it has no patch at the moment.

            People

              tappro Mikhail Pershin
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: