Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.13.0, Lustre 2.12.3
    • Lustre 2.12.2
    • RHEL7.6, lustre-2.12.2, ZFS-0.8.1-1
    • 2
    • 9223372036854775807

    Description

      We are hitting an issue after the upgrade to lustre-2.12.2 last weekend where the MDS servers will start to hang, and the lustre_health check will report NOT HEALTHY because of long waiting requests. We think this is related to LU-10250, but wanted to create a seperate ticket to track this to push the priority up. It is taking an MDS down about once a day causing production outages.

      2019-07-05T02:16:36.434782-04:00 f2-mds2.ncrc.gov kernel: Lustre: f2-OST0035-osc-MDT0001: Connection to f2-OST0035 (at 10.10.33.50@o2ib2) was l

      ost; in progress operations using this service will wait for recovery to complete

      2019-07-05T02:17:26.605024-04:00 f2-mds2.ncrc.gov kernel: Lustre: f2-OST0035-osc-MDT0001: Connection restored to 10.10.33.50@o2ib2 (at 10.10.33.50@o2ib2)

      2019-07-05T02:28:56.360456-04:00 f2-mds2.ncrc.gov kernel: INFO: task txg_quiesce:40218 blocked for more than 120 seconds.

      2019-07-05T02:28:56.360500-04:00 f2-mds2.ncrc.gov kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

      2019-07-05T02:28:56.371558-04:00 f2-mds2.ncrc.gov kernel: txg_quiesce     D ffff99aec932a080     0 40218      2 0x00000000

      2019-07-05T02:28:56.371580-04:00 f2-mds2.ncrc.gov kernel: Call Trace:

      2019-07-05T02:28:56.371599-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff83967c49>] schedule+0x29/0x70

      2019-07-05T02:28:56.384261-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0995325>] cv_wait_common+0x125/0x150 [spl]

      2019-07-05T02:28:56.384280-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c2d40>] ? wake_up_atomic_t+0x30/0x30

      2019-07-05T02:28:56.397218-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0995365>] __cv_wait+0x15/0x20 [spl]

      2019-07-05T02:28:56.397238-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0bc863b>] txg_quiesce_thread+0x2cb/0x3c0 [zfs]

      2019-07-05T02:28:56.411080-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0bc8370>] ? txg_init+0x2b0/0x2b0 [zfs]

      2019-07-05T02:28:56.411101-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc099cb93>] thread_generic_wrapper+0x73/0x80 [spl]

      2019-07-05T02:28:56.425331-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc099cb20>] ? __thread_exit+0x20/0x20 [spl]

      2019-07-05T02:28:56.425350-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c1c71>] kthread+0xd1/0xe0

      2019-07-05T02:28:56.430937-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c1ba0>] ? insert_kthread_work+0x40/0x40

      2019-07-05T02:28:56.437770-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff83974c1d>] ret_from_fork_nospec_begin+0x7/0x21

      2019-07-05T02:28:56.444951-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c1ba0>] ? insert_kthread_work+0x40/0x40

      2019-07-05T02:28:56.459337-04:00 f2-mds2.ncrc.gov kernel: INFO: task mdt04_000:42947 blocked for more than 120 seconds.

      2019-07-05T02:28:56.459357-04:00 f2-mds2.ncrc.gov kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

      2019-07-05T02:28:56.478994-04:00 f2-mds2.ncrc.gov kernel: mdt04_000       D ffff99aecc7ae180     0 42947      2 0x00000000

      2019-07-05T02:28:56.479015-04:00 f2-mds2.ncrc.gov kernel: Call Trace:

      2019-07-05T02:28:56.479037-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff83967c49>] schedule+0x29/0x70

      2019-07-05T02:28:56.491665-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0995325>] cv_wait_common+0x125/0x150 [spl]

      2019-07-05T02:28:56.491685-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c2d40>] ? wake_up_atomic_t+0x30/0x30

      2019-07-05T02:28:56.504560-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0995365>] __cv_wait+0x15/0x20 [spl]

      2019-07-05T02:28:56.504580-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0b675fb>] dmu_tx_wait+0x20b/0x3b0 [zfs]

      2019-07-05T02:28:56.517986-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0b67831>] dmu_tx_assign+0x91/0x490 [zfs]

      2019-07-05T02:28:56.518007-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc176d049>] osd_trans_start+0x199/0x440 [osd_zfs]

      2019-07-05T02:28:56.532397-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc18ce837>] mdt_empty_transno+0xf7/0x850 [mdt]

      2019-07-05T02:28:56.532416-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc18d1eee>] mdt_mfd_open+0x8de/0xe70 [mdt]

      2019-07-05T02:28:56.546398-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc18a4ba2>] ? mdt_pack_acl2body+0x1c2/0x9f0 [mdt]

      2019-07-05T02:28:56.546419-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc18d2acb>] mdt_finish_open+0x64b/0x760 [mdt]

      Attachments

        Issue Links

          Activity

            [LU-12510] mds server hangs cv_wait_common

            I'll be working on a patch re-implementing this using new DMU API, but this can wait a bit, IMO.
            the landed patch is fully functional, it just keep using not-best-approach to access DMU functionality.
            that was made this way as part of performance improvements due to missing bits in DMU API.

            bzzz Alex Zhuravlev added a comment - I'll be working on a patch re-implementing this using new DMU API, but this can wait a bit, IMO. the landed patch is fully functional, it just keep using not-best-approach to access DMU functionality. that was made this way as part of performance improvements due to missing bits in DMU API.
            pjones Peter Jones added a comment -

            James

            Do you mean that further Lustre changes will be needed to adjust to the ZFS changes in 0.8.2?

            Peter

            pjones Peter Jones added a comment - James Do you mean that further Lustre changes will be needed to adjust to the ZFS changes in 0.8.2? Peter

            The above patch is only a work around. A more complete solution is needed with ZFS 0.8.2. We should either open another ticket or reopen this ticket.

            simmonsja James A Simmons added a comment - The above patch is only a work around. A more complete solution is needed with ZFS 0.8.2. We should either open another ticket or reopen this ticket.
            pjones Peter Jones added a comment -

            Landed for 2.13

            pjones Peter Jones added a comment - Landed for 2.13

            James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/35600
            Subject: LU-12510 osd: osd-zfs to release zrlock quickly
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: d844b66bf8d58708237547598d016b356a07a01d

            gerrit Gerrit Updater added a comment - James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/35600 Subject: LU-12510 osd: osd-zfs to release zrlock quickly Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: d844b66bf8d58708237547598d016b356a07a01d

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35524/
            Subject: LU-12510 osd: osd-zfs to release zrlock quickly
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 88b329ac2ab568a25251f3f7c3a7e0c7367cb36f

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35524/ Subject: LU-12510 osd: osd-zfs to release zrlock quickly Project: fs/lustre-release Branch: master Current Patch Set: Commit: 88b329ac2ab568a25251f3f7c3a7e0c7367cb36f

            pfarrell hard to say, but I'd not very likely, few reported processes weren't running, but got stuck with IO:

            [16250.115937] Pid: 2829, comm: ll_ost00_029 3.10.0-957.21.3.el7_lustre.x86_64 #1 SMP Tue Jul 16 17:30:03 UTC 2019
            [16250.117636] Call Trace:
            [16250.118155]  [<ffffffffc02ee262>] cv_wait_common+0xb2/0x150 [spl]
            [16250.119447]  [<ffffffffc02ee338>] __cv_wait_io+0x18/0x20 [spl]
            [16250.120491]  [<ffffffffc04aca3b>] zio_wait+0x11b/0x1c0 [zfs]
            [16250.121621]  [<ffffffffc03ebfce>] dbuf_read+0x67e/0x9e0 [zfs]
            [16250.122689]  [<ffffffffc03f8060>] dmu_buf_hold_by_dnode+0x50/0x80 [zfs]
            [16250.123910]  [<ffffffffc046b6c2>] zap_get_leaf_byblk.isra.11+0x92/0x260 [zfs]
            [16250.125193]  [<ffffffffc046b978>] zap_deref_leaf+0xe8/0x100 [zfs]
            [16250.126336]  [<ffffffffc046c9b2>] fzap_lookup+0x62/0x180 [zfs]
            [16250.127455]  [<ffffffffc04712b1>] zap_lookup_impl+0xd1/0x200 [zfs]
            [16250.128629]  [<ffffffffc0471c2e>] zap_lookup_norm+0x8e/0xd0 [zfs]
            [16250.129773]  [<ffffffffc0471ca3>] zap_lookup+0x33/0x40 [zfs]
            [16250.130823]  [<ffffffffc113d4c1>] osd_fid_lookup+0x3e1/0x4c0 [osd_zfs]
            [16250.132040]  [<ffffffffc1135669>] osd_object_init+0x389/0xca0 [osd_zfs]
            [16250.133221]  [<ffffffffc0c17b25>] lu_object_alloc+0xe5/0x320 [obdclass]
            
            bzzz Alex Zhuravlev added a comment - pfarrell hard to say, but I'd not very likely, few reported processes weren't running, but got stuck with IO: [16250.115937] Pid: 2829, comm: ll_ost00_029 3.10.0-957.21.3.el7_lustre.x86_64 #1 SMP Tue Jul 16 17:30:03 UTC 2019 [16250.117636] Call Trace: [16250.118155] [<ffffffffc02ee262>] cv_wait_common+0xb2/0x150 [spl] [16250.119447] [<ffffffffc02ee338>] __cv_wait_io+0x18/0x20 [spl] [16250.120491] [<ffffffffc04aca3b>] zio_wait+0x11b/0x1c0 [zfs] [16250.121621] [<ffffffffc03ebfce>] dbuf_read+0x67e/0x9e0 [zfs] [16250.122689] [<ffffffffc03f8060>] dmu_buf_hold_by_dnode+0x50/0x80 [zfs] [16250.123910] [<ffffffffc046b6c2>] zap_get_leaf_byblk.isra.11+0x92/0x260 [zfs] [16250.125193] [<ffffffffc046b978>] zap_deref_leaf+0xe8/0x100 [zfs] [16250.126336] [<ffffffffc046c9b2>] fzap_lookup+0x62/0x180 [zfs] [16250.127455] [<ffffffffc04712b1>] zap_lookup_impl+0xd1/0x200 [zfs] [16250.128629] [<ffffffffc0471c2e>] zap_lookup_norm+0x8e/0xd0 [zfs] [16250.129773] [<ffffffffc0471ca3>] zap_lookup+0x33/0x40 [zfs] [16250.130823] [<ffffffffc113d4c1>] osd_fid_lookup+0x3e1/0x4c0 [osd_zfs] [16250.132040] [<ffffffffc1135669>] osd_object_init+0x389/0xca0 [osd_zfs] [16250.133221] [<ffffffffc0c17b25>] lu_object_alloc+0xe5/0x320 [obdclass]

            bzzz,

            I'm wondering if you think the failure here:
            https://testing.whamcloud.com/test_sets/612acaa2-ac97-11e9-a0be-52540065bddc

            Which is due to the client being evicted from OST0, which happens because OST0 is stuck like this:

            [ 9758.094925] Lustre: ll_ost00_016: service thread pid 2789 was inactive for 40.093 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
            [ 9767.054545] Lustre: ll_ost00_073: service thread pid 3361 was inactive for 40.069 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            [ 9767.054551] Pid: 3233, comm: ll_ost00_055 3.10.0-957.21.3.el7_lustre.x86_64 #1 SMP Tue Jul 16 17:30:03 UTC 2019
            [ 9767.054551] Call Trace:
            [ 9767.054645]  [<ffffffffc02ee2d5>] cv_wait_common+0x125/0x150 [spl]
            [ 9767.054650]  [<ffffffffc02ee315>] __cv_wait+0x15/0x20 [spl]
            [ 9767.054685]  [<ffffffffc04542bf>] txg_wait_synced+0xef/0x140 [zfs]
            [ 9767.054736]  [<ffffffffc112c9ab>] osd_trans_stop+0x53b/0x5e0 [osd_zfs]
            [ 9767.054771]  [<ffffffffc1270cd5>] ofd_trans_stop+0x25/0x60 [ofd]
            [ 9767.054776]  [<ffffffffc12752c5>] ofd_destroy+0x2c5/0x960 [ofd]
            [ 9767.054780]  [<ffffffffc126d534>] ofd_destroy_by_fid+0x1f4/0x4a0 [ofd]
            [ 9767.054784]  [<ffffffffc1262057>] ofd_destroy_hdl+0x267/0x970 [ofd]
            [ 9767.055141]  [<ffffffffc0f5e51a>] tgt_request_handle+0x91a/0x15c0 [ptlrpc]
            [ 9767.055179]  [<ffffffffc0f03646>] ptlrpc_server_handle_request+0x256/0xb00 [ptlrpc]
            [ 9767.055212]  [<ffffffffc0f0717c>] ptlrpc_main+0xbac/0x1560 [ptlrpc]
            [ 9767.055252]  [<ffffffffa9cc1da1>] kthread+0xd1/0xe0
            [ 9767.055280]  [<ffffffffaa375c37>] ret_from_fork_nospec_end+0x0/0x39
            [ 9767.055288]  [<ffffffffffffffff>] 0xffffffffffffffff 

            We've seen similar failures in a few tests recently, where the force_sync command apparently takes forever.  I'm guessing they have the same detailed signature as this.

            pfarrell Patrick Farrell (Inactive) added a comment - bzzz , I'm wondering if you think the failure here: https://testing.whamcloud.com/test_sets/612acaa2-ac97-11e9-a0be-52540065bddc Which is due to the client being evicted from OST0, which happens because OST0 is stuck like this: [ 9758.094925] Lustre: ll_ost00_016: service thread pid 2789 was inactive for 40.093 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [ 9767.054545] Lustre: ll_ost00_073: service thread pid 3361 was inactive for 40.069 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 9767.054551] Pid: 3233, comm: ll_ost00_055 3.10.0-957.21.3.el7_lustre.x86_64 #1 SMP Tue Jul 16 17:30:03 UTC 2019 [ 9767.054551] Call Trace: [ 9767.054645] [<ffffffffc02ee2d5>] cv_wait_common+0x125/0x150 [spl] [ 9767.054650] [<ffffffffc02ee315>] __cv_wait+0x15/0x20 [spl] [ 9767.054685] [<ffffffffc04542bf>] txg_wait_synced+0xef/0x140 [zfs] [ 9767.054736] [<ffffffffc112c9ab>] osd_trans_stop+0x53b/0x5e0 [osd_zfs] [ 9767.054771] [<ffffffffc1270cd5>] ofd_trans_stop+0x25/0x60 [ofd] [ 9767.054776] [<ffffffffc12752c5>] ofd_destroy+0x2c5/0x960 [ofd] [ 9767.054780] [<ffffffffc126d534>] ofd_destroy_by_fid+0x1f4/0x4a0 [ofd] [ 9767.054784] [<ffffffffc1262057>] ofd_destroy_hdl+0x267/0x970 [ofd] [ 9767.055141] [<ffffffffc0f5e51a>] tgt_request_handle+0x91a/0x15c0 [ptlrpc] [ 9767.055179] [<ffffffffc0f03646>] ptlrpc_server_handle_request+0x256/0xb00 [ptlrpc] [ 9767.055212] [<ffffffffc0f0717c>] ptlrpc_main+0xbac/0x1560 [ptlrpc] [ 9767.055252] [<ffffffffa9cc1da1>] kthread+0xd1/0xe0 [ 9767.055280] [<ffffffffaa375c37>] ret_from_fork_nospec_end+0x0/0x39 [ 9767.055288] [<ffffffffffffffff>] 0xffffffffffffffff We've seen similar failures in a few tests recently, where the force_sync command apparently takes forever.  I'm guessing they have the same detailed signature as this.
            simmonsja James A Simmons added a comment - - edited

            So far patch 35524 is holding up well with our production file system. The current patch is a work around until a proper solution is done so don't close this ticket once the patch lands.

            simmonsja James A Simmons added a comment - - edited So far patch 35524 is holding up well with our production file system. The current patch is a work around until a proper solution is done so don't close this ticket once the patch lands.

            James, I'd expect that a large parallel create and unlink workload on an aged mdt would have the best chance of hitting this.

            behlendorf Brian Behlendorf added a comment - James, I'd expect that a large parallel create and unlink workload on an aged mdt would have the best chance of hitting this.
            bzzz Alex Zhuravlev added a comment - please, try https://review.whamcloud.com/#/c/35524/

            People

              bzzz Alex Zhuravlev
              dustb100 Dustin Leverman
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: