Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12510

mds server hangs cv_wait_common

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.13.0, Lustre 2.12.3
    • Lustre 2.12.2
    • RHEL7.6, lustre-2.12.2, ZFS-0.8.1-1
    • 2
    • 9223372036854775807

    Description

      We are hitting an issue after the upgrade to lustre-2.12.2 last weekend where the MDS servers will start to hang, and the lustre_health check will report NOT HEALTHY because of long waiting requests. We think this is related to LU-10250, but wanted to create a seperate ticket to track this to push the priority up. It is taking an MDS down about once a day causing production outages.

      2019-07-05T02:16:36.434782-04:00 f2-mds2.ncrc.gov kernel: Lustre: f2-OST0035-osc-MDT0001: Connection to f2-OST0035 (at 10.10.33.50@o2ib2) was l

      ost; in progress operations using this service will wait for recovery to complete

      2019-07-05T02:17:26.605024-04:00 f2-mds2.ncrc.gov kernel: Lustre: f2-OST0035-osc-MDT0001: Connection restored to 10.10.33.50@o2ib2 (at 10.10.33.50@o2ib2)

      2019-07-05T02:28:56.360456-04:00 f2-mds2.ncrc.gov kernel: INFO: task txg_quiesce:40218 blocked for more than 120 seconds.

      2019-07-05T02:28:56.360500-04:00 f2-mds2.ncrc.gov kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

      2019-07-05T02:28:56.371558-04:00 f2-mds2.ncrc.gov kernel: txg_quiesce     D ffff99aec932a080     0 40218      2 0x00000000

      2019-07-05T02:28:56.371580-04:00 f2-mds2.ncrc.gov kernel: Call Trace:

      2019-07-05T02:28:56.371599-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff83967c49>] schedule+0x29/0x70

      2019-07-05T02:28:56.384261-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0995325>] cv_wait_common+0x125/0x150 [spl]

      2019-07-05T02:28:56.384280-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c2d40>] ? wake_up_atomic_t+0x30/0x30

      2019-07-05T02:28:56.397218-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0995365>] __cv_wait+0x15/0x20 [spl]

      2019-07-05T02:28:56.397238-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0bc863b>] txg_quiesce_thread+0x2cb/0x3c0 [zfs]

      2019-07-05T02:28:56.411080-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0bc8370>] ? txg_init+0x2b0/0x2b0 [zfs]

      2019-07-05T02:28:56.411101-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc099cb93>] thread_generic_wrapper+0x73/0x80 [spl]

      2019-07-05T02:28:56.425331-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc099cb20>] ? __thread_exit+0x20/0x20 [spl]

      2019-07-05T02:28:56.425350-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c1c71>] kthread+0xd1/0xe0

      2019-07-05T02:28:56.430937-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c1ba0>] ? insert_kthread_work+0x40/0x40

      2019-07-05T02:28:56.437770-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff83974c1d>] ret_from_fork_nospec_begin+0x7/0x21

      2019-07-05T02:28:56.444951-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c1ba0>] ? insert_kthread_work+0x40/0x40

      2019-07-05T02:28:56.459337-04:00 f2-mds2.ncrc.gov kernel: INFO: task mdt04_000:42947 blocked for more than 120 seconds.

      2019-07-05T02:28:56.459357-04:00 f2-mds2.ncrc.gov kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

      2019-07-05T02:28:56.478994-04:00 f2-mds2.ncrc.gov kernel: mdt04_000       D ffff99aecc7ae180     0 42947      2 0x00000000

      2019-07-05T02:28:56.479015-04:00 f2-mds2.ncrc.gov kernel: Call Trace:

      2019-07-05T02:28:56.479037-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff83967c49>] schedule+0x29/0x70

      2019-07-05T02:28:56.491665-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0995325>] cv_wait_common+0x125/0x150 [spl]

      2019-07-05T02:28:56.491685-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c2d40>] ? wake_up_atomic_t+0x30/0x30

      2019-07-05T02:28:56.504560-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0995365>] __cv_wait+0x15/0x20 [spl]

      2019-07-05T02:28:56.504580-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0b675fb>] dmu_tx_wait+0x20b/0x3b0 [zfs]

      2019-07-05T02:28:56.517986-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0b67831>] dmu_tx_assign+0x91/0x490 [zfs]

      2019-07-05T02:28:56.518007-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc176d049>] osd_trans_start+0x199/0x440 [osd_zfs]

      2019-07-05T02:28:56.532397-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc18ce837>] mdt_empty_transno+0xf7/0x850 [mdt]

      2019-07-05T02:28:56.532416-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc18d1eee>] mdt_mfd_open+0x8de/0xe70 [mdt]

      2019-07-05T02:28:56.546398-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc18a4ba2>] ? mdt_pack_acl2body+0x1c2/0x9f0 [mdt]

      2019-07-05T02:28:56.546419-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc18d2acb>] mdt_finish_open+0x64b/0x760 [mdt]

      Attachments

        1. f2-mds4_20190709.tgz
          69.09 MB
          Philip B Curtis
        2. f2-mds4_lustre_unhealthy_20190707.tar.gz
          2.34 MB
          Dustin Leverman
        3. LU12510.tar.gz
          70.23 MB
          Dustin Leverman

        Issue Links

          Activity

            People

              bzzz Alex Zhuravlev
              dustb100 Dustin Leverman
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: