Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.12.2
-
RHEL7.6, lustre-2.12.2, ZFS-0.8.1-1
-
2
-
9223372036854775807
Description
We are hitting an issue after the upgrade to lustre-2.12.2 last weekend where the MDS servers will start to hang, and the lustre_health check will report NOT HEALTHY because of long waiting requests. We think this is related to LU-10250, but wanted to create a seperate ticket to track this to push the priority up. It is taking an MDS down about once a day causing production outages.
2019-07-05T02:16:36.434782-04:00 f2-mds2.ncrc.gov kernel: Lustre: f2-OST0035-osc-MDT0001: Connection to f2-OST0035 (at 10.10.33.50@o2ib2) was l
ost; in progress operations using this service will wait for recovery to complete
2019-07-05T02:17:26.605024-04:00 f2-mds2.ncrc.gov kernel: Lustre: f2-OST0035-osc-MDT0001: Connection restored to 10.10.33.50@o2ib2 (at 10.10.33.50@o2ib2)
2019-07-05T02:28:56.360456-04:00 f2-mds2.ncrc.gov kernel: INFO: task txg_quiesce:40218 blocked for more than 120 seconds.
2019-07-05T02:28:56.360500-04:00 f2-mds2.ncrc.gov kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2019-07-05T02:28:56.371558-04:00 f2-mds2.ncrc.gov kernel: txg_quiesce D ffff99aec932a080 0 40218 2 0x00000000
2019-07-05T02:28:56.371580-04:00 f2-mds2.ncrc.gov kernel: Call Trace:
2019-07-05T02:28:56.371599-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff83967c49>] schedule+0x29/0x70
2019-07-05T02:28:56.384261-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0995325>] cv_wait_common+0x125/0x150 [spl]
2019-07-05T02:28:56.384280-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c2d40>] ? wake_up_atomic_t+0x30/0x30
2019-07-05T02:28:56.397218-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0995365>] __cv_wait+0x15/0x20 [spl]
2019-07-05T02:28:56.397238-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0bc863b>] txg_quiesce_thread+0x2cb/0x3c0 [zfs]
2019-07-05T02:28:56.411080-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0bc8370>] ? txg_init+0x2b0/0x2b0 [zfs]
2019-07-05T02:28:56.411101-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc099cb93>] thread_generic_wrapper+0x73/0x80 [spl]
2019-07-05T02:28:56.425331-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc099cb20>] ? __thread_exit+0x20/0x20 [spl]
2019-07-05T02:28:56.425350-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c1c71>] kthread+0xd1/0xe0
2019-07-05T02:28:56.430937-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c1ba0>] ? insert_kthread_work+0x40/0x40
2019-07-05T02:28:56.437770-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff83974c1d>] ret_from_fork_nospec_begin+0x7/0x21
2019-07-05T02:28:56.444951-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c1ba0>] ? insert_kthread_work+0x40/0x40
2019-07-05T02:28:56.459337-04:00 f2-mds2.ncrc.gov kernel: INFO: task mdt04_000:42947 blocked for more than 120 seconds.
2019-07-05T02:28:56.459357-04:00 f2-mds2.ncrc.gov kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2019-07-05T02:28:56.478994-04:00 f2-mds2.ncrc.gov kernel: mdt04_000 D ffff99aecc7ae180 0 42947 2 0x00000000
2019-07-05T02:28:56.479015-04:00 f2-mds2.ncrc.gov kernel: Call Trace:
2019-07-05T02:28:56.479037-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff83967c49>] schedule+0x29/0x70
2019-07-05T02:28:56.491665-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0995325>] cv_wait_common+0x125/0x150 [spl]
2019-07-05T02:28:56.491685-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c2d40>] ? wake_up_atomic_t+0x30/0x30
2019-07-05T02:28:56.504560-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0995365>] __cv_wait+0x15/0x20 [spl]
2019-07-05T02:28:56.504580-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0b675fb>] dmu_tx_wait+0x20b/0x3b0 [zfs]
2019-07-05T02:28:56.517986-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0b67831>] dmu_tx_assign+0x91/0x490 [zfs]
2019-07-05T02:28:56.518007-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc176d049>] osd_trans_start+0x199/0x440 [osd_zfs]
2019-07-05T02:28:56.532397-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc18ce837>] mdt_empty_transno+0xf7/0x850 [mdt]
2019-07-05T02:28:56.532416-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc18d1eee>] mdt_mfd_open+0x8de/0xe70 [mdt]
2019-07-05T02:28:56.546398-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc18a4ba2>] ? mdt_pack_acl2body+0x1c2/0x9f0 [mdt]
2019-07-05T02:28:56.546419-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc18d2acb>] mdt_finish_open+0x64b/0x760 [mdt]