Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12946

Multipath path flapping issue

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      The symptoms of this issue are high I/O latency seen on an external server from the multipath devices without associated latency on the underlying physical disks. The following messages are also associated with this issue:

      Oct 28 15:12:23 nvme1 kernel: device-mapper: multipath: Reinstating path 8:160.
      Oct 28 15:12:23 nvme1 multipathd: 8:160: reinstated
      Oct 28 15:12:23 nvme1 multipathd: 360001ff0b05e90000000002e8964000a: remaining active paths: 5
      Oct 28 15:12:25 nvme1 multipathd: 360001ff0b05e90000000002e8964000a: switch to path group #1
      Oct 28 15:12:25 nvme1 kernel: device-mapper: multipath: Failing path 8:160.
      Oct 28 15:12:26 nvme1 multipathd: sdk: mark as failed
      Oct 28 15:12:26 nvme1 multipathd: 360001ff0b05e90000000002e8964000a: remaining active paths: 4
      Oct 28 15:12:27 nvme1 multipathd: 360001ff0b05e90000000002e8964000a: sdi - tur checker reports path is up
      Oct 28 15:12:27 nvme1 multipathd: 8:128: reinstated
      Oct 28 15:12:27 nvme1 multipathd: 360001ff0b05e90000000002e8964000a: remaining active paths: 5
      Oct 28 15:12:27 nvme1 kernel: device-mapper: multipath: Reinstating path 8:128.
      Oct 28 15:12:28 nvme1 multipathd: 360001ff0b05e90000000002e8964000a: sdk - tur checker reports path is up
      Oct 28 15:12:28 nvme1 kernel: device-mapper: multipath: Reinstating path 8:160.

      Specifically, paths being failed by the device-mapper module itself (e.g. this message) without any other associated error:

      Oct 28 15:12:25 nvme1 kernel: device-mapper: multipath: Failing path 8:160.
      Oct 28 15:12:26 nvme1 multipathd: sdk: mark as failed

      and then being reinstated shortly thereafter by the tur checker:

      Oct 28 15:12:28 nvme1 multipathd: 360001ff0b05e90000000002e8964000a: sdk - tur checker reports path is up
      Oct 28 15:12:28 nvme1 kernel: device-mapper: multipath: Reinstating path 8:160.

      For a more definitive diagnosis one can run the attached systemtap script and the output should look like this (the key is the return/error fields set to 3 and the tail of the backtrace being fail_path called by multipath_end_io):

      blk_insert_cloned_request() return=3
      dm_multipath:multipath_end_io() error=3
      0xffffffffc03a46f0 : fail_path+0x0/0xe0 [dm_multipath]
      0xffffffffc03a646a : multipath_end_io+0x4a/0x1a0 [dm_multipath]
      0xffffffffc0019403 : dm_softirq_done+0xc3/0x2d0 [dm_mod]
      0xffffffffb154fe5f : __blk_mq_complete_request+0xdf/0x1e0 [kernel]
      0xffffffffb154ff87 : blk_mq_complete_request+0x27/0x30 [kernel]
      0xffffffffc0018614 : dm_complete_request+0x34/0x60 [dm_mod]
      0xffffffffc0018d40 : map_request+0x170/0x2a0 [dm_mod]
      0xffffffffc001928e : dm_mq_queue_rq+0x8e/0x140 [dm_mod]
      0xffffffffb1550d18 : blk_mq_dispatch_rq_list+0x268/0x620 [kernel]
      0xffffffffb1555dee : blk_mq_do_dispatch_sched+0x7e/0x130 [kernel]
      0xffffffffb15567ae : blk_mq_sched_dispatch_requests+0x11e/0x1c0 [kernel]
      0xffffffffb154ec10 : __blk_mq_run_hw_queue+0x50/0xc0 [kernel]
      0xffffffffb154ed38 : __blk_mq_delay_run_hw_queue+0xb8/0x110 [kernel]
      0xffffffffb154ee57 : blk_mq_run_hw_queue+0x57/0x110 [kernel]
      0xffffffffb1556c6e : blk_mq_sched_insert_requests+0x6e/0xc0 [kernel]
      0xffffffffb1551d1c : blk_mq_flush_plug_list+0x19c/0x200 [kernel]
      0xffffffffb154703e : blk_flush_plug_list+0xce/0x230 [kernel]
      0xffffffffb15475b4 : blk_finish_plug+0x14/0x40 [kernel]
      0xffffffffc0a15f3b [mmfslinux]
      0xffffffffc0e5b174 [mmfs26]

      The cause is a borked RedHat backport of this upstream patch (https://github.com/torvalds/linux/commit/86ff7c2a80cd357f6156a53b354f6a0b357dc0c9). Here's what I believe to be the fix which prevents the BLK_MQ_RQ_QUEUE_DEV_BUSY state from causing the request to be prematurely completed in an error state:

      — kernel-3.10.0-957.21.3.el7/linux-3.10.0-957.21.3.el7.x86_64/drivers/md/dm-rq.c 2019-06-14 06:29:35.000000000 +0000
      +++ kernel-3.10.0-957.21.3.el7.patched/linux-3.10.0-957.21.3.el7.x86_64/drivers/md/dm-rq.c 2019-10-28 00:16:55.949220284 +0000
      @@ -477,7 +477,7 @@
      
      clone->start_time = jiffies;
      r = blk_insert_cloned_request(clone->q, clone);
      
      if (r != BLK_MQ_RQ_QUEUE_OK && r != BLK_MQ_RQ_QUEUE_BUSY)
      + if (r != BLK_MQ_RQ_QUEUE_OK && r != BLK_MQ_RQ_QUEUE_BUSY && r != BLK_MQ_RQ_QUEUE_DEV_BUSY )
      /* must complete clone in terms of original request */
      dm_complete_request(rq, r);
      return r;
      @@ -661,7 +661,7 @@
      trace_block_rq_remap(clone->q, clone, disk_devt(dm_disk(md)),
      blk_rq_pos(rq));
      ret = dm_dispatch_clone_request(clone, rq);
      if (ret == BLK_MQ_RQ_QUEUE_BUSY) {
      + if (ret == BLK_MQ_RQ_QUEUE_BUSY || ret == BLK_MQ_RQ_QUEUE_DEV_BUSY) {
      blk_rq_unprep_clone(clone);
      tio->ti->type->release_clone_rq(clone);
      tio->clone = NULL;
      
      

      Attachments

        Activity

          People

            wshilong Wang Shilong (Inactive)
            wshilong Wang Shilong (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: