Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6780

bulk recovery is not stable when 2 MDTs fails at the same time

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      I saw a few bulk timeout error with patch
      http://review.whamcloud.com/#/c/13786/

      11:44:13:Lustre: DEBUG MARKER: == replay-single test 110f: DNE: create striped dir, fail MDT1/MDT2 == 11:32:16 (1435663936)
      11:44:13:Lustre: DEBUG MARKER: sync; sync; sync
      11:44:13:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno
      11:44:13:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly
      11:44:13:Turning device dm-0 (0xfd00000) read-only
      11:44:13:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
      11:44:13:Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
      11:44:13:Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts
      11:44:13:Lustre: DEBUG MARKER: umount -d /mnt/mds1
      11:44:13:Removing read-only on unknown block (0xfd00000)
      11:44:13:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      11:44:13:Lustre: DEBUG MARKER: hostname
      11:44:13:Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P1
      11:44:13:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre                                 /dev/lvm-Role_MDS/P1 /mnt/mds1
      11:44:13:LDISKFS-fs (dm-0): recovery complete
      11:44:13:LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts:
      11:44:13:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
      11:44:13:Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
      11:44:13:Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null
      11:44:13:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
      11:44:13:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
      11:44:13:Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
      11:44:13:Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
      11:44:13:Lustre: 2930:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1435664041/real 1435664041]  req@ffff88006043d9c0 x1505397080987136/t0(0) o400->lustre-MDT0001-osp-MDT0000@10.1.4.127@tcp:24/4 lens 224/224 e 1 to 1 dl 1435664046 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
      11:44:13:Lustre: 2930:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 35 previous similar messages
      11:44:13:LustreError: 4290:0:(ldlm_lib.c:3030:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff88006a3c3050 x1505397094695176/t0(0) o1000->lustre-MDT0001-mdtlov_UUID@10.1.4.127@tcp:638/0 lens 248/16608 e 4 to 0 dl 1435664093 ref 1 fl Interpret:/0/0 rc 0/0
      11:44:13:LNet: Service thread pid 4290 completed after 100.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      11:44:13:Lustre: 2930:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1435664645/real 1435664645]  req@ffff88006043d9c0 x1505397080993520/t0(0) o400->lustre-MDT0001-osp-MDT0000@10.1.4.127@tcp:24/4 lens 224/224 e 1 to 1 dl 1435664647 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
      12:24:19:Lustre: 2930:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 73 previous similar messages
      12:24:19:Lustre: 2930:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1435665245/real 1435665245]  req@ffff88006043d9c0 x1505397081000080/t0(0) o400->lustre-MDT0001-osp-MDT0000@10.1.4.127@tcp:24/4 lens 224/224 e 1 to 1 dl 1435665247 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
      12:24:19:Lustre: 2930:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 99 previous similar messages
      12:24:19:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  rpc : @@@@@@ FAIL: can\'t put import for mdc.lustre-MDT0001-mdc-*.mds_server_uuid into FULL state after 1475 sec, have REPLAY
      12:24:19:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  rpc : @@@@@@ FAIL: can\'t put import for mdc.lustre-MDT0001-mdc-*.mds_server_uuid into FULL state after 1475 sec, have REPLAY_WAIT
      12:24:19:Lustre: DEBUG MARKER: rpc : @@@@@@ FAIL: can't put import for mdc.lustre-MDT0001-mdc-*.mds_server_uuid into FULL state after 1475 sec, have REPLAY_WAIT
      12:24:19:Lustre: DEBUG MARKER: rpc : @@@@@@ FAIL: can't put import for mdc.lustre-MDT0001-mdc-*.mds_server_uuid into FULL state after 1475 sec, have REPLAY
      

      Attachments

        Activity

          [LU-6780] bulk recovery is not stable when 2 MDTs fails at the same time
          pjones Peter Jones added a comment -

          Landed for 2.8

          pjones Peter Jones added a comment - Landed for 2.8

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15458/
          Subject: LU-6780 ptlrpc: Do not resend req with allow_replay
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 0ee3487737bd876e233213ccec4e6fca4690093e

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15458/ Subject: LU-6780 ptlrpc: Do not resend req with allow_replay Project: fs/lustre-release Branch: master Current Patch Set: Commit: 0ee3487737bd876e233213ccec4e6fca4690093e

          wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15458
          Subject: LU-6780 ptlrpc: Do not resend req with allow_replay
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: e2e25cb1b74cc1616c2df3d7dce9d4f5e78437d6

          gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15458 Subject: LU-6780 ptlrpc: Do not resend req with allow_replay Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e2e25cb1b74cc1616c2df3d7dce9d4f5e78437d6

          People

            di.wang Di Wang
            di.wang Di Wang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: