Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5115

replay-single test_73b: @@@@@@ FAIL: post-failover df: 1

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.9.0
    • None
    • 3
    • 14102

    Attachments

      Issue Links

        Activity

          [LU-5115] replay-single test_73b: @@@@@@ FAIL: post-failover df: 1
          niu Niu Yawei (Inactive) added a comment - Dup of LU-6670

          Indeed, I think you are right.

          niu Niu Yawei (Inactive) added a comment - Indeed, I think you are right.

          in "fail()", the import state of all clients will be checked and waited to become the specified state, and if some client MDC state is in FULL, it should not affect
          the check of other client's MDC import, Will "pdsh" or other similar tools wait all the nodes to finish the execution of the command?

          hongchao.zhang Hongchao Zhang added a comment - in "fail()", the import state of all clients will be checked and waited to become the specified state, and if some client MDC state is in FULL, it should not affect the check of other client's MDC import, Will "pdsh" or other similar tools wait all the nodes to finish the execution of the command?

          yes, it's quite possible they are related.

          niu Niu Yawei (Inactive) added a comment - yes, it's quite possible they are related.

          Is it a duplicate of LU-6670?

          standan Saurabh Tandan (Inactive) added a comment - Is it a duplicate of LU-6670 ?
          yujian Jian Yu added a comment - More failure instances on master branch: https://testing.hpdd.intel.com/test_sets/f29243fe-627f-11e6-b5b1-5254006e85c2 https://testing.hpdd.intel.com/test_sets/38583f22-63ea-11e6-b5b1-5254006e85c2

          In the future, please use "Raise bug" from the Maloo page so that it will automatically triage this failure in the future.

          Ok, thanks.

          This looks like a test script problem:

          shadow-40vm5: CMD: shadow-40vm5.shadow.whamcloud.com lctl get_param -n at_max
          shadow-40vm6: CMD: shadow-40vm6.shadow.whamcloud.com lctl get_param -n at_max
          shadow-40vm5: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
          shadow-40vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 76 sec
          shadow-40vm5: stat: cannot read file system information for `/mnt/lustre': Input/output error
           replay-single test_73b: @@@@@@ FAIL: post-failover df: 1
          

          wait_clients_import_state() reported shadow-40vm5 client recovered in 0 second? that looks suspicious to me, look at the log for this client:

          00000001:02000400:1.0:1401256333.271532:0:23799:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
          00000001:02000400:0.0:1401256333.271547:0:23800:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
          00000001:02000400:1.0:1401256333.422776:0:23883:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
          00000001:02000400:0.0:1401256333.422803:0:23882:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
          

          Checking vm5 MDC state, and found it's FULL.

          00000001:02000400:1.0:1401256409.934172:0:23908:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 76 sec
          00000001:02000400:1.0:1401256410.078821:0:23932:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 76 sec
          00000080:00200000:0.0:1401256411.194194:0:23955:0:(llite_lib.c:1657:ll_statfs()) VFS Op: at 4299377545 jiffies
          00000080:00000004:0.0:1401256411.194198:0:23955:0:(obd_class.h:1294:obd_statfs()) osfs 4299275701, max_age 4299376545
          00800000:00000004:0.0:1401256411.194203:0:23955:0:(obd_class.h:1294:obd_statfs()) osfs 4299275699, max_age 4299376545
          00000100:00080000:0.0:1401256411.194224:0:23955:0:(client.c:1403:ptlrpc_send_new_req()) @@@ req from PID 23955 waiting for recovery: (FULL != REPLAY)  req@ffff88006e57f400 x1469319256038592/t0(0) o41->lustre-MDT0000-mdc-ffff88007d721000@10.1.5.230@tcp:12/10 lens 224/368 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
          

          vm6 MDC state become FULL, and statfs was issued, however, the vm5 was actually in REPLAY state, that's why the statfs on vm5 failed.

          00000100:00080000:1.0:1401256419.512383:0:23956:0:(import.c:1283:ptlrpc_invalidate_import_thread()) ffff88007dac8000 lustre-MDT0000_UUID: changing import state from EVICTED to RECOVER
          00000100:00080000:1.0:1401256419.512384:0:23956:0:(import.c:1394:ptlrpc_import_recovery_state_machine()) reconnected to lustre-MDT0000_UUID@10.1.5.230@tcp
          00000100:00080000:1.0:1401256419.512386:0:23956:0:(import.c:1399:ptlrpc_import_recovery_state_machine()) ffff88007dac8000 lustre-MDT0000_UUID: changing import state from RECOVER to FULL
          00000100:02000000:1.0:1401256419.512389:0:23956:0:(import.c:1407:ptlrpc_import_recovery_state_machine()) lustre-MDT0000-mdc-ffff88007d721000: Connection restored to lustre-MDT0000 (at 10.1.5.230@tcp)
          00000080:00020000:0.0:1401256419.513378:0:23955:0:(llite_lib.c:1611:ll_statfs_internal()) md_statfs fails: rc = -5
          

          Finally, vm5 recovered, however, the statfs has failed.

          Look at following script:

          fail() {
                  local facets=$1
                  local clients=${CLIENTS:-$HOSTNAME}
          
                  facet_failover $* || error "failover: $?"
                  wait_clients_import_state "$clients" "$facets" FULL
                  clients_up || error "post-failover df: $?"
          }
          

          It's quite possible that when wait_clients_import_state() is called, some clients' MDC state is still in FULL, I'm not sure what's the better way to improve the script, maybe wait a longer time before calling the wait_clients_import_state()?

          niu Niu Yawei (Inactive) added a comment - In the future, please use "Raise bug" from the Maloo page so that it will automatically triage this failure in the future. Ok, thanks. This looks like a test script problem: shadow-40vm5: CMD: shadow-40vm5.shadow.whamcloud.com lctl get_param -n at_max shadow-40vm6: CMD: shadow-40vm6.shadow.whamcloud.com lctl get_param -n at_max shadow-40vm5: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec shadow-40vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 76 sec shadow-40vm5: stat: cannot read file system information for `/mnt/lustre': Input/output error replay-single test_73b: @@@@@@ FAIL: post-failover df: 1 wait_clients_import_state() reported shadow-40vm5 client recovered in 0 second? that looks suspicious to me, look at the log for this client: 00000001:02000400:1.0:1401256333.271532:0:23799:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec 00000001:02000400:0.0:1401256333.271547:0:23800:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec 00000001:02000400:1.0:1401256333.422776:0:23883:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec 00000001:02000400:0.0:1401256333.422803:0:23882:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Checking vm5 MDC state, and found it's FULL. 00000001:02000400:1.0:1401256409.934172:0:23908:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 76 sec 00000001:02000400:1.0:1401256410.078821:0:23932:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 76 sec 00000080:00200000:0.0:1401256411.194194:0:23955:0:(llite_lib.c:1657:ll_statfs()) VFS Op: at 4299377545 jiffies 00000080:00000004:0.0:1401256411.194198:0:23955:0:(obd_class.h:1294:obd_statfs()) osfs 4299275701, max_age 4299376545 00800000:00000004:0.0:1401256411.194203:0:23955:0:(obd_class.h:1294:obd_statfs()) osfs 4299275699, max_age 4299376545 00000100:00080000:0.0:1401256411.194224:0:23955:0:(client.c:1403:ptlrpc_send_new_req()) @@@ req from PID 23955 waiting for recovery: (FULL != REPLAY) req@ffff88006e57f400 x1469319256038592/t0(0) o41->lustre-MDT0000-mdc-ffff88007d721000@10.1.5.230@tcp:12/10 lens 224/368 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 vm6 MDC state become FULL, and statfs was issued, however, the vm5 was actually in REPLAY state, that's why the statfs on vm5 failed. 00000100:00080000:1.0:1401256419.512383:0:23956:0:( import .c:1283:ptlrpc_invalidate_import_thread()) ffff88007dac8000 lustre-MDT0000_UUID: changing import state from EVICTED to RECOVER 00000100:00080000:1.0:1401256419.512384:0:23956:0:( import .c:1394:ptlrpc_import_recovery_state_machine()) reconnected to lustre-MDT0000_UUID@10.1.5.230@tcp 00000100:00080000:1.0:1401256419.512386:0:23956:0:( import .c:1399:ptlrpc_import_recovery_state_machine()) ffff88007dac8000 lustre-MDT0000_UUID: changing import state from RECOVER to FULL 00000100:02000000:1.0:1401256419.512389:0:23956:0:( import .c:1407:ptlrpc_import_recovery_state_machine()) lustre-MDT0000-mdc-ffff88007d721000: Connection restored to lustre-MDT0000 (at 10.1.5.230@tcp) 00000080:00020000:0.0:1401256419.513378:0:23955:0:(llite_lib.c:1611:ll_statfs_internal()) md_statfs fails: rc = -5 Finally, vm5 recovered, however, the statfs has failed. Look at following script: fail() { local facets=$1 local clients=${CLIENTS:-$HOSTNAME} facet_failover $* || error "failover: $?" wait_clients_import_state "$clients" "$facets" FULL clients_up || error "post-failover df: $?" } It's quite possible that when wait_clients_import_state() is called, some clients' MDC state is still in FULL, I'm not sure what's the better way to improve the script, maybe wait a longer time before calling the wait_clients_import_state()?

          In the future, please use "Raise bug" from the Maloo page so that it will automatically triage this failure in the future.

          adilger Andreas Dilger added a comment - In the future, please use "Raise bug" from the Maloo page so that it will automatically triage this failure in the future.

          People

            wc-triage WC Triage
            niu Niu Yawei (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: