Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18302

recovery-small test_134: FAIL: rm failed

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.16.0, Lustre 2.15.6
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for jianyu <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/819093e4-599f-4cdd-a07b-beb8ba8b3c62

      test_134 failed with the following error:

      rm: cannot remove '/mnt/lustre/d134.recovery-small/1/f134.recovery-small': Input/output error
      pdsh@onyx-81vm4: onyx-81vm4: ssh exited with exit code 5
      onyx-81vm6: mv: failed to access '/mnt/lustre/d134.recovery-small/2/f134.recovery-small_2': Cannot send after transport endpoint shutdown
      pdsh@onyx-81vm4: onyx-81vm6: ssh exited with exit code 1
      pdsh@onyx-81vm4: onyx-81vm6: ssh exited with exit code 5
      CMD: onyx-81vm4.onyx.whamcloud.com,onyx-81vm6,onyx-81vm7 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/opt/iozone/bin:/usr/lib64/openmpi/bin:/usr/share/Modules/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config 		TESTLOG_PREFIX=/autotest/autotest-2/2024-09-30/lustre-master_failover-part-1_4581_150_55f5e009-b5b6-4b5e-89f5-a0d648cecea4//recovery-small TESTNAME=test_134 		CONFIG=/usr/lib64/lustre/tests/cfg/autotest_config.sh bash rpc.sh wait_import_state_mount \(FULL\|IDLE\) mdc.lustre-MDT0000-mdc-*.mds_server_uuid 
      onyx-81vm7: onyx-81vm7.onyx.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      onyx-81vm4: onyx-81vm4.onyx.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      onyx-81vm6: onyx-81vm6.onyx.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      onyx-81vm4: CMD: onyx-81vm4.onyx.whamcloud.com lctl get_param -n at_max
      onyx-81vm7: CMD: onyx-81vm7.onyx.whamcloud.com lctl get_param -n at_max
      onyx-81vm6: CMD: onyx-81vm6.onyx.whamcloud.com lctl get_param -n at_max
      onyx-81vm4: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
      onyx-81vm7: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
      onyx-81vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
       recovery-small test_134: @@@@@@ FAIL: rm failed 
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-master/4581 - 4.18.0-513.24.1.el8_9.x86_64
      servers: https://build.whamcloud.com/job/lustre-master/4581 - 4.18.0-513.24.1.el8_lustre.x86_64

      <<Please provide additional information about the failure here>>

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      recovery-small test_134 - rm failed

      Attachments

        Issue Links

          Activity

            [LU-18302] recovery-small test_134: FAIL: rm failed
            laisiyao Lai Siyao added a comment -
            00000001:00080000:1.0:1727765474.036128:0:3575:0:(tgt_lastrcvd.c:1724:tgt_clients_data_init()) RCVRNG CLIENT uuid: eabd0dd9-14ce-4ee1-a2e9-6edf5
            b48757f idx: 1 lr: 932007903247 srv lr: 944892805333 lx: 0 gen 5
            00000001:00080000:1.0:1727765474.036144:0:3575:0:(tgt_lastrcvd.c:1724:tgt_clients_data_init()) RCVRNG CLIENT uuid: 7ebc758c-f217-4d7e-bd3a-03a41
            e195093 idx: 2 lr: 932007903248 srv lr: 944892805333 lx: 0 gen 3
            00000001:00080000:1.0:1727765474.036163:0:3575:0:(tgt_lastrcvd.c:1724:tgt_clients_data_init()) RCVRNG CLIENT uuid: 4d35a620-fa6c-4bfd-881f-8cd6c
            5f84bec idx: 3 lr: 0 srv lr: 944892805333 lx: 0 gen 14
            

            This shows the three clients data are initialized, but during recovery server denied their connections because uuid doesn't match.

            laisiyao Lai Siyao added a comment - 00000001:00080000:1.0:1727765474.036128:0:3575:0:(tgt_lastrcvd.c:1724:tgt_clients_data_init()) RCVRNG CLIENT uuid: eabd0dd9-14ce-4ee1-a2e9-6edf5 b48757f idx: 1 lr: 932007903247 srv lr: 944892805333 lx: 0 gen 5 00000001:00080000:1.0:1727765474.036144:0:3575:0:(tgt_lastrcvd.c:1724:tgt_clients_data_init()) RCVRNG CLIENT uuid: 7ebc758c-f217-4d7e-bd3a-03a41 e195093 idx: 2 lr: 932007903248 srv lr: 944892805333 lx: 0 gen 3 00000001:00080000:1.0:1727765474.036163:0:3575:0:(tgt_lastrcvd.c:1724:tgt_clients_data_init()) RCVRNG CLIENT uuid: 4d35a620-fa6c-4bfd-881f-8cd6c 5f84bec idx: 3 lr: 0 srv lr: 944892805333 lx: 0 gen 14 This shows the three clients data are initialized, but during recovery server denied their connections because uuid doesn't match.
            laisiyao Lai Siyao added a comment -
            00000100:00080000:0.0:1727765542.183015:0:707970:0:(import.c:1311:ptlrpc_connect_interpret()) @@@ lustre-MDT0000-mdc-ffff9003ff03b800: evicting (reconnect/recover flags not set: 4)  req@ffff90040bc44000 x1811685733501568/t0(0) o38->lustre-MDT0000-mdc-ffff9003ff03b800@10.240.26.208@tcp:12/10 lens 520/416 e 0 to 0 dl 1727765562 ref 1 fl Interpret:RNQU/200/0 rc 0/0 job:'kworker.0' uid:0 gid:0
            
            00010000:02000000:0.0:1727765540.493306:0:3609:0:(ldlm_lib.c:1751:target_finish_recovery()) lustre-MDT0000: Recovery over after 1:00, of 3 clients 0 recovered and 3 were evicted.
            

            This log message shows recovery is aborted, and the possible reason is client_data is corrupt, so the 3 client connections are denied and evicted. But I don't know what lead to this.

            laisiyao Lai Siyao added a comment - 00000100:00080000:0.0:1727765542.183015:0:707970:0:(import.c:1311:ptlrpc_connect_interpret()) @@@ lustre-MDT0000-mdc-ffff9003ff03b800: evicting (reconnect/recover flags not set: 4) req@ffff90040bc44000 x1811685733501568/t0(0) o38->lustre-MDT0000-mdc-ffff9003ff03b800@10.240.26.208@tcp:12/10 lens 520/416 e 0 to 0 dl 1727765562 ref 1 fl Interpret:RNQU/200/0 rc 0/0 job:'kworker.0' uid:0 gid:0 00010000:02000000:0.0:1727765540.493306:0:3609:0:(ldlm_lib.c:1751:target_finish_recovery()) lustre-MDT0000: Recovery over after 1:00, of 3 clients 0 recovered and 3 were evicted. This log message shows recovery is aborted, and the possible reason is client_data is corrupt, so the 3 client connections are denied and evicted. But I don't know what lead to this.
            yujian Jian Yu added a comment - +1 on Lustre b2_15 branch: https://testing.whamcloud.com/test_sets/fdaf4eb1-0b01-446e-9ac6-46a18599796c

            "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56641
            Subject: LU-18302 test: fix recovery-small test_134
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1b661220783ad565b0afd24250cfbfd3faa92b5d

            gerrit Gerrit Updater added a comment - "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56641 Subject: LU-18302 test: fix recovery-small test_134 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1b661220783ad565b0afd24250cfbfd3faa92b5d
            yujian Jian Yu added a comment -

            The subtest has failing for years in the failover-part-1 and failover-zfs-part-1 test sessions.

            yujian Jian Yu added a comment - The subtest has failing for years in the failover-part-1 and failover-zfs-part-1 test sessions.
            lixi_wc Li Xi added a comment -

            laisiyao Would you please take a look?

            lixi_wc Li Xi added a comment - laisiyao Would you please take a look?

            People

              laisiyao Lai Siyao
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: