Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13498

sanity test 56w fails with '/usr/bin/lfs_migrate -y -c 7 /mnt/lustre/d56w.sanity/file1 failed '

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0, Lustre 2.12.7
    • Lustre 2.14.0
    • shared key (SSK) enabled
    • 3
    • 9223372036854775807

    Description

      In the past week, sanity test 56w is failing about 83%, 39 out of 47 runs, of the time for review-dne-ssk and failing about 84%, 36 out of 43 runs, of the time for review-dne-selinux-ssk. We don’t see this test fail for non-SSK testing.

       

      It looks likes this test started failing at a high rate on April 23, 2020. There were 16 patches that landed to master on April 23.

       

      Here are a couple of links to test 56w failures

      https://testing.whamcloud.com/test_sets/5a2a02d8-ec20-4cb1-905a-eb91a8ff4c88

      https://testing.whamcloud.com/test_sets/ef56cf99-fc91-47c3-bea4-5791fc068f18

      https://testing.whamcloud.com/test_sets/e4b730e7-af11-4db6-ba4d-61f2fe5a2bcb

       

      In the suite_log, we see

      == sanity test 56w: check lfs_migrate -c stripe_count works ========================================== 12:41:59 (1588164119)
      
      striped dir -i0 -c1 -H crush /mnt/lustre/d56w.sanity
      
      striped dir -i0 -c1 -H crush /mnt/lustre/d56w.sanity/dir1
      
      striped dir -i0 -c1 -H crush /mnt/lustre/d56w.sanity/dir2
      
      striped dir -i0 -c1 -H all_char /mnt/lustre/d56w.sanity/dir3
      
      total: 200 link in 0.20 seconds: 1024.77 ops/second
      
      /usr/bin/lfs_migrate -y -c 7 /mnt/lustre/d56w.sanity/file1
      
      /mnt/lustre/d56w.sanity/file1: lfs migrate: cannot get group lock: Input/output error (5)
      
      error: lfs migrate: /mnt/lustre/d56w.sanity/file1: cannot get group lock: Input/output error
      
      falling back to rsync: rsync: ERROR: cannot stat destination "/mnt/lustre/d56w.sanity/.file1.62QDaz": Cannot send after transport endpoint shutdown (108)
      
      rsync error: errors selecting input/output files, dirs (code 3) at main.c(635) [Receiver=3.1.2]
      
       
      
      /mnt/lustre/d56w.sanity/file1: copy error, exiting
      
       sanity test_56w: @@@@@@ FAIL: /usr/bin/lfs_migrate -y -c 7 /mnt/lustre/d56w.sanity/file1 failed
      
      

       

       

      In the client 1 journal, we see Lustre errors

      /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre"  mds1_FSTYPE=ldiskfs ost1_FSTYPE=ldiskfs MGSFSTYPE=ldiskfs MDSFSTYPE=ldiskfs OSTFSTYPE=ldiskfs VERBOSE=true FSTYPE=ldiskfs NETTYPE=tcp sh -c "/usr/sbin/lctl mark == sanity test 56w: check lfs_migrate -c stripe_count works ========================================== 12:41:59 \(1588164119\)");echo XXRETCODE:$?'
      
      Apr 29 12:42:00 trevis-3vm6.trevis.whamcloud.com kernel: Lustre: DEBUG MARKER: == sanity test 56w: check lfs_migrate -c stripe_count works ========================================== 12:41:59 (1588164119)
      
      Apr 29 12:42:00 trevis-3vm6.trevis.whamcloud.com mrshd[20366]: pam_unix(mrsh:session): session closed for user root
      
      Apr 29 12:42:00 trevis-3vm6.trevis.whamcloud.com systemd-logind[556]: Removed session c1853.
      
      Apr 29 12:42:04 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:504:gss_do_check_seq()) seq 306 (in main window) is a replay: max 356, winsize 2048
      
      Apr 29 12:42:04 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:504:gss_do_check_seq()) Skipped 1 previous similar message
      
      Apr 29 12:42:04 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:2118:gss_svc_verify_request()) phase 0: discard replayed req: seq 306
      
      Apr 29 12:42:04 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:2118:gss_svc_verify_request()) Skipped 1 previous similar message
      
      Apr 29 12:42:04 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:2288:gss_svc_handle_data()) svc 2 failed: major 0x00000002: req xid 1665307003134464 ctx ffffa01ab94c9040 idx 0xeccffee1dc90815c(0->10.9.3.8@tcp)
      
      Apr 29 12:42:04 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:2288:gss_svc_handle_data()) Skipped 1 previous similar message
      
      Apr 29 12:42:37 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:504:gss_do_check_seq()) seq 305 (in main window) is a replay: max 350, winsize 2048
      
      Apr 29 12:42:37 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2118:gss_svc_verify_request()) phase 0: discard replayed req: seq 296
      
      Apr 29 12:42:37 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2118:gss_svc_verify_request()) Skipped 8 previous similar messages
      
      Apr 29 12:42:37 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2288:gss_svc_handle_data()) svc 2 failed: major 0x00000002: req xid 1665307003134720 ctx ffffa01abbd8e440 idx 0xeccffee1dc90815a(0->10.9.3.8@tcp)
      
      Apr 29 12:42:37 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2288:gss_svc_handle_data()) Skipped 8 previous similar messages
      
      Apr 29 12:42:37 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:504:gss_do_check_seq()) Skipped 10 previous similar messages
      
      Apr 29 12:43:43 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:504:gss_do_check_seq()) seq 315 (in main window) is a replay: max 356, winsize 2048
      
      Apr 29 12:43:43 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2118:gss_svc_verify_request()) phase 0: discard replayed req: seq 302
      
      Apr 29 12:43:43 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2118:gss_svc_verify_request()) Skipped 17 previous similar messages
      
      Apr 29 12:43:43 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2288:gss_svc_handle_data()) svc 2 failed: major 0x00000002: req xid 1665307003134720 ctx ffffa01abbd8e440 idx 0xeccffee1dc90815a(0->10.9.3.8@tcp)
      
      Apr 29 12:43:43 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2288:gss_svc_handle_data()) Skipped 17 previous similar messages
      
      Apr 29 12:43:43 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:504:gss_do_check_seq()) Skipped 17 previous similar messages
      
      Apr 29 12:43:54 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 11-0: lustre-OST0003-osc-ffffa01abb58c800: operation ldlm_enqueue to node 10.9.3.8@tcp failed: rc = -107
      
      Apr 29 12:43:54 trevis-3vm6.trevis.whamcloud.com kernel: Lustre: lustre-OST0003-osc-ffffa01abb58c800: Connection to lustre-OST0003 (at 10.9.3.8@tcp) was lost; in progress operations using this service will wait for recovery to complete
      
      Apr 29 12:43:54 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 167-0: lustre-OST0003-osc-ffffa01abb58c800: This client was evicted by lustre-OST0003; in progress operations using this service will fail.
      
      Apr 29 12:43:54 trevis-3vm6.trevis.whamcloud.com kernel: Lustre: lustre-OST0002-osc-ffffa01abb58c800: Connection restored to 10.9.3.8@tcp (at 10.9.3.8@tcp)
      
      Apr 29 12:43:54 trevis-3vm6.trevis.whamcloud.com kernel: Lustre: 25173:0:(gss_cli_upcall.c:398:gss_do_ctx_fini_rpc()) client finishing forward ctx ffffa01abbf20b00 idx 0x256375f206ea7d19 (0->lustre-OST0002_UUID)
      
      Apr 29 12:43:54 trevis-3vm6.trevis.whamcloud.com kernel: Lustre: 11878:0:(sec_gss.c:1228:gss_cli_ctx_fini_common()) gss.keyring@ffffa01aa77c4000: destroy ctx ffffa01abbf20b00(0->lustre-OST0002_UUID)
      
      Apr 29 12:43:54 trevis-3vm6.trevis.whamcloud.com in.mrshd[20590]: connect from 10.9.3.6 (10.9.3.6)
      
      

      Attachments

        Issue Links

          Activity

            People

              sebastien Sebastien Buisson
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: