Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.14.0
-
shared key (SSK) enabled
-
3
-
9223372036854775807
Description
In the past week, sanity test 56w is failing about 83%, 39 out of 47 runs, of the time for review-dne-ssk and failing about 84%, 36 out of 43 runs, of the time for review-dne-selinux-ssk. We don’t see this test fail for non-SSK testing.
It looks likes this test started failing at a high rate on April 23, 2020. There were 16 patches that landed to master on April 23.
Here are a couple of links to test 56w failures
https://testing.whamcloud.com/test_sets/5a2a02d8-ec20-4cb1-905a-eb91a8ff4c88
https://testing.whamcloud.com/test_sets/ef56cf99-fc91-47c3-bea4-5791fc068f18
https://testing.whamcloud.com/test_sets/e4b730e7-af11-4db6-ba4d-61f2fe5a2bcb
In the suite_log, we see
== sanity test 56w: check lfs_migrate -c stripe_count works ========================================== 12:41:59 (1588164119) striped dir -i0 -c1 -H crush /mnt/lustre/d56w.sanity striped dir -i0 -c1 -H crush /mnt/lustre/d56w.sanity/dir1 striped dir -i0 -c1 -H crush /mnt/lustre/d56w.sanity/dir2 striped dir -i0 -c1 -H all_char /mnt/lustre/d56w.sanity/dir3 total: 200 link in 0.20 seconds: 1024.77 ops/second /usr/bin/lfs_migrate -y -c 7 /mnt/lustre/d56w.sanity/file1 /mnt/lustre/d56w.sanity/file1: lfs migrate: cannot get group lock: Input/output error (5) error: lfs migrate: /mnt/lustre/d56w.sanity/file1: cannot get group lock: Input/output error falling back to rsync: rsync: ERROR: cannot stat destination "/mnt/lustre/d56w.sanity/.file1.62QDaz": Cannot send after transport endpoint shutdown (108) rsync error: errors selecting input/output files, dirs (code 3) at main.c(635) [Receiver=3.1.2] /mnt/lustre/d56w.sanity/file1: copy error, exiting sanity test_56w: @@@@@@ FAIL: /usr/bin/lfs_migrate -y -c 7 /mnt/lustre/d56w.sanity/file1 failed
In the client 1 journal, we see Lustre errors
/usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" mds1_FSTYPE=ldiskfs ost1_FSTYPE=ldiskfs MGSFSTYPE=ldiskfs MDSFSTYPE=ldiskfs OSTFSTYPE=ldiskfs VERBOSE=true FSTYPE=ldiskfs NETTYPE=tcp sh -c "/usr/sbin/lctl mark == sanity test 56w: check lfs_migrate -c stripe_count works ========================================== 12:41:59 \(1588164119\)");echo XXRETCODE:$?' Apr 29 12:42:00 trevis-3vm6.trevis.whamcloud.com kernel: Lustre: DEBUG MARKER: == sanity test 56w: check lfs_migrate -c stripe_count works ========================================== 12:41:59 (1588164119) Apr 29 12:42:00 trevis-3vm6.trevis.whamcloud.com mrshd[20366]: pam_unix(mrsh:session): session closed for user root Apr 29 12:42:00 trevis-3vm6.trevis.whamcloud.com systemd-logind[556]: Removed session c1853. Apr 29 12:42:04 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:504:gss_do_check_seq()) seq 306 (in main window) is a replay: max 356, winsize 2048 Apr 29 12:42:04 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:504:gss_do_check_seq()) Skipped 1 previous similar message Apr 29 12:42:04 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:2118:gss_svc_verify_request()) phase 0: discard replayed req: seq 306 Apr 29 12:42:04 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:2118:gss_svc_verify_request()) Skipped 1 previous similar message Apr 29 12:42:04 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:2288:gss_svc_handle_data()) svc 2 failed: major 0x00000002: req xid 1665307003134464 ctx ffffa01ab94c9040 idx 0xeccffee1dc90815c(0->10.9.3.8@tcp) Apr 29 12:42:04 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:2288:gss_svc_handle_data()) Skipped 1 previous similar message Apr 29 12:42:37 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:504:gss_do_check_seq()) seq 305 (in main window) is a replay: max 350, winsize 2048 Apr 29 12:42:37 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2118:gss_svc_verify_request()) phase 0: discard replayed req: seq 296 Apr 29 12:42:37 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2118:gss_svc_verify_request()) Skipped 8 previous similar messages Apr 29 12:42:37 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2288:gss_svc_handle_data()) svc 2 failed: major 0x00000002: req xid 1665307003134720 ctx ffffa01abbd8e440 idx 0xeccffee1dc90815a(0->10.9.3.8@tcp) Apr 29 12:42:37 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2288:gss_svc_handle_data()) Skipped 8 previous similar messages Apr 29 12:42:37 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:504:gss_do_check_seq()) Skipped 10 previous similar messages Apr 29 12:43:43 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:504:gss_do_check_seq()) seq 315 (in main window) is a replay: max 356, winsize 2048 Apr 29 12:43:43 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2118:gss_svc_verify_request()) phase 0: discard replayed req: seq 302 Apr 29 12:43:43 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2118:gss_svc_verify_request()) Skipped 17 previous similar messages Apr 29 12:43:43 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2288:gss_svc_handle_data()) svc 2 failed: major 0x00000002: req xid 1665307003134720 ctx ffffa01abbd8e440 idx 0xeccffee1dc90815a(0->10.9.3.8@tcp) Apr 29 12:43:43 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2288:gss_svc_handle_data()) Skipped 17 previous similar messages Apr 29 12:43:43 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:504:gss_do_check_seq()) Skipped 17 previous similar messages Apr 29 12:43:54 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 11-0: lustre-OST0003-osc-ffffa01abb58c800: operation ldlm_enqueue to node 10.9.3.8@tcp failed: rc = -107 Apr 29 12:43:54 trevis-3vm6.trevis.whamcloud.com kernel: Lustre: lustre-OST0003-osc-ffffa01abb58c800: Connection to lustre-OST0003 (at 10.9.3.8@tcp) was lost; in progress operations using this service will wait for recovery to complete Apr 29 12:43:54 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 167-0: lustre-OST0003-osc-ffffa01abb58c800: This client was evicted by lustre-OST0003; in progress operations using this service will fail. Apr 29 12:43:54 trevis-3vm6.trevis.whamcloud.com kernel: Lustre: lustre-OST0002-osc-ffffa01abb58c800: Connection restored to 10.9.3.8@tcp (at 10.9.3.8@tcp) Apr 29 12:43:54 trevis-3vm6.trevis.whamcloud.com kernel: Lustre: 25173:0:(gss_cli_upcall.c:398:gss_do_ctx_fini_rpc()) client finishing forward ctx ffffa01abbf20b00 idx 0x256375f206ea7d19 (0->lustre-OST0002_UUID) Apr 29 12:43:54 trevis-3vm6.trevis.whamcloud.com kernel: Lustre: 11878:0:(sec_gss.c:1228:gss_cli_ctx_fini_common()) gss.keyring@ffffa01aa77c4000: destroy ctx ffffa01abbf20b00(0->lustre-OST0002_UUID) Apr 29 12:43:54 trevis-3vm6.trevis.whamcloud.com in.mrshd[20590]: connect from 10.9.3.6 (10.9.3.6)
Attachments
Issue Links
- is related to
-
LU-9795 SSK test failures in many suites when SHARED_KEY is enabled
- Reopened
-
LU-12896 recovery-small test_110k: (gss_keyring.c:152:ctx_upcall_timeout_kr()) ASSERTION( key ) failed
- Resolved
- is related to
-
LU-11269 ptlrpc_set_add_req()) ASSERTION( req->rq_import->imp_state != LUSTRE_IMP_IDLE ) failed
- Resolved