[LU-13498] sanity test 56w fails with '/usr/bin/lfs_migrate -y -c 7 /mnt/lustre/d56w.sanity/file1 failed ' Created: 30/Apr/20  Updated: 07/Jul/21  Resolved: 17/Nov/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: Lustre 2.14.0, Lustre 2.12.7

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Sebastien Buisson
Resolution: Fixed Votes: 0
Labels: SSK
Environment:

shared key (SSK) enabled


Issue Links:
Related
is related to LU-11269 ptlrpc_set_add_req()) ASSERTION( req-... Resolved
is related to LU-9795 SSK test failures in many suites when... Reopened
is related to LU-12896 recovery-small test_110k: (gss_keyrin... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In the past week, sanity test 56w is failing about 83%, 39 out of 47 runs, of the time for review-dne-ssk and failing about 84%, 36 out of 43 runs, of the time for review-dne-selinux-ssk. We don’t see this test fail for non-SSK testing.

 

It looks likes this test started failing at a high rate on April 23, 2020. There were 16 patches that landed to master on April 23.

 

Here are a couple of links to test 56w failures

https://testing.whamcloud.com/test_sets/5a2a02d8-ec20-4cb1-905a-eb91a8ff4c88

https://testing.whamcloud.com/test_sets/ef56cf99-fc91-47c3-bea4-5791fc068f18

https://testing.whamcloud.com/test_sets/e4b730e7-af11-4db6-ba4d-61f2fe5a2bcb

 

In the suite_log, we see

== sanity test 56w: check lfs_migrate -c stripe_count works ========================================== 12:41:59 (1588164119)

striped dir -i0 -c1 -H crush /mnt/lustre/d56w.sanity

striped dir -i0 -c1 -H crush /mnt/lustre/d56w.sanity/dir1

striped dir -i0 -c1 -H crush /mnt/lustre/d56w.sanity/dir2

striped dir -i0 -c1 -H all_char /mnt/lustre/d56w.sanity/dir3

total: 200 link in 0.20 seconds: 1024.77 ops/second

/usr/bin/lfs_migrate -y -c 7 /mnt/lustre/d56w.sanity/file1

/mnt/lustre/d56w.sanity/file1: lfs migrate: cannot get group lock: Input/output error (5)

error: lfs migrate: /mnt/lustre/d56w.sanity/file1: cannot get group lock: Input/output error

falling back to rsync: rsync: ERROR: cannot stat destination "/mnt/lustre/d56w.sanity/.file1.62QDaz": Cannot send after transport endpoint shutdown (108)

rsync error: errors selecting input/output files, dirs (code 3) at main.c(635) [Receiver=3.1.2]

 

/mnt/lustre/d56w.sanity/file1: copy error, exiting

 sanity test_56w: @@@@@@ FAIL: /usr/bin/lfs_migrate -y -c 7 /mnt/lustre/d56w.sanity/file1 failed

 

 

In the client 1 journal, we see Lustre errors

/usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre"  mds1_FSTYPE=ldiskfs ost1_FSTYPE=ldiskfs MGSFSTYPE=ldiskfs MDSFSTYPE=ldiskfs OSTFSTYPE=ldiskfs VERBOSE=true FSTYPE=ldiskfs NETTYPE=tcp sh -c "/usr/sbin/lctl mark == sanity test 56w: check lfs_migrate -c stripe_count works ========================================== 12:41:59 \(1588164119\)");echo XXRETCODE:$?'

Apr 29 12:42:00 trevis-3vm6.trevis.whamcloud.com kernel: Lustre: DEBUG MARKER: == sanity test 56w: check lfs_migrate -c stripe_count works ========================================== 12:41:59 (1588164119)

Apr 29 12:42:00 trevis-3vm6.trevis.whamcloud.com mrshd[20366]: pam_unix(mrsh:session): session closed for user root

Apr 29 12:42:00 trevis-3vm6.trevis.whamcloud.com systemd-logind[556]: Removed session c1853.

Apr 29 12:42:04 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:504:gss_do_check_seq()) seq 306 (in main window) is a replay: max 356, winsize 2048

Apr 29 12:42:04 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:504:gss_do_check_seq()) Skipped 1 previous similar message

Apr 29 12:42:04 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:2118:gss_svc_verify_request()) phase 0: discard replayed req: seq 306

Apr 29 12:42:04 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:2118:gss_svc_verify_request()) Skipped 1 previous similar message

Apr 29 12:42:04 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:2288:gss_svc_handle_data()) svc 2 failed: major 0x00000002: req xid 1665307003134464 ctx ffffa01ab94c9040 idx 0xeccffee1dc90815c(0->10.9.3.8@tcp)

Apr 29 12:42:04 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:2288:gss_svc_handle_data()) Skipped 1 previous similar message

Apr 29 12:42:37 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:504:gss_do_check_seq()) seq 305 (in main window) is a replay: max 350, winsize 2048

Apr 29 12:42:37 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2118:gss_svc_verify_request()) phase 0: discard replayed req: seq 296

Apr 29 12:42:37 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2118:gss_svc_verify_request()) Skipped 8 previous similar messages

Apr 29 12:42:37 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2288:gss_svc_handle_data()) svc 2 failed: major 0x00000002: req xid 1665307003134720 ctx ffffa01abbd8e440 idx 0xeccffee1dc90815a(0->10.9.3.8@tcp)

Apr 29 12:42:37 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2288:gss_svc_handle_data()) Skipped 8 previous similar messages

Apr 29 12:42:37 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:504:gss_do_check_seq()) Skipped 10 previous similar messages

Apr 29 12:43:43 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:504:gss_do_check_seq()) seq 315 (in main window) is a replay: max 356, winsize 2048

Apr 29 12:43:43 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2118:gss_svc_verify_request()) phase 0: discard replayed req: seq 302

Apr 29 12:43:43 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2118:gss_svc_verify_request()) Skipped 17 previous similar messages

Apr 29 12:43:43 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2288:gss_svc_handle_data()) svc 2 failed: major 0x00000002: req xid 1665307003134720 ctx ffffa01abbd8e440 idx 0xeccffee1dc90815a(0->10.9.3.8@tcp)

Apr 29 12:43:43 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 5075:0:(sec_gss.c:2288:gss_svc_handle_data()) Skipped 17 previous similar messages

Apr 29 12:43:43 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 19221:0:(sec_gss.c:504:gss_do_check_seq()) Skipped 17 previous similar messages

Apr 29 12:43:54 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 11-0: lustre-OST0003-osc-ffffa01abb58c800: operation ldlm_enqueue to node 10.9.3.8@tcp failed: rc = -107

Apr 29 12:43:54 trevis-3vm6.trevis.whamcloud.com kernel: Lustre: lustre-OST0003-osc-ffffa01abb58c800: Connection to lustre-OST0003 (at 10.9.3.8@tcp) was lost; in progress operations using this service will wait for recovery to complete

Apr 29 12:43:54 trevis-3vm6.trevis.whamcloud.com kernel: LustreError: 167-0: lustre-OST0003-osc-ffffa01abb58c800: This client was evicted by lustre-OST0003; in progress operations using this service will fail.

Apr 29 12:43:54 trevis-3vm6.trevis.whamcloud.com kernel: Lustre: lustre-OST0002-osc-ffffa01abb58c800: Connection restored to 10.9.3.8@tcp (at 10.9.3.8@tcp)

Apr 29 12:43:54 trevis-3vm6.trevis.whamcloud.com kernel: Lustre: 25173:0:(gss_cli_upcall.c:398:gss_do_ctx_fini_rpc()) client finishing forward ctx ffffa01abbf20b00 idx 0x256375f206ea7d19 (0->lustre-OST0002_UUID)

Apr 29 12:43:54 trevis-3vm6.trevis.whamcloud.com kernel: Lustre: 11878:0:(sec_gss.c:1228:gss_cli_ctx_fini_common()) gss.keyring@ffffa01aa77c4000: destroy ctx ffffa01abbf20b00(0->lustre-OST0002_UUID)

Apr 29 12:43:54 trevis-3vm6.trevis.whamcloud.com in.mrshd[20590]: connect from 10.9.3.6 (10.9.3.6)



 Comments   
Comment by Gerrit Updater [ 30/Apr/20 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38435
Subject: LU-13498 tests: reproduce sanity 56w failure
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e64b9a8b4fee1fdb0e97a7844589b54037ed6ed9

Comment by Alex Zhuravlev [ 01/May/20 ]

jamesanunez are you able to reproduce the problem with ONLY=56w ?

Comment by James Nunez (Inactive) [ 01/May/20 ]

Alex - No, I have not been able to reproduce with ONLY=56w. I'm running all of sanity, not using ONLY, right now to see if I can reproduce this failure. 

Comment by Gerrit Updater [ 04/May/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/38467
Subject: LU-13498 tests: attempt to reproduce sanity 56w failure - 1
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 526f25a91002f9eb912ec45c24a296dcf62fa11d

Comment by Gerrit Updater [ 04/May/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/38468
Subject: LU-13498 tests: attempt to reproduce sanity 56w failure - 2
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2148ccbe08dec4159d62fd2c0c7aa5d22fd43736

Comment by Gerrit Updater [ 04/May/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/38469
Subject: LU-13498 tests: attempt to reproduce sanity 56w failure - 3
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3de1e1f576fa58432122ff573f55b3a78b700c73

Comment by Gerrit Updater [ 04/May/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/38470
Subject: LU-13498 tests: attempt to reproduce sanity 56w failure - 4
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 015eea490fa7daeddd614fcadc29939904dde2a9

Comment by Gerrit Updater [ 04/May/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/38471
Subject: LU-13498 tests: attempt to reproduce sanity 56w failure - 5
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ff15058c93c0aa6eae2f399642dffc162b13b9b4

Comment by Gerrit Updater [ 04/May/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/38473
Subject: LU-13498 tests: attempt to reproduce sanity 56w failure - 6
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 927668b63dacbbad75409454fc867e1aa943df39

Comment by Gerrit Updater [ 04/May/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/38475
Subject: LU-13498 tests: attempt to reproduce sanity 56w failure - 7
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ff08911ca69ded11523f8634afc6b8f2d84f9289

Comment by Sebastien Buisson [ 05/May/20 ]

With the various test patches above, we get the following results regarding sanity test_56w with SSK enabled:

[...]
7e8efb339b LU-12043 llite: fix to submit complete read block with ra disabled     <-- fail
596420d2e7 LU-6142 fid: Fix style issues for fid_handler.c
a4f8f2840f LU-13454 lnet: Handle health stats for reply to optimized GET
0716f5a9d9 LU-12506 mdc: clean up code style for mdc_locks.c
b09afdf576 LU-11269 ptlrpc: request's counter in import                                        <-- fail
a486ca181d LU-10391 lnet: change lnet_sock_connect to take sockaddr              <-- pass
3f608461b3 LU-11025 dne: refactor dir migration                                                    <-- pass
[...]

The list above is an extract of patches that were merged on 04/23, with latest being at the top.
So it seems the regression starts to appear with the patch b09afdf576 "LU-11269 ptlrpc: request's counter in import".

Comment by Alex Zhuravlev [ 05/May/20 ]

looking at this..

Comment by Alex Zhuravlev [ 25/May/20 ]

actually I can't reproduce this with https://review.whamcloud.com/#/c/38445/

Comment by Sebastien Buisson [ 25/May/20 ]

I can see 2 differences between your testing for patch https://review.whamcloud.com/38445, and the testing for patch https://review.whamcloud.com/38471:

  • your patch uses only one MDS and one OSS, whereas the other patch specifies mdscount=2 mdtcount=4 osscount=2 ostcount=8
  • your patch only runs test_56w (many times), whereas the other patch runs all the tests in sanity before this one. So maybe there is a side effect of another test, like restarting targets or something else?
Comment by Alex Zhuravlev [ 25/May/20 ]

okey.. let me try more..

 

Comment by Alexander Boyko [ 25/May/20 ]

I see a wrong error handler at gss, it decrements counter twice.
Could this help ?

diff --git a/lustre/ptlrpc/gss/gss_cli_upcall.c b/lustre/ptlrpc/gss/gss_cli_upcall.c
index 70d4711..68f6ca1 100644
--- a/lustre/ptlrpc/gss/gss_cli_upcall.c
+++ b/lustre/ptlrpc/gss/gss_cli_upcall.c
@@ -408,9 +408,10 @@ int gss_do_ctx_fini_rpc(struct gss_cli_ctx *gctx)

         rc = ptlrpc_request_bufs_pack(req, LUSTRE_OBD_VERSION, SEC_CTX_FINI,
                                       NULL, ctx);
-       if (rc)
-               GOTO(out_ref, rc);
-
+       if (rc) {
+               ptlrpc_request_free(req);
+               GOTO(out, rc);
+       }
         /* fix the user desc */
         if (req->rq_pack_udesc) {
                 /* we rely the fact that this request is in AUTH mode,

Comment by Alex Zhuravlev [ 25/May/20 ]

aboyko can you please clarity why do you think it's done twice?

ptlrpc_req_finished() -> ptlrpc_req_finished()

if rq_refcount becomes zero, then imp_reqs is decremented

 

Comment by Alexander Boyko [ 25/May/20 ]

First at gss_do_ctx_fini_rpc->ptlrpc_request_bufs_pack()>atomic_dec(&imp>imp_reqs)
Second gss_do_ctx_fini_rpc->ptlrpc_req_finished()>_ptlrpc_req_finished()>_ptlrpc_free_req()> atomic_dec(&request>rq_import->imp_reqs);

Am I wrong?

Comment by Gerrit Updater [ 11/Jun/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/38898
Subject: LU-13498 tests: try to reproduce sanity 56w failure - master
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f78294786bc75a01c653acab95731fc2f9f05e39

Comment by Gerrit Updater [ 11/Jun/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/38899
Subject: LU-13498 tests: try to fix sanity 56w failure
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8934f74e1f45c74d182619dd908e9bcbd287b89c

Comment by Gerrit Updater [ 30/Jul/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/39541
Subject: LU-13498 tests: except sanity/recovery-small tests with SSK
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c16b5f9c5847cb3b57a65a2d47ffcad5c1440d26

Comment by Gerrit Updater [ 02/Oct/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/40122
Subject: LU-13498 gss: update sequence in case of target disconnect
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1e47c6420e1259b22ec441696adf7cbfb194cf17

Comment by Gerrit Updater [ 05/Oct/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/40140
Subject: LU-13498 sec: fix credentials with nodemap and SSK
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e4db1dbeedff72b1bbe08968d52ce41b9ca05a60

Comment by Gerrit Updater [ 07/Oct/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/40161
Subject: LU-13498 tests: remove tests from ALWAYS_EXCEPT with SSK
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a1b64b894b960795c5de06c9458ec615931b6834

Comment by Gerrit Updater [ 19/Oct/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40140/
Subject: LU-13498 sec: fix credentials with nodemap and SSK
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2bf6442d7d9bd452153e6b1ea08ddaae3dfb3716

Comment by Gerrit Updater [ 22/Oct/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40122/
Subject: LU-13498 gss: update sequence in case of target disconnect
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1275857c178fdf6e301345c7588499451c8ffd37

Comment by Gerrit Updater [ 29/Oct/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40161/
Subject: LU-13498 tests: remove tests from ALWAYS_EXCEPT with SSK
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8f50015022da5f5e2f55bdf843c09f44f7beb10c

Comment by Peter Jones [ 07/Nov/20 ]

Is there still work remaining here or can this ticket be marked as resolved?

Comment by Sebastien Buisson [ 17/Nov/20 ]

From my standpoint, there is no remaining work here, as sanity test_56w is fixed with SHARED_KEY and now that review-dne-ssk and review-dne-selinux-ssk are no more optional but enforced tests.

Comment by Peter Jones [ 17/Nov/20 ]

Great - thanks

Comment by Gerrit Updater [ 14/Dec/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/40960
Subject: LU-13498 sec: fix credentials with nodemap and SSK
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 410638b119dda475aff41c65e15c150f1be75c90

Comment by Gerrit Updater [ 16/Dec/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/40993
Subject: LU-13498 gss: update sequence in case of target disconnect
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: dca08c3f9cde937dcae4f3085cd1ff4af2c34de5

Comment by Gerrit Updater [ 23/Jan/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40960/
Subject: LU-13498 sec: fix credentials with nodemap and SSK
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 233656e5dbffec99421326644f64db3d4e46dbc7

Comment by Gerrit Updater [ 17/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40993/
Subject: LU-13498 gss: update sequence in case of target disconnect
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: be1fd5f1be6386e7c49153b8cd7ba2c722fb077d

Comment by Gerrit Updater [ 07/Jul/21 ]

Diego Moreno (morenod@ethz.ch) uploaded a new patch: https://review.whamcloud.com/44163
Subject: LU-13498 sec: fix credentials with nodemap and SSK
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: b1ccfd0e0353249a06e39524844c71efff73122c

Generated at Sat Feb 10 03:01:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.