[LU-17147] sanity-lsnapshot test_1b: Fail to create lss_1b_0 Created: 26/Sep/23  Updated: 18/Oct/23  Resolved: 18/Oct/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Not a Bug Votes: 0
Labels: None

Issue Links:
Related
is related to LU-17142 MGC long time connection Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Alexander Boyko <c17825@cray.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/6df69fab-a0cd-49b7-a6dd-426a9c267d8a

test_1b failed with the following error:

Create lss_1b_0
CMD: onyx-60vm8 /usr/sbin/lctl snapshot_create -F lustre -n lss_1b_0
onyx-60vm8: Fail to freeze barrier for lustre: Invalid argument
onyx-60vm8: Can't create the snapshot lss_1b_0
pdsh@onyx-60vm1: onyx-60vm8: ssh exited with exit code 22
CMD: onyx-60vm8 cat /var/log/lsnapshot.log
Mon Sep 25 18:02:19 2023 (784346:jt_snapshot_create:1532:lustre:ssh): Create snapshot lss_0_0 successfully with comment <(null)>, barrier , timeout <30>
Mon Sep 25 18:02:21 2023 (784875:jt_snapshot_create:1532:lustre:ssh): Create snapshot lss_0_1 successfully with comment <(null)>, barrier , timeout <-1>
Mon Sep 25 18:02:23 2023 (785404:jt_snapshot_create:1532:lustre:ssh): Create snapshot lss_0_2 successfully with comment , barrier , timeout <30>
Mon Sep 25 18:02:25 2023 (785933:jt_snapshot_create:1532:lustre:ssh): Create snapshot lss_0_3 successfully with comment , barrier , timeout <30>
Mon Sep 25 18:02:26 2023 (786462:jt_snapshot_create:1525:lustre:ssh): Can't create snapshot lss_0_0 with comment <(null)> barrier , timeout <30>: -17
Mon Sep 25 18:02:32 2023 (787140:jt_snapshot_create:1532:lustre:ssh): Create snapshot lss_1a_0 successfully with comment <(null)>, barrier , timeout <30>
Mon Sep 25 18:02:35 2023 (787677:jt_snapshot_mount:2520:lustre:ssh): The snapshot lss_1a_0 is mounted
Mon Sep 25 18:02:38 2023 (788595:jt_snapshot_umount:2675:lustre:ssh): the snapshot lss_1a_0 have been umounted
Mon Sep 25 18:02:44 2023 (789655:snapshot_create:1419:lustre:ssh): Can't set barrier within 30 seconds on lustre: rc = -22
Mon Sep 25 18:02:44 2023 (789655:jt_snapshot_create:1525:lustre:ssh): Can't create snapshot lss_1b_0 with comment <(null)> barrier , timeout <30>: -22
 sanity-lsnapshot test_1b: @@@@@@ FAIL: (1) Fail to create lss_1b_0 

[14771.317798] Lustre: DEBUG MARKER: == sanity-lsnapshot test 1b: mount snapshot without original filesystem mounted ========================================================== 18:02:43 (1695664963)
[14771.487370] Lustre: DEBUG MARKER: /usr/sbin/lctl snapshot_create -F lustre -n lss_1b_0
[14772.106064] Lustre: DEBUG MARKER: cat /var/log/lsnapshot.log
[14772.479236] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity-lsnapshot test_1b: @@@@@@ FAIL: \(1\) Fail to create lss_1b_0 
[14772.681271] Lustre: DEBUG MARKER: sanity-lsnapshot test_1b: @@@@@@ FAIL: (1) Fail to create lss_1b_0

Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/98866 - 4.18.0-477.21.1.el8_8.x86_64
servers: https://build.whamcloud.com/job/lustre-reviews/98866 - 4.18.0-477.21.1.el8_lustre.x86_64

<<Please provide additional information about the failure here>>

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-lsnapshot test_1b - (1) Fail to create lss_1b_0



 Comments   
Comment by Andreas Dilger [ 28/Sep/23 ]

This has only failed recently (and twice) with patch https://review.whamcloud.com/52498 "LU-17142 mgc: reconnection without pinger" so it looks like the problem is being caused by that patch somehow.

Comment by Alexander Boyko [ 17/Oct/23 ]

I analyzed this issue. And It is really side effect of patch mgc: reconnection without pinger.
The situation is next
MGC reconnect to MGS and cancels all local locks. And when MGS sends glimpse ast, client replies with -22, EINVAL because client don't have a lock. And later MGS treat this reply wrong, so mgs_barrier_freeze ends with -22.

client
00010000:00010000:4.0:1693550583.414234:0:2655926:0:(ldlm_lockd.c:2456:ldlm_callback_handler()) callback on lock 0x28f467194556273 - lock disappeared
00010000:00010000:4.0:1693550583.416981:0:2655926:0:(ldlm_lockd.c:2361:ldlm_callback_errmsg()) @@@ Operate with invalid parameter, NID=12345-0@lo lock=0x28f46719455636f: rc = 0  req@0000000079795088 x1775816258196096/t0(0) o106->b6ac8162-74c3-4f9e-abd8-ddbd41354f83@0@lo:279/0 lens 400/224 e 0 to 0 dl 1693550594 ref 1 fl Interpret:/0/0 rc -22/-22 job:'' uid:4294967295 gid:4294967295
server
20000000:00000001:5.0:1693550583.411102:0:2670065:0:(mgs_barrier.c:323:mgs_barrier_freeze()) Process entered
20000000:00000001:5.0:1693550583.411149:0:2670065:0:(mgs_barrier.c:128:mgs_barrier_glimpse_lock()) Process entered
20000000:00010000:5.0:1693550583.411348:0:2670065:0:(mgs_barrier.c:194:mgs_barrier_glimpse_lock()) ###  glimpse
20000000:00000001:5.0:1693550583.418575:0:2670065:0:(mgs_barrier.c:62:mgs_barrier_gl_interpret_reply()) Process entered
20000000:00000001:5.0:1693550583.418607:0:2670065:0:(mgs_barrier.c:71:mgs_barrier_gl_interpret_reply()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea)
20000000:00000001:5.0:1693550583.419278:0:2670065:0:(mgs_barrier.c:62:mgs_barrier_gl_interpret_reply()) Process entered
20000000:00000001:5.0:1693550583.419291:0:2670065:0:(mgs_barrier.c:71:mgs_barrier_gl_interpret_reply()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea)
20000000:00000001:4.0:1693550583.423784:0:2670065:0:(mgs_barrier.c:62:mgs_barrier_gl_interpret_reply()) Process entered
20000000:00000001:4.0:1693550583.423808:0:2670065:0:(mgs_barrier.c:88:mgs_barrier_gl_interpret_reply()) Process leaving via out (rc=0 : 0 : 0x0)
20000000:00000001:4.0:1693550583.465200:0:2670065:0:(mgs_barrier.c:62:mgs_barrier_gl_interpret_reply()) Process entered
20000000:00000001:4.0:1693550583.465212:0:2670065:0:(mgs_barrier.c:88:mgs_barrier_gl_interpret_reply()) Process leaving via out (rc=0 : 0 : 0x0)
20000000:00000001:4.0:1693550583.466645:0:2670065:0:(mgs_barrier.c:206:mgs_barrier_glimpse_lock()) Process leaving via out (rc=0 : 0 : 0x0)
20000000:00000001:4.0:1693550583.466677:0:2670065:0:(mgs_barrier.c:456:mgs_barrier_freeze()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea)

I'll fix it at main patch LU-17142 .

Generated at Sat Feb 10 03:33:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.