Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17147

sanity-lsnapshot test_1b: Fail to create lss_1b_0

Details

    • Bug
    • Resolution: Not a Bug
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Alexander Boyko <c17825@cray.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/6df69fab-a0cd-49b7-a6dd-426a9c267d8a

      test_1b failed with the following error:

      Create lss_1b_0
      CMD: onyx-60vm8 /usr/sbin/lctl snapshot_create -F lustre -n lss_1b_0
      onyx-60vm8: Fail to freeze barrier for lustre: Invalid argument
      onyx-60vm8: Can't create the snapshot lss_1b_0
      pdsh@onyx-60vm1: onyx-60vm8: ssh exited with exit code 22
      CMD: onyx-60vm8 cat /var/log/lsnapshot.log
      Mon Sep 25 18:02:19 2023 (784346:jt_snapshot_create:1532:lustre:ssh): Create snapshot lss_0_0 successfully with comment <(null)>, barrier , timeout <30>
      Mon Sep 25 18:02:21 2023 (784875:jt_snapshot_create:1532:lustre:ssh): Create snapshot lss_0_1 successfully with comment <(null)>, barrier , timeout <-1>
      Mon Sep 25 18:02:23 2023 (785404:jt_snapshot_create:1532:lustre:ssh): Create snapshot lss_0_2 successfully with comment , barrier , timeout <30>
      Mon Sep 25 18:02:25 2023 (785933:jt_snapshot_create:1532:lustre:ssh): Create snapshot lss_0_3 successfully with comment , barrier , timeout <30>
      Mon Sep 25 18:02:26 2023 (786462:jt_snapshot_create:1525:lustre:ssh): Can't create snapshot lss_0_0 with comment <(null)> barrier , timeout <30>: -17
      Mon Sep 25 18:02:32 2023 (787140:jt_snapshot_create:1532:lustre:ssh): Create snapshot lss_1a_0 successfully with comment <(null)>, barrier , timeout <30>
      Mon Sep 25 18:02:35 2023 (787677:jt_snapshot_mount:2520:lustre:ssh): The snapshot lss_1a_0 is mounted
      Mon Sep 25 18:02:38 2023 (788595:jt_snapshot_umount:2675:lustre:ssh): the snapshot lss_1a_0 have been umounted
      Mon Sep 25 18:02:44 2023 (789655:snapshot_create:1419:lustre:ssh): Can't set barrier within 30 seconds on lustre: rc = -22
      Mon Sep 25 18:02:44 2023 (789655:jt_snapshot_create:1525:lustre:ssh): Can't create snapshot lss_1b_0 with comment <(null)> barrier , timeout <30>: -22
       sanity-lsnapshot test_1b: @@@@@@ FAIL: (1) Fail to create lss_1b_0 
      
      [14771.317798] Lustre: DEBUG MARKER: == sanity-lsnapshot test 1b: mount snapshot without original filesystem mounted ========================================================== 18:02:43 (1695664963)
      [14771.487370] Lustre: DEBUG MARKER: /usr/sbin/lctl snapshot_create -F lustre -n lss_1b_0
      [14772.106064] Lustre: DEBUG MARKER: cat /var/log/lsnapshot.log
      [14772.479236] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity-lsnapshot test_1b: @@@@@@ FAIL: \(1\) Fail to create lss_1b_0 
      [14772.681271] Lustre: DEBUG MARKER: sanity-lsnapshot test_1b: @@@@@@ FAIL: (1) Fail to create lss_1b_0
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/98866 - 4.18.0-477.21.1.el8_8.x86_64
      servers: https://build.whamcloud.com/job/lustre-reviews/98866 - 4.18.0-477.21.1.el8_lustre.x86_64

      <<Please provide additional information about the failure here>>

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-lsnapshot test_1b - (1) Fail to create lss_1b_0

      Attachments

        Issue Links

          Activity

            [LU-17147] sanity-lsnapshot test_1b: Fail to create lss_1b_0

            I analyzed this issue. And It is really side effect of patch mgc: reconnection without pinger.
            The situation is next
            MGC reconnect to MGS and cancels all local locks. And when MGS sends glimpse ast, client replies with -22, EINVAL because client don't have a lock. And later MGS treat this reply wrong, so mgs_barrier_freeze ends with -22.

            client
            00010000:00010000:4.0:1693550583.414234:0:2655926:0:(ldlm_lockd.c:2456:ldlm_callback_handler()) callback on lock 0x28f467194556273 - lock disappeared
            00010000:00010000:4.0:1693550583.416981:0:2655926:0:(ldlm_lockd.c:2361:ldlm_callback_errmsg()) @@@ Operate with invalid parameter, NID=12345-0@lo lock=0x28f46719455636f: rc = 0  req@0000000079795088 x1775816258196096/t0(0) o106->b6ac8162-74c3-4f9e-abd8-ddbd41354f83@0@lo:279/0 lens 400/224 e 0 to 0 dl 1693550594 ref 1 fl Interpret:/0/0 rc -22/-22 job:'' uid:4294967295 gid:4294967295
            server
            20000000:00000001:5.0:1693550583.411102:0:2670065:0:(mgs_barrier.c:323:mgs_barrier_freeze()) Process entered
            20000000:00000001:5.0:1693550583.411149:0:2670065:0:(mgs_barrier.c:128:mgs_barrier_glimpse_lock()) Process entered
            20000000:00010000:5.0:1693550583.411348:0:2670065:0:(mgs_barrier.c:194:mgs_barrier_glimpse_lock()) ###  glimpse
            20000000:00000001:5.0:1693550583.418575:0:2670065:0:(mgs_barrier.c:62:mgs_barrier_gl_interpret_reply()) Process entered
            20000000:00000001:5.0:1693550583.418607:0:2670065:0:(mgs_barrier.c:71:mgs_barrier_gl_interpret_reply()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea)
            20000000:00000001:5.0:1693550583.419278:0:2670065:0:(mgs_barrier.c:62:mgs_barrier_gl_interpret_reply()) Process entered
            20000000:00000001:5.0:1693550583.419291:0:2670065:0:(mgs_barrier.c:71:mgs_barrier_gl_interpret_reply()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea)
            20000000:00000001:4.0:1693550583.423784:0:2670065:0:(mgs_barrier.c:62:mgs_barrier_gl_interpret_reply()) Process entered
            20000000:00000001:4.0:1693550583.423808:0:2670065:0:(mgs_barrier.c:88:mgs_barrier_gl_interpret_reply()) Process leaving via out (rc=0 : 0 : 0x0)
            20000000:00000001:4.0:1693550583.465200:0:2670065:0:(mgs_barrier.c:62:mgs_barrier_gl_interpret_reply()) Process entered
            20000000:00000001:4.0:1693550583.465212:0:2670065:0:(mgs_barrier.c:88:mgs_barrier_gl_interpret_reply()) Process leaving via out (rc=0 : 0 : 0x0)
            20000000:00000001:4.0:1693550583.466645:0:2670065:0:(mgs_barrier.c:206:mgs_barrier_glimpse_lock()) Process leaving via out (rc=0 : 0 : 0x0)
            20000000:00000001:4.0:1693550583.466677:0:2670065:0:(mgs_barrier.c:456:mgs_barrier_freeze()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea)
            

            I'll fix it at main patch LU-17142 .

            aboyko Alexander Boyko added a comment - I analyzed this issue. And It is really side effect of patch mgc: reconnection without pinger. The situation is next MGC reconnect to MGS and cancels all local locks. And when MGS sends glimpse ast, client replies with -22, EINVAL because client don't have a lock. And later MGS treat this reply wrong, so mgs_barrier_freeze ends with -22. client 00010000:00010000:4.0:1693550583.414234:0:2655926:0:(ldlm_lockd.c:2456:ldlm_callback_handler()) callback on lock 0x28f467194556273 - lock disappeared 00010000:00010000:4.0:1693550583.416981:0:2655926:0:(ldlm_lockd.c:2361:ldlm_callback_errmsg()) @@@ Operate with invalid parameter, NID=12345-0@lo lock=0x28f46719455636f: rc = 0 req@0000000079795088 x1775816258196096/t0(0) o106->b6ac8162-74c3-4f9e-abd8-ddbd41354f83@0@lo:279/0 lens 400/224 e 0 to 0 dl 1693550594 ref 1 fl Interpret:/0/0 rc -22/-22 job:'' uid:4294967295 gid:4294967295 server 20000000:00000001:5.0:1693550583.411102:0:2670065:0:(mgs_barrier.c:323:mgs_barrier_freeze()) Process entered 20000000:00000001:5.0:1693550583.411149:0:2670065:0:(mgs_barrier.c:128:mgs_barrier_glimpse_lock()) Process entered 20000000:00010000:5.0:1693550583.411348:0:2670065:0:(mgs_barrier.c:194:mgs_barrier_glimpse_lock()) ### glimpse 20000000:00000001:5.0:1693550583.418575:0:2670065:0:(mgs_barrier.c:62:mgs_barrier_gl_interpret_reply()) Process entered 20000000:00000001:5.0:1693550583.418607:0:2670065:0:(mgs_barrier.c:71:mgs_barrier_gl_interpret_reply()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea) 20000000:00000001:5.0:1693550583.419278:0:2670065:0:(mgs_barrier.c:62:mgs_barrier_gl_interpret_reply()) Process entered 20000000:00000001:5.0:1693550583.419291:0:2670065:0:(mgs_barrier.c:71:mgs_barrier_gl_interpret_reply()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea) 20000000:00000001:4.0:1693550583.423784:0:2670065:0:(mgs_barrier.c:62:mgs_barrier_gl_interpret_reply()) Process entered 20000000:00000001:4.0:1693550583.423808:0:2670065:0:(mgs_barrier.c:88:mgs_barrier_gl_interpret_reply()) Process leaving via out (rc=0 : 0 : 0x0) 20000000:00000001:4.0:1693550583.465200:0:2670065:0:(mgs_barrier.c:62:mgs_barrier_gl_interpret_reply()) Process entered 20000000:00000001:4.0:1693550583.465212:0:2670065:0:(mgs_barrier.c:88:mgs_barrier_gl_interpret_reply()) Process leaving via out (rc=0 : 0 : 0x0) 20000000:00000001:4.0:1693550583.466645:0:2670065:0:(mgs_barrier.c:206:mgs_barrier_glimpse_lock()) Process leaving via out (rc=0 : 0 : 0x0) 20000000:00000001:4.0:1693550583.466677:0:2670065:0:(mgs_barrier.c:456:mgs_barrier_freeze()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea) I'll fix it at main patch LU-17142 .

            This has only failed recently (and twice) with patch https://review.whamcloud.com/52498 "LU-17142 mgc: reconnection without pinger" so it looks like the problem is being caused by that patch somehow.

            adilger Andreas Dilger added a comment - This has only failed recently (and twice) with patch https://review.whamcloud.com/52498 " LU-17142 mgc: reconnection without pinger " so it looks like the problem is being caused by that patch somehow.

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: