Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17792

replay-single: test 135 Error: 'import is not in REPLAY_LOCKS state'

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • Lustre 2.17.0
    • Lustre 2.16.0
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for James Simmons <uja.ornl@gmail.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/c43792eb-a655-4405-8720-f1b31aee6d88

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/104489 - 4.18.0-513.5.1.el8_9.x86_64
      servers: https://build.whamcloud.com/job/lustre-reviews/104489 - 4.18.0-513.18.1.el8_lustre.x86_64

      <<Please provide additional information about the failure here>>

      Started lustre-OST0000
      CMD: trevis-41vm1.trevis.whamcloud.com PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/opt/iozone/bin:/usr/lib64/openmpi/bin:/usr/share/Modules/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config TESTLOG_PREFIX=/autotest/autotest-2/2024-04-29/lustre-reviews_review-zfs_104489_2_4b20e343-e437-4963-b984-b19ca92bce9e//replay-single TESTNAME=test_135 bash rpc.sh wait_import_state_mount REPLAY_LOCKS osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid
      End of sync
      trevis-41vm1.trevis.whamcloud.com: executing wait_import_state_mount REPLAY_LOCKS osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid
      CMD: trevis-41vm1.trevis.whamcloud.com lctl get_param -n at_max
      rpc test_135: @@@@@@ FAIL: can't put import for osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid into REPLAY_LOCKS state after 1475 sec, have IDLE
      Trace dump:
      = /usr/lib64/lustre/tests/test-framework.sh:7011:error()
      = /usr/lib64/lustre/tests/test-framework.sh:8433:_wait_import_state()
      = /usr/lib64/lustre/tests/test-framework.sh:8455:wait_import_state()
      = /usr/lib64/lustre/tests/test-framework.sh:8465:wait_import_state_mount()
      = rpc.sh:20:main()
      CMD: trevis-41vm1.trevis.whamcloud.com,trevis-41vm2,trevis-41vm3,trevis-41vm6 /usr/sbin/lctl dk > /autotest/autotest-2/2024-04-29/lustre-reviews_review-zfs_104489_2_4b20e343-e437-4963-b984-b19ca92bce9e//rpc.test_135.debug_log.$(hostname -s).1714424452.log;
      dmesg > /autotest/autotest-2/2024-04-29/lustre-reviews_review-zfs_104489_2_4b20e343-e437-4963-b984-b19ca92bce9e//rpc.test_135.dmesg.$(hostname -s).1714424452.log
      trevis-41vm1.trevis.whamcloud.com: Dumping lctl log to /autotest/autotest-2/2024-04-29/lustre-reviews_review-zfs_104489_2_4b20e343-e437-4963-b984-b19ca92bce9e//rpc.test_135.*.1714424452.log
      replay-single test_135: @@@@@@ FAIL: import is not in REPLAY_LOCKS state

      Attachments

        Issue Links

          Activity

            [LU-17792] replay-single: test 135 Error: 'import is not in REPLAY_LOCKS state'

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56935/
            Subject: LU-17792 tests: fix replay-single 135 "not in REPLAY_LOCKS"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 568f1e807ce1902ae043e4e8c4625399fa7cd8db

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56935/ Subject: LU-17792 tests: fix replay-single 135 "not in REPLAY_LOCKS" Project: fs/lustre-release Branch: master Current Patch Set: Commit: 568f1e807ce1902ae043e4e8c4625399fa7cd8db

            I think that I found the issue. The ZFS replay-barrier set ZFS to RO and this prevent the object allocation to select the ost:

            /* check whether a target is available for new object allocation */               
            static inline int lod_statfs_check(struct lu_tgt_descs *ltd,                      
                                               struct lu_tgt_desc *tgt)                       
            {                                                                                 
                    struct obd_statfs *sfs = &tgt->ltd_statfs;                                
                                                                                              
                    if (sfs->os_state & OS_STATFS_ENOSPC ||                                   
                        (sfs->os_state & OS_STATFS_ENOINO &&                                  
                         /* OST allocation allowed while precreated objects available */      
                         (ltd->ltd_is_mdt || sfs->os_fprecreated == 0)))                      
                            return -ENOSPC;                                                   
                                                                                              
                    /* If the OST is readonly then we can't allocate objects there */         
                    if (sfs->os_state & OS_STATFS_READONLY)                                       <--------
                            return -EROFS;                                                    
                                                                                              
                    /* object creation is skipped on the OST with max_create_count=0 */       
                    if (!ltd->ltd_is_mdt && sfs->os_state & OS_STATFS_NOCREATE)               
                            return -ENOBUFS;                                                  
                                                                                              
                    return 0;                                                                 
            }                                                                                 
            
            eaujames Etienne Aujames added a comment - I think that I found the issue. The ZFS replay-barrier set ZFS to RO and this prevent the object allocation to select the ost: /* check whether a target is available for new object allocation */ static inline int lod_statfs_check( struct lu_tgt_descs *ltd, struct lu_tgt_desc *tgt) { struct obd_statfs *sfs = &tgt->ltd_statfs; if (sfs->os_state & OS_STATFS_ENOSPC || (sfs->os_state & OS_STATFS_ENOINO && /* OST allocation allowed while precreated objects available */ (ltd->ltd_is_mdt || sfs->os_fprecreated == 0))) return -ENOSPC; /* If the OST is readonly then we can't allocate objects there */ if (sfs->os_state & OS_STATFS_READONLY) <-------- return -EROFS; /* object creation is skipped on the OST with max_create_count=0 */ if (!ltd->ltd_is_mdt && sfs->os_state & OS_STATFS_NOCREATE) return -ENOBUFS; return 0; }
            gerrit Gerrit Updater added a comment - - edited

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56935
            Subject: LU-17792 tests: fix replay-single 135 "not in REPLAY_LOCKS"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 16e27bad8f43321f753697065492bea87c237705

            gerrit Gerrit Updater added a comment - - edited "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56935 Subject: LU-17792 tests: fix replay-single 135 "not in REPLAY_LOCKS" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 16e27bad8f43321f753697065492bea87c237705
            eaujames Etienne Aujames added a comment - - edited

            adilger, I tried to debug this some months ago, but I am not able to reproduce it.

            "import is not in REPLAY_LOCKS state'" seems to have more occurrences for ZFS. But I tried to run the test more than 100 time on ZFS without any failure.
            So, these failures are more complex, maybe these have multiple causes. It can be caused with something in the environment and/or something coming from context inherited from the previous tests.

            The debug logs from the failures cases, seems to indicate that we don't enter in REPLAY_LOCKS state on the client.
            If we are not able to find a reliable way to get the server in "REPLAY_LOCKS", we can just skip the test if we don't enter this state.

            I will push a debug patch to get more debug information.

            eaujames Etienne Aujames added a comment - - edited adilger , I tried to debug this some months ago, but I am not able to reproduce it. "import is not in REPLAY_LOCKS state'" seems to have more occurrences for ZFS. But I tried to run the test more than 100 time on ZFS without any failure. So, these failures are more complex, maybe these have multiple causes. It can be caused with something in the environment and/or something coming from context inherited from the previous tests. The debug logs from the failures cases, seems to indicate that we don't enter in REPLAY_LOCKS state on the client. If we are not able to find a reliable way to get the server in "REPLAY_LOCKS", we can just skip the test if we don't enter this state. I will push a debug patch to get more debug information.

            eaujames, could you please take a look at this subtest failure. It looks like it was introduced with your patch.

            adilger Andreas Dilger added a comment - eaujames , could you please take a look at this subtest failure. It looks like it was introduced with your patch.
            nangelinas Nikitas Angelinas added a comment - +1 on master: https://testing.whamcloud.com/test_sets/45681a0c-49fa-4a60-8ed4-a1be0badaddd
            yujian Jian Yu added a comment - Lustre 2.16.0 RC5: https://testing.whamcloud.com/test_sets/eb1d5bb8-8d6b-4973-94ed-61d0acf36b17
            emoly.liu Emoly Liu added a comment - +1 on master: https://testing.whamcloud.com/test_sets/381fd2ee-2b72-4f87-b902-88df4762d3c7
            yujian Jian Yu added a comment -

            The failure occurred consistently in failover-part-1 and failover-zfs-part-1 test sessions.

            yujian Jian Yu added a comment - The failure occurred consistently in failover-part-1 and failover-zfs-part-1 test sessions.
            yujian Jian Yu added a comment - +1 on master branch: https://testing.whamcloud.com/test_sets/70917f0b-ca74-4291-ac26-511cb8de1371

            People

              eaujames Etienne Aujames
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: