Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10401

sanity test_133g: timeout during MDT mount

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

      This issue relates to the following test suite run:

      Info required for matching: sanity 133g
      Info required for matching: sanity 133h

      Attachments

        Issue Links

          Activity

            [LU-10401] sanity test_133g: timeout during MDT mount

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38567/
            Subject: LU-10401 tests: add -F so list_param prints entry type
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1c54733894f81e854363fbd2d49c141842f73ae4

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38567/ Subject: LU-10401 tests: add -F so list_param prints entry type Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1c54733894f81e854363fbd2d49c141842f73ae4

            Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38679
            Subject: LU-10401 test: debug patch
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 1
            Commit: a0a337f38866706829cd0a45b7ec63d97fc4406e

            gerrit Gerrit Updater added a comment - Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38679 Subject: LU-10401 test: debug patch Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: a0a337f38866706829cd0a45b7ec63d97fc4406e
            ys Yang Sheng added a comment -

            Looks like it still exist. Further investigating.

            ys Yang Sheng added a comment - Looks like it still exist. Further investigating.
            pjones Peter Jones added a comment -

            Hmm so did the fix for LU-11761 (marked as included in 2.12.3) not work then?

            pjones Peter Jones added a comment - Hmm so did the fix for LU-11761 (marked as included in 2.12.3) not work then?
            ys Yang Sheng added a comment -

            So from stacktrace, This issue should duplicate with LU-11761.

            ys Yang Sheng added a comment - So from stacktrace, This issue should duplicate with LU-11761 .

            Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38567
            Subject: LU-10401 test: add parameter to print entry type
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1cd538d7db07365a3edd5ef9af5c260d22a673a4

            gerrit Gerrit Updater added a comment - Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38567 Subject: LU-10401 test: add parameter to print entry type Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1cd538d7db07365a3edd5ef9af5c260d22a673a4
            ys Yang Sheng added a comment -

            Hi, Andreas,

            I have working to find the root cause. The more stranger thing is that failover was triggered. Looks like some var was exported wrongly?

            cln..Failing mds1 on trevis-19vm4
            CMD: trevis-19vm4 grep -c /mnt/lustre-mds1' ' /proc/mounts || true
            Stopping /mnt/lustre-mds1 (opts:) on trevis-19vm4
            CMD: trevis-19vm4 umount -d /mnt/lustre-mds1
            CMD: trevis-19vm4 lsmod | grep lnet > /dev/null &&
            lctl dl | grep ' ST ' || true
            

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Andreas, I have working to find the root cause. The more stranger thing is that failover was triggered. Looks like some var was exported wrongly? cln..Failing mds1 on trevis-19vm4 CMD: trevis-19vm4 grep -c /mnt/lustre-mds1' ' /proc/mounts || true Stopping /mnt/lustre-mds1 (opts:) on trevis-19vm4 CMD: trevis-19vm4 umount -d /mnt/lustre-mds1 CMD: trevis-19vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true Thanks, YangSheng

            YS, can you please make a patch to increase the timeout, or speed up recovery, so this test will pass consistently.

            adilger Andreas Dilger added a comment - YS, can you please make a patch to increase the timeout, or speed up recovery, so this test will pass consistently.
            ys Yang Sheng added a comment -

            From client log:

            [10864.205961] ptlrpcd_rcv     S    0  7446      2 0x80000080
            [10864.206882] Call Trace:
            [10864.207350]  ? __schedule+0x253/0x830
            [10864.208002]  schedule+0x28/0x70
            [10864.208566]  schedule_timeout+0x16b/0x390
            [10864.209283]  ? __next_timer_interrupt+0xc0/0xc0
            [10864.210087]  ptlrpc_set_wait+0x4ba/0x6e0 [ptlrpc]
            [10864.210904]  ? finish_wait+0x80/0x80
            [10864.211568]  ptlrpc_queue_wait+0x7e/0x210 [ptlrpc]
            [10864.212407]  fld_client_rpc+0x277/0x580 [fld]
            [10864.213238]  ? cfs_trace_unlock_tcd+0x2e/0x80 [libcfs]
            [10864.214125]  fld_client_lookup+0x254/0x470 [fld]
            [10864.214952]  lmv_fld_lookup+0x8c/0x420 [lmv]
            [10864.215712]  lmv_lock_match+0x7c/0x3f0 [lmv]
            [10864.216621]  ll_have_md_lock+0x169/0x3b0 [lustre]
            [10864.217447]  ? vsnprintf+0x101/0x520
            [10864.218092]  ll_md_blocking_ast+0x60d/0xbd0 [lustre]
            [10864.218976]  ldlm_cancel_callback+0x7b/0x250 [ptlrpc]
            [10864.219868]  ? ldlm_lock_remove_from_lru_nolock+0x38/0xf0 [ptlrpc]
            [10864.220929]  ldlm_lock_cancel+0x55/0x1c0 [ptlrpc]
            [10864.221772]  ldlm_cli_cancel_list_local+0x8f/0x300 [ptlrpc]
            [10864.222740]  ldlm_replay_locks+0x662/0x850 [ptlrpc]
            [10864.223611]  ptlrpc_import_recovery_state_machine+0x868/0x970 [ptlrpc]
            [10864.224735]  ptlrpc_connect_interpret+0x11f0/0x22d0 [ptlrpc]
            [10864.225717]  ? after_reply+0x8de/0xd30 [ptlrpc]
            [10864.226519]  ptlrpc_check_set+0x50c/0x21f0 [ptlrpc]
            [10864.227366]  ? schedule_timeout+0x173/0x390
            [10864.228119]  ptlrpcd_check+0x3d5/0x5b0 [ptlrpc]
            [10864.228953]  ptlrpcd+0x3d0/0x4c0 [ptlrpc]
            [10864.229660]  ? finish_wait+0x80/0x80
            [10864.230329]  ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc]
            [10864.231162]  kthread+0x112/0x130
            [10864.231747]  ? kthread_flush_work_fn+0x10/0x10
            [10864.232516]  ret_from_fork+0x35/0x40
            

            Looks like recovery still not finished.

            ys Yang Sheng added a comment - From client log: [10864.205961] ptlrpcd_rcv S 0 7446 2 0x80000080 [10864.206882] Call Trace: [10864.207350] ? __schedule+0x253/0x830 [10864.208002] schedule+0x28/0x70 [10864.208566] schedule_timeout+0x16b/0x390 [10864.209283] ? __next_timer_interrupt+0xc0/0xc0 [10864.210087] ptlrpc_set_wait+0x4ba/0x6e0 [ptlrpc] [10864.210904] ? finish_wait+0x80/0x80 [10864.211568] ptlrpc_queue_wait+0x7e/0x210 [ptlrpc] [10864.212407] fld_client_rpc+0x277/0x580 [fld] [10864.213238] ? cfs_trace_unlock_tcd+0x2e/0x80 [libcfs] [10864.214125] fld_client_lookup+0x254/0x470 [fld] [10864.214952] lmv_fld_lookup+0x8c/0x420 [lmv] [10864.215712] lmv_lock_match+0x7c/0x3f0 [lmv] [10864.216621] ll_have_md_lock+0x169/0x3b0 [lustre] [10864.217447] ? vsnprintf+0x101/0x520 [10864.218092] ll_md_blocking_ast+0x60d/0xbd0 [lustre] [10864.218976] ldlm_cancel_callback+0x7b/0x250 [ptlrpc] [10864.219868] ? ldlm_lock_remove_from_lru_nolock+0x38/0xf0 [ptlrpc] [10864.220929] ldlm_lock_cancel+0x55/0x1c0 [ptlrpc] [10864.221772] ldlm_cli_cancel_list_local+0x8f/0x300 [ptlrpc] [10864.222740] ldlm_replay_locks+0x662/0x850 [ptlrpc] [10864.223611] ptlrpc_import_recovery_state_machine+0x868/0x970 [ptlrpc] [10864.224735] ptlrpc_connect_interpret+0x11f0/0x22d0 [ptlrpc] [10864.225717] ? after_reply+0x8de/0xd30 [ptlrpc] [10864.226519] ptlrpc_check_set+0x50c/0x21f0 [ptlrpc] [10864.227366] ? schedule_timeout+0x173/0x390 [10864.228119] ptlrpcd_check+0x3d5/0x5b0 [ptlrpc] [10864.228953] ptlrpcd+0x3d0/0x4c0 [ptlrpc] [10864.229660] ? finish_wait+0x80/0x80 [10864.230329] ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc] [10864.231162] kthread+0x112/0x130 [10864.231747] ? kthread_flush_work_fn+0x10/0x10 [10864.232516] ret_from_fork+0x35/0x40 Looks like recovery still not finished.

            YS, can you please look into why this test is failing. It only has problems with CentOS 7.8 and RHEL 8.1, not the regular CentOS 7.7 runs that are part of review testing.

            adilger Andreas Dilger added a comment - YS, can you please look into why this test is failing. It only has problems with CentOS 7.8 and RHEL 8.1, not the regular CentOS 7.7 runs that are part of review testing.
            adilger Andreas Dilger added a comment - Some examples of recent failures, which are specific to CentOS 7.8 and RHEL 8.0: https://testing.whamcloud.com/test_sets/a85aeb21-b186-4e60-990c-cf89bb26d855 https://testing.whamcloud.com/test_sets/75f7b5a2-05d8-4e7b-8ca7-6253da6a7add

            People

              ys Yang Sheng
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: