Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10401

sanity test_133g: timeout during MDT mount

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

      This issue relates to the following test suite run:

      Info required for matching: sanity 133g
      Info required for matching: sanity 133h

      Attachments

        Issue Links

          Activity

            [LU-10401] sanity test_133g: timeout during MDT mount
            ys Yang Sheng added a comment -

            So from stacktrace, This issue should duplicate with LU-11761.

            ys Yang Sheng added a comment - So from stacktrace, This issue should duplicate with LU-11761 .

            Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38567
            Subject: LU-10401 test: add parameter to print entry type
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1cd538d7db07365a3edd5ef9af5c260d22a673a4

            gerrit Gerrit Updater added a comment - Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38567 Subject: LU-10401 test: add parameter to print entry type Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1cd538d7db07365a3edd5ef9af5c260d22a673a4
            ys Yang Sheng added a comment -

            Hi, Andreas,

            I have working to find the root cause. The more stranger thing is that failover was triggered. Looks like some var was exported wrongly?

            cln..Failing mds1 on trevis-19vm4
            CMD: trevis-19vm4 grep -c /mnt/lustre-mds1' ' /proc/mounts || true
            Stopping /mnt/lustre-mds1 (opts:) on trevis-19vm4
            CMD: trevis-19vm4 umount -d /mnt/lustre-mds1
            CMD: trevis-19vm4 lsmod | grep lnet > /dev/null &&
            lctl dl | grep ' ST ' || true
            

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Andreas, I have working to find the root cause. The more stranger thing is that failover was triggered. Looks like some var was exported wrongly? cln..Failing mds1 on trevis-19vm4 CMD: trevis-19vm4 grep -c /mnt/lustre-mds1' ' /proc/mounts || true Stopping /mnt/lustre-mds1 (opts:) on trevis-19vm4 CMD: trevis-19vm4 umount -d /mnt/lustre-mds1 CMD: trevis-19vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true Thanks, YangSheng

            YS, can you please make a patch to increase the timeout, or speed up recovery, so this test will pass consistently.

            adilger Andreas Dilger added a comment - YS, can you please make a patch to increase the timeout, or speed up recovery, so this test will pass consistently.
            ys Yang Sheng added a comment -

            From client log:

            [10864.205961] ptlrpcd_rcv     S    0  7446      2 0x80000080
            [10864.206882] Call Trace:
            [10864.207350]  ? __schedule+0x253/0x830
            [10864.208002]  schedule+0x28/0x70
            [10864.208566]  schedule_timeout+0x16b/0x390
            [10864.209283]  ? __next_timer_interrupt+0xc0/0xc0
            [10864.210087]  ptlrpc_set_wait+0x4ba/0x6e0 [ptlrpc]
            [10864.210904]  ? finish_wait+0x80/0x80
            [10864.211568]  ptlrpc_queue_wait+0x7e/0x210 [ptlrpc]
            [10864.212407]  fld_client_rpc+0x277/0x580 [fld]
            [10864.213238]  ? cfs_trace_unlock_tcd+0x2e/0x80 [libcfs]
            [10864.214125]  fld_client_lookup+0x254/0x470 [fld]
            [10864.214952]  lmv_fld_lookup+0x8c/0x420 [lmv]
            [10864.215712]  lmv_lock_match+0x7c/0x3f0 [lmv]
            [10864.216621]  ll_have_md_lock+0x169/0x3b0 [lustre]
            [10864.217447]  ? vsnprintf+0x101/0x520
            [10864.218092]  ll_md_blocking_ast+0x60d/0xbd0 [lustre]
            [10864.218976]  ldlm_cancel_callback+0x7b/0x250 [ptlrpc]
            [10864.219868]  ? ldlm_lock_remove_from_lru_nolock+0x38/0xf0 [ptlrpc]
            [10864.220929]  ldlm_lock_cancel+0x55/0x1c0 [ptlrpc]
            [10864.221772]  ldlm_cli_cancel_list_local+0x8f/0x300 [ptlrpc]
            [10864.222740]  ldlm_replay_locks+0x662/0x850 [ptlrpc]
            [10864.223611]  ptlrpc_import_recovery_state_machine+0x868/0x970 [ptlrpc]
            [10864.224735]  ptlrpc_connect_interpret+0x11f0/0x22d0 [ptlrpc]
            [10864.225717]  ? after_reply+0x8de/0xd30 [ptlrpc]
            [10864.226519]  ptlrpc_check_set+0x50c/0x21f0 [ptlrpc]
            [10864.227366]  ? schedule_timeout+0x173/0x390
            [10864.228119]  ptlrpcd_check+0x3d5/0x5b0 [ptlrpc]
            [10864.228953]  ptlrpcd+0x3d0/0x4c0 [ptlrpc]
            [10864.229660]  ? finish_wait+0x80/0x80
            [10864.230329]  ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc]
            [10864.231162]  kthread+0x112/0x130
            [10864.231747]  ? kthread_flush_work_fn+0x10/0x10
            [10864.232516]  ret_from_fork+0x35/0x40
            

            Looks like recovery still not finished.

            ys Yang Sheng added a comment - From client log: [10864.205961] ptlrpcd_rcv S 0 7446 2 0x80000080 [10864.206882] Call Trace: [10864.207350] ? __schedule+0x253/0x830 [10864.208002] schedule+0x28/0x70 [10864.208566] schedule_timeout+0x16b/0x390 [10864.209283] ? __next_timer_interrupt+0xc0/0xc0 [10864.210087] ptlrpc_set_wait+0x4ba/0x6e0 [ptlrpc] [10864.210904] ? finish_wait+0x80/0x80 [10864.211568] ptlrpc_queue_wait+0x7e/0x210 [ptlrpc] [10864.212407] fld_client_rpc+0x277/0x580 [fld] [10864.213238] ? cfs_trace_unlock_tcd+0x2e/0x80 [libcfs] [10864.214125] fld_client_lookup+0x254/0x470 [fld] [10864.214952] lmv_fld_lookup+0x8c/0x420 [lmv] [10864.215712] lmv_lock_match+0x7c/0x3f0 [lmv] [10864.216621] ll_have_md_lock+0x169/0x3b0 [lustre] [10864.217447] ? vsnprintf+0x101/0x520 [10864.218092] ll_md_blocking_ast+0x60d/0xbd0 [lustre] [10864.218976] ldlm_cancel_callback+0x7b/0x250 [ptlrpc] [10864.219868] ? ldlm_lock_remove_from_lru_nolock+0x38/0xf0 [ptlrpc] [10864.220929] ldlm_lock_cancel+0x55/0x1c0 [ptlrpc] [10864.221772] ldlm_cli_cancel_list_local+0x8f/0x300 [ptlrpc] [10864.222740] ldlm_replay_locks+0x662/0x850 [ptlrpc] [10864.223611] ptlrpc_import_recovery_state_machine+0x868/0x970 [ptlrpc] [10864.224735] ptlrpc_connect_interpret+0x11f0/0x22d0 [ptlrpc] [10864.225717] ? after_reply+0x8de/0xd30 [ptlrpc] [10864.226519] ptlrpc_check_set+0x50c/0x21f0 [ptlrpc] [10864.227366] ? schedule_timeout+0x173/0x390 [10864.228119] ptlrpcd_check+0x3d5/0x5b0 [ptlrpc] [10864.228953] ptlrpcd+0x3d0/0x4c0 [ptlrpc] [10864.229660] ? finish_wait+0x80/0x80 [10864.230329] ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc] [10864.231162] kthread+0x112/0x130 [10864.231747] ? kthread_flush_work_fn+0x10/0x10 [10864.232516] ret_from_fork+0x35/0x40 Looks like recovery still not finished.

            YS, can you please look into why this test is failing. It only has problems with CentOS 7.8 and RHEL 8.1, not the regular CentOS 7.7 runs that are part of review testing.

            adilger Andreas Dilger added a comment - YS, can you please look into why this test is failing. It only has problems with CentOS 7.8 and RHEL 8.1, not the regular CentOS 7.7 runs that are part of review testing.
            adilger Andreas Dilger added a comment - Some examples of recent failures, which are specific to CentOS 7.8 and RHEL 8.0: https://testing.whamcloud.com/test_sets/a85aeb21-b186-4e60-990c-cf89bb26d855 https://testing.whamcloud.com/test_sets/75f7b5a2-05d8-4e7b-8ca7-6253da6a7add

            It's not clear if this patch will solve the failures, it it might. I'd like to keep this issue open until I can get a passing result from e2fsprogs. Unfortunately, there is no way to test e2fsprogs in autotest against a specific patch, so this had to be landed before I could retry with e2fsprogs.

            adilger Andreas Dilger added a comment - It's not clear if this patch will solve the failures, it it might . I'd like to keep this issue open until I can get a passing result from e2fsprogs. Unfortunately, there is no way to test e2fsprogs in autotest against a specific patch, so this had to be landed before I could retry with e2fsprogs.
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38444/
            Subject: LU-10401 tests: fix error from 'tr -d='
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 52b5f4a5c3fc942f2b6aef9dbed780bd2c2a6798

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38444/ Subject: LU-10401 tests: fix error from 'tr -d=' Project: fs/lustre-release Branch: master Current Patch Set: Commit: 52b5f4a5c3fc942f2b6aef9dbed780bd2c2a6798

            I haven't been able to get RHEL7.8 to pass at this point.

            adilger Andreas Dilger added a comment - I haven't been able to get RHEL7.8 to pass at this point.

            People

              ys Yang Sheng
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: