Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10401

sanity test_133g: timeout during MDT mount

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

      This issue relates to the following test suite run:

      Info required for matching: sanity 133g
      Info required for matching: sanity 133h

      Attachments

        Issue Links

          Activity

            [LU-10401] sanity test_133g: timeout during MDT mount
            ys Yang Sheng added a comment -

            It is my fault. Since awk get input from stdin, So the FILENAME is '-'. But i want to know why this test case is exist? Looks like it only check whether output a '\n' char for get_param. Does it make sense?

            ys Yang Sheng added a comment - It is my fault. Since awk get input from stdin, So the FILENAME is '-'. But i want to know why this test case is exist? Looks like it only check whether output a '\n' char for get_param. Does it make sense?

            After the latest patch landing, I'm seeing a couple of places where sanity test_133h is failing:
            https://testing.whamcloud.com/test_sets/0a1bd601-7de7-4031-974e-bc138ca14637
            https://testing.whamcloud.com/test_sets/9238d945-2814-403b-8ab4-eb96075b4cd4
            https://testing.whamcloud.com/test_sets/1073e4a6-5e88-410a-8c80-f0cb15585635
            https://testing.whamcloud.com/test_sets/0f75133d-b81d-483b-bfc8-2b90699c9fb9

            files do not end with newline: -
            

            It is always '-' as the reported filename, so I suspect there is something wrong with how the test was modified, either reporting an error incorrectly, or not printing out the filename properly when a real error is hit. Two of the above failures were with review-ldiskfs-arm and two were on e2fsprogs tiny sessions. It isn't yet clear if these are permanent or intermittent failures, though I suspect they are permanent failures for those configurations, as I can't imagine how this test would be racy.

            adilger Andreas Dilger added a comment - After the latest patch landing, I'm seeing a couple of places where sanity test_133h is failing: https://testing.whamcloud.com/test_sets/0a1bd601-7de7-4031-974e-bc138ca14637 https://testing.whamcloud.com/test_sets/9238d945-2814-403b-8ab4-eb96075b4cd4 https://testing.whamcloud.com/test_sets/1073e4a6-5e88-410a-8c80-f0cb15585635 https://testing.whamcloud.com/test_sets/0f75133d-b81d-483b-bfc8-2b90699c9fb9 files do not end with newline: - It is always '-' as the reported filename, so I suspect there is something wrong with how the test was modified, either reporting an error incorrectly, or not printing out the filename properly when a real error is hit. Two of the above failures were with review-ldiskfs-arm and two were on e2fsprogs tiny sessions. It isn't yet clear if these are permanent or intermittent failures, though I suspect they are permanent failures for those configurations, as I can't imagine how this test would be racy.

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38567/
            Subject: LU-10401 tests: add -F so list_param prints entry type
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1c54733894f81e854363fbd2d49c141842f73ae4

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38567/ Subject: LU-10401 tests: add -F so list_param prints entry type Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1c54733894f81e854363fbd2d49c141842f73ae4

            Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38679
            Subject: LU-10401 test: debug patch
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 1
            Commit: a0a337f38866706829cd0a45b7ec63d97fc4406e

            gerrit Gerrit Updater added a comment - Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38679 Subject: LU-10401 test: debug patch Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: a0a337f38866706829cd0a45b7ec63d97fc4406e
            ys Yang Sheng added a comment -

            Looks like it still exist. Further investigating.

            ys Yang Sheng added a comment - Looks like it still exist. Further investigating.
            pjones Peter Jones added a comment -

            Hmm so did the fix for LU-11761 (marked as included in 2.12.3) not work then?

            pjones Peter Jones added a comment - Hmm so did the fix for LU-11761 (marked as included in 2.12.3) not work then?
            ys Yang Sheng added a comment -

            So from stacktrace, This issue should duplicate with LU-11761.

            ys Yang Sheng added a comment - So from stacktrace, This issue should duplicate with LU-11761 .

            Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38567
            Subject: LU-10401 test: add parameter to print entry type
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1cd538d7db07365a3edd5ef9af5c260d22a673a4

            gerrit Gerrit Updater added a comment - Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38567 Subject: LU-10401 test: add parameter to print entry type Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1cd538d7db07365a3edd5ef9af5c260d22a673a4
            ys Yang Sheng added a comment -

            Hi, Andreas,

            I have working to find the root cause. The more stranger thing is that failover was triggered. Looks like some var was exported wrongly?

            cln..Failing mds1 on trevis-19vm4
            CMD: trevis-19vm4 grep -c /mnt/lustre-mds1' ' /proc/mounts || true
            Stopping /mnt/lustre-mds1 (opts:) on trevis-19vm4
            CMD: trevis-19vm4 umount -d /mnt/lustre-mds1
            CMD: trevis-19vm4 lsmod | grep lnet > /dev/null &&
            lctl dl | grep ' ST ' || true
            

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Andreas, I have working to find the root cause. The more stranger thing is that failover was triggered. Looks like some var was exported wrongly? cln..Failing mds1 on trevis-19vm4 CMD: trevis-19vm4 grep -c /mnt/lustre-mds1' ' /proc/mounts || true Stopping /mnt/lustre-mds1 (opts:) on trevis-19vm4 CMD: trevis-19vm4 umount -d /mnt/lustre-mds1 CMD: trevis-19vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true Thanks, YangSheng

            YS, can you please make a patch to increase the timeout, or speed up recovery, so this test will pass consistently.

            adilger Andreas Dilger added a comment - YS, can you please make a patch to increase the timeout, or speed up recovery, so this test will pass consistently.
            ys Yang Sheng added a comment -

            From client log:

            [10864.205961] ptlrpcd_rcv     S    0  7446      2 0x80000080
            [10864.206882] Call Trace:
            [10864.207350]  ? __schedule+0x253/0x830
            [10864.208002]  schedule+0x28/0x70
            [10864.208566]  schedule_timeout+0x16b/0x390
            [10864.209283]  ? __next_timer_interrupt+0xc0/0xc0
            [10864.210087]  ptlrpc_set_wait+0x4ba/0x6e0 [ptlrpc]
            [10864.210904]  ? finish_wait+0x80/0x80
            [10864.211568]  ptlrpc_queue_wait+0x7e/0x210 [ptlrpc]
            [10864.212407]  fld_client_rpc+0x277/0x580 [fld]
            [10864.213238]  ? cfs_trace_unlock_tcd+0x2e/0x80 [libcfs]
            [10864.214125]  fld_client_lookup+0x254/0x470 [fld]
            [10864.214952]  lmv_fld_lookup+0x8c/0x420 [lmv]
            [10864.215712]  lmv_lock_match+0x7c/0x3f0 [lmv]
            [10864.216621]  ll_have_md_lock+0x169/0x3b0 [lustre]
            [10864.217447]  ? vsnprintf+0x101/0x520
            [10864.218092]  ll_md_blocking_ast+0x60d/0xbd0 [lustre]
            [10864.218976]  ldlm_cancel_callback+0x7b/0x250 [ptlrpc]
            [10864.219868]  ? ldlm_lock_remove_from_lru_nolock+0x38/0xf0 [ptlrpc]
            [10864.220929]  ldlm_lock_cancel+0x55/0x1c0 [ptlrpc]
            [10864.221772]  ldlm_cli_cancel_list_local+0x8f/0x300 [ptlrpc]
            [10864.222740]  ldlm_replay_locks+0x662/0x850 [ptlrpc]
            [10864.223611]  ptlrpc_import_recovery_state_machine+0x868/0x970 [ptlrpc]
            [10864.224735]  ptlrpc_connect_interpret+0x11f0/0x22d0 [ptlrpc]
            [10864.225717]  ? after_reply+0x8de/0xd30 [ptlrpc]
            [10864.226519]  ptlrpc_check_set+0x50c/0x21f0 [ptlrpc]
            [10864.227366]  ? schedule_timeout+0x173/0x390
            [10864.228119]  ptlrpcd_check+0x3d5/0x5b0 [ptlrpc]
            [10864.228953]  ptlrpcd+0x3d0/0x4c0 [ptlrpc]
            [10864.229660]  ? finish_wait+0x80/0x80
            [10864.230329]  ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc]
            [10864.231162]  kthread+0x112/0x130
            [10864.231747]  ? kthread_flush_work_fn+0x10/0x10
            [10864.232516]  ret_from_fork+0x35/0x40
            

            Looks like recovery still not finished.

            ys Yang Sheng added a comment - From client log: [10864.205961] ptlrpcd_rcv S 0 7446 2 0x80000080 [10864.206882] Call Trace: [10864.207350] ? __schedule+0x253/0x830 [10864.208002] schedule+0x28/0x70 [10864.208566] schedule_timeout+0x16b/0x390 [10864.209283] ? __next_timer_interrupt+0xc0/0xc0 [10864.210087] ptlrpc_set_wait+0x4ba/0x6e0 [ptlrpc] [10864.210904] ? finish_wait+0x80/0x80 [10864.211568] ptlrpc_queue_wait+0x7e/0x210 [ptlrpc] [10864.212407] fld_client_rpc+0x277/0x580 [fld] [10864.213238] ? cfs_trace_unlock_tcd+0x2e/0x80 [libcfs] [10864.214125] fld_client_lookup+0x254/0x470 [fld] [10864.214952] lmv_fld_lookup+0x8c/0x420 [lmv] [10864.215712] lmv_lock_match+0x7c/0x3f0 [lmv] [10864.216621] ll_have_md_lock+0x169/0x3b0 [lustre] [10864.217447] ? vsnprintf+0x101/0x520 [10864.218092] ll_md_blocking_ast+0x60d/0xbd0 [lustre] [10864.218976] ldlm_cancel_callback+0x7b/0x250 [ptlrpc] [10864.219868] ? ldlm_lock_remove_from_lru_nolock+0x38/0xf0 [ptlrpc] [10864.220929] ldlm_lock_cancel+0x55/0x1c0 [ptlrpc] [10864.221772] ldlm_cli_cancel_list_local+0x8f/0x300 [ptlrpc] [10864.222740] ldlm_replay_locks+0x662/0x850 [ptlrpc] [10864.223611] ptlrpc_import_recovery_state_machine+0x868/0x970 [ptlrpc] [10864.224735] ptlrpc_connect_interpret+0x11f0/0x22d0 [ptlrpc] [10864.225717] ? after_reply+0x8de/0xd30 [ptlrpc] [10864.226519] ptlrpc_check_set+0x50c/0x21f0 [ptlrpc] [10864.227366] ? schedule_timeout+0x173/0x390 [10864.228119] ptlrpcd_check+0x3d5/0x5b0 [ptlrpc] [10864.228953] ptlrpcd+0x3d0/0x4c0 [ptlrpc] [10864.229660] ? finish_wait+0x80/0x80 [10864.230329] ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc] [10864.231162] kthread+0x112/0x130 [10864.231747] ? kthread_flush_work_fn+0x10/0x10 [10864.232516] ret_from_fork+0x35/0x40 Looks like recovery still not finished.

            People

              ys Yang Sheng
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: