Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10401

sanity test_133g: timeout during MDT mount

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

      This issue relates to the following test suite run:

      Info required for matching: sanity 133g
      Info required for matching: sanity 133h

      Attachments

        Issue Links

          Activity

            [LU-10401] sanity test_133g: timeout during MDT mount
            ys Yang Sheng added a comment -

            From client log:

            [10864.205961] ptlrpcd_rcv     S    0  7446      2 0x80000080
            [10864.206882] Call Trace:
            [10864.207350]  ? __schedule+0x253/0x830
            [10864.208002]  schedule+0x28/0x70
            [10864.208566]  schedule_timeout+0x16b/0x390
            [10864.209283]  ? __next_timer_interrupt+0xc0/0xc0
            [10864.210087]  ptlrpc_set_wait+0x4ba/0x6e0 [ptlrpc]
            [10864.210904]  ? finish_wait+0x80/0x80
            [10864.211568]  ptlrpc_queue_wait+0x7e/0x210 [ptlrpc]
            [10864.212407]  fld_client_rpc+0x277/0x580 [fld]
            [10864.213238]  ? cfs_trace_unlock_tcd+0x2e/0x80 [libcfs]
            [10864.214125]  fld_client_lookup+0x254/0x470 [fld]
            [10864.214952]  lmv_fld_lookup+0x8c/0x420 [lmv]
            [10864.215712]  lmv_lock_match+0x7c/0x3f0 [lmv]
            [10864.216621]  ll_have_md_lock+0x169/0x3b0 [lustre]
            [10864.217447]  ? vsnprintf+0x101/0x520
            [10864.218092]  ll_md_blocking_ast+0x60d/0xbd0 [lustre]
            [10864.218976]  ldlm_cancel_callback+0x7b/0x250 [ptlrpc]
            [10864.219868]  ? ldlm_lock_remove_from_lru_nolock+0x38/0xf0 [ptlrpc]
            [10864.220929]  ldlm_lock_cancel+0x55/0x1c0 [ptlrpc]
            [10864.221772]  ldlm_cli_cancel_list_local+0x8f/0x300 [ptlrpc]
            [10864.222740]  ldlm_replay_locks+0x662/0x850 [ptlrpc]
            [10864.223611]  ptlrpc_import_recovery_state_machine+0x868/0x970 [ptlrpc]
            [10864.224735]  ptlrpc_connect_interpret+0x11f0/0x22d0 [ptlrpc]
            [10864.225717]  ? after_reply+0x8de/0xd30 [ptlrpc]
            [10864.226519]  ptlrpc_check_set+0x50c/0x21f0 [ptlrpc]
            [10864.227366]  ? schedule_timeout+0x173/0x390
            [10864.228119]  ptlrpcd_check+0x3d5/0x5b0 [ptlrpc]
            [10864.228953]  ptlrpcd+0x3d0/0x4c0 [ptlrpc]
            [10864.229660]  ? finish_wait+0x80/0x80
            [10864.230329]  ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc]
            [10864.231162]  kthread+0x112/0x130
            [10864.231747]  ? kthread_flush_work_fn+0x10/0x10
            [10864.232516]  ret_from_fork+0x35/0x40
            

            Looks like recovery still not finished.

            ys Yang Sheng added a comment - From client log: [10864.205961] ptlrpcd_rcv S 0 7446 2 0x80000080 [10864.206882] Call Trace: [10864.207350] ? __schedule+0x253/0x830 [10864.208002] schedule+0x28/0x70 [10864.208566] schedule_timeout+0x16b/0x390 [10864.209283] ? __next_timer_interrupt+0xc0/0xc0 [10864.210087] ptlrpc_set_wait+0x4ba/0x6e0 [ptlrpc] [10864.210904] ? finish_wait+0x80/0x80 [10864.211568] ptlrpc_queue_wait+0x7e/0x210 [ptlrpc] [10864.212407] fld_client_rpc+0x277/0x580 [fld] [10864.213238] ? cfs_trace_unlock_tcd+0x2e/0x80 [libcfs] [10864.214125] fld_client_lookup+0x254/0x470 [fld] [10864.214952] lmv_fld_lookup+0x8c/0x420 [lmv] [10864.215712] lmv_lock_match+0x7c/0x3f0 [lmv] [10864.216621] ll_have_md_lock+0x169/0x3b0 [lustre] [10864.217447] ? vsnprintf+0x101/0x520 [10864.218092] ll_md_blocking_ast+0x60d/0xbd0 [lustre] [10864.218976] ldlm_cancel_callback+0x7b/0x250 [ptlrpc] [10864.219868] ? ldlm_lock_remove_from_lru_nolock+0x38/0xf0 [ptlrpc] [10864.220929] ldlm_lock_cancel+0x55/0x1c0 [ptlrpc] [10864.221772] ldlm_cli_cancel_list_local+0x8f/0x300 [ptlrpc] [10864.222740] ldlm_replay_locks+0x662/0x850 [ptlrpc] [10864.223611] ptlrpc_import_recovery_state_machine+0x868/0x970 [ptlrpc] [10864.224735] ptlrpc_connect_interpret+0x11f0/0x22d0 [ptlrpc] [10864.225717] ? after_reply+0x8de/0xd30 [ptlrpc] [10864.226519] ptlrpc_check_set+0x50c/0x21f0 [ptlrpc] [10864.227366] ? schedule_timeout+0x173/0x390 [10864.228119] ptlrpcd_check+0x3d5/0x5b0 [ptlrpc] [10864.228953] ptlrpcd+0x3d0/0x4c0 [ptlrpc] [10864.229660] ? finish_wait+0x80/0x80 [10864.230329] ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc] [10864.231162] kthread+0x112/0x130 [10864.231747] ? kthread_flush_work_fn+0x10/0x10 [10864.232516] ret_from_fork+0x35/0x40 Looks like recovery still not finished.

            YS, can you please look into why this test is failing. It only has problems with CentOS 7.8 and RHEL 8.1, not the regular CentOS 7.7 runs that are part of review testing.

            adilger Andreas Dilger added a comment - YS, can you please look into why this test is failing. It only has problems with CentOS 7.8 and RHEL 8.1, not the regular CentOS 7.7 runs that are part of review testing.
            adilger Andreas Dilger added a comment - Some examples of recent failures, which are specific to CentOS 7.8 and RHEL 8.0: https://testing.whamcloud.com/test_sets/a85aeb21-b186-4e60-990c-cf89bb26d855 https://testing.whamcloud.com/test_sets/75f7b5a2-05d8-4e7b-8ca7-6253da6a7add

            It's not clear if this patch will solve the failures, it it might. I'd like to keep this issue open until I can get a passing result from e2fsprogs. Unfortunately, there is no way to test e2fsprogs in autotest against a specific patch, so this had to be landed before I could retry with e2fsprogs.

            adilger Andreas Dilger added a comment - It's not clear if this patch will solve the failures, it it might . I'd like to keep this issue open until I can get a passing result from e2fsprogs. Unfortunately, there is no way to test e2fsprogs in autotest against a specific patch, so this had to be landed before I could retry with e2fsprogs.
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38444/
            Subject: LU-10401 tests: fix error from 'tr -d='
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 52b5f4a5c3fc942f2b6aef9dbed780bd2c2a6798

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38444/ Subject: LU-10401 tests: fix error from 'tr -d=' Project: fs/lustre-release Branch: master Current Patch Set: Commit: 52b5f4a5c3fc942f2b6aef9dbed780bd2c2a6798

            I haven't been able to get RHEL7.8 to pass at this point.

            adilger Andreas Dilger added a comment - I haven't been able to get RHEL7.8 to pass at this point.

            There are error messages printed during the test run:

            CMD: trevis-64vm4 /usr/sbin/lctl list_param -R '*' | grep '=' |
            				tr -d= | egrep -v 'force_lbug|changelog_mask' |
            				xargs badarea_io
            trevis-64vm4: tr: invalid option -- '='
            trevis-64vm4: Try 'tr --help' for more information.
            

            It isn't clear whether this is causing the test to fail, or is something spurious, but running RHEL7.8 is failing sanity test_133g repeatedly.

            adilger Andreas Dilger added a comment - There are error messages printed during the test run: CMD: trevis-64vm4 /usr/sbin/lctl list_param -R '*' | grep '=' | tr -d= | egrep -v 'force_lbug|changelog_mask' | xargs badarea_io trevis-64vm4: tr: invalid option -- '=' trevis-64vm4: Try 'tr --help' for more information. It isn't clear whether this is causing the test to fail, or is something spurious, but running RHEL7.8 is failing sanity test_133g repeatedly.

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38444
            Subject: LU-10401 tests: fix error from 'tr -d='
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b18389f3f14d2c88268ec5da15a76295cda30c67

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38444 Subject: LU-10401 tests: fix error from 'tr -d=' Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b18389f3f14d2c88268ec5da15a76295cda30c67
            yujian Jian Yu added a comment - +1 on b2_12: https://testing.whamcloud.com/test_sets/98320413-d7e1-4cb1-9c78-20b4d126575f
            hornc Chris Horn added a comment - +1 on master: https://testing.whamcloud.com/test_sets/1235bfca-16f4-11ea-98f1-52540065bddc

            People

              ys Yang Sheng
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: