Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for S Buisson <sbuisson@ddn.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/8bc8339c-402c-46d0-8b05-420e0f7ead5a

      test_66 failed with the following error:

      Timeout occurred after 166 minutes, last suite running was recovery-small
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/100300 - 4.18.0-477.21.1.el8_8.x86_64
      servers: https://build.whamcloud.com/job/lustre-reviews/100300 - 4.18.0-477.21.1.el8_lustre.x86_64

      Test output ends with:

      CMD: onyx-49vm6 /usr/sbin/lctl set_param -n mdt.lustre-MDT0000.evict_client 93213b02-0ca9-4e1e-8437-34d3c201a0a9
      /usr/lib64/lustre/tests/test-framework.sh: line 3718: /mnt/lustre/recon: Input/output error
      CMD: onyx-49vm1.onyx.whamcloud.com,onyx-49vm2 /usr/bin/lfs df /mnt/lustre; uname -n >> /mnt/lustre/recon
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      recovery-small test_66 - Timeout occurred after 166 minutes, last suite running was recovery-small

      Attachments

        Issue Links

          Activity

            [LU-17286] recovery-small test_66 timeout

            Patch included in 2.15.60.

            adilger Andreas Dilger added a comment - Patch included in 2.15.60.

            Issue fixed with landing of patch "LU-17317 gss: do not continue using expired reverse context" https://review.whamcloud.com/c/fs/lustre-release/+/53375.

            sebastien Sebastien Buisson added a comment - Issue fixed with landing of patch " LU-17317 gss: do not continue using expired reverse context" https://review.whamcloud.com/c/fs/lustre-release/+/53375 .

            I hit the same type of issue on LU-17228.

            eaujames Etienne Aujames added a comment - I hit the same type of issue on LU-17228 .

            It looks like this subtest started failing on 2023-11-13 (2.15.58.101, master only) for several LU-17173 patches, but they haven't landed yet.
            https://testing.whamcloud.com/search?horizon=2332800&status%5B%5D=TIMEOUT&test_set_script_id=f36cabd0-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=e7e8f4f4-3a66-11e4-b82a-5254006e85c2&source=sub_tests#redirect

            A fair number of patches were landed on 2023-11-08 and tagged 2.15.59.0, so it isn't clear which patch may have introduced the problem.

            6521c313f7  New tag 2.15.59
            6a6e4ee20f LU-17184 mgc: remove damaged local configs
            21295b169b LU-17213 llite: check sdio before freeing it
            87ca3cffe6 LU-17259 lnet: kgnilnd_nl_get should return 0 on success
            5afe3b0538 LU-17258 socklnd: ensure connection type established upon race
            c4c9a8eea3 LU-17256 debian: allow building client dkms on arm64
            982eca73a9 LU-17000 coverity: Fix Logically dead code under liblnetconfig.c
            b83156304d LU-17254 lnet: Fix ofed detection with specific kernel version
            9ba375983d LU-17249 ptlrpc: protect scp_rqbd_idle list operations
            ee56161ea0 LU-17000 coverity: Fix Dereference after null under client.c
            edb968d04f LU-17242 debug: remove CFS_CHECK_STACK
            4b290188fb LU-16868 tests: skip conf-sanity/32 in interop
            37a50f74da LU-16796 lfsck: Change lfsck_assistant_object to use kref
            d7e3e7c104 LU-16796 target: Change struct job_stat to use kref
            a12c352a3d LU-17205 utils: add lctl get_param -H option
            8fa3532b1e LU-17204 lod: don't panic on short LOVEA
            f2f8b6deaf LU-17203 libcfs: ignore remaining items
            1759ae751a LU-16796 ofd: Change struct ofd_seq to use refcount_t
            e420e92ac9 LU-17196 tests: sanity-lnet test_310 MR support
            6aede12548 LU-16518 rsync: fix new clang error in lustre_rsync.c
            36b14a23a6 LU-17207 lnet: race b/w monitor thr stop and discovery push
            1b694ad04f LU-16896 flr: resync could mess mirror content
            d1fadf8e8a LU-17132 kernel: update RHEL 8.8 [4.18.0-477.27.1.el8_8]
            67e0d9e40a LU-17191 osc: only call xa_insert for new entries
            ee0e9447e7 LU-17115 quota: fix race of acquiring locks in qmt
            ecea24d843 LU-17071 o2iblnd: IBLND_REJECT_EARLY condition causes Oops
            05c97b1096 LU-17232 build: fix ext4-misc for el7.6 server
            24d515367f LU-9859 libcfs: migrate libcfs_mem.c to lnet/lib-mem.c
            b0cc96a1ff LU-17131 ldiskfs: el9.2 encdata and filename-encode
            57ac32a223 LU-16097 quota: release preacquired quota when over limits
            6fbffd9c09 LU-14361 statahead: add tunable for fname pattern statahead
            753e058b4c LU-4974 lod: Change pool_desc to "[lod|lov]_pool_desc"
            cb5f92c0e3 LU-10391 ksocklnd: use ksocknal_protocol v4 for IPv6
            02b22df643 LU-17235 o2iblnd: adding alias ib interface causes crash
            6ad9ef1fec LU-17225 utils: check_iam print records
            b5fde4d6c0 LU-17197 obdclass: preserve fairness when waiting for rpc slot
            68254c484a LU-10391 lnet: handle discovery with Netlink
            4512347d6c LU-16356 hsm: add running ref to the coordinator
            a772e90243 LU-16032 osd: move unlink of large objects to separate thread
            2c97684db9 LU-17181 misc: don't block reclaim threads
            a8e66b899a LU-17103 lnet: use workqueue for lnd ping buffer updates
            
            adilger Andreas Dilger added a comment - It looks like this subtest started failing on 2023-11-13 (2.15.58.101, master only) for several LU-17173 patches, but they haven't landed yet. https://testing.whamcloud.com/search?horizon=2332800&status%5B%5D=TIMEOUT&test_set_script_id=f36cabd0-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=e7e8f4f4-3a66-11e4-b82a-5254006e85c2&source=sub_tests#redirect A fair number of patches were landed on 2023-11-08 and tagged 2.15.59.0, so it isn't clear which patch may have introduced the problem. 6521c313f7 New tag 2.15.59 6a6e4ee20f LU-17184 mgc: remove damaged local configs 21295b169b LU-17213 llite: check sdio before freeing it 87ca3cffe6 LU-17259 lnet: kgnilnd_nl_get should return 0 on success 5afe3b0538 LU-17258 socklnd: ensure connection type established upon race c4c9a8eea3 LU-17256 debian: allow building client dkms on arm64 982eca73a9 LU-17000 coverity: Fix Logically dead code under liblnetconfig.c b83156304d LU-17254 lnet: Fix ofed detection with specific kernel version 9ba375983d LU-17249 ptlrpc: protect scp_rqbd_idle list operations ee56161ea0 LU-17000 coverity: Fix Dereference after null under client.c edb968d04f LU-17242 debug: remove CFS_CHECK_STACK 4b290188fb LU-16868 tests: skip conf-sanity/32 in interop 37a50f74da LU-16796 lfsck: Change lfsck_assistant_object to use kref d7e3e7c104 LU-16796 target: Change struct job_stat to use kref a12c352a3d LU-17205 utils: add lctl get_param -H option 8fa3532b1e LU-17204 lod: don't panic on short LOVEA f2f8b6deaf LU-17203 libcfs: ignore remaining items 1759ae751a LU-16796 ofd: Change struct ofd_seq to use refcount_t e420e92ac9 LU-17196 tests: sanity-lnet test_310 MR support 6aede12548 LU-16518 rsync: fix new clang error in lustre_rsync.c 36b14a23a6 LU-17207 lnet: race b/w monitor thr stop and discovery push 1b694ad04f LU-16896 flr: resync could mess mirror content d1fadf8e8a LU-17132 kernel: update RHEL 8.8 [4.18.0-477.27.1.el8_8] 67e0d9e40a LU-17191 osc: only call xa_insert for new entries ee0e9447e7 LU-17115 quota: fix race of acquiring locks in qmt ecea24d843 LU-17071 o2iblnd: IBLND_REJECT_EARLY condition causes Oops 05c97b1096 LU-17232 build: fix ext4-misc for el7.6 server 24d515367f LU-9859 libcfs: migrate libcfs_mem.c to lnet/lib-mem.c b0cc96a1ff LU-17131 ldiskfs: el9.2 encdata and filename-encode 57ac32a223 LU-16097 quota: release preacquired quota when over limits 6fbffd9c09 LU-14361 statahead: add tunable for fname pattern statahead 753e058b4c LU-4974 lod: Change pool_desc to "[lod|lov]_pool_desc" cb5f92c0e3 LU-10391 ksocklnd: use ksocknal_protocol v4 for IPv6 02b22df643 LU-17235 o2iblnd: adding alias ib interface causes crash 6ad9ef1fec LU-17225 utils: check_iam print records b5fde4d6c0 LU-17197 obdclass: preserve fairness when waiting for rpc slot 68254c484a LU-10391 lnet: handle discovery with Netlink 4512347d6c LU-16356 hsm: add running ref to the coordinator a772e90243 LU-16032 osd: move unlink of large objects to separate thread 2c97684db9 LU-17181 misc: don't block reclaim threads a8e66b899a LU-17103 lnet: use workqueue for lnd ping buffer updates

            It looks like this issue is related to GSS, with the client being evicted from the MDS (deliberately), getting an RPC error from the OSS, and then never making further progress.

            MDS logs show:

            [ 4532.190289] Lustre: DEBUG MARKER: == recovery-small test 66: lock enqueue re-send vs client eviction === 05:45:08 (1701150308)
            [ 4532.458517] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000157
            [ 4532.670053] Lustre: *** cfs_fail_loc=157, val=2147483648***
            [ 4532.671470] LustreError: 14380:0:(ldlm_lib.c:3278:target_send_reply_msg()) @@@ dropping reply  req@00000000e6ef81e4 x1783780844763008/t0(0) o101->6f45c8eb-3f53-49b5-bd8e-541944a4c347@10.240.41.232@tcp:219/0 lens 576/688 e 0 to 0 dl 1701150364 ref 1 fl Interpret:/200/0 rc 0/0 job:'stat.0' uid:0 gid:0
            [ 4534.620423] Lustre: lustre-MDT0000: Client 6f45c8eb-3f53-49b5-bd8e-541944a4c347 (at 10.240.41.232@tcp) reconnecting
            [ 4534.622463] Lustre: Skipped 1 previous similar message
            [ 4534.787142] Lustre: DEBUG MARKER: lctl set_param fail_loc=0
            [ 4536.790311] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n mdt.lustre-MDT0000.evict_client 6f45c8eb-3f53-49b5-bd8e-541944a4c347
            [ 4536.954163] Lustre: 138089:0:(genops.c:1623:obd_export_evict_by_uuid()) lustre-MDT0000: evicting 6f45c8eb-3f53-49b5-bd8e-541944a4c347 at adminstrative request
            

            OSS logs show:

            [ 4532.043858] Lustre: DEBUG MARKER: == recovery-small test 66: lock enqueue re-send vs client eviction === 05:45:08 (1701150308)
            [ 4534.268230] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x80000136
            [ 4538.317046] Lustre: 19301:0:(sec_gss.c:686:gss_cli_ctx_handle_err_notify()) req x1783781103120832/t0, ctx 00000000be17c5f3 idx 0x13da74db5b288b54(0->c): reverseserver respond (00080000/00000000)
            [ 4538.320271] Lustre: 19301:0:(sec_gss.c:686:gss_cli_ctx_handle_err_notify()) Skipped 1 previous similar message
            [ 4538.322145] LustreError: 19301:0:(client.c:1491:after_reply()) @@@ unwrap reply failed: rc = -22  req@0000000093c1e5d9 x1783781103120832/t0(0) o105->lustre-OST0000@10.240.41.232@tcp:15/16 lens 392/224 e 0 to 0 dl 1701150331 ref 1 fl Rpc:RQU/0/ffffffff rc 0/-1 job:'' uid:4294967295 gid:4294967295
            

            Client logs show (boot clock is offset by -87.25s based on "test 66" line):

            [ 4444.948121] Lustre: DEBUG MARKER: == recovery-small test 66: lock enqueue re-send vs client eviction === 05:45:08 (1701150308)
            [ 4445.418200] Lustre: DEBUG MARKER: stat /mnt/lustre/f66.recovery-small
            [ 4449.721569] LustreError: 167-0: lustre-MDT0000-mdc-ffff94079bcca000: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
            [ 4449.903814] Lustre: DEBUG MARKER: /usr/bin/lfs df /mnt/lustre; uname -n >> /mnt/lustre/recon
            [ 4451.221178] Lustre: 127393:0:(gss_svc_upcall.c:1620:gss_svc_upcall_get_ctx()) Invalid gss ctx idx 0x13da74db5b288b54 from 10.240.41.234@tcp
            
            adilger Andreas Dilger added a comment - It looks like this issue is related to GSS, with the client being evicted from the MDS (deliberately), getting an RPC error from the OSS, and then never making further progress. MDS logs show: [ 4532.190289] Lustre: DEBUG MARKER: == recovery-small test 66: lock enqueue re-send vs client eviction === 05:45:08 (1701150308) [ 4532.458517] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000157 [ 4532.670053] Lustre: *** cfs_fail_loc=157, val=2147483648*** [ 4532.671470] LustreError: 14380:0:(ldlm_lib.c:3278:target_send_reply_msg()) @@@ dropping reply req@00000000e6ef81e4 x1783780844763008/t0(0) o101->6f45c8eb-3f53-49b5-bd8e-541944a4c347@10.240.41.232@tcp:219/0 lens 576/688 e 0 to 0 dl 1701150364 ref 1 fl Interpret:/200/0 rc 0/0 job:'stat.0' uid:0 gid:0 [ 4534.620423] Lustre: lustre-MDT0000: Client 6f45c8eb-3f53-49b5-bd8e-541944a4c347 (at 10.240.41.232@tcp) reconnecting [ 4534.622463] Lustre: Skipped 1 previous similar message [ 4534.787142] Lustre: DEBUG MARKER: lctl set_param fail_loc=0 [ 4536.790311] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n mdt.lustre-MDT0000.evict_client 6f45c8eb-3f53-49b5-bd8e-541944a4c347 [ 4536.954163] Lustre: 138089:0:(genops.c:1623:obd_export_evict_by_uuid()) lustre-MDT0000: evicting 6f45c8eb-3f53-49b5-bd8e-541944a4c347 at adminstrative request OSS logs show: [ 4532.043858] Lustre: DEBUG MARKER: == recovery-small test 66: lock enqueue re-send vs client eviction === 05:45:08 (1701150308) [ 4534.268230] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x80000136 [ 4538.317046] Lustre: 19301:0:(sec_gss.c:686:gss_cli_ctx_handle_err_notify()) req x1783781103120832/t0, ctx 00000000be17c5f3 idx 0x13da74db5b288b54(0->c): reverseserver respond (00080000/00000000) [ 4538.320271] Lustre: 19301:0:(sec_gss.c:686:gss_cli_ctx_handle_err_notify()) Skipped 1 previous similar message [ 4538.322145] LustreError: 19301:0:(client.c:1491:after_reply()) @@@ unwrap reply failed: rc = -22 req@0000000093c1e5d9 x1783781103120832/t0(0) o105->lustre-OST0000@10.240.41.232@tcp:15/16 lens 392/224 e 0 to 0 dl 1701150331 ref 1 fl Rpc:RQU/0/ffffffff rc 0/-1 job:'' uid:4294967295 gid:4294967295 Client logs show (boot clock is offset by -87.25s based on " test 66 " line): [ 4444.948121] Lustre: DEBUG MARKER: == recovery-small test 66: lock enqueue re-send vs client eviction === 05:45:08 (1701150308) [ 4445.418200] Lustre: DEBUG MARKER: stat /mnt/lustre/f66.recovery-small [ 4449.721569] LustreError: 167-0: lustre-MDT0000-mdc-ffff94079bcca000: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. [ 4449.903814] Lustre: DEBUG MARKER: /usr/bin/lfs df /mnt/lustre; uname -n >> /mnt/lustre/recon [ 4451.221178] Lustre: 127393:0:(gss_svc_upcall.c:1620:gss_svc_upcall_get_ctx()) Invalid gss ctx idx 0x13da74db5b288b54 from 10.240.41.234@tcp

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: