Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for S Buisson <sbuisson@ddn.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/37ce626a-22a6-11ea-bb75-52540065bddc

      test_30b failed with the following error:

      touch skn
      

      After correctly setting skn flavor for all components, the client cannot create a simple file:

      touch: setting times of '/mnt/lustre/d30b.sanity-sec/f0': Input/output error
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-sec test_30b - touch skn
      sanity-sec test_30b - mkdir skn

      Attachments

        Activity

          [LU-13116] sanity-sec test_30b: touch skn
          pjones Peter Jones added a comment -

          Looks like everything has landed for 2.14

          pjones Peter Jones added a comment - Looks like everything has landed for 2.14

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37151/
          Subject: LU-13116 tests: properly clean keyring in sanity-sec test_30
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: e628b7cd06ce0092b3003d27a2f45e6a0b7574d9

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37151/ Subject: LU-13116 tests: properly clean keyring in sanity-sec test_30 Project: fs/lustre-release Branch: master Current Patch Set: Commit: e628b7cd06ce0092b3003d27a2f45e6a0b7574d9

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37251/
          Subject: LU-13116 ptlrpc: return error for conn with NULL export
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: c54dc02d51e76146f3f273196f47204a2fb48345

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37251/ Subject: LU-13116 ptlrpc: return error for conn with NULL export Project: fs/lustre-release Branch: master Current Patch Set: Commit: c54dc02d51e76146f3f273196f47204a2fb48345

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37344/
          Subject: LU-13116 mgc: do not lose sptlrpc config lock
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 26e8f1137b82b26dca0219f58543ea6edd9a30ae

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37344/ Subject: LU-13116 mgc: do not lose sptlrpc config lock Project: fs/lustre-release Branch: master Current Patch Set: Commit: 26e8f1137b82b26dca0219f58543ea6edd9a30ae

          Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/37344
          Subject: LU-13116 mgc: do not lose sptlrpc config lock
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: e6b9eac6a8ad0ad26e628e66b20cbdef34acf637

          gerrit Gerrit Updater added a comment - Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/37344 Subject: LU-13116 mgc: do not lose sptlrpc config lock Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e6b9eac6a8ad0ad26e628e66b20cbdef34acf637
          sebastien Sebastien Buisson added a comment - - edited

          I think I finally managed to find the reason why sanity-sec test_30b is failing. In fact it happens when recovery-small is launched before sanity-sec, and the problem is due in particular to recovery-small test_23 which restarts the combined MGS/MDS.

          There is a bug in the Lustre code that makes the MGC that is used by the MDS server, lose the reference on the sptlrpc config log when the colocated MGS is stopped. As a consequence, even once the MGS is restarted, the MDS does not hold any reference on the sptlrpc config log anymore, and does not get any subsequent update of this llog. It explains why the newly set srpc flavor in sanity-sec test_30b is not taken into account by the MDS server, which refuses requests from the clients that do apply the new flavor.

          It seems this bug was introduced by patch 0ad54d5977 ("LU-11185 mgc: config lock leak").

          sebastien Sebastien Buisson added a comment - - edited I think I finally managed to find the reason why sanity-sec test_30b is failing. In fact it happens when recovery-small is launched before sanity-sec, and the problem is due in particular to recovery-small test_23 which restarts the combined MGS/MDS. There is a bug in the Lustre code that makes the MGC that is used by the MDS server, lose the reference on the sptlrpc config log when the colocated MGS is stopped. As a consequence, even once the MGS is restarted, the MDS does not hold any reference on the sptlrpc config log anymore, and does not get any subsequent update of this llog. It explains why the newly set srpc flavor in sanity-sec test_30b is not taken into account by the MDS server, which refuses requests from the clients that do apply the new flavor. It seems this bug was introduced by patch 0ad54d5977 (" LU-11185 mgc: config lock leak").

          Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/37251
          Subject: LU-13116 ptlrpc: return error for conn with NULL export
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 983af71915a0652c4eb4f615db0f6d4ae7e64a67

          gerrit Gerrit Updater added a comment - Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/37251 Subject: LU-13116 ptlrpc: return error for conn with NULL export Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 983af71915a0652c4eb4f615db0f6d4ae7e64a67

          Sometimes sanity-sec test_30 triggers an assertion failed:

          [ 2768.772950] LustreError: 2528:0:(gss_keyring.c:838:gss_sec_lookup_ctx_kr()) failed request key: -126
          [ 2768.773485] LustreError: 2528:0:(sec.c:451:sptlrpc_req_get_ctx()) req ffff96a3cdc90900: fail to get context
          [ 2768.773984] LustreError: 2528:0:(lmv_obd.c:308:lmv_connect_mdc()) target seb-MDT0000_UUID connect error -111
          [ 2768.774048] LustreError: 2563:0:(import.c:597:import_select_connection()) ASSERTION( dlmexp != ((void *)0) ) failed:
          [ 2768.774050] LustreError: 2563:0:(import.c:597:import_select_connection()) LBUG
          [ 2768.774052] Pid: 2563, comm: lctl 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018
          [ 2768.774052] Call Trace:
          [ 2768.774071]  [<ffffffffc06498ac>] libcfs_call_trace+0x8c/0xc0 [libcfs]
          [ 2768.774086]  [<ffffffffc064995c>] lbug_with_loc+0x4c/0xa0 [libcfs]
          [ 2768.774146]  [<ffffffffc0ac161c>] ptlrpc_connect_import_locked+0x128c/0x1330 [ptlrpc]
          [ 2768.774168]  [<ffffffffc0ac16e1>] ptlrpc_connect_import+0x21/0x30 [ptlrpc]
          [ 2768.774188]  [<ffffffffc0a9a498>] ptlrpc_recover_import+0x318/0x790 [ptlrpc]
          [ 2768.774211]  [<ffffffffc0ac902c>] lprocfs_import_seq_write+0x3ec/0x520 [ptlrpc]
          [ 2768.774217]  [<ffffffffc0d78919>] mdc_import_seq_write+0x19/0x20 [mdc]
          [ 2768.774222]  [<ffffffff86a94d40>] proc_reg_write+0x40/0x80
          [ 2768.774226]  [<ffffffff86a1f180>] vfs_write+0xc0/0x1f0
          [ 2768.774227]  [<ffffffff86a1ffaf>] SyS_write+0x7f/0xf0
          [ 2768.774232]  [<ffffffff86f2579b>] system_call_fastpath+0x22/0x27
          [ 2768.774239]  [<ffffffffffffffff>] 0xffffffffffffffff
          [ 2768.774240] Kernel panic - not syncing: LBUG
          [ 2768.774242] CPU: 4 PID: 2563 Comm: lctl Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.11.6.el7.x86_64 #1
          [ 2768.774243] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
          [ 2768.774243] Call Trace:
          [ 2768.774247]  [<ffffffff86f135d4>] dump_stack+0x19/0x1b
          [ 2768.774250]  [<ffffffff86f0d11f>] panic+0xe8/0x21f
          [ 2768.774254]  [<ffffffffc06499ab>] lbug_with_loc+0x9b/0xa0 [libcfs]
          [ 2768.774274]  [<ffffffffc0ac161c>] ptlrpc_connect_import_locked+0x128c/0x1330 [ptlrpc]
          [ 2768.774292]  [<ffffffffc0a610bf>] ? import_set_conn+0x31f/0x7a0 [ptlrpc]
          [ 2768.774313]  [<ffffffffc0ac16e1>] ptlrpc_connect_import+0x21/0x30 [ptlrpc]
          [ 2768.774332]  [<ffffffffc0a9a498>] ptlrpc_recover_import+0x318/0x790 [ptlrpc]
          [ 2768.774353]  [<ffffffffc0ac902c>] lprocfs_import_seq_write+0x3ec/0x520 [ptlrpc]
          [ 2768.774356]  [<ffffffffc0d78919>] mdc_import_seq_write+0x19/0x20 [mdc]
          [ 2768.774358]  [<ffffffff86a94d40>] proc_reg_write+0x40/0x80
          [ 2768.774359]  [<ffffffff86a1f180>] vfs_write+0xc0/0x1f0
          [ 2768.774361]  [<ffffffff86f256e1>] ? system_call_after_swapgs+0xae/0x146
          [ 2768.774363]  [<ffffffff86a1ffaf>] SyS_write+0x7f/0xf0
          [ 2768.774364]  [<ffffffff86f256e1>] ? system_call_after_swapgs+0xae/0x146
          [ 2768.774366]  [<ffffffff86f2579b>] system_call_fastpath+0x22/0x27
          [ 2768.774368]  [<ffffffff86f256e1>] ? system_call_after_swapgs+0xae/0x146
          

          This assertion should be replaced with test on dlmexp, and simply return an error in case it is NULL.

          sebastien Sebastien Buisson added a comment - Sometimes sanity-sec test_30 triggers an assertion failed: [ 2768.772950] LustreError: 2528:0:(gss_keyring.c:838:gss_sec_lookup_ctx_kr()) failed request key: -126 [ 2768.773485] LustreError: 2528:0:(sec.c:451:sptlrpc_req_get_ctx()) req ffff96a3cdc90900: fail to get context [ 2768.773984] LustreError: 2528:0:(lmv_obd.c:308:lmv_connect_mdc()) target seb-MDT0000_UUID connect error -111 [ 2768.774048] LustreError: 2563:0:(import.c:597:import_select_connection()) ASSERTION( dlmexp != ((void *)0) ) failed: [ 2768.774050] LustreError: 2563:0:(import.c:597:import_select_connection()) LBUG [ 2768.774052] Pid: 2563, comm: lctl 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018 [ 2768.774052] Call Trace: [ 2768.774071] [<ffffffffc06498ac>] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 2768.774086] [<ffffffffc064995c>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 2768.774146] [<ffffffffc0ac161c>] ptlrpc_connect_import_locked+0x128c/0x1330 [ptlrpc] [ 2768.774168] [<ffffffffc0ac16e1>] ptlrpc_connect_import+0x21/0x30 [ptlrpc] [ 2768.774188] [<ffffffffc0a9a498>] ptlrpc_recover_import+0x318/0x790 [ptlrpc] [ 2768.774211] [<ffffffffc0ac902c>] lprocfs_import_seq_write+0x3ec/0x520 [ptlrpc] [ 2768.774217] [<ffffffffc0d78919>] mdc_import_seq_write+0x19/0x20 [mdc] [ 2768.774222] [<ffffffff86a94d40>] proc_reg_write+0x40/0x80 [ 2768.774226] [<ffffffff86a1f180>] vfs_write+0xc0/0x1f0 [ 2768.774227] [<ffffffff86a1ffaf>] SyS_write+0x7f/0xf0 [ 2768.774232] [<ffffffff86f2579b>] system_call_fastpath+0x22/0x27 [ 2768.774239] [<ffffffffffffffff>] 0xffffffffffffffff [ 2768.774240] Kernel panic - not syncing: LBUG [ 2768.774242] CPU: 4 PID: 2563 Comm: lctl Kdump: loaded Tainted: G OE ------------ 3.10.0-862.11.6.el7.x86_64 #1 [ 2768.774243] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [ 2768.774243] Call Trace: [ 2768.774247] [<ffffffff86f135d4>] dump_stack+0x19/0x1b [ 2768.774250] [<ffffffff86f0d11f>] panic+0xe8/0x21f [ 2768.774254] [<ffffffffc06499ab>] lbug_with_loc+0x9b/0xa0 [libcfs] [ 2768.774274] [<ffffffffc0ac161c>] ptlrpc_connect_import_locked+0x128c/0x1330 [ptlrpc] [ 2768.774292] [<ffffffffc0a610bf>] ? import_set_conn+0x31f/0x7a0 [ptlrpc] [ 2768.774313] [<ffffffffc0ac16e1>] ptlrpc_connect_import+0x21/0x30 [ptlrpc] [ 2768.774332] [<ffffffffc0a9a498>] ptlrpc_recover_import+0x318/0x790 [ptlrpc] [ 2768.774353] [<ffffffffc0ac902c>] lprocfs_import_seq_write+0x3ec/0x520 [ptlrpc] [ 2768.774356] [<ffffffffc0d78919>] mdc_import_seq_write+0x19/0x20 [mdc] [ 2768.774358] [<ffffffff86a94d40>] proc_reg_write+0x40/0x80 [ 2768.774359] [<ffffffff86a1f180>] vfs_write+0xc0/0x1f0 [ 2768.774361] [<ffffffff86f256e1>] ? system_call_after_swapgs+0xae/0x146 [ 2768.774363] [<ffffffff86a1ffaf>] SyS_write+0x7f/0xf0 [ 2768.774364] [<ffffffff86f256e1>] ? system_call_after_swapgs+0xae/0x146 [ 2768.774366] [<ffffffff86f2579b>] system_call_fastpath+0x22/0x27 [ 2768.774368] [<ffffffff86f256e1>] ? system_call_after_swapgs+0xae/0x146 This assertion should be replaced with test on dlmexp, and simply return an error in case it is NULL.

          The patch https://review.whamcloud.com/37151 is certainly helpful to cleanup sanity-sec test suite, but after more investigations the reason for test_30b failures is the following: on server side, the new srpc flavor is not taken into account. So clients send requests with the newly enforced flavor, but these requests get dropped on server side:

          Jan 08 18:34:12 onyx-65vm9.onyx.whamcloud.com kernel: Lustre: 6597:0:(sec.c:2105:sptlrpc_target_export_check()) exp ffff8d9e622ce000(lustre-MDT0000): req ffff8d9e614af600 (1|0|0|1|0|0) with unauthorized flavor 22, expect 20222|0(+0)|0(+0)
          Jan 08 18:34:12 onyx-65vm9.onyx.whamcloud.com kernel: LustreError: 6597:0:(service.c:2154:ptlrpc_server_handle_req_in()) @@@ DROPPING req with illegal security flavor  req@ffff8d9e614af600 x1655160908866112/t0(0) o400->aefb6a7b-4cfd-4@10.2.5.160@tcp:0/0 lens 224/0 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1 job:'kworker/0:3.0'
          

          (above, flavor 22 is skn, 20222 is ski)

          And, after a bunch of these, clients finally get evicted, and cannot reconnect because of incompatible srpc flavor:

          Jan 08 18:34:58 onyx-65vm9.onyx.whamcloud.com kernel: Lustre: lustre-MDT0000: haven't heard from client b7fdc252-f2b2-4 (at 10.2.5.159@tcp) in 47 seconds. I think it's dead, and I am evicting it. exp ffff8d9e5ab80c00, cur 1578508498 expire 1578508468 last 1578508451
          Jan 08 18:35:01 onyx-65vm9.onyx.whamcloud.com kernel: LustreError: 2578:0:(tgt_handler.c:929:tgt_connect_check_sptlrpc()) lustre-MDT0000: unauthorized rpc flavor 22 from 10.2.5.159@tcp, expect 20222
          

          test_30b can fail on either mkdir or touch, it should be the same root cause.

          I did not manage to reproduce the problem on my own test system, so I have triggered a number of test sessions in Maloo with various parameters, in the hope one of them would hit the bug. I still did not figure out how a server could possibly not take the new srpc flavor into account.

          sebastien Sebastien Buisson added a comment - The patch https://review.whamcloud.com/37151 is certainly helpful to cleanup sanity-sec test suite, but after more investigations the reason for test_30b failures is the following: on server side, the new srpc flavor is not taken into account. So clients send requests with the newly enforced flavor, but these requests get dropped on server side: Jan 08 18:34:12 onyx-65vm9.onyx.whamcloud.com kernel: Lustre: 6597:0:(sec.c:2105:sptlrpc_target_export_check()) exp ffff8d9e622ce000(lustre-MDT0000): req ffff8d9e614af600 (1|0|0|1|0|0) with unauthorized flavor 22, expect 20222|0(+0)|0(+0) Jan 08 18:34:12 onyx-65vm9.onyx.whamcloud.com kernel: LustreError: 6597:0:(service.c:2154:ptlrpc_server_handle_req_in()) @@@ DROPPING req with illegal security flavor req@ffff8d9e614af600 x1655160908866112/t0(0) o400->aefb6a7b-4cfd-4@10.2.5.160@tcp:0/0 lens 224/0 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1 job:'kworker/0:3.0' (above, flavor 22 is skn, 20222 is ski) And, after a bunch of these, clients finally get evicted, and cannot reconnect because of incompatible srpc flavor: Jan 08 18:34:58 onyx-65vm9.onyx.whamcloud.com kernel: Lustre: lustre-MDT0000: haven't heard from client b7fdc252-f2b2-4 (at 10.2.5.159@tcp) in 47 seconds. I think it's dead, and I am evicting it. exp ffff8d9e5ab80c00, cur 1578508498 expire 1578508468 last 1578508451 Jan 08 18:35:01 onyx-65vm9.onyx.whamcloud.com kernel: LustreError: 2578:0:(tgt_handler.c:929:tgt_connect_check_sptlrpc()) lustre-MDT0000: unauthorized rpc flavor 22 from 10.2.5.159@tcp, expect 20222 test_30b can fail on either mkdir or touch, it should be the same root cause. I did not manage to reproduce the problem on my own test system, so I have triggered a number of test sessions in Maloo with various parameters, in the hope one of them would hit the bug. I still did not figure out how a server could possibly not take the new srpc flavor into account.

          I'm not sure if this is the same issue or not, but, we see sanity-sec test 30b fail with 'mkdir skn' whne SSK enabled. This test started to fail with the mkdir error starting on 16 DEC 2019. Please see the following for more information
          https://testing.whamcloud.com/test_sets/5e22a5a0-2041-11ea-b0f4-52540065bddc
          https://testing.whamcloud.com/test_sets/43fcab28-31dd-11ea-971c-52540065bddc

          If these are separate issues, I can open a new ticket for the mkdir error.

          jamesanunez James Nunez (Inactive) added a comment - I'm not sure if this is the same issue or not, but, we see sanity-sec test 30b fail with 'mkdir skn' whne SSK enabled. This test started to fail with the mkdir error starting on 16 DEC 2019. Please see the following for more information https://testing.whamcloud.com/test_sets/5e22a5a0-2041-11ea-b0f4-52540065bddc https://testing.whamcloud.com/test_sets/43fcab28-31dd-11ea-971c-52540065bddc If these are separate issues, I can open a new ticket for the mkdir error.

          People

            sebastien Sebastien Buisson
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: