Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8880

sanity test_1: fails w/DNE @@@ faked source MDT

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.10.0
    • Lustre 2.9.0
    • None
    • 3
    • 9223372036854775807

    Description

      The `/usr/bin/lfs setdirstripe -i1 -c2 /mnt/lustre/d1.sanity` command issued by sanity.sh test_1 hangs when mdt2mdt connections are secured with the krb5p security flavor. This test has failed with SSK flavors skpi and ski; other flavors have not yet been tested.

      == MDS2 logs ==

      [  341.920887] Lustre: DEBUG MARKER: == sanity test 1: mkdir; remkdir; rmdir ============================================================== 09:13:33 (1480515213)
      [  341.954824] LustreError: 2582:0:(sec.c:2010:sptlrpc_svc_check_from()) @@@ faked source MDT  req@ffff8800d368ed00 x1552432553273552/t0(0) o105-><?>@<?>:0/0 lens 336/0 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1
      [  343.088324] Lustre: 2602:0:(sec_gss.c:2063:gss_svc_handle_init()) create svc ctx ffff8802051cc440: user from 172.18.0.18@tcp authenticated as mds
      [  343.088328] Lustre: 2602:0:(sec_gss.c:2063:gss_svc_handle_init()) Skipped 9 previous similar messages
      [  343.088346] Lustre: 2602:0:(sec_gss.c:371:gss_cli_ctx_uptodate()) server installed reverse ctx ffff8802051a6900 idx 0x35ebf0324b78e726, expiry 1480599087(+83873s)
      [  343.088348] Lustre: 2602:0:(sec_gss.c:371:gss_cli_ctx_uptodate()) Skipped 9 previous similar messages
      [  352.954000] LustreError: 2582:0:(sec.c:2010:sptlrpc_svc_check_from()) @@@ faked source MDT  req@ffff880203a13300 x1552432553273552/t0(0) o105-><?>@<?>:0/0 lens 336/0 e 0 to 0 dl 0 ref 1 fl New:/2/ffffffff rc 0/-1
      [  363.953599] LustreError: 2582:0:(sec.c:2010:sptlrpc_svc_check_from()) @@@ faked source MDT  req@ffff880204ecc600 x1552432553273552/t0(0) o105-><?>@<?>:0/0 lens 336/0 e 0 to 0 dl 0 ref 1 fl New:/2/ffffffff rc 0/-1
      [  374.953137] LustreError: 2582:0:(sec.c:2010:sptlrpc_svc_check_from()) @@@ faked source MDT  req@ffff880205722850 x1552432553273552/t0(0) o105-><?>@<?>:0/0 lens 336/0 e 0 to 0 dl 0 ref 1 fl New:/2/ffffffff rc 0/-1
      [  382.074673] LNet: Service thread pid 3152 was inactive for 40.12s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [  382.074679] Pid: 3152, comm: mdt00_003
      [  382.074681] 
      Call Trace:
      [  382.074692]  [<ffffffff8163bd69>] schedule+0x29/0x70
      [  382.074697]  [<ffffffff816399c5>] schedule_timeout+0x175/0x2d0
      [  382.074704]  [<ffffffff8108bff0>] ? process_timeout+0x0/0x10
      [  382.074723]  [<ffffffffa0642ef1>] ? cfs_block_sigsinv+0x71/0xa0 [libcfs]
      [  382.074777]  [<ffffffffa0970fa0>] ? ldlm_expired_completion_wait+0x0/0x250 [ptlrpc]
      [  382.074820]  [<ffffffffa0971791>] ldlm_completion_ast+0x5a1/0x910 [ptlrpc]
      [  382.074860]  [<ffffffffa095e730>] ? ldlm_resource_add_lock+0x70/0x1b0 [ptlrpc]
      [  382.074867]  [<ffffffff810b8940>] ? default_wake_function+0x0/0x20
      [  382.074910]  [<ffffffffa0973448>] ldlm_cli_enqueue_fini+0x938/0xdb0 [ptlrpc]
      [  382.074952]  [<ffffffffa09763b2>] ldlm_cli_enqueue+0x6c2/0x810 [ptlrpc]
      [  382.074994]  [<ffffffffa09711f0>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc]
      [  382.075012]  [<ffffffffa0cfda80>] ? mdt_remote_blocking_ast+0x0/0x590 [mdt]
      [  382.075032]  [<ffffffffa0eafc82>] osp_md_object_lock+0x172/0x2e0 [osp]
      [  382.075054]  [<ffffffffa0e56c51>] ? mdd_la_get+0x81/0x1a0 [mdd]
      [  382.075077]  [<ffffffffa0decb20>] lod_object_lock+0xf0/0x950 [lod]
      [  382.075093]  [<ffffffffa06524be>] ? cfs_hash_bd_add_locked+0x1e/0x80 [libcfs]
      [  382.075105]  [<ffffffffa0652869>] ? cfs_hash_spin_unlock+0x9/0x10 [libcfs]
      [  382.075171]  [<ffffffffa077fdb7>] ? lu_object_find_try+0x207/0x2b0 [obdclass]
      [  382.075177]  [<ffffffffa0e5410b>] mdd_object_lock+0x3b/0xd0 [mdd]
      [  382.075186]  [<ffffffffa0d0ca72>] mdt_remote_object_lock+0x1e2/0x710 [mdt]
      [  382.075202]  [<ffffffffa077ff0c>] ? lu_object_find_at+0xac/0xe0 [obdclass]
      [  382.075228]  [<ffffffffa09a3bc7>] ? lustre_msg_buf+0x17/0x60 [ptlrpc]
      [  382.075237]  [<ffffffffa0d0dfc6>] mdt_object_lock_internal+0x166/0x300 [mdt]
      [  382.075244]  [<ffffffffa0d0e174>] mdt_object_lock+0x14/0x20 [mdt]
      [  382.075254]  [<ffffffffa0d246a1>] mdt_md_create+0x421/0xba0 [mdt]
      [  382.075268]  [<ffffffffa075d838>] ? lprocfs_counter_add+0x128/0x1a0 [obdclass]
      [  382.075285]  [<ffffffffa0793ef1>] ? lprocfs_job_stats_log+0xd1/0x630 [obdclass]
      [  382.075294]  [<ffffffffa0d24f8b>] mdt_reint_create+0x16b/0x350 [mdt]
      [  382.075302]  [<ffffffffa0d26490>] mdt_reint_rec+0x80/0x210 [mdt]
      [  382.075309]  [<ffffffffa0d08db2>] mdt_reint_internal+0x5b2/0x9b0 [mdt]
      [  382.075317]  [<ffffffffa0d141e7>] mdt_reint+0x67/0x140 [mdt]
      [  382.075351]  [<ffffffffa0a05225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
      [  382.075378]  [<ffffffffa09b11ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      [  382.075404]  [<ffffffffa09aed68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [  382.075407]  [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
      [  382.075409]  [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
      [  382.075434]  [<ffffffffa09b5260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
      [  382.075458]  [<ffffffffa09b47c0>] ? ptlrpc_main+0x0/0x1de0 [ptlrpc]
      [  382.075461]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
      [  382.075463]  [<ffffffff810a5ac0>] ? kthread+0x0/0xe0
      [  382.075466]  [<ffffffff81646c98>] ret_from_fork+0x58/0x90
      [  382.075469]  [<ffffffff810a5ac0>] ? kthread+0x0/0xe0
      
      [  382.075472] LustreError: dumping log to /tmp/lustre-log.1480515253.3152
      [  385.952974] LustreError: 2582:0:(sec.c:2010:sptlrpc_svc_check_from()) @@@ faked source MDT  req@ffff880205375800 x1552432553273552/t0(0) o105-><?>@<?>:0/0 lens 336/0 e 0 to 0 dl 0 ref 1 fl New:/2/ffffffff rc 0/-1
      [  396.952564] LustreError: 2582:0:(sec.c:2010:sptlrpc_svc_check_from()) @@@ faked source MDT  req@ffff880204d3db00 x1552432553273552/t0(0) o105-><?>@<?>:0/0 lens 336/0 e 0 to 0 dl 0 ref 1 fl New:/2/ffffffff rc 0/-1
      [  418.952006] LustreError: 2582:0:(sec.c:2010:sptlrpc_svc_check_from()) @@@ faked source MDT  req@ffff8800d72c8050 x1552432553273552/t0(0) o105-><?>@<?>:0/0 lens 336/0 e 0 to 0 dl 0 ref 1 fl New:/2/ffffffff rc 0/-1
      [  418.952150] LustreError: 2582:0:(sec.c:2010:sptlrpc_svc_check_from()) Skipped 1 previous similar message
      [  451.951069] LustreError: 2582:0:(sec.c:2010:sptlrpc_svc_check_from()) @@@ faked source MDT  req@ffff8800d04f0f00 x1552432553273552/t0(0) o105-><?>@<?>:0/0 lens 336/0 e 0 to 0 dl 0 ref 1 fl New:/2/ffffffff rc 0/-1
      [  451.951177] LustreError: 2582:0:(sec.c:2010:sptlrpc_svc_check_from()) Skipped 2 previous similar messages
      [  517.949644] LustreError: 2582:0:(sec.c:2010:sptlrpc_svc_check_from()) @@@ faked source MDT  req@ffff8800d0514600 x1552432553273552/t0(0) o105-><?>@<?>:0/0 lens 336/0 e 0 to 0 dl 0 ref 1 fl New:/2/ffffffff rc 0/-1
      [  517.949761] LustreError: 2582:0:(sec.c:2010:sptlrpc_svc_check_from()) Skipped 5 previous similar messages
      [  641.950918] Lustre: lustre-MDT0000-osp-MDT0001: Connection to lustre-MDT0000 (at 172.18.0.18@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [  641.950949] LustreError: 3152:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1480515213, 300s ago), entering recovery for lustre-MDT0000_UUID@172.18.0.18@tcp ns: lustre-MDT0000-osp-MDT0001 lock: ffff880211f20e00/0xd408ec1632ce7580 lrc: 4/0,1 mode: --/EX res: [0x200000007:0x1:0x0].0x0 bits 0x2 rrc: 1 type: IBT flags: 0x1000001000000 nid: local remote: 0xf8f327625a9c3943 expref: -99 pid: 3152 timeout: 0 lvb_type: 0
      [  641.951489] Lustre: lustre-MDT0000-osp-MDT0001: Connection restored to 172.18.0.18@tcp (at 172.18.0.18@tcp)
      [  641.951493] Lustre: Skipped 3 previous similar messages
      [  652.950923] LustreError: 2582:0:(sec.c:2010:sptlrpc_svc_check_from()) @@@ faked source MDT  req@ffff8802051a8900 x1552432553273552/t0(0) o105-><?>@<?>:0/0 lens 336/0 e 0 to 0 dl 0 ref 1 fl New:/2/ffffffff rc 0/-1
      [  652.951025] LustreError: 2582:0:(sec.c:2010:sptlrpc_svc_check_from()) Skipped 12 previous similar messages
      [  916.946544] LustreError: 2582:0:(sec.c:2010:sptlrpc_svc_check_from()) @@@ faked source MDT  req@ffff8800d04d3c00 x1552432553273552/t0(0) o105-><?>@<?>:0/0 lens 336/0 e 0 to 0 dl 0 ref 1 fl New:/2/ffffffff rc 0/-1
      [  916.946651] LustreError: 2582:0:(sec.c:2010:sptlrpc_svc_check_from()) Skipped 23 previous similar messages
      [  936.436828] Lustre: 2596:0:(service.c:1331:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
        req@ffff8800d368ea00 x1552432529149200/t0(0) o36->c91cef1e-2de0-ca74-2459-2c580f0f0023@172.18.0.22@tcp:-1/-1 lens 672/568 e 24 to 0 dl 1480515813 ref 2 fl Interpret:/0/0 rc 0/0
      [  942.942067] Lustre: lustre-MDT0001: Client c91cef1e-2de0-ca74-2459-2c580f0f0023 (at 172.18.0.22@tcp) reconnecting
      [  942.942090] Lustre: lustre-MDT0001: Connection restored to c91cef1e-2de0-ca74-2459-2c580f0f0023 (at 172.18.0.22@tcp)
      

      The test was executed with:

      export SEC=krb5p; ./auster -f multinode -rsv -d /import/results/krb5p sanity --only 1
      

      Attachments

        Issue Links

          Activity

            [LU-8880] sanity test_1: fails w/DNE @@@ faked source MDT
            pjones Peter Jones added a comment -

            Landed for 2.10

            pjones Peter Jones added a comment - Landed for 2.10

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24236/
            Subject: LU-8880 gss: fix GSS support for DNE
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 8b7a8056ef50c1a27e9ade1f0523beef3a5fe393

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24236/ Subject: LU-8880 gss: fix GSS support for DNE Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8b7a8056ef50c1a27e9ade1f0523beef3a5fe393

            Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/24236
            Subject: LU-8880 gss: fix GSS support for DNE
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4728fb8df9bd85db7df65a648f000454b1a5b3d5

            gerrit Gerrit Updater added a comment - Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/24236 Subject: LU-8880 gss: fix GSS support for DNE Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4728fb8df9bd85db7df65a648f000454b1a5b3d5

            yes, looks good to me. Thanks.

            di.wang Di Wang (Inactive) added a comment - yes, looks good to me. Thanks.

            Hi,

            Your proposal to use cl_sp_to field of client_obd is interesting. I have tested the following code for import_to_gss_svc(), it works:

            static inline
            __u32 import_to_gss_svc(struct obd_import *imp)
            {
            	int cl_sp_to = 0xFF;
            
            	if (imp->imp_obd)
            		cl_sp_to = imp->imp_obd->u.cli.cl_sp_to;
            
            	switch (cl_sp_to) {
            	case LUSTRE_SP_MDT:
            		return LUSTRE_GSS_TGT_MDS;
            	case LUSTRE_SP_OST:
            		return LUSTRE_GSS_TGT_OSS;
            	case LUSTRE_SP_MGC:
            	case LUSTRE_SP_MGS:
            		return LUSTRE_GSS_TGT_MGS;
            	case LUSTRE_SP_CLI:
            	case LUSTRE_SP_ANY:
            	default:
            		return 0;
            	}
            }
            

            But the sensitive question was more about the real meaning of gsc_usr_{mdc,oss,root} entries. After some investigation, I think we could also use cl_sp_to in gas_svc_upcall_install_rvs_ctx(). I tested the following modification, it avoids setdirstripe hanging:

            @@ -833,12 +836,18 @@ int gss_svc_upcall_install_rvs_ctx(struct obd_import *imp,
                     }
                     rsci.h.expiry_time = (time_t) ctx_expiry;
             
            -        if (strcmp(imp->imp_obd->obd_type->typ_name, LUSTRE_MDC_NAME) == 0)
            -                rsci.ctx.gsc_usr_mds = 1;
            -        else if (strcmp(imp->imp_obd->obd_type->typ_name, LUSTRE_OSC_NAME) == 0)
            -                rsci.ctx.gsc_usr_oss = 1;
            -        else
            -                rsci.ctx.gsc_usr_root = 1;
            +	switch (imp->imp_obd->u.cli.cl_sp_to) {
            +	case LUSTRE_SP_MDT:
            +		rsci.ctx.gsc_usr_mds = 1;
            +		break;
            +	case LUSTRE_SP_OST:
            +		rsci.ctx.gsc_usr_oss = 1;
            +		break;
            +	case LUSTRE_SP_CLI:
            +		rsci.ctx.gsc_usr_root = 1;
            +	default:
            +		break;
            +	}
             
                     rscp = rsc_update(&rsci, rscp);
                     if (rscp == NULL)
            

            What do you think?
            Thanks,
            Sebastien.

            sbuisson Sebastien Buisson (Inactive) added a comment - - edited Hi, Your proposal to use cl_sp_to field of client_obd is interesting. I have tested the following code for import_to_gss_svc(), it works: static inline __u32 import_to_gss_svc(struct obd_import *imp) { int cl_sp_to = 0xFF; if (imp->imp_obd) cl_sp_to = imp->imp_obd->u.cli.cl_sp_to; switch (cl_sp_to) { case LUSTRE_SP_MDT: return LUSTRE_GSS_TGT_MDS; case LUSTRE_SP_OST: return LUSTRE_GSS_TGT_OSS; case LUSTRE_SP_MGC: case LUSTRE_SP_MGS: return LUSTRE_GSS_TGT_MGS; case LUSTRE_SP_CLI: case LUSTRE_SP_ANY: default : return 0; } } But the sensitive question was more about the real meaning of gsc_usr_{mdc,oss,root} entries. After some investigation, I think we could also use cl_sp_to in gas_svc_upcall_install_rvs_ctx(). I tested the following modification, it avoids setdirstripe hanging: @@ -833,12 +836,18 @@ int gss_svc_upcall_install_rvs_ctx(struct obd_import *imp, } rsci.h.expiry_time = (time_t) ctx_expiry; - if (strcmp(imp->imp_obd->obd_type->typ_name, LUSTRE_MDC_NAME) == 0) - rsci.ctx.gsc_usr_mds = 1; - else if (strcmp(imp->imp_obd->obd_type->typ_name, LUSTRE_OSC_NAME) == 0) - rsci.ctx.gsc_usr_oss = 1; - else - rsci.ctx.gsc_usr_root = 1; + switch (imp->imp_obd->u.cli.cl_sp_to) { + case LUSTRE_SP_MDT: + rsci.ctx.gsc_usr_mds = 1; + break ; + case LUSTRE_SP_OST: + rsci.ctx.gsc_usr_oss = 1; + break ; + case LUSTRE_SP_CLI: + rsci.ctx.gsc_usr_root = 1; + default : + break ; + } rscp = rsc_update(&rsci, rscp); if (rscp == NULL) What do you think? Thanks, Sebastien.

            I saw there is already cl_sp_to in client_obd. Any reason you can not use that?

            If it has to get the type from the obd name, then you probably should analyze a bit more, in case "OST" is included in some crazy fsname. Please check those examples in obd_mount.c, like server_name_is_ost() ...

            di.wang Di Wang (Inactive) added a comment - I saw there is already cl_sp_to in client_obd. Any reason you can not use that? If it has to get the type from the obd name, then you probably should analyze a bit more, in case "OST" is included in some crazy fsname. Please check those examples in obd_mount.c, like server_name_is_ost() ...

            People

              jfilizetti Jeremy Filizetti
              nblavend Nathan Lavender (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: