Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12001

sanity-gss issues

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

    Description

      It looks like in my testing sanity-gss test 150 has a 100% crash failure rate like this:

      [ 3757.187493] Lustre: DEBUG MARKER: == sanity-gss test 150: secure mgs connection: client flavor setting ================================= 07:38:39 (1551011919)
      [ 3759.003595] LustreError: 5651:0:(file.c:4273:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -1
      [ 3759.126390] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) page@ffff8800b2af7e00[2 ffff8800c2ee3558 2 1           (null)]
      
      [ 3759.131722] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) vvp-page@ffff8800b2af7e50(0:0) vm@ffffea0002721f88 1fffff0000282c 2:0 ffff8800b2af7e00 6144 lru
      
      [ 3759.136409] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) lov-page@ffff8800b2af7e90, comp index: 0, gen: 0
      
      [ 3759.139344] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) osc-page@ffff8800b2af7ec8 6144: 1< 0x845fed 258 0 + + > 2< 25165824 0 4096 0x5 0x520 |           (null) ffff8800c4288798 ffff8800cff6de60 > 3< 1 25 0 > 4< 0 0 8 227942400 - | - - + - > 5< - - + - | 0 - | 0 - ->
      
      [ 3759.145390] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) end page@ffff8800b2af7e00
      
      [ 3759.147726] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) uncovered page!
      [ 3759.149580] LustreError: 555:0:(ldlm_resource.c:1725:ldlm_resource_dump()) --- Resource: [0xa:0x0:0x0].0x0 (ffff8800c33fb9c0) refcount = 2
      [ 3759.152436] LustreError: 555:0:(ldlm_resource.c:1728:ldlm_resource_dump()) Granted locks (in reverse order):
      [ 3759.154733] LustreError: 555:0:(ldlm_resource.c:1731:ldlm_resource_dump()) ### ### ns: lustre-OST0001-osc-ffff8800d231d800 lock: ffff8800da181d80/0x513d4886fdafe22b lrc: 2/0,0 mode: PW/PW res: [0xa:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->4194303) flags: 0x20000020000 nid: local remote: 0x6960c454aded7432 expref: -99 pid: 2547 timeout: 0 lvb_type: 1
      [ 3759.162297] Pid: 555, comm: kworker/u16:4 3.10.0-7.6-debug #2 SMP Wed Nov 7 02:25:20 EST 2018
      [ 3759.164316] Call Trace:
      [ 3759.165036]  [<ffffffffa01197dc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [ 3759.166673]  [<ffffffffa0119836>] libcfs_debug_dumpstack+0x26/0x30 [libcfs]
      [ 3759.169212]  [<ffffffffa06b3cd0>] osc_req_attr_set+0x3e0/0x610 [osc]
      [ 3759.171461]  [<ffffffffa02acdf0>] cl_req_attr_set+0x60/0x150 [obdclass]
      [ 3759.173538]  [<ffffffffa06ae546>] osc_build_rpc+0x496/0xfe0 [osc]
      [ 3759.175067]  [<ffffffffa06c8d27>] osc_io_unplug0+0xe17/0x1920 [osc]
      [ 3759.176554]  [<ffffffffa06cd7e0>] osc_cache_writeback_range+0xc20/0x1260 [osc]
      [ 3759.178457]  [<ffffffffa06bc2c5>] osc_io_fsync_start+0x85/0x1a0 [osc]
      [ 3759.180085]  [<ffffffffa02acf45>] cl_io_start+0x65/0x130 [obdclass]
      [ 3759.181571]  [<ffffffffa07b9d05>] lov_io_call.isra.7+0x85/0x140 [lov]
      [ 3759.183270]  [<ffffffffa07b9ec6>] lov_io_start+0x56/0x150 [lov]
      [ 3759.184661]  [<ffffffffa02acf45>] cl_io_start+0x65/0x130 [obdclass]
      [ 3759.186366]  [<ffffffffa02af21c>] cl_io_loop+0xcc/0x1c0 [obdclass]
      [ 3759.187837]  [<ffffffffa0c278ab>] cl_sync_file_range+0x2db/0x380 [lustre]
      [ 3759.189677]  [<ffffffffa0c4519a>] ll_writepages+0x7a/0x200 [lustre]
      [ 3759.191845]  [<ffffffff811b92a1>] do_writepages+0x21/0x50
      [ 3759.193168]  [<ffffffff812654e0>] __writeback_single_inode+0x40/0x2a0
      [ 3759.194940]  [<ffffffff812662a6>] writeback_sb_inodes+0x2a6/0x4b0
      [ 3759.196596]  [<ffffffff81266c52>] wb_writeback+0x102/0x340
      [ 3759.198126]  [<ffffffff81267511>] bdi_writeback_workfn+0x141/0x4e0
      [ 3759.199598]  [<ffffffff810ad8ad>] process_one_work+0x18d/0x4a0
      [ 3759.201118]  [<ffffffff810adce6>] worker_thread+0x126/0x3b0
      [ 3759.202435]  [<ffffffff810b4ed4>] kthread+0xe4/0xf0
      [ 3759.203719]  [<ffffffff817c4c77>] ret_from_fork_nospec_end+0x0/0x39
      [ 3759.205294]  [<ffffffffffffffff>] 0xffffffffffffffff
      [ 3759.206488] LustreError: 555:0:(osc_object.c:401:osc_req_attr_set()) LBUG
      

      So I looked inside and it seems there are much deeper problems going on under the hood.

      It all starts in early init with this list warning:

      [  113.445869] Lustre: 16553:0:(sec_gss.c:2066:gss_svc_handle_init()) create svc ctx ffff88009df5de40: user from 192.168.10.181@tcp authenticated as root
      [  113.449010] Lustre: 16553:0:(sec_gss.c:2066:gss_svc_handle_init()) Skipped 1 previous similar message
      [  113.455283] Lustre: 15079:0:(sec_gss.c:370:gss_cli_ctx_uptodate()) server installed reverse ctx ffff8800a8b93f00 idx 0x96480a9894da0293, expiry 1551008338(+60s)
      [  113.460308] Lustre: 15079:0:(sec_gss.c:370:gss_cli_ctx_uptodate()) Skipped 2 previous similar messages
      [  115.146368] Lustre: 16553:0:(sec_gss.c:2066:gss_svc_handle_init()) create svc ctx ffff88009ddffe40: user from 0@lo authenticated as mds
      [  115.148646] ------------[ cut here ]------------
      [  115.148653] WARNING: CPU: 2 PID: 15079 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
      [  115.148654] list_del corruption. prev->next should be ffff8800a6c5fba0, but was ffff8800d354af68
      [  115.148683] Modules linked in: zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) jbd2 mbcache lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) dm_flakey dm_mod crc_t10dif crct10dif_generic crct10dif_common squashfs pcspkr i2c_piix4 i2c_core binfmt_misc ip_tables rpcsec_gss_krb5 ata_generic pata_acpi ata_piix serio_raw libata virtio_blk floppy
      [  115.148687] CPU: 2 PID: 15079 Comm: ll_ost00_001 Kdump: loaded Tainted: P           OE  ------------   3.10.0-7.6-debug #2
      [  115.148688] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [  115.148689] Call Trace:
      [  115.148696]  [<ffffffff817afbf2>] dump_stack+0x19/0x1b
      [  115.148701]  [<ffffffff810895c8>] __warn+0xd8/0x100
      [  115.148704]  [<ffffffff8108964f>] warn_slowpath_fmt+0x5f/0x80
      [  115.148722]  [<ffffffff8109bfb0>] ? __internal_add_timer+0x130/0x130
      [  115.148724]  [<ffffffff813ff301>] __list_del_entry+0xa1/0xd0
      [  115.148726]  [<ffffffff813ff33d>] list_del+0xd/0x30
      [  115.148730]  [<ffffffff810b5c66>] remove_wait_queue+0x26/0x40
      [  115.148744]  [<ffffffffa0734937>] gss_svc_upcall_handle_init+0x267/0xf10 [ptlrpc_gss]
      [  115.148748]  [<ffffffff810caae0>] ? wake_up_state+0x20/0x20
      [  115.148766]  [<ffffffffa0726839>] gss_svc_handle_init+0x7e9/0xb70 [ptlrpc_gss]
      [  115.148769]  [<ffffffff810ac981>] ? try_to_grab_pending+0xb1/0x180
      [  115.148774]  [<ffffffffa072cffb>] gss_svc_accept+0x81b/0xad0 [ptlrpc_gss]
      [  115.148780]  [<ffffffffa0741fe8>] gss_svc_accept_kr+0x18/0x20 [ptlrpc_gss]
      [  115.148836]  [<ffffffffa04fc30e>] sptlrpc_svc_unwrap_request+0xee/0x600 [ptlrpc]
      [  115.148871]  [<ffffffffa04dd108>] ptlrpc_main+0xa28/0x2040 [ptlrpc]
      [  115.148873]  [<ffffffff810c32ed>] ? finish_task_switch+0x5d/0x1b0
      [  115.148919]  [<ffffffffa04dc6e0>] ? ptlrpc_register_service+0xfe0/0xfe0 [ptlrpc]
      [  115.148922]  [<ffffffff810b4ed4>] kthread+0xe4/0xf0
      [  115.148925]  [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140
      [  115.148928]  [<ffffffff817c4c77>] ret_from_fork_nospec_begin+0x21/0x21
      [  115.148931]  [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140
      [  115.148933] ---[ end trace 3829a662976cf54a ]---
      [  115.149029] Lustre: 15079:0:(sec_gss.c:370:gss_cli_ctx_uptodate()) server installed reverse ctx ffff88009cbe5f00 idx 0x88a4014f1934dac8, expiry 1551008339(+60s)
      [  115.149417] Lustre: 17039:0:(sec_gss.c:377:gss_cli_ctx_uptodate()) client refreshed ctx ffff8800bb3a6f00 idx 0x88a4014f1934daca (0->lustre-OST0000_UUID), expiry 1551008329(+50s)
      

      and then it kind of goes downhill from there complaining about errors authentication failures and such.

      [  711.023235] LustreError: 21817:0:(gss_keyring.c:1423:gss_kt_update()) negotiation: rpc err 0, gss err 0
      [  711.027252] LustreError: 21817:0:(gss_keyring.c:1423:gss_kt_update()) Skipped 629 previous similar messages
      [ 1208.887079] LustreError: 15487:0:(gss_svc_upcall.c:1044:gss_svc_upcall_handle_init()) authentication failed
      [ 1208.891203] LustreError: 15487:0:(gss_svc_upcall.c:1044:gss_svc_upcall_handle_init()) Skipped 2077 previous similar messages
      

      and eventually it dies in that crash.

      So clearly something very wrong is going on.

      This is on a dual-VM setup where client and server are on separate VMs.

      you can see all related logs from a sample run here:
      http://testing.linuxhacker.ru:3333/lustre-reports/72/testresults/sanity-gss-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/

      Attachments

        Activity

          People

            wc-triage WC Triage
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: