Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.13.0
-
None
-
3
-
9223372036854775807
Description
It looks like in my testing sanity-gss test 150 has a 100% crash failure rate like this:
[ 3757.187493] Lustre: DEBUG MARKER: == sanity-gss test 150: secure mgs connection: client flavor setting ================================= 07:38:39 (1551011919) [ 3759.003595] LustreError: 5651:0:(file.c:4273:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -1 [ 3759.126390] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) page@ffff8800b2af7e00[2 ffff8800c2ee3558 2 1 (null)] [ 3759.131722] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) vvp-page@ffff8800b2af7e50(0:0) vm@ffffea0002721f88 1fffff0000282c 2:0 ffff8800b2af7e00 6144 lru [ 3759.136409] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) lov-page@ffff8800b2af7e90, comp index: 0, gen: 0 [ 3759.139344] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) osc-page@ffff8800b2af7ec8 6144: 1< 0x845fed 258 0 + + > 2< 25165824 0 4096 0x5 0x520 | (null) ffff8800c4288798 ffff8800cff6de60 > 3< 1 25 0 > 4< 0 0 8 227942400 - | - - + - > 5< - - + - | 0 - | 0 - -> [ 3759.145390] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) end page@ffff8800b2af7e00 [ 3759.147726] LustreError: 555:0:(osc_object.c:391:osc_req_attr_set()) uncovered page! [ 3759.149580] LustreError: 555:0:(ldlm_resource.c:1725:ldlm_resource_dump()) --- Resource: [0xa:0x0:0x0].0x0 (ffff8800c33fb9c0) refcount = 2 [ 3759.152436] LustreError: 555:0:(ldlm_resource.c:1728:ldlm_resource_dump()) Granted locks (in reverse order): [ 3759.154733] LustreError: 555:0:(ldlm_resource.c:1731:ldlm_resource_dump()) ### ### ns: lustre-OST0001-osc-ffff8800d231d800 lock: ffff8800da181d80/0x513d4886fdafe22b lrc: 2/0,0 mode: PW/PW res: [0xa:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->4194303) flags: 0x20000020000 nid: local remote: 0x6960c454aded7432 expref: -99 pid: 2547 timeout: 0 lvb_type: 1 [ 3759.162297] Pid: 555, comm: kworker/u16:4 3.10.0-7.6-debug #2 SMP Wed Nov 7 02:25:20 EST 2018 [ 3759.164316] Call Trace: [ 3759.165036] [<ffffffffa01197dc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 3759.166673] [<ffffffffa0119836>] libcfs_debug_dumpstack+0x26/0x30 [libcfs] [ 3759.169212] [<ffffffffa06b3cd0>] osc_req_attr_set+0x3e0/0x610 [osc] [ 3759.171461] [<ffffffffa02acdf0>] cl_req_attr_set+0x60/0x150 [obdclass] [ 3759.173538] [<ffffffffa06ae546>] osc_build_rpc+0x496/0xfe0 [osc] [ 3759.175067] [<ffffffffa06c8d27>] osc_io_unplug0+0xe17/0x1920 [osc] [ 3759.176554] [<ffffffffa06cd7e0>] osc_cache_writeback_range+0xc20/0x1260 [osc] [ 3759.178457] [<ffffffffa06bc2c5>] osc_io_fsync_start+0x85/0x1a0 [osc] [ 3759.180085] [<ffffffffa02acf45>] cl_io_start+0x65/0x130 [obdclass] [ 3759.181571] [<ffffffffa07b9d05>] lov_io_call.isra.7+0x85/0x140 [lov] [ 3759.183270] [<ffffffffa07b9ec6>] lov_io_start+0x56/0x150 [lov] [ 3759.184661] [<ffffffffa02acf45>] cl_io_start+0x65/0x130 [obdclass] [ 3759.186366] [<ffffffffa02af21c>] cl_io_loop+0xcc/0x1c0 [obdclass] [ 3759.187837] [<ffffffffa0c278ab>] cl_sync_file_range+0x2db/0x380 [lustre] [ 3759.189677] [<ffffffffa0c4519a>] ll_writepages+0x7a/0x200 [lustre] [ 3759.191845] [<ffffffff811b92a1>] do_writepages+0x21/0x50 [ 3759.193168] [<ffffffff812654e0>] __writeback_single_inode+0x40/0x2a0 [ 3759.194940] [<ffffffff812662a6>] writeback_sb_inodes+0x2a6/0x4b0 [ 3759.196596] [<ffffffff81266c52>] wb_writeback+0x102/0x340 [ 3759.198126] [<ffffffff81267511>] bdi_writeback_workfn+0x141/0x4e0 [ 3759.199598] [<ffffffff810ad8ad>] process_one_work+0x18d/0x4a0 [ 3759.201118] [<ffffffff810adce6>] worker_thread+0x126/0x3b0 [ 3759.202435] [<ffffffff810b4ed4>] kthread+0xe4/0xf0 [ 3759.203719] [<ffffffff817c4c77>] ret_from_fork_nospec_end+0x0/0x39 [ 3759.205294] [<ffffffffffffffff>] 0xffffffffffffffff [ 3759.206488] LustreError: 555:0:(osc_object.c:401:osc_req_attr_set()) LBUG
So I looked inside and it seems there are much deeper problems going on under the hood.
It all starts in early init with this list warning:
[ 113.445869] Lustre: 16553:0:(sec_gss.c:2066:gss_svc_handle_init()) create svc ctx ffff88009df5de40: user from 192.168.10.181@tcp authenticated as root [ 113.449010] Lustre: 16553:0:(sec_gss.c:2066:gss_svc_handle_init()) Skipped 1 previous similar message [ 113.455283] Lustre: 15079:0:(sec_gss.c:370:gss_cli_ctx_uptodate()) server installed reverse ctx ffff8800a8b93f00 idx 0x96480a9894da0293, expiry 1551008338(+60s) [ 113.460308] Lustre: 15079:0:(sec_gss.c:370:gss_cli_ctx_uptodate()) Skipped 2 previous similar messages [ 115.146368] Lustre: 16553:0:(sec_gss.c:2066:gss_svc_handle_init()) create svc ctx ffff88009ddffe40: user from 0@lo authenticated as mds [ 115.148646] ------------[ cut here ]------------ [ 115.148653] WARNING: CPU: 2 PID: 15079 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0 [ 115.148654] list_del corruption. prev->next should be ffff8800a6c5fba0, but was ffff8800d354af68 [ 115.148683] Modules linked in: zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) jbd2 mbcache lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) dm_flakey dm_mod crc_t10dif crct10dif_generic crct10dif_common squashfs pcspkr i2c_piix4 i2c_core binfmt_misc ip_tables rpcsec_gss_krb5 ata_generic pata_acpi ata_piix serio_raw libata virtio_blk floppy [ 115.148687] CPU: 2 PID: 15079 Comm: ll_ost00_001 Kdump: loaded Tainted: P OE ------------ 3.10.0-7.6-debug #2 [ 115.148688] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 115.148689] Call Trace: [ 115.148696] [<ffffffff817afbf2>] dump_stack+0x19/0x1b [ 115.148701] [<ffffffff810895c8>] __warn+0xd8/0x100 [ 115.148704] [<ffffffff8108964f>] warn_slowpath_fmt+0x5f/0x80 [ 115.148722] [<ffffffff8109bfb0>] ? __internal_add_timer+0x130/0x130 [ 115.148724] [<ffffffff813ff301>] __list_del_entry+0xa1/0xd0 [ 115.148726] [<ffffffff813ff33d>] list_del+0xd/0x30 [ 115.148730] [<ffffffff810b5c66>] remove_wait_queue+0x26/0x40 [ 115.148744] [<ffffffffa0734937>] gss_svc_upcall_handle_init+0x267/0xf10 [ptlrpc_gss] [ 115.148748] [<ffffffff810caae0>] ? wake_up_state+0x20/0x20 [ 115.148766] [<ffffffffa0726839>] gss_svc_handle_init+0x7e9/0xb70 [ptlrpc_gss] [ 115.148769] [<ffffffff810ac981>] ? try_to_grab_pending+0xb1/0x180 [ 115.148774] [<ffffffffa072cffb>] gss_svc_accept+0x81b/0xad0 [ptlrpc_gss] [ 115.148780] [<ffffffffa0741fe8>] gss_svc_accept_kr+0x18/0x20 [ptlrpc_gss] [ 115.148836] [<ffffffffa04fc30e>] sptlrpc_svc_unwrap_request+0xee/0x600 [ptlrpc] [ 115.148871] [<ffffffffa04dd108>] ptlrpc_main+0xa28/0x2040 [ptlrpc] [ 115.148873] [<ffffffff810c32ed>] ? finish_task_switch+0x5d/0x1b0 [ 115.148919] [<ffffffffa04dc6e0>] ? ptlrpc_register_service+0xfe0/0xfe0 [ptlrpc] [ 115.148922] [<ffffffff810b4ed4>] kthread+0xe4/0xf0 [ 115.148925] [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140 [ 115.148928] [<ffffffff817c4c77>] ret_from_fork_nospec_begin+0x21/0x21 [ 115.148931] [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140 [ 115.148933] ---[ end trace 3829a662976cf54a ]--- [ 115.149029] Lustre: 15079:0:(sec_gss.c:370:gss_cli_ctx_uptodate()) server installed reverse ctx ffff88009cbe5f00 idx 0x88a4014f1934dac8, expiry 1551008339(+60s) [ 115.149417] Lustre: 17039:0:(sec_gss.c:377:gss_cli_ctx_uptodate()) client refreshed ctx ffff8800bb3a6f00 idx 0x88a4014f1934daca (0->lustre-OST0000_UUID), expiry 1551008329(+50s)
and then it kind of goes downhill from there complaining about errors authentication failures and such.
[ 711.023235] LustreError: 21817:0:(gss_keyring.c:1423:gss_kt_update()) negotiation: rpc err 0, gss err 0 [ 711.027252] LustreError: 21817:0:(gss_keyring.c:1423:gss_kt_update()) Skipped 629 previous similar messages [ 1208.887079] LustreError: 15487:0:(gss_svc_upcall.c:1044:gss_svc_upcall_handle_init()) authentication failed [ 1208.891203] LustreError: 15487:0:(gss_svc_upcall.c:1044:gss_svc_upcall_handle_init()) Skipped 2077 previous similar messages
and eventually it dies in that crash.
So clearly something very wrong is going on.
This is on a dual-VM setup where client and server are on separate VMs.
you can see all related logs from a sample run here:
http://testing.linuxhacker.ru:3333/lustre-reports/72/testresults/sanity-gss-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/