Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.3
-
lustre-b2_12-ib build #31 version=2.12.2_105_gec6b9a6
-
3
-
9223372036854775807
Description
2 OSS hit the "general protection fault: 0000 1 SMP" in failover test when running for about 2 days
soak-4 and soak-5 all shows
[ 320.525059] LustreError: Skipped 5 previous similar messages [ 320.554484] Lustre: Skipped 6 previous similar messages [ 321.739504] Lustre: soaked-OST0000: Connection restored to cccf9223-12de-f413-f873-1abd2bca972e (at 172.16.1.29@o2ib1) [ 321.751469] Lustre: Skipped 3 previous similar messages [ 322.819814] LustreError: 137-5: soaked-OST0004_UUID: not available for connect from 192.168.1.111@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. [ 322.839645] LustreError: Skipped 4 previous similar messages [ 325.343054] Lustre: soaked-OST0004: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [ 326.721349] Lustre: soaked-OST0008: Connection restored to b6d310e5-89b5-4823-cd5f-bf69ac139c6a (at 172.16.1.26@o2ib1) [ 326.721352] Lustre: soaked-OST0000: Connection restored to b6d310e5-89b5-4823-cd5f-bf69ac139c6a (at 172.16.1.26@o2ib1) [ 326.721358] Lustre: Skipped 3 previous similar messages [ 330.480118] Lustre: soaked-OST0004: Will be in recovery for at least 2:30, or until 28 clients reconnect [ 330.543092] Lustre: soaked-OST000c: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [ 335.247236] Lustre: soaked-OST000c: Connection restored to (at 172.16.1.22@o2ib1) [ 335.255753] Lustre: Skipped 20 previous similar messages [ 351.605041] Lustre: soaked-OST0000: Connection restored to d33f0a29-feff-6666-e008-19168b75e455 (at 172.16.1.38@o2ib1) [ 351.617080] Lustre: Skipped 65 previous similar messages [ 354.213237] Lustre: soaked-OST0004: Recovery over after 0:23, of 28 clients 28 recovered and 0 were evicted. [ 354.248446] Lustre: soaked-OST0004: deleting orphan objects from 0x0:201673384 to 0x0:201676484 [ 354.337426] Lustre: soaked-OST0004: deleting orphan objects from 0x400000402:131183875 to 0x400000402:131188552 [ 354.344896] Lustre: soaked-OST0004: deleting orphan objects from 0x400000401:194940686 to 0x400000401:194951539 [ 355.092341] Lustre: soaked-OST0008: Recovery over after 0:35, of 28 clients 28 recovered and 0 were evicted. [ 355.112233] Lustre: soaked-OST0008: deleting orphan objects from 0x0:201739882 to 0x0:201746221 [ 355.439495] Lustre: soaked-OST0008: deleting orphan objects from 0x500000401:195199465 to 0x500000401:195206220 [ 355.439890] Lustre: soaked-OST0008: deleting orphan objects from 0x500000402:131227465 to 0x500000402:131237293 [ 430.413849] Lustre: soaked-OST0004: deleting orphan objects from 0x400000400:140327007 to 0x400000400:140331081 [ 430.413851] Lustre: soaked-OST0008: deleting orphan objects from 0x500000400:140581975 to 0x500000400:140583563 [ 430.493859] Lustre: soaked-OST0008: Connection restored to f9470b7c-9158-fc3c-884e-b494778ee289 (at 172.16.1.31@o2ib1) [ 430.505816] Lustre: Skipped 3 previous similar messages [ 503.499925] LustreError: 34183:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 150s: evicting client at 172.16.1.35@o2ib1 ns: filter-soake d-OST0004_UUID lock: ffff9b2f0e700b40/0x39ef56938122067c lrc: 3/0,0 mode: PW/PW res: [0x400000400:0x85d13f5:0x0].0x0 rrc: 7 type: EXT [0->18446744073709551615] (req 0->18446 744073709551615) flags: 0x60000400000020 nid: 172.16.1.35@o2ib1 remote: 0x4d7feaf65d51590b expref: 8 pid: 52997 timeout: 503 lvb_type: 0 [ 506.075756] Lustre: soaked-OST0004: Connection restored to 2f6488d4-9684-6185-e787-5c39dc9ffacd (at 172.16.1.35@o2ib1) [ 506.087771] Lustre: Skipped 5 previous similar messages [ 506.322934] Lustre: soaked-OST0000: recovery is timed out, evict stale exports [ 506.331066] Lustre: soaked-OST0000: disconnecting 3 stale clients [ 508.810352] Lustre: soaked-OST0000: Recovery over after 3:09, of 28 clients 25 recovered and 3 were evicted. [ 508.836797] Lustre: soaked-OST0000: deleting orphan objects from 0x0:201827014 to 0x0:201838552 [ 509.128301] Lustre: soaked-OST0000: deleting orphan objects from 0x300000402:131294344 to 0x300000402:131298527 [ 509.182439] Lustre: soaked-OST0000: deleting orphan objects from 0x300000401:195138776 to 0x300000401:195149783 [ 517.586983] Lustre: soaked-OST000c: recovery is timed out, evict stale exports [ 517.595101] Lustre: soaked-OST000c: disconnecting 1 stale clients [ 519.522063] Lustre: soaked-OST000c: Recovery over after 3:09, of 28 clients 27 recovered and 1 was evicted. [ 519.522711] Lustre: soaked-OST000c: deleting orphan objects from 0x600000401:140374471 to 0x600000401:140375679 [ 519.544629] general protection fault: 0000 [#1] SMP [ 519.550190] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE)[ 519.565747] Lustre: soaked-OST000c: deleting orphan objects from 0x0:201738135 to 0x0:201745341 libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel iTCO_wdt lrw iTCO_vendor_support gf128mul glue_helper ablk_helper cryptd pcspkr dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) joydev ses enclosure ipmi_ssif lpc_ich mei_me sg ioatdma i2c_i801 mei ipmi_si ipmi_devintf ipmi_msghandler wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci isci igb drm mlx4_core(OE) crct10dif_pclmul libsas libahci crct10dif_common ptp mpt2sas crc32c_intel devlink pps_core libata raid_class dca scsi_transport_sas mlx_compat(OE) drm_panel_orientation_quirks i2c_algo_bit [ 519.677136] CPU: 3 PID: 60300 Comm: tgt_recover_12 Kdump: loaded Tainted: P OE ------------ 3.10.0-957.21.3.el7_lustre.x86_64 #1 [ 519.691306] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [ 519.703834] task: ffff9b2f27cb0000 ti: ffff9b2f20b18000 task.ti: ffff9b2f20b18000 [ 519.712184] RIP: 0010:[<ffffffffc0fdeefc>] [<ffffffffc0fdeefc>] keys_fill+0x5c/0x180 [obdclass] [ 519.722046] RSP: 0018:ffff9b2f20b1bad0 EFLAGS: 00010246 [ 519.727974] RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000000000 RCX: ffff9b2f20b1bfd8 [ 519.735936] RDX: ffff9b2f20b1baf8 RSI: 0000000000000002 RDI: ffffffffc1044080 [ 519.743898] RBP: ffff9b2f20b1baf0 R08: 0000000000000000 R09: ffff9b287fc07b00 [ 519.751868] R10: ffffffffc14cf797 R11: ffff9b2f12860c00 R12: ffffffffc1044140 [ 519.759836] R13: ffff9b2f076a3120 R14: 0000000000000013 R15: ffff9b2f15fe62c8 [ 519.767799] FS: 0000000000000000(0000) GS:ffff9b2b2e0c0000(0000) knlGS:0000000000000000 [ 519.776834] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 519.783244] CR2: 00007f81db1e2140 CR3: 000000023c410000 CR4: 00000000000607e0 [ 519.791207] Call Trace: [ 519.793962] [<ffffffffc0fe3961>] lu_context_refill+0x41/0x50 [obdclass] [ 519.801466] [<ffffffffc0fe39f4>] lu_env_refill+0x24/0x30 [obdclass] [ 519.808579] [<ffffffffc14cf831>] ofd_lvbo_init+0x2a1/0x7f0 [ofd] [ 519.815426] [<ffffffffc12aa0fd>] ldlm_server_completion_ast+0x5fd/0x980 [ptlrpc] [ 519.823809] [<ffffffffc12a9b00>] ? ldlm_server_blocking_ast+0xa40/0xa40 [ptlrpc] [ 519.832181] [<ffffffffc127c748>] ldlm_work_cp_ast_lock+0xa8/0x1d0 [ptlrpc] [ 519.839981] [<ffffffffc12c3bf2>] ptlrpc_set_wait+0x72/0x790 [ptlrpc] [ 519.847171] [<ffffffffa401d75d>] ? kmem_cache_alloc_node_trace+0x11d/0x210 [ 519.854957] [<ffffffffc0fc1a79>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [ 519.862942] [<ffffffffc127c6a0>] ? ldlm_work_gl_ast_lock+0x3a0/0x3a0 [ptlrpc] [ 519.871032] [<ffffffffc12ba472>] ? ptlrpc_prep_set+0xd2/0x280 [ptlrpc] [ 519.878452] [<ffffffffc1281f25>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] [ 519.885847] [<ffffffffc12833e1>] __ldlm_reprocess_all+0x101/0x340 [ptlrpc] [ 519.893651] [<ffffffffc1283986>] ldlm_reprocess_res+0x26/0x30 [ptlrpc] [ 519.901043] [<ffffffffc0cf4fb0>] cfs_hash_for_each_relax+0x250/0x450 [libcfs] [ 519.909127] [<ffffffffc1283960>] ? ldlm_lock_mode_downgrade+0x320/0x320 [ptlrpc] [ 519.917499] [<ffffffffc1283960>] ? ldlm_lock_mode_downgrade+0x320/0x320 [ptlrpc] [ 519.925861] [<ffffffffc0cf8345>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs] [ 519.933945] [<ffffffffc12839cc>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc] [ 519.942416] [<ffffffffc1296211>] target_recovery_thread+0xcd1/0x1160 [ptlrpc] [ 519.950516] [<ffffffffc1295540>] ? replay_request_or_update.isra.23+0x8c0/0x8c0 [ptlrpc] [ 519.959660] [<ffffffffa3ec1da1>] kthread+0xd1/0xe0 [ 519.965102] [<ffffffffa3ec1cd0>] ? insert_kthread_work+0x40/0x40 [ 519.971919] [<ffffffffa4575c37>] ret_from_fork_nospec_begin+0x21/0x21 [ 519.979204] [<ffffffffa3ec1cd0>] ? insert_kthread_work+0x40/0x40 [ 519.986002] Code: ab 51 06 00 0f 1f 00 31 db eb 15 0f 1f 40 00 48 83 c3 08 48 81 fb 40 01 00 00 0f 84 9f 00 00 00 49 8b 45 10 4c 8b a3 e0 bf 10 c1 <48> 83 3c 18 00 75 dd 4d 85 e4 74 d8 41 8b 04 24 41 8b 55 00 85 [ 520.007678] RIP [<ffffffffc0fdeefc>] keys_fill+0x5c/0x180 [obdclass] [ 520.014893] RSP <ffff9b2f20b1bad0> [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct