[LU-12663] SOAK: OSS hit general protection fault: 0000 [#1] SMP Created: 13/Aug/19  Updated: 14/Aug/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: soak
Environment:

lustre-b2_12-ib build #31 version=2.12.2_105_gec6b9a6


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

2 OSS hit the "general protection fault: 0000 1 SMP" in failover test when running for about 2 days

soak-4 and soak-5 all shows

[  320.525059] LustreError: Skipped 5 previous similar messages
[  320.554484] Lustre: Skipped 6 previous similar messages
[  321.739504] Lustre: soaked-OST0000: Connection restored to cccf9223-12de-f413-f873-1abd2bca972e (at 172.16.1.29@o2ib1)
[  321.751469] Lustre: Skipped 3 previous similar messages
[  322.819814] LustreError: 137-5: soaked-OST0004_UUID: not available for connect from 192.168.1.111@o2ib (no target). If you are running an HA pair check that the target is
 mounted on the other server.
[  322.839645] LustreError: Skipped 4 previous similar messages
[  325.343054] Lustre: soaked-OST0004: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
[  326.721349] Lustre: soaked-OST0008: Connection restored to b6d310e5-89b5-4823-cd5f-bf69ac139c6a (at 172.16.1.26@o2ib1)
[  326.721352] Lustre: soaked-OST0000: Connection restored to b6d310e5-89b5-4823-cd5f-bf69ac139c6a (at 172.16.1.26@o2ib1)
[  326.721358] Lustre: Skipped 3 previous similar messages
[  330.480118] Lustre: soaked-OST0004: Will be in recovery for at least 2:30, or until 28 clients reconnect
[  330.543092] Lustre: soaked-OST000c: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
[  335.247236] Lustre: soaked-OST000c: Connection restored to  (at 172.16.1.22@o2ib1)
[  335.255753] Lustre: Skipped 20 previous similar messages
[  351.605041] Lustre: soaked-OST0000: Connection restored to d33f0a29-feff-6666-e008-19168b75e455 (at 172.16.1.38@o2ib1)
[  351.617080] Lustre: Skipped 65 previous similar messages
[  354.213237] Lustre: soaked-OST0004: Recovery over after 0:23, of 28 clients 28 recovered and 0 were evicted.
[  354.248446] Lustre: soaked-OST0004: deleting orphan objects from 0x0:201673384 to 0x0:201676484
[  354.337426] Lustre: soaked-OST0004: deleting orphan objects from 0x400000402:131183875 to 0x400000402:131188552
[  354.344896] Lustre: soaked-OST0004: deleting orphan objects from 0x400000401:194940686 to 0x400000401:194951539
[  355.092341] Lustre: soaked-OST0008: Recovery over after 0:35, of 28 clients 28 recovered and 0 were evicted.
[  355.112233] Lustre: soaked-OST0008: deleting orphan objects from 0x0:201739882 to 0x0:201746221
[  355.439495] Lustre: soaked-OST0008: deleting orphan objects from 0x500000401:195199465 to 0x500000401:195206220
[  355.439890] Lustre: soaked-OST0008: deleting orphan objects from 0x500000402:131227465 to 0x500000402:131237293
[  430.413849] Lustre: soaked-OST0004: deleting orphan objects from 0x400000400:140327007 to 0x400000400:140331081
[  430.413851] Lustre: soaked-OST0008: deleting orphan objects from 0x500000400:140581975 to 0x500000400:140583563
[  430.493859] Lustre: soaked-OST0008: Connection restored to f9470b7c-9158-fc3c-884e-b494778ee289 (at 172.16.1.31@o2ib1)
[  430.505816] Lustre: Skipped 3 previous similar messages
[  503.499925] LustreError: 34183:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 150s: evicting client at 172.16.1.35@o2ib1  ns: filter-soake
d-OST0004_UUID lock: ffff9b2f0e700b40/0x39ef56938122067c lrc: 3/0,0 mode: PW/PW res: [0x400000400:0x85d13f5:0x0].0x0 rrc: 7 type: EXT [0->18446744073709551615] (req 0->18446
744073709551615) flags: 0x60000400000020 nid: 172.16.1.35@o2ib1 remote: 0x4d7feaf65d51590b expref: 8 pid: 52997 timeout: 503 lvb_type: 0
[  506.075756] Lustre: soaked-OST0004: Connection restored to 2f6488d4-9684-6185-e787-5c39dc9ffacd (at 172.16.1.35@o2ib1)
[  506.087771] Lustre: Skipped 5 previous similar messages
[  506.322934] Lustre: soaked-OST0000: recovery is timed out, evict stale exports
[  506.331066] Lustre: soaked-OST0000: disconnecting 3 stale clients
[  508.810352] Lustre: soaked-OST0000: Recovery over after 3:09, of 28 clients 25 recovered and 3 were evicted.
[  508.836797] Lustre: soaked-OST0000: deleting orphan objects from 0x0:201827014 to 0x0:201838552
[  509.128301] Lustre: soaked-OST0000: deleting orphan objects from 0x300000402:131294344 to 0x300000402:131298527
[  509.182439] Lustre: soaked-OST0000: deleting orphan objects from 0x300000401:195138776 to 0x300000401:195149783
[  517.586983] Lustre: soaked-OST000c: recovery is timed out, evict stale exports
[  517.595101] Lustre: soaked-OST000c: disconnecting 1 stale clients
[  519.522063] Lustre: soaked-OST000c: Recovery over after 3:09, of 28 clients 27 recovered and 1 was evicted.
[  519.522711] Lustre: soaked-OST000c: deleting orphan objects from 0x600000401:140374471 to 0x600000401:140375679
[  519.544629] general protection fault: 0000 [#1] SMP 
[  519.550190] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE)[  519.565747] Lustre: soaked-OST000c: deleting orphan objects from 0x0:201738135 to 0x0:201745341
 libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel iTCO_wdt lrw iTCO_vendor_support gf128mul glue_helper ablk_helper cryptd pcspkr dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) joydev ses enclosure ipmi_ssif lpc_ich mei_me sg ioatdma i2c_i801 mei ipmi_si ipmi_devintf ipmi_msghandler wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci isci igb drm mlx4_core(OE) crct10dif_pclmul libsas libahci crct10dif_common ptp mpt2sas crc32c_intel devlink pps_core libata raid_class dca scsi_transport_sas mlx_compat(OE) drm_panel_orientation_quirks i2c_algo_bit
[  519.677136] CPU: 3 PID: 60300 Comm: tgt_recover_12 Kdump: loaded Tainted: P           OE  ------------   3.10.0-957.21.3.el7_lustre.x86_64 #1
[  519.691306] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
[  519.703834] task: ffff9b2f27cb0000 ti: ffff9b2f20b18000 task.ti: ffff9b2f20b18000
[  519.712184] RIP: 0010:[<ffffffffc0fdeefc>]  [<ffffffffc0fdeefc>] keys_fill+0x5c/0x180 [obdclass]
[  519.722046] RSP: 0018:ffff9b2f20b1bad0  EFLAGS: 00010246
[  519.727974] RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000000000 RCX: ffff9b2f20b1bfd8
[  519.735936] RDX: ffff9b2f20b1baf8 RSI: 0000000000000002 RDI: ffffffffc1044080
[  519.743898] RBP: ffff9b2f20b1baf0 R08: 0000000000000000 R09: ffff9b287fc07b00
[  519.751868] R10: ffffffffc14cf797 R11: ffff9b2f12860c00 R12: ffffffffc1044140
[  519.759836] R13: ffff9b2f076a3120 R14: 0000000000000013 R15: ffff9b2f15fe62c8
[  519.767799] FS:  0000000000000000(0000) GS:ffff9b2b2e0c0000(0000) knlGS:0000000000000000
[  519.776834] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  519.783244] CR2: 00007f81db1e2140 CR3: 000000023c410000 CR4: 00000000000607e0
[  519.791207] Call Trace:
[  519.793962]  [<ffffffffc0fe3961>] lu_context_refill+0x41/0x50 [obdclass]
[  519.801466]  [<ffffffffc0fe39f4>] lu_env_refill+0x24/0x30 [obdclass]
[  519.808579]  [<ffffffffc14cf831>] ofd_lvbo_init+0x2a1/0x7f0 [ofd]
[  519.815426]  [<ffffffffc12aa0fd>] ldlm_server_completion_ast+0x5fd/0x980 [ptlrpc]
[  519.823809]  [<ffffffffc12a9b00>] ? ldlm_server_blocking_ast+0xa40/0xa40 [ptlrpc]
[  519.832181]  [<ffffffffc127c748>] ldlm_work_cp_ast_lock+0xa8/0x1d0 [ptlrpc]
[  519.839981]  [<ffffffffc12c3bf2>] ptlrpc_set_wait+0x72/0x790 [ptlrpc]
[  519.847171]  [<ffffffffa401d75d>] ? kmem_cache_alloc_node_trace+0x11d/0x210
[  519.854957]  [<ffffffffc0fc1a79>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[  519.862942]  [<ffffffffc127c6a0>] ? ldlm_work_gl_ast_lock+0x3a0/0x3a0 [ptlrpc]
[  519.871032]  [<ffffffffc12ba472>] ? ptlrpc_prep_set+0xd2/0x280 [ptlrpc]
[  519.878452]  [<ffffffffc1281f25>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
[  519.885847]  [<ffffffffc12833e1>] __ldlm_reprocess_all+0x101/0x340 [ptlrpc]
[  519.893651]  [<ffffffffc1283986>] ldlm_reprocess_res+0x26/0x30 [ptlrpc]
[  519.901043]  [<ffffffffc0cf4fb0>] cfs_hash_for_each_relax+0x250/0x450 [libcfs]
[  519.909127]  [<ffffffffc1283960>] ? ldlm_lock_mode_downgrade+0x320/0x320 [ptlrpc]
[  519.917499]  [<ffffffffc1283960>] ? ldlm_lock_mode_downgrade+0x320/0x320 [ptlrpc]
[  519.925861]  [<ffffffffc0cf8345>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs]
[  519.933945]  [<ffffffffc12839cc>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc]
[  519.942416]  [<ffffffffc1296211>] target_recovery_thread+0xcd1/0x1160 [ptlrpc]
[  519.950516]  [<ffffffffc1295540>] ? replay_request_or_update.isra.23+0x8c0/0x8c0 [ptlrpc]
[  519.959660]  [<ffffffffa3ec1da1>] kthread+0xd1/0xe0
[  519.965102]  [<ffffffffa3ec1cd0>] ? insert_kthread_work+0x40/0x40
[  519.971919]  [<ffffffffa4575c37>] ret_from_fork_nospec_begin+0x21/0x21
[  519.979204]  [<ffffffffa3ec1cd0>] ? insert_kthread_work+0x40/0x40
[  519.986002] Code: ab 51 06 00 0f 1f 00 31 db eb 15 0f 1f 40 00 48 83 c3 08 48 81 fb 40 01 00 00 0f 84 9f 00 00 00 49 8b 45 10 4c 8b a3 e0 bf 10 c1 <48> 83 3c 18 00 75 dd 4d 85 e4 74 d8 41 8b 04 24 41 8b 55 00 85 
[  520.007678] RIP  [<ffffffffc0fdeefc>] keys_fill+0x5c/0x180 [obdclass]
[  520.014893]  RSP <ffff9b2f20b1bad0>
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct


 Comments   
Comment by James Nunez (Inactive) [ 14/Aug/19 ]

Here is what was loaded on soak and issues seen prior to this crash:
2019-8-3: Soak started with lustre-b2_12-ib #31
2019-8-7: OSS hit LU-9845 which caused the whole testing hung, stop and restart soak
2019-8-12: OSS hit LU-12663, stop running soak

Generated at Sat Feb 10 02:54:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.