Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12663

SOAK: OSS hit general protection fault: 0000 [#1] SMP

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.3
    • lustre-b2_12-ib build #31 version=2.12.2_105_gec6b9a6
    • 3
    • 9223372036854775807

    Description

      2 OSS hit the "general protection fault: 0000 1 SMP" in failover test when running for about 2 days

      soak-4 and soak-5 all shows

      [  320.525059] LustreError: Skipped 5 previous similar messages
      [  320.554484] Lustre: Skipped 6 previous similar messages
      [  321.739504] Lustre: soaked-OST0000: Connection restored to cccf9223-12de-f413-f873-1abd2bca972e (at 172.16.1.29@o2ib1)
      [  321.751469] Lustre: Skipped 3 previous similar messages
      [  322.819814] LustreError: 137-5: soaked-OST0004_UUID: not available for connect from 192.168.1.111@o2ib (no target). If you are running an HA pair check that the target is
       mounted on the other server.
      [  322.839645] LustreError: Skipped 4 previous similar messages
      [  325.343054] Lustre: soaked-OST0004: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      [  326.721349] Lustre: soaked-OST0008: Connection restored to b6d310e5-89b5-4823-cd5f-bf69ac139c6a (at 172.16.1.26@o2ib1)
      [  326.721352] Lustre: soaked-OST0000: Connection restored to b6d310e5-89b5-4823-cd5f-bf69ac139c6a (at 172.16.1.26@o2ib1)
      [  326.721358] Lustre: Skipped 3 previous similar messages
      [  330.480118] Lustre: soaked-OST0004: Will be in recovery for at least 2:30, or until 28 clients reconnect
      [  330.543092] Lustre: soaked-OST000c: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      [  335.247236] Lustre: soaked-OST000c: Connection restored to  (at 172.16.1.22@o2ib1)
      [  335.255753] Lustre: Skipped 20 previous similar messages
      [  351.605041] Lustre: soaked-OST0000: Connection restored to d33f0a29-feff-6666-e008-19168b75e455 (at 172.16.1.38@o2ib1)
      [  351.617080] Lustre: Skipped 65 previous similar messages
      [  354.213237] Lustre: soaked-OST0004: Recovery over after 0:23, of 28 clients 28 recovered and 0 were evicted.
      [  354.248446] Lustre: soaked-OST0004: deleting orphan objects from 0x0:201673384 to 0x0:201676484
      [  354.337426] Lustre: soaked-OST0004: deleting orphan objects from 0x400000402:131183875 to 0x400000402:131188552
      [  354.344896] Lustre: soaked-OST0004: deleting orphan objects from 0x400000401:194940686 to 0x400000401:194951539
      [  355.092341] Lustre: soaked-OST0008: Recovery over after 0:35, of 28 clients 28 recovered and 0 were evicted.
      [  355.112233] Lustre: soaked-OST0008: deleting orphan objects from 0x0:201739882 to 0x0:201746221
      [  355.439495] Lustre: soaked-OST0008: deleting orphan objects from 0x500000401:195199465 to 0x500000401:195206220
      [  355.439890] Lustre: soaked-OST0008: deleting orphan objects from 0x500000402:131227465 to 0x500000402:131237293
      [  430.413849] Lustre: soaked-OST0004: deleting orphan objects from 0x400000400:140327007 to 0x400000400:140331081
      [  430.413851] Lustre: soaked-OST0008: deleting orphan objects from 0x500000400:140581975 to 0x500000400:140583563
      [  430.493859] Lustre: soaked-OST0008: Connection restored to f9470b7c-9158-fc3c-884e-b494778ee289 (at 172.16.1.31@o2ib1)
      [  430.505816] Lustre: Skipped 3 previous similar messages
      [  503.499925] LustreError: 34183:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 150s: evicting client at 172.16.1.35@o2ib1  ns: filter-soake
      d-OST0004_UUID lock: ffff9b2f0e700b40/0x39ef56938122067c lrc: 3/0,0 mode: PW/PW res: [0x400000400:0x85d13f5:0x0].0x0 rrc: 7 type: EXT [0->18446744073709551615] (req 0->18446
      744073709551615) flags: 0x60000400000020 nid: 172.16.1.35@o2ib1 remote: 0x4d7feaf65d51590b expref: 8 pid: 52997 timeout: 503 lvb_type: 0
      [  506.075756] Lustre: soaked-OST0004: Connection restored to 2f6488d4-9684-6185-e787-5c39dc9ffacd (at 172.16.1.35@o2ib1)
      [  506.087771] Lustre: Skipped 5 previous similar messages
      [  506.322934] Lustre: soaked-OST0000: recovery is timed out, evict stale exports
      [  506.331066] Lustre: soaked-OST0000: disconnecting 3 stale clients
      [  508.810352] Lustre: soaked-OST0000: Recovery over after 3:09, of 28 clients 25 recovered and 3 were evicted.
      [  508.836797] Lustre: soaked-OST0000: deleting orphan objects from 0x0:201827014 to 0x0:201838552
      [  509.128301] Lustre: soaked-OST0000: deleting orphan objects from 0x300000402:131294344 to 0x300000402:131298527
      [  509.182439] Lustre: soaked-OST0000: deleting orphan objects from 0x300000401:195138776 to 0x300000401:195149783
      [  517.586983] Lustre: soaked-OST000c: recovery is timed out, evict stale exports
      [  517.595101] Lustre: soaked-OST000c: disconnecting 1 stale clients
      [  519.522063] Lustre: soaked-OST000c: Recovery over after 3:09, of 28 clients 27 recovered and 1 was evicted.
      [  519.522711] Lustre: soaked-OST000c: deleting orphan objects from 0x600000401:140374471 to 0x600000401:140375679
      [  519.544629] general protection fault: 0000 [#1] SMP 
      [  519.550190] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE)[  519.565747] Lustre: soaked-OST000c: deleting orphan objects from 0x0:201738135 to 0x0:201745341
       libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel iTCO_wdt lrw iTCO_vendor_support gf128mul glue_helper ablk_helper cryptd pcspkr dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) joydev ses enclosure ipmi_ssif lpc_ich mei_me sg ioatdma i2c_i801 mei ipmi_si ipmi_devintf ipmi_msghandler wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci isci igb drm mlx4_core(OE) crct10dif_pclmul libsas libahci crct10dif_common ptp mpt2sas crc32c_intel devlink pps_core libata raid_class dca scsi_transport_sas mlx_compat(OE) drm_panel_orientation_quirks i2c_algo_bit
      [  519.677136] CPU: 3 PID: 60300 Comm: tgt_recover_12 Kdump: loaded Tainted: P           OE  ------------   3.10.0-957.21.3.el7_lustre.x86_64 #1
      [  519.691306] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
      [  519.703834] task: ffff9b2f27cb0000 ti: ffff9b2f20b18000 task.ti: ffff9b2f20b18000
      [  519.712184] RIP: 0010:[<ffffffffc0fdeefc>]  [<ffffffffc0fdeefc>] keys_fill+0x5c/0x180 [obdclass]
      [  519.722046] RSP: 0018:ffff9b2f20b1bad0  EFLAGS: 00010246
      [  519.727974] RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000000000 RCX: ffff9b2f20b1bfd8
      [  519.735936] RDX: ffff9b2f20b1baf8 RSI: 0000000000000002 RDI: ffffffffc1044080
      [  519.743898] RBP: ffff9b2f20b1baf0 R08: 0000000000000000 R09: ffff9b287fc07b00
      [  519.751868] R10: ffffffffc14cf797 R11: ffff9b2f12860c00 R12: ffffffffc1044140
      [  519.759836] R13: ffff9b2f076a3120 R14: 0000000000000013 R15: ffff9b2f15fe62c8
      [  519.767799] FS:  0000000000000000(0000) GS:ffff9b2b2e0c0000(0000) knlGS:0000000000000000
      [  519.776834] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  519.783244] CR2: 00007f81db1e2140 CR3: 000000023c410000 CR4: 00000000000607e0
      [  519.791207] Call Trace:
      [  519.793962]  [<ffffffffc0fe3961>] lu_context_refill+0x41/0x50 [obdclass]
      [  519.801466]  [<ffffffffc0fe39f4>] lu_env_refill+0x24/0x30 [obdclass]
      [  519.808579]  [<ffffffffc14cf831>] ofd_lvbo_init+0x2a1/0x7f0 [ofd]
      [  519.815426]  [<ffffffffc12aa0fd>] ldlm_server_completion_ast+0x5fd/0x980 [ptlrpc]
      [  519.823809]  [<ffffffffc12a9b00>] ? ldlm_server_blocking_ast+0xa40/0xa40 [ptlrpc]
      [  519.832181]  [<ffffffffc127c748>] ldlm_work_cp_ast_lock+0xa8/0x1d0 [ptlrpc]
      [  519.839981]  [<ffffffffc12c3bf2>] ptlrpc_set_wait+0x72/0x790 [ptlrpc]
      [  519.847171]  [<ffffffffa401d75d>] ? kmem_cache_alloc_node_trace+0x11d/0x210
      [  519.854957]  [<ffffffffc0fc1a79>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [  519.862942]  [<ffffffffc127c6a0>] ? ldlm_work_gl_ast_lock+0x3a0/0x3a0 [ptlrpc]
      [  519.871032]  [<ffffffffc12ba472>] ? ptlrpc_prep_set+0xd2/0x280 [ptlrpc]
      [  519.878452]  [<ffffffffc1281f25>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
      [  519.885847]  [<ffffffffc12833e1>] __ldlm_reprocess_all+0x101/0x340 [ptlrpc]
      [  519.893651]  [<ffffffffc1283986>] ldlm_reprocess_res+0x26/0x30 [ptlrpc]
      [  519.901043]  [<ffffffffc0cf4fb0>] cfs_hash_for_each_relax+0x250/0x450 [libcfs]
      [  519.909127]  [<ffffffffc1283960>] ? ldlm_lock_mode_downgrade+0x320/0x320 [ptlrpc]
      [  519.917499]  [<ffffffffc1283960>] ? ldlm_lock_mode_downgrade+0x320/0x320 [ptlrpc]
      [  519.925861]  [<ffffffffc0cf8345>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs]
      [  519.933945]  [<ffffffffc12839cc>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc]
      [  519.942416]  [<ffffffffc1296211>] target_recovery_thread+0xcd1/0x1160 [ptlrpc]
      [  519.950516]  [<ffffffffc1295540>] ? replay_request_or_update.isra.23+0x8c0/0x8c0 [ptlrpc]
      [  519.959660]  [<ffffffffa3ec1da1>] kthread+0xd1/0xe0
      [  519.965102]  [<ffffffffa3ec1cd0>] ? insert_kthread_work+0x40/0x40
      [  519.971919]  [<ffffffffa4575c37>] ret_from_fork_nospec_begin+0x21/0x21
      [  519.979204]  [<ffffffffa3ec1cd0>] ? insert_kthread_work+0x40/0x40
      [  519.986002] Code: ab 51 06 00 0f 1f 00 31 db eb 15 0f 1f 40 00 48 83 c3 08 48 81 fb 40 01 00 00 0f 84 9f 00 00 00 49 8b 45 10 4c 8b a3 e0 bf 10 c1 <48> 83 3c 18 00 75 dd 4d 85 e4 74 d8 41 8b 04 24 41 8b 55 00 85 
      [  520.007678] RIP  [<ffffffffc0fdeefc>] keys_fill+0x5c/0x180 [obdclass]
      [  520.014893]  RSP <ffff9b2f20b1bad0>
      [    0.000000] Initializing cgroup subsys cpuset
      [    0.000000] Initializing cgroup subsys cpu
      [    0.000000] Initializing cgroup subsys cpuacct
      

      Attachments

        Activity

          People

            wc-triage WC Triage
            sarah Sarah Liu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: