Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11457

osd_oi_insert(): the FID is used by two objects

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0
    • Lustre 2.12.0, Lustre 2.12.2, Lustre 2.12.4, Lustre 2.12.5
    • 3
    • 9223372036854775807

    Description

      tag-2.11.55

      MDS crash

      [15650.670434] device-mapper: multipath: Failing path 8:96.^M
      [15650.765276] BUG: unable to handle kernel NULL pointer dereference at           (null)^M
      [15650.775741] IP: [<          (null)>]           (null)^M
      [15650.783081] PGD 0 ^M
      [15650.786948] Oops: 0010 [#1] SMP ^M
      [15650.792218] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt joydev pcspkr ipmi_ssif iTCO_vendor_support sg ipmi_si ipmi_devintf shpchp ipmi_msghandler i2c_i801 mei_me ioatdma mei lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm igb isci ahci ptp mlx4_core(OE) mpt3sas libsas libahci pps_core dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas^M
      [15650.934649] CPU: 14 PID: 9491 Comm: mdt_rdpg01_008 Tainted: P           OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1^M
      [15650.952002] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
      [15650.966961] task: ffff8be6a1253f40 ti: ffff8be68a044000 task.ti: ffff8be68a044000^M
      [15650.977791] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)^M
      [15650.988693] RSP: 0018:ffff8be68a047b58  EFLAGS: 00010246^M
      [15650.997167] RAX: 0000000000000000 RBX: ffff8be68b820000 RCX: 0000000000000002^M
      [15651.007733] RDX: ffffffffc164c7b0 RSI: ffff8be68a047b60 RDI: ffff8be68b820008^M
      [15651.018326] RBP: ffff8be68a047b98 R08: 0000000000000004 R09: 0000000000000000^M
      [15651.028930] R10: 0000000000000001 R11: 00000000007fffff R12: ffff8be26f9fab00^M
      [15651.039547] R13: ffff8be279a448a0 R14: ffff8be68a160000 R15: ffff8be68b820008^M
      [15651.050168] FS:  0000000000000000(0000) GS:ffff8be6ad980000(0000) knlGS:0000000000000000^M
      [15651.061880] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
      [15651.070980] CR2: 0000000000000000 CR3: 000000042c3b6000 CR4: 00000000000607e0^M
      [15651.081660] Call Trace:^M
      [15651.087091]  [<ffffffffc164ac3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]^M
      [15651.098058]  [<ffffffffc164ae17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]^M
      [15651.108370]  [<ffffffffc188eb47>] lod_it_load+0x27/0x90 [lod]^M
      [15651.117554]  [<ffffffffc0f48808>] dt_index_walk+0xf8/0x430 [obdclass]^M
      [15651.127457]  [<ffffffffc1915080>] ? mdd_object_lock+0xe0/0xe0 [mdd]^M
      [15651.137132]  [<ffffffffc1916d9f>] mdd_readpage+0x25f/0x5a0 [mdd]^M
      [15651.146553]  [<ffffffffc1782bda>] mdt_readpage+0x63a/0x880 [mdt]^M
      [15651.155992]  [<ffffffffc11e82ca>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M
      [15651.166379]  [<ffffffffc11c02e1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]^M
      [15651.177493]  [<ffffffffc0dfcbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]^M
      [15651.188033]  [<ffffffffc118b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M
      [15651.199251]  [<ffffffffc1188315>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]^M
      [15651.209399]  [<ffffffff83ccf682>] ? default_wake_function+0x12/0x20^M
      [15651.218931]  [<ffffffff83cc52ab>] ? __wake_up_common+0x5b/0x90^M
      [15651.228026]  [<ffffffffc118ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]^M
      [15651.237575]  [<ffffffffc118e1b0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]^M
      [15651.248365]  [<ffffffff83cbb621>] kthread+0xd1/0xe0^M
      [15651.256344]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.265688]  [<ffffffff843205f7>] ret_from_fork_nospec_begin+0x21/0x21^M
      [15651.275475]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.284736] Code:  Bad RIP value.^M
      [15651.290946] RIP  [<          (null)>]           (null)^M
      [15651.299236]  RSP <ffff8be68a047b58>^M
      [15651.305543] CR2: 0000000000000000^M
      [15651.315778] ---[ end trace 4ae4238c00f9aeec ]---^M
      [15651.336386] Kernel panic - not syncing: Fatal exception^M
      [15651.344613] Kernel Offset: 0x2c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)^M
      [15651.369289] ------------[ cut here ]------------^M
      [15651.376397] WARNING: CPU: 14 PID: 9491 at arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x65/0x70^M
      [15651.388915] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt joydev pcspkr ipmi_ssif iTCO_vendor_support sg ipmi_si ipmi_devintf shpchp ipmi_msghandler i2c_i801 mei_me ioatdma mei lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm igb isci ahci ptp mlx4_core(OE) mpt3sas libsas libahci pps_core dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas^M
      [15651.529620] CPU: 14 PID: 9491 Comm: mdt_rdpg01_008 Tainted: P      D    OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1^M
      [15651.546472] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
      [15651.561156] Call Trace:^M
      [15651.566023]  <IRQ>  [<ffffffff8430e84e>] dump_stack+0x19/0x1b^M
      [15651.574646]  [<ffffffff83c91e18>] __warn+0xd8/0x100^M
      [15651.582224]  [<ffffffff83c91f5d>] warn_slowpath_null+0x1d/0x20^M
      [15651.590851]  [<ffffffff83c54e95>] native_smp_send_reschedule+0x65/0x70^M
      [15651.600279]  [<ffffffff83cddf81>] trigger_load_balance+0x191/0x280^M
      [15651.609280]  [<ffffffff83ccdc0a>] scheduler_tick+0x10a/0x150^M
      [15651.617702]  [<ffffffff83d01c10>] ? tick_sched_do_timer+0x50/0x50^M
      [15651.626619]  [<ffffffff83ca4f65>] update_process_times+0x65/0x80^M
      [15651.635416]  [<ffffffff83d01a10>] tick_sched_handle+0x30/0x70^M
      [15651.643916]  [<ffffffff83d01c49>] tick_sched_timer+0x39/0x80^M
      [15651.652315]  [<ffffffff83cbf7e6>] __hrtimer_run_queues+0xd6/0x260^M
      [15651.661210]  [<ffffffff83cbfd7f>] hrtimer_interrupt+0xaf/0x1d0^M
      [15651.669814]  [<ffffffff83c5847b>] local_apic_timer_interrupt+0x3b/0x60^M
      [15651.679184]  [<ffffffff84325063>] smp_apic_timer_interrupt+0x43/0x60^M
      [15651.688352]  [<ffffffff843217b2>] apic_timer_interrupt+0x162/0x170^M
      [15651.697316]  <EOI>  [<ffffffff84308c3d>] ? panic+0x1d5/0x21f^M
      [15651.705715]  [<ffffffff84308ba1>] ? panic+0x139/0x21f^M
      [15651.713430]  [<ffffffff84318745>] oops_end+0xc5/0xe0^M
      [15651.721020]  [<ffffffff8430807e>] no_context+0x285/0x2a8^M
      [15651.728984]  [<ffffffff84308115>] __bad_area_nosemaphore+0x74/0x1d1^M
      [15651.738014]  [<ffffffff84308286>] bad_area_nosemaphore+0x14/0x16^M
      [15651.746760]  [<ffffffff8431b6e0>] __do_page_fault+0x330/0x4f0^M
      [15651.755199]  [<ffffffff8431b8d5>] do_page_fault+0x35/0x90^M
      [15651.763264]  [<ffffffff84317758>] page_fault+0x28/0x30^M
      [15651.771013]  [<ffffffffc164c7b0>] ? osd_object_alloc+0x360/0x360 [osd_ldiskfs]^M
      [15651.781105]  [<ffffffffc164ac3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]^M
      [15651.791402]  [<ffffffffc164ae17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]^M
      [15651.801028]  [<ffffffffc188eb47>] lod_it_load+0x27/0x90 [lod]^M
      [15651.809517]  [<ffffffffc0f48808>] dt_index_walk+0xf8/0x430 [obdclass]^M
      [15651.818761]  [<ffffffffc1915080>] ? mdd_object_lock+0xe0/0xe0 [mdd]^M
      [15651.827808]  [<ffffffffc1916d9f>] mdd_readpage+0x25f/0x5a0 [mdd]^M
      [15651.836533]  [<ffffffffc1782bda>] mdt_readpage+0x63a/0x880 [mdt]^M
      [15651.845269]  [<ffffffffc11e82ca>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M
      [15651.854937]  [<ffffffffc11c02e1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]^M
      [15651.865302]  [<ffffffffc0dfcbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]^M
      [15651.875084]  [<ffffffffc118b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M
      [15651.885522]  [<ffffffffc1188315>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]^M
      [15651.894881]  [<ffffffff83ccf682>] ? default_wake_function+0x12/0x20^M
      [15651.903621]  [<ffffffff83cc52ab>] ? __wake_up_common+0x5b/0x90^M
      [15651.911880]  [<ffffffffc118ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]^M
      [15651.920585]  [<ffffffffc118e1b0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]^M
      [15651.930460]  [<ffffffff83cbb621>] kthread+0xd1/0xe0^M
      [15651.937468]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.945794]  [<ffffffff843205f7>] ret_from_fork_nospec_begin+0x21/0x21^M
      [15651.954564]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.962806] ---[ end trace 4ae4238c00f9aeed ]---^M
      
      

      Attachments

        Issue Links

          Activity

            [LU-11457] osd_oi_insert(): the FID is used by two objects
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51601/
            Subject: LU-11457 osd-ldiskfs: scrub FID reuse
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 5c59f8551a96ec0b3bbb935a3173c75e878d9ff0

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51601/ Subject: LU-11457 osd-ldiskfs: scrub FID reuse Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5c59f8551a96ec0b3bbb935a3173c75e878d9ff0

            "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51601
            Subject: LU-11457 osd-ldiskfs: scrub FID reuse
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a2627b8c26d5b0d5580971f227d1cb11d8ae8669

            gerrit Gerrit Updater added a comment - "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51601 Subject: LU-11457 osd-ldiskfs: scrub FID reuse Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a2627b8c26d5b0d5580971f227d1cb11d8ae8669

            This was hit on a customer OST. It looks like this can happen if a block in the inode table overwrites another in-use block in the inode table, and then there are two inodes that have the same content (lma FID, blocks), but different O/<seq>/dX/<oid> pathnames. Alternately, it is possible that the original object was cleared by e2fsck (maybe due to duplicate block processing) and a new empty object was created in its place with the same FID (it looks like both happened). OI Scrub was modified to do a full O/ tree walk in LU-13124 and check multiply-linked objects, but in this case it should detect the path/FID mismatch and either:

            • replace the "new" object if it has no blocks, but "bad" object has blocks
            • remove the "bad" object has no blocks or the same blocks, it should be removed
            adilger Andreas Dilger added a comment - This was hit on a customer OST. It looks like this can happen if a block in the inode table overwrites another in-use block in the inode table, and then there are two inodes that have the same content ( lma FID, blocks), but different O/<seq>/dX/<oid> pathnames. Alternately, it is possible that the original object was cleared by e2fsck (maybe due to duplicate block processing) and a new empty object was created in its place with the same FID (it looks like both happened). OI Scrub was modified to do a full O/ tree walk in LU-13124 and check multiply-linked objects, but in this case it should detect the path/FID mismatch and either: replace the "new" object if it has no blocks, but "bad" object has blocks remove the "bad" object has no blocks or the same blocks, it should be removed
            lixi_wc Li Xi added a comment -

            We checked the inodes via debugfs and they showed that the inodes did not have the same FID in the LMA, and in fact two of the inodes were completely unused, so it isn't clear why this error was being printed at all:

            I am not sure whether this related, but the osd_add_dot_dotdot_internal() seems save the wrong child fid to "..".

            lixi_wc Li Xi added a comment - We checked the inodes via debugfs and they showed that the inodes did not have the same FID in the LMA, and in fact two of the inodes were completely unused, so it isn't clear why this error was being printed at all: I am not sure whether this related, but the osd_add_dot_dotdot_internal() seems save the wrong child fid to "..".
            sarah Sarah Liu added a comment -

            soak hit the problem on 2.12.5 testing after running for 4 days.

            [18137.800259] Pid: 5007, comm: mdt01_017 3.10.0-1062.18.1.el7_lustre.x86_64 #1 SMP Wed May 27 23:31:53 UTC 2020
            [18137.811351] Call Trace:
            [18137.814107]  [<ffffffffc107f8fc>] top_trans_wait_result+0xa9/0x158 [ptlrpc]
            [18137.822008]  [<ffffffffc105fb79>] top_trans_stop+0x4e9/0xa70 [ptlrpc]
            [18137.829352]  [<ffffffffc1522dec>] lod_trans_stop+0x25c/0x340 [lod]
            [18137.836307]  [<ffffffffc15d56ee>] mdd_trans_stop+0x2e/0x174 [mdd]
            [18137.843150]  [<ffffffffc15bdc73>] mdd_unlink+0x3f3/0xaa0 [mdd]
            [18137.849700]  [<ffffffffc148139d>] mdo_unlink+0x46/0x48 [mdt]
            [18137.856087]  [<ffffffffc14452bd>] mdt_reint_unlink+0xbed/0x14b0 [mdt]
            [18137.863361]  [<ffffffffc1449963>] mdt_reint_rec+0x83/0x210 [mdt]
            [18137.870109]  [<ffffffffc1426273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
            [18137.877464]  [<ffffffffc14316e7>] mdt_reint+0x67/0x140 [mdt]
            [18137.883828]  [<ffffffffc104d9da>] tgt_request_handle+0xada/0x1570 [ptlrpc]
            [18137.891641]  [<ffffffffc0ff248b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [18137.900298]  [<ffffffffc0ff5df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
            [18137.907411]  [<ffffffff9f2c6321>] kthread+0xd1/0xe0
            [18137.912887]  [<ffffffff9f98ed37>] ret_from_fork_nospec_end+0x0/0x39
            [18137.919936]  [<ffffffffffffffff>] 0xffffffffffffffff
            [18137.925528] LNet: Service thread pid 4804 was inactive for 202.30s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
            [18152.716405] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 0 seconds
            [18152.727780] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 111 previous similar messages
            [18210.718946] Lustre: 4675:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1591370247/real 1591370247]  req@ffff8b87f443ec00 x1668654025156672/t0(0) o105->soaked-MDT0001@192.168.1.111@o2ib:15/16 lens 304/224 e 0 to 1 dl 1591370254 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
            [18210.749830] Lustre: 4675:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 110 previous similar messages
            [18232.719942] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 0 seconds
            [18232.731320] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 91 previous similar messages
            [18235.829192] Lustre: soaked-MDT0001: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID
            [18235.841908] Lustre: soaked-MDT0001: Connection restored to soaked-MDT0000-mdtlov_UUID (at 192.168.1.108@o2ib)
            [18235.853001] Lustre: Skipped 1 previous similar message
            [18236.004566] LustreError: 4796:0:(osd_oi.c:762:osd_oi_insert()) dm-1: the FID [0x2400134c5:0x8001:0x0] is used by two objects: 477626398/123258462 477626399/123258463
            [18236.053964] LNet: 4615:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.108@o2ib
            [18236.794778] BUG: unable to handle kernel NULL pointer dereference at           (null)
            [18236.803570] IP: [<          (null)>]           (null)
            [18236.809248] PGD 0 
            [18236.811519] Oops: 0010 [#1] SMP 
            [18236.815153] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel iTCO_wdt iTCO_vendor_support aesni_intel lrw gf128mul glue_helper ipmi_ssif ablk_helper dm_round_robin mei_me cryptd mei sg ioatdma ipmi_si i2c_i801 lpc_ich ipmi_devintf joydev ipmi_msghandler pcspkr pcc_cpufreq wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops isci ttm igb mlx4_core(OE) ahci libsas libahci ptp mpt2sas drm devlink pps_core crct10dif_pclmul raid_class crct10dif_common dca libata crc32c_intel mlx_compat(OE) scsi_transport_sas drm_panel_orientation_quirks i2c_algo_bit
            [18236.927648] CPU: 31 PID: 4694 Comm: mdt_rdpg01_001 Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.18.1.el7_lustre.x86_64 #1
            [18236.941923] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            [18236.954462] task: ffff8b8807dae2a0 ti: ffff8b88072f0000 task.ti: ffff8b88072f0000
            [18236.962824] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
            [18236.971207] RSP: 0018:ffff8b88072f3b68  EFLAGS: 00010246
            [18236.977142] RAX: 0000000000000000 RBX: ffff8b87ff710000 RCX: 0000000000000002
            [18236.985109] RDX: ffffffffc12e2ab0 RSI: ffff8b88072f3b70 RDI: ffff8b87ff710008
            [18236.993080] RBP: ffff8b88072f3ba8 R08: 0000000000000000 R09: 0000000000000001
            [18237.001053] R10: ffff8b842dbb5800 R11: 00000000007fffff R12: ffff8b87fef0f000
            [18237.009011] R13: ffff8b87eb34f628 R14: ffff8b88283e6800 R15: ffff8b87ff710008
            [18237.016993] FS:  0000000000000000(0000) GS:ffff8b882e5c0000(0000) knlGS:0000000000000000
            [18237.026029] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [18237.032449] CR2: 0000000000000000 CR3: 00000006c0a10000 CR4: 00000000000607e0
            [18237.040420] Call Trace:
            [18237.043201]  [<ffffffffc12e0eae>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
            [18237.051478]  [<ffffffffc12e1087>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]
            [18237.059104]  [<ffffffffc1538e07>] lod_it_load+0x27/0x90 [lod]
            [18237.065587]  [<ffffffffc0d0e258>] dt_index_walk+0xf8/0x430 [obdclass]
            [18237.072807]  [<ffffffffc15c0c60>] ? mdd_object_lock+0xe0/0xe0 [mdd]
            [18237.079818]  [<ffffffffc15c25bf>] mdd_readpage+0x25f/0x5a0 [mdd]
            [18237.086558]  [<ffffffffc141f88a>] mdt_readpage+0x69a/0x8e0 [mdt]
            [18237.093384]  [<ffffffffc104d9da>] tgt_request_handle+0xada/0x1570 [ptlrpc]
            [18237.101137]  [<ffffffffc1027121>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
            [18237.109611]  [<ffffffffc0bc0bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
            [18237.117558]  [<ffffffffc0ff248b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [18237.126192]  [<ffffffffc0fef2a5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
            [18237.133784]  [<ffffffff9f2d3a33>] ? __wake_up+0x13/0x20
            [18237.139689]  [<ffffffffc0ff5df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
            [18237.146704]  [<ffffffff9f980e02>] ? __schedule+0x402/0x840
            [18237.152899]  [<ffffffffc0ff52c0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
            [18237.161175]  [<ffffffff9f2c6321>] kthread+0xd1/0xe0
            [18237.166629]  [<ffffffff9f2c6250>] ? insert_kthread_work+0x40/0x40
            [18237.173457]  [<ffffffff9f98ed37>] ret_from_fork_nospec_begin+0x21/0x21
            [18237.180748]  [<ffffffff9f2c6250>] ? insert_kthread_work+0x40/0x40
            [18237.187548] Code:  Bad RIP value.
            [18237.191284] RIP  [<          (null)>]           (null)
            [18237.197031]  RSP <ffff8b88072f3b68>
            [18237.200923] CR2: 0000000000000000
            [    0.000000] Initializing cgroup subsys cpuset
            [    0.000000] Initializing cgroup subsys cpu
            [    0.000000] Initializing cgroup subsys cpuacct
            [    0.000000] Linux version 3.10.0-1062.18.1.el7_lustre.x86_64 (jenkins@trevis-307-el7-x8664-1.trevis.whamcloud.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Wed May 27 23:31:53 UTC 2020
            [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-1062.18.1.el7_lustre.x86_64 ro console=ttyS0,115200 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd disable_cpu_apicid=0 elfcorehdr=869816K
            [    0.000000] e820: BIOS-provided physical RAM map:
            [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved
            [    0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000008efff] usable
            [    0.000000] BIOS-e820: [mem 0x000000000008f000-0x000000000009ffff] reserved
            [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
            [    0.000000] BIOS-e820: [mem 0x000000002b000000-0x000000003516dfff] usable
            [    0.000000] BIOS-e820: [mem 0x00000000bb3c7000-0x00000000bdd2efff] reserved
            
            
            sarah Sarah Liu added a comment - soak hit the problem on 2.12.5 testing after running for 4 days. [18137.800259] Pid: 5007, comm: mdt01_017 3.10.0-1062.18.1.el7_lustre.x86_64 #1 SMP Wed May 27 23:31:53 UTC 2020 [18137.811351] Call Trace: [18137.814107] [<ffffffffc107f8fc>] top_trans_wait_result+0xa9/0x158 [ptlrpc] [18137.822008] [<ffffffffc105fb79>] top_trans_stop+0x4e9/0xa70 [ptlrpc] [18137.829352] [<ffffffffc1522dec>] lod_trans_stop+0x25c/0x340 [lod] [18137.836307] [<ffffffffc15d56ee>] mdd_trans_stop+0x2e/0x174 [mdd] [18137.843150] [<ffffffffc15bdc73>] mdd_unlink+0x3f3/0xaa0 [mdd] [18137.849700] [<ffffffffc148139d>] mdo_unlink+0x46/0x48 [mdt] [18137.856087] [<ffffffffc14452bd>] mdt_reint_unlink+0xbed/0x14b0 [mdt] [18137.863361] [<ffffffffc1449963>] mdt_reint_rec+0x83/0x210 [mdt] [18137.870109] [<ffffffffc1426273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] [18137.877464] [<ffffffffc14316e7>] mdt_reint+0x67/0x140 [mdt] [18137.883828] [<ffffffffc104d9da>] tgt_request_handle+0xada/0x1570 [ptlrpc] [18137.891641] [<ffffffffc0ff248b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [18137.900298] [<ffffffffc0ff5df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [18137.907411] [<ffffffff9f2c6321>] kthread+0xd1/0xe0 [18137.912887] [<ffffffff9f98ed37>] ret_from_fork_nospec_end+0x0/0x39 [18137.919936] [<ffffffffffffffff>] 0xffffffffffffffff [18137.925528] LNet: Service thread pid 4804 was inactive for 202.30s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [18152.716405] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 0 seconds [18152.727780] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 111 previous similar messages [18210.718946] Lustre: 4675:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1591370247/real 1591370247] req@ffff8b87f443ec00 x1668654025156672/t0(0) o105->soaked-MDT0001@192.168.1.111@o2ib:15/16 lens 304/224 e 0 to 1 dl 1591370254 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 [18210.749830] Lustre: 4675:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 110 previous similar messages [18232.719942] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 0 seconds [18232.731320] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 91 previous similar messages [18235.829192] Lustre: soaked-MDT0001: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID [18235.841908] Lustre: soaked-MDT0001: Connection restored to soaked-MDT0000-mdtlov_UUID (at 192.168.1.108@o2ib) [18235.853001] Lustre: Skipped 1 previous similar message [18236.004566] LustreError: 4796:0:(osd_oi.c:762:osd_oi_insert()) dm-1: the FID [0x2400134c5:0x8001:0x0] is used by two objects: 477626398/123258462 477626399/123258463 [18236.053964] LNet: 4615:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.108@o2ib [18236.794778] BUG: unable to handle kernel NULL pointer dereference at (null) [18236.803570] IP: [< (null)>] (null) [18236.809248] PGD 0 [18236.811519] Oops: 0010 [#1] SMP [18236.815153] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel iTCO_wdt iTCO_vendor_support aesni_intel lrw gf128mul glue_helper ipmi_ssif ablk_helper dm_round_robin mei_me cryptd mei sg ioatdma ipmi_si i2c_i801 lpc_ich ipmi_devintf joydev ipmi_msghandler pcspkr pcc_cpufreq wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops isci ttm igb mlx4_core(OE) ahci libsas libahci ptp mpt2sas drm devlink pps_core crct10dif_pclmul raid_class crct10dif_common dca libata crc32c_intel mlx_compat(OE) scsi_transport_sas drm_panel_orientation_quirks i2c_algo_bit [18236.927648] CPU: 31 PID: 4694 Comm: mdt_rdpg01_001 Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.18.1.el7_lustre.x86_64 #1 [18236.941923] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [18236.954462] task: ffff8b8807dae2a0 ti: ffff8b88072f0000 task.ti: ffff8b88072f0000 [18236.962824] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [18236.971207] RSP: 0018:ffff8b88072f3b68 EFLAGS: 00010246 [18236.977142] RAX: 0000000000000000 RBX: ffff8b87ff710000 RCX: 0000000000000002 [18236.985109] RDX: ffffffffc12e2ab0 RSI: ffff8b88072f3b70 RDI: ffff8b87ff710008 [18236.993080] RBP: ffff8b88072f3ba8 R08: 0000000000000000 R09: 0000000000000001 [18237.001053] R10: ffff8b842dbb5800 R11: 00000000007fffff R12: ffff8b87fef0f000 [18237.009011] R13: ffff8b87eb34f628 R14: ffff8b88283e6800 R15: ffff8b87ff710008 [18237.016993] FS: 0000000000000000(0000) GS:ffff8b882e5c0000(0000) knlGS:0000000000000000 [18237.026029] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [18237.032449] CR2: 0000000000000000 CR3: 00000006c0a10000 CR4: 00000000000607e0 [18237.040420] Call Trace: [18237.043201] [<ffffffffc12e0eae>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs] [18237.051478] [<ffffffffc12e1087>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs] [18237.059104] [<ffffffffc1538e07>] lod_it_load+0x27/0x90 [lod] [18237.065587] [<ffffffffc0d0e258>] dt_index_walk+0xf8/0x430 [obdclass] [18237.072807] [<ffffffffc15c0c60>] ? mdd_object_lock+0xe0/0xe0 [mdd] [18237.079818] [<ffffffffc15c25bf>] mdd_readpage+0x25f/0x5a0 [mdd] [18237.086558] [<ffffffffc141f88a>] mdt_readpage+0x69a/0x8e0 [mdt] [18237.093384] [<ffffffffc104d9da>] tgt_request_handle+0xada/0x1570 [ptlrpc] [18237.101137] [<ffffffffc1027121>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [18237.109611] [<ffffffffc0bc0bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [18237.117558] [<ffffffffc0ff248b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [18237.126192] [<ffffffffc0fef2a5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [18237.133784] [<ffffffff9f2d3a33>] ? __wake_up+0x13/0x20 [18237.139689] [<ffffffffc0ff5df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [18237.146704] [<ffffffff9f980e02>] ? __schedule+0x402/0x840 [18237.152899] [<ffffffffc0ff52c0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [18237.161175] [<ffffffff9f2c6321>] kthread+0xd1/0xe0 [18237.166629] [<ffffffff9f2c6250>] ? insert_kthread_work+0x40/0x40 [18237.173457] [<ffffffff9f98ed37>] ret_from_fork_nospec_begin+0x21/0x21 [18237.180748] [<ffffffff9f2c6250>] ? insert_kthread_work+0x40/0x40 [18237.187548] Code: Bad RIP value. [18237.191284] RIP [< (null)>] (null) [18237.197031] RSP <ffff8b88072f3b68> [18237.200923] CR2: 0000000000000000 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct [ 0.000000] Linux version 3.10.0-1062.18.1.el7_lustre.x86_64 (jenkins@trevis-307-el7-x8664-1.trevis.whamcloud.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Wed May 27 23:31:53 UTC 2020 [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-1062.18.1.el7_lustre.x86_64 ro console=ttyS0,115200 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd disable_cpu_apicid=0 elfcorehdr=869816K [ 0.000000] e820: BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000008efff] usable [ 0.000000] BIOS-e820: [mem 0x000000000008f000-0x000000000009ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x000000002b000000-0x000000003516dfff] usable [ 0.000000] BIOS-e820: [mem 0x00000000bb3c7000-0x00000000bdd2efff] reserved
            sarah Sarah Liu added a comment -

            Hit this issue again on lustre-b2_12-next #57
            soak-11 console:

            [16422.066032] Lustre: 79110:0:(service.c:1322:ptlrpc_at_send_early_reply()) Skipped 425 previous similar messages
            [16469.920777] Lustre: 79107:0:(service.c:2011:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 225s  req@ffff89052d9cf980 x1656840860166528/t0(0) o103->s
            oaked-MDT0002-mdtlov_UUID@192.168.1.110@o2ib:0/0 lens 328/0 e 0 to 0 dl 0 ref 1 fl New:/2/ffffffff rc 0/-1
            [16469.946879] Lustre: 79107:0:(service.c:2011:ptlrpc_server_handle_req_in()) Skipped 370 previous similar messages
            [16511.443090] Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
            [16511.453496] Lustre: 79109:0:(service.c:1541:ptlrpc_at_check_timed()) earlyQ=3796 reqQ=17207 recA=0, svcEst=600, delay=85403
            [16511.466032] Lustre: 79109:0:(service.c:1322:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-2s), not sending early reply. Consider increasing at_ear
            ly_margin (5)?  req@ffff890983671200 x1656840826067648/t0(0) o103->soaked-MDT0000-mdtlov_UUID@192.168.1.108@o2ib:154/0 lens 328/0 e 0 to 0 dl 1580100394 ref 2 fl
             New:/0/ffffffff rc 0/-1
            [16511.499803] Lustre: 79109:0:(service.c:1322:ptlrpc_at_send_early_reply()) Skipped 109 previous similar messages
            [16589.022832] Lustre: soaked-MDT0003: Client 58fa13bb-1d12-d81e-9977-655c7b0a359d (at 192.168.1.117@o2ib) reconnecting
            [16665.657814] Lustre: soaked-MDT0003: Client 2d96e5f6-af44-1f55-5648-3c5026213dae (at 192.168.1.130@o2ib) reconnecting
            [16751.557870] Lustre: soaked-MDT0003: Client d23f5b34-b590-b62f-81cf-55bba7beeb62 (at 192.168.1.142@o2ib) reconnecting
            [16756.583622] Lustre: soaked-MDT0003: Client bd3af862-653c-bd66-2072-7b69d8812dd7 (at 192.168.1.140@o2ib) reconnecting
            [16756.595384] Lustre: Skipped 1 previous similar message
            [16781.901025] Lustre: soaked-MDT0003: Client 55eecb6a-6cc1-b7f2-e17b-d2ce51f4ff57 (at 192.168.1.141@o2ib) reconnecting
            [16785.356557] perf: interrupt took too long (3961 > 3942), lowering kernel.perf_event_max_sample_rate to 50000
            [16840.006883] Lustre: 78276:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1580100124/real 1580100124]  req@
            ffff890592e0d580 x1656840776727360/t0(0) o103->soaked-MDT0000-osp-MDT0003@192.168.1.108@o2ib:17/18 lens 328/224 e 7 to 1 dl 1580100725 ref 1 fl Rpc:X/2/ffffffff 
            rc -11/-1
            [16840.012891] Lustre: soaked-MDT0000-osp-MDT0003: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will 
            wait for recovery to complete
            [16840.057838] Lustre: 78276:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1124 previous similar messages
            [16840.104230] Lustre: soaked-MDT0000-osp-MDT0003: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib)
            [16840.115694] Lustre: Skipped 8 previous similar messages
            [16844.361410] Lustre: soaked-MDT0003: Received new LWP connection from 192.168.1.110@o2ib, removing former export from same NID
            [16844.374044] Lustre: Skipped 1 previous similar message
            [16879.242891] Lustre: 79107:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
            [16879.242891]   req@ffff890969d8e300 x1656840800674880/t0(0) o103->soaked-MDT0000-mdtlov_UUID@192.168.1.108@o2ib:529/0 lens 328/0 e 0 to 0 dl 1580100769 ref 2 f
            l New:/2/ffffffff rc 0/-1
            [16879.274271] Lustre: 79107:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 34712 previous similar messages
            [16884.947622] LustreError: 79136:0:(osd_oi.c:762:osd_oi_insert()) dm-5: the FID [0x2c00543c1:0xe1d:0x0] is used by two objects: 212861283/3086025257 233832815/3
            086025258
            [16884.964548] LNet: 78257:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.108@o2ib
            [16887.446546] BUG: unable to handle kernel NULL pointer dereference at           (null)
            [16887.455308] IP: [<          (null)>]           (null)
            [16887.460965] PGD 0
            [16887.463234] Oops: 0010 [#1] SMP
            [16887.466862] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov
            (OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE)
            rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) sb_edac intel_powe
            rclamp coretemp intel_rapl iosf_mbi kvm iTCO_wdt irqbypass iTCO_vendor_support crc32_pclmul ghash_clmulni_intel aesni_intel lrw ipmi_ssif gf128mul glue_helper me
            i_me dm_round_robin ablk_helper i2c_i801 mei ipmi_si lpc_ich cryptd ioatdma joydev pcspkr sg ipmi_devintf ipmi_msghandler wmi pcc_cpufreq dm_multipath dm_mod aut
            h_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect mlx4_core(OE) igb sysimgblt fb_sys
            _fops isci mpt3sas ttm ahci libsas libahci ptp crct10dif_pclmul drm devlink pps_core raid_class crct10dif_common dca libata crc32c_intel mlx_compat(OE) scsi_tran
            sport_sas drm_panel_orientation_quirks i2c_algo_bit
            [16887.582758] CPU: 8 PID: 83890 Comm: mdt_out01_011 Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.9.1.el7_lustre.x86_64 #1
            [16887.596833] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            [16887.609356] task: ffff8905a85d41c0 ti: ffff8905765b0000 task.ti: ffff8905765b0000
            [16887.617710] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
            [16887.626075] RSP: 0018:ffff8905765b3b68  EFLAGS: 00010246
            [16887.631996] RAX: 0000000000000000 RBX: ffff89097aff8000 RCX: 0000000000000002
            [16887.639962] RDX: ffffffffc1564ab0 RSI: ffff8905765b3b70 RDI: ffff89097aff8008
            [16887.647926] RBP: ffff8905765b3ba8 R08: 0000000000000000 R09: 0000000000000001
            [16887.655891] R10: 0000000200000008 R11: 00000000007fffff R12: ffff8909a93ded00
            [16887.663863] R13: ffff89096ddb15d8 R14: ffff890981779a00 R15: ffff89097aff8008
            [16887.671829] FS:  0000000000000000(0000) GS:ffff8909ae200000(0000) knlGS:0000000000000000
            [16887.680870] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [16887.687285] CR2: 0000000000000000 CR3: 0000000604210000 CR4: 00000000000607e0
            [16887.695251] Call Trace:
            [16887.698006]  [<ffffffffc1562eae>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
            [16887.706274]  [<ffffffffc1563087>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]
            [16887.713907]  [<ffffffffc0dba208>] dt_index_walk+0xf8/0x430 [obdclass]
            [16887.721135]  [<ffffffffc0dba540>] ? dt_index_walk+0x430/0x430 [obdclass]
            [16887.728649]  [<ffffffffc0dbae34>] dt_index_read+0x394/0x6a0 [obdclass]
            [16887.736047]  [<ffffffffc10f4e42>] tgt_obd_idx_read+0x612/0x860 [ptlrpc]
            [16887.743503]  [<ffffffffc10f99ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
            [16887.751242]  [<ffffffffc10d3101>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
            [16887.759709]  [<ffffffffc0c6ebde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
            [16887.767639]  [<ffffffffc109e47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [16887.776256]  [<ffffffffc109b295>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
            [16887.783833]  [<ffffffff892d3903>] ? __wake_up+0x13/0x20
            [16887.789723]  [<ffffffffc10a1de4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
            [16887.796776]  [<ffffffffc10a12b0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
            [16887.805038]  [<ffffffff892c61f1>] kthread+0xd1/0xe0
            [16887.810487]  [<ffffffff892c6120>] ? insert_kthread_work+0x40/0x40
            [16887.817293]  [<ffffffff8998dd37>] ret_from_fork_nospec_begin+0x21/0x21
            [16887.824581]  [<ffffffff892c6120>] ? insert_kthread_work+0x40/0x40
            [16887.831383] Code:  Bad RIP value.
            [16887.835102] RIP  [<          (null)>]           (null)
            [16887.835102] RIP  [<          (null)>]           (null)
            [16887.840852]  RSP <ffff8905765b3b68>
            [16887.844734] CR2: 0000000000000000
            [    0.000000] Initializing cgroup subsys cpuset
            [    0.000000] Initializing cgroup subsys cpu
            [    0.000000] Initializing cgroup subsys cpuacct
            [    0.000000] Linux version 3.10.0-1062.9.1.el7_lustre.x86_64 (jenkins@trevis-310-el7-x8664-6.trevis.whamcloud.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Thu Jan 23 14:10:41 UTC 2020
            [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-1062.9.1.el7_lustre.x86_64 ro console=ttyS0,115200 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd disable_cpu_apicid=0 elfcorehdr=869812K
            [    0.000000] e820: BIOS-provided physical RAM map:
            [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved
            [    0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000008ffff] usable
            [    0.000000] BIOS-e820: [mem 0x0000000000090000-0x000000000009ffff] reserved
            

            core dump can be found under /scratch/dumps/soak-11.spirit.whamcloud.com/10.10.1.111-2020-01-27-04\:53\:21/

            sarah Sarah Liu added a comment - Hit this issue again on lustre-b2_12-next #57 soak-11 console: [16422.066032] Lustre: 79110:0:(service.c:1322:ptlrpc_at_send_early_reply()) Skipped 425 previous similar messages [16469.920777] Lustre: 79107:0:(service.c:2011:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 225s req@ffff89052d9cf980 x1656840860166528/t0(0) o103->s oaked-MDT0002-mdtlov_UUID@192.168.1.110@o2ib:0/0 lens 328/0 e 0 to 0 dl 0 ref 1 fl New:/2/ffffffff rc 0/-1 [16469.946879] Lustre: 79107:0:(service.c:2011:ptlrpc_server_handle_req_in()) Skipped 370 previous similar messages [16511.443090] Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). [16511.453496] Lustre: 79109:0:(service.c:1541:ptlrpc_at_check_timed()) earlyQ=3796 reqQ=17207 recA=0, svcEst=600, delay=85403 [16511.466032] Lustre: 79109:0:(service.c:1322:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-2s), not sending early reply. Consider increasing at_ear ly_margin (5)? req@ffff890983671200 x1656840826067648/t0(0) o103->soaked-MDT0000-mdtlov_UUID@192.168.1.108@o2ib:154/0 lens 328/0 e 0 to 0 dl 1580100394 ref 2 fl New:/0/ffffffff rc 0/-1 [16511.499803] Lustre: 79109:0:(service.c:1322:ptlrpc_at_send_early_reply()) Skipped 109 previous similar messages [16589.022832] Lustre: soaked-MDT0003: Client 58fa13bb-1d12-d81e-9977-655c7b0a359d (at 192.168.1.117@o2ib) reconnecting [16665.657814] Lustre: soaked-MDT0003: Client 2d96e5f6-af44-1f55-5648-3c5026213dae (at 192.168.1.130@o2ib) reconnecting [16751.557870] Lustre: soaked-MDT0003: Client d23f5b34-b590-b62f-81cf-55bba7beeb62 (at 192.168.1.142@o2ib) reconnecting [16756.583622] Lustre: soaked-MDT0003: Client bd3af862-653c-bd66-2072-7b69d8812dd7 (at 192.168.1.140@o2ib) reconnecting [16756.595384] Lustre: Skipped 1 previous similar message [16781.901025] Lustre: soaked-MDT0003: Client 55eecb6a-6cc1-b7f2-e17b-d2ce51f4ff57 (at 192.168.1.141@o2ib) reconnecting [16785.356557] perf: interrupt took too long (3961 > 3942), lowering kernel.perf_event_max_sample_rate to 50000 [16840.006883] Lustre: 78276:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1580100124/real 1580100124] req@ ffff890592e0d580 x1656840776727360/t0(0) o103->soaked-MDT0000-osp-MDT0003@192.168.1.108@o2ib:17/18 lens 328/224 e 7 to 1 dl 1580100725 ref 1 fl Rpc:X/2/ffffffff rc -11/-1 [16840.012891] Lustre: soaked-MDT0000-osp-MDT0003: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will wait for recovery to complete [16840.057838] Lustre: 78276:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1124 previous similar messages [16840.104230] Lustre: soaked-MDT0000-osp-MDT0003: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib) [16840.115694] Lustre: Skipped 8 previous similar messages [16844.361410] Lustre: soaked-MDT0003: Received new LWP connection from 192.168.1.110@o2ib, removing former export from same NID [16844.374044] Lustre: Skipped 1 previous similar message [16879.242891] Lustre: 79107:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply [16879.242891] req@ffff890969d8e300 x1656840800674880/t0(0) o103->soaked-MDT0000-mdtlov_UUID@192.168.1.108@o2ib:529/0 lens 328/0 e 0 to 0 dl 1580100769 ref 2 f l New:/2/ffffffff rc 0/-1 [16879.274271] Lustre: 79107:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 34712 previous similar messages [16884.947622] LustreError: 79136:0:(osd_oi.c:762:osd_oi_insert()) dm-5: the FID [0x2c00543c1:0xe1d:0x0] is used by two objects: 212861283/3086025257 233832815/3 086025258 [16884.964548] LNet: 78257:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.108@o2ib [16887.446546] BUG: unable to handle kernel NULL pointer dereference at (null) [16887.455308] IP: [< (null)>] (null) [16887.460965] PGD 0 [16887.463234] Oops: 0010 [#1] SMP [16887.466862] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov (OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) sb_edac intel_powe rclamp coretemp intel_rapl iosf_mbi kvm iTCO_wdt irqbypass iTCO_vendor_support crc32_pclmul ghash_clmulni_intel aesni_intel lrw ipmi_ssif gf128mul glue_helper me i_me dm_round_robin ablk_helper i2c_i801 mei ipmi_si lpc_ich cryptd ioatdma joydev pcspkr sg ipmi_devintf ipmi_msghandler wmi pcc_cpufreq dm_multipath dm_mod aut h_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect mlx4_core(OE) igb sysimgblt fb_sys _fops isci mpt3sas ttm ahci libsas libahci ptp crct10dif_pclmul drm devlink pps_core raid_class crct10dif_common dca libata crc32c_intel mlx_compat(OE) scsi_tran sport_sas drm_panel_orientation_quirks i2c_algo_bit [16887.582758] CPU: 8 PID: 83890 Comm: mdt_out01_011 Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.9.1.el7_lustre.x86_64 #1 [16887.596833] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [16887.609356] task: ffff8905a85d41c0 ti: ffff8905765b0000 task.ti: ffff8905765b0000 [16887.617710] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [16887.626075] RSP: 0018:ffff8905765b3b68 EFLAGS: 00010246 [16887.631996] RAX: 0000000000000000 RBX: ffff89097aff8000 RCX: 0000000000000002 [16887.639962] RDX: ffffffffc1564ab0 RSI: ffff8905765b3b70 RDI: ffff89097aff8008 [16887.647926] RBP: ffff8905765b3ba8 R08: 0000000000000000 R09: 0000000000000001 [16887.655891] R10: 0000000200000008 R11: 00000000007fffff R12: ffff8909a93ded00 [16887.663863] R13: ffff89096ddb15d8 R14: ffff890981779a00 R15: ffff89097aff8008 [16887.671829] FS: 0000000000000000(0000) GS:ffff8909ae200000(0000) knlGS:0000000000000000 [16887.680870] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [16887.687285] CR2: 0000000000000000 CR3: 0000000604210000 CR4: 00000000000607e0 [16887.695251] Call Trace: [16887.698006] [<ffffffffc1562eae>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs] [16887.706274] [<ffffffffc1563087>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs] [16887.713907] [<ffffffffc0dba208>] dt_index_walk+0xf8/0x430 [obdclass] [16887.721135] [<ffffffffc0dba540>] ? dt_index_walk+0x430/0x430 [obdclass] [16887.728649] [<ffffffffc0dbae34>] dt_index_read+0x394/0x6a0 [obdclass] [16887.736047] [<ffffffffc10f4e42>] tgt_obd_idx_read+0x612/0x860 [ptlrpc] [16887.743503] [<ffffffffc10f99ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] [16887.751242] [<ffffffffc10d3101>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [16887.759709] [<ffffffffc0c6ebde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [16887.767639] [<ffffffffc109e47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [16887.776256] [<ffffffffc109b295>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [16887.783833] [<ffffffff892d3903>] ? __wake_up+0x13/0x20 [16887.789723] [<ffffffffc10a1de4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [16887.796776] [<ffffffffc10a12b0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [16887.805038] [<ffffffff892c61f1>] kthread+0xd1/0xe0 [16887.810487] [<ffffffff892c6120>] ? insert_kthread_work+0x40/0x40 [16887.817293] [<ffffffff8998dd37>] ret_from_fork_nospec_begin+0x21/0x21 [16887.824581] [<ffffffff892c6120>] ? insert_kthread_work+0x40/0x40 [16887.831383] Code: Bad RIP value. [16887.835102] RIP [< (null)>] (null) [16887.835102] RIP [< (null)>] (null) [16887.840852] RSP <ffff8905765b3b68> [16887.844734] CR2: 0000000000000000 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct [ 0.000000] Linux version 3.10.0-1062.9.1.el7_lustre.x86_64 (jenkins@trevis-310-el7-x8664-6.trevis.whamcloud.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Thu Jan 23 14:10:41 UTC 2020 [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-1062.9.1.el7_lustre.x86_64 ro console=ttyS0,115200 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd disable_cpu_apicid=0 elfcorehdr=869812K [ 0.000000] e820: BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000008ffff] usable [ 0.000000] BIOS-e820: [mem 0x0000000000090000-0x000000000009ffff] reserved core dump can be found under /scratch/dumps/soak-11.spirit.whamcloud.com/10.10.1.111-2020-01-27-04\:53\:21/
            ys Yang Sheng added a comment -

            From vmcore, looks like some thing really wrong:

            .....
            crash> inode ffff8da86e8e5d08 -x
            struct inode {
              i_mode = 0x41ed,
              i_opflags = 0x0,
              i_uid = {
                val = 0x21b
              },
              i_gid = {
                val = 0x21b
              },
              i_flags = 0x80,
              i_acl = 0x0,
              i_default_acl = 0xffffffffffffffff,
              i_op = 0xffffffffc1178d00 <ldiskfs_file_inode_operations>,
              i_sb = 0xffff8da5797df800,
              i_mapping = 0xffff8da86e8e5e58,
              i_security = 0x0,
              i_ino = 0x1c78042b,
              {
                i_nlink = 0x1,
                __i_nlink = 0x1
              },
              i_rdev = 0x0,
              i_size = 0x0,
              i_atime = {
                tv_sec = 0x5cd8196d,
                tv_nsec = 0x0
              },
              i_mtime = {
                tv_sec = 0x5cd8196d,
                tv_nsec = 0x0
              },
              i_ctime = {
                tv_sec = 0x5cd8196d,
                tv_nsec = 0x0
              },
              i_lock = {
                {
                  rlock = {
                    raw_lock = {
                      val = {
                        counter = 0x0
                      }
                    }
                  }
                }
              },
              i_bytes = 0x0,
              i_blkbits = 0xc,
              i_blocks = 0x0,
              i_state = 0x0,
              i_mutex = {
                count = {
                  counter = 0x1
                },
                wait_lock = {
                  {
                    rlock = {
                      raw_lock = {
                        val = {
                          counter = 0x0
                        }
                      }
                    }
                  }
                },
                wait_list = {
                  next = 0xffff8da86e8e5db8,
                  prev = 0xffff8da86e8e5db8
                },
                owner = 0x0,
                {
                  osq = {
                    tail = {
                      counter = 0x0
                    }
                  },
                  __UNIQUE_ID_rh_kabi_hide3 = {
                    spin_mlock = 0x0
                  },
                  {<No data fields>}
                }
              },
              dirtied_when = 0x0,
              i_hash = {
                next = 0x0,
                pprev = 0xffffab0282849760
              },
              i_io_list = {
                next = 0xffff8da86e8e5df0,
                prev = 0xffff8da86e8e5df0
              },
              i_lru = {
                next = 0xffff8da86e8e5e00,
                prev = 0xffff8da86e8e5e00
              },
              i_sb_list = {
                next = 0xffff8da8555f27a0,
                prev = 0xffff8da86ea655b0
              },
              {
                i_dentry = {
                  first = 0x0
                },
                i_rcu = {
                  next = 0x0,
                  func = 0x0
                }
              },
              i_version = 0x1,
              i_count = {
                counter = 0x1
              },
              i_dio_count = {
                counter = 0x0
              },
              i_writecount = {
                counter = 0x0
              },
              i_fop = 0xffffffffc1178dc0 <ldiskfs_file_operations>,
              i_flock = 0x0,
              i_data = {
                host = 0xffff8da86e8e5d08,
                page_tree = {
                  rh_reserved_height = 0x0,
                  gfp_mask = 0x20,
                  rnode = 0x0
                },
                tree_lock = {
                  {
                    rlock = {
                      raw_lock = {
                        val = {
                          counter = 0x0
                        }
                      }
                    }
                  }
                },
                {
                  i_mmap_writable = {
                    counter = 0x0
                  },
                  __UNIQUE_ID_rh_kabi_hide17 = {
                    i_mmap_writable = 0x0
                  },
                  {<No data fields>}
                },
                i_mmap = {
                  rb_node = 0x0
                },
                i_mmap_nonlinear = {
                  next = 0xffff8da86e8e5e80,
                  prev = 0xffff8da86e8e5e80
                },
                i_mmap_mutex = {
                  count = {
                    counter = 0x1
                  },
                  wait_lock = {
                    {
                      rlock = {
                        raw_lock = {
                          val = {
                            counter = 0x0
                          }
                        }
                      }
                    }
                  },
                  wait_list = {
                    next = 0xffff8da86e8e5e98,
                    prev = 0xffff8da86e8e5e98
                  },
                  owner = 0x0,
                  {
                    osq = {
                      tail = {
                        counter = 0x0
                      }
                    },
                    __UNIQUE_ID_rh_kabi_hide3 = {
                      spin_mlock = 0x0
                    },
                    {<No data fields>}
                  }
                },
                nrpages = 0x0,
                nrexceptional = 0x0,
                writeback_index = 0x0,
                a_ops = 0xffffffffc1178b20 <ldiskfs_journalled_aops>,
                flags = 0x200da,
                backing_dev_info = 0xffff8da8e3673b20,
                private_lock = {
                  {
                    rlock = {
                      raw_lock = {
                        val = {
                          counter = 0x0
                        }
                      }
                    }
                  }
                },
                private_list = {
                  next = 0xffff8da86e8e5ef0,
                  prev = 0xffff8da86e8e5ef0
                },
                private_data = 0x0
              },
              i_dquot = {0xffff8da8e5124100, 0xffff8da8e5124200, 0x0},
              i_devices = {
                next = 0xffff8da86e8e5f20,
                prev = 0xffff8da86e8e5f20
              },
              {
                i_pipe = 0x0,
                i_bdev = 0x0,
                i_cdev = 0x0
              },
              i_generation = 0x3f58114f,
              i_fsnotify_mask = 0x0,
              {
                i_fsnotify_marks = 0x0,
                __UNIQUE_ID_rh_kabi_hide18 = {
                  i_fsnotify_marks = {
                    first = 0x0
                  }
                },
                {<No data fields>}
              },
              i_readcount = {
                counter = 0x0
              },
              i_private = 0x0
            }
            ......
            

            The i_mode indicated this inode is a directory(0x4000). But the i_op & i_fop set to REG operations. Looks like some corner case causes such situation?

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - From vmcore, looks like some thing really wrong: ..... crash> inode ffff8da86e8e5d08 -x struct inode { i_mode = 0x41ed, i_opflags = 0x0, i_uid = { val = 0x21b }, i_gid = { val = 0x21b }, i_flags = 0x80, i_acl = 0x0, i_default_acl = 0xffffffffffffffff, i_op = 0xffffffffc1178d00 <ldiskfs_file_inode_operations>, i_sb = 0xffff8da5797df800, i_mapping = 0xffff8da86e8e5e58, i_security = 0x0, i_ino = 0x1c78042b, { i_nlink = 0x1, __i_nlink = 0x1 }, i_rdev = 0x0, i_size = 0x0, i_atime = { tv_sec = 0x5cd8196d, tv_nsec = 0x0 }, i_mtime = { tv_sec = 0x5cd8196d, tv_nsec = 0x0 }, i_ctime = { tv_sec = 0x5cd8196d, tv_nsec = 0x0 }, i_lock = { { rlock = { raw_lock = { val = { counter = 0x0 } } } } }, i_bytes = 0x0, i_blkbits = 0xc, i_blocks = 0x0, i_state = 0x0, i_mutex = { count = { counter = 0x1 }, wait_lock = { { rlock = { raw_lock = { val = { counter = 0x0 } } } } }, wait_list = { next = 0xffff8da86e8e5db8, prev = 0xffff8da86e8e5db8 }, owner = 0x0, { osq = { tail = { counter = 0x0 } }, __UNIQUE_ID_rh_kabi_hide3 = { spin_mlock = 0x0 }, {<No data fields>} } }, dirtied_when = 0x0, i_hash = { next = 0x0, pprev = 0xffffab0282849760 }, i_io_list = { next = 0xffff8da86e8e5df0, prev = 0xffff8da86e8e5df0 }, i_lru = { next = 0xffff8da86e8e5e00, prev = 0xffff8da86e8e5e00 }, i_sb_list = { next = 0xffff8da8555f27a0, prev = 0xffff8da86ea655b0 }, { i_dentry = { first = 0x0 }, i_rcu = { next = 0x0, func = 0x0 } }, i_version = 0x1, i_count = { counter = 0x1 }, i_dio_count = { counter = 0x0 }, i_writecount = { counter = 0x0 }, i_fop = 0xffffffffc1178dc0 <ldiskfs_file_operations>, i_flock = 0x0, i_data = { host = 0xffff8da86e8e5d08, page_tree = { rh_reserved_height = 0x0, gfp_mask = 0x20, rnode = 0x0 }, tree_lock = { { rlock = { raw_lock = { val = { counter = 0x0 } } } } }, { i_mmap_writable = { counter = 0x0 }, __UNIQUE_ID_rh_kabi_hide17 = { i_mmap_writable = 0x0 }, {<No data fields>} }, i_mmap = { rb_node = 0x0 }, i_mmap_nonlinear = { next = 0xffff8da86e8e5e80, prev = 0xffff8da86e8e5e80 }, i_mmap_mutex = { count = { counter = 0x1 }, wait_lock = { { rlock = { raw_lock = { val = { counter = 0x0 } } } } }, wait_list = { next = 0xffff8da86e8e5e98, prev = 0xffff8da86e8e5e98 }, owner = 0x0, { osq = { tail = { counter = 0x0 } }, __UNIQUE_ID_rh_kabi_hide3 = { spin_mlock = 0x0 }, {<No data fields>} } }, nrpages = 0x0, nrexceptional = 0x0, writeback_index = 0x0, a_ops = 0xffffffffc1178b20 <ldiskfs_journalled_aops>, flags = 0x200da, backing_dev_info = 0xffff8da8e3673b20, private_lock = { { rlock = { raw_lock = { val = { counter = 0x0 } } } } }, private_list = { next = 0xffff8da86e8e5ef0, prev = 0xffff8da86e8e5ef0 }, private_data = 0x0 }, i_dquot = {0xffff8da8e5124100, 0xffff8da8e5124200, 0x0}, i_devices = { next = 0xffff8da86e8e5f20, prev = 0xffff8da86e8e5f20 }, { i_pipe = 0x0, i_bdev = 0x0, i_cdev = 0x0 }, i_generation = 0x3f58114f, i_fsnotify_mask = 0x0, { i_fsnotify_marks = 0x0, __UNIQUE_ID_rh_kabi_hide18 = { i_fsnotify_marks = { first = 0x0 } }, {<No data fields>} }, i_readcount = { counter = 0x0 }, i_private = 0x0 } ...... The i_mode indicated this inode is a directory(0x4000). But the i_op & i_fop set to REG operations. Looks like some corner case causes such situation? Thanks, YangSheng
            sarah Sarah Liu added a comment -

            I uploaded the vmcore again, please try if it works. Or if you can access spirit, the dumps are located under
            /scratch/dumps/soak-10.spirit.whamcloud.com/10.10.1.110-2019-05-12-13:06:18

            I was using the mlnx bulid
            https://build.whamcloud.com/job/lustre-b2_12-ib/16/arch=x86_64,build_type=server,distro=el7.6,ib_stack=mlx/

            sarah Sarah Liu added a comment - I uploaded the vmcore again, please try if it works. Or if you can access spirit, the dumps are located under /scratch/dumps/soak-10.spirit.whamcloud.com/10.10.1.110-2019-05-12-13:06:18 I was using the mlnx bulid https://build.whamcloud.com/job/lustre-b2_12-ib/16/arch=x86_64,build_type=server,distro=el7.6,ib_stack=mlx/
            ys Yang Sheng added a comment -

            Hi, Sarah,

            Which build do you test? I use

            https://build.whamcloud.com/job/lustre-b2_12/arch=x86_64,build_type=server,distro=el7.6,ib_stack=inkernel/15/artifact/artifacts/RPMS/x86_64/

            cannot decode the vmcore.

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Sarah, Which build do you test? I use https://build.whamcloud.com/job/lustre-b2_12/arch=x86_64,build_type=server,distro=el7.6,ib_stack=inkernel/15/artifact/artifacts/RPMS/x86_64/ cannot decode the vmcore. Thanks, YangSheng

            People

              ys Yang Sheng
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: