Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11781

MDS hit BUG: unable to handle kernel NULL pointer dereference

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.12.0
    • None
    • 2.12-RC2 lustre-master-ib build #173 EL7.6 DNE
    • 3
    • 9223372036854775807

    Description

      MDS 0 hit kernel panic after running about 24 hours.

      MDS 0 console

      [18476.473668] Lustre: MGS: haven't heard from client 8e54524b-a52c-7091-0e06-f2d4a89dd59c (at 192.168.1.109@o2ib) in 227 seconds. I think it's dead, and I am evic
      ting it. exp ffff9fcae4159400, cur 1544788503 expire 1544788353 last 1544788276
      [18520.421813] LNet: 28111:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 192.168.1.109@o2ib: 0 seconds
      [18520.433290] LNet: 28111:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Skipped 1 previous similar message
      [18521.243163] LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.126@o2ib (no target). If you are running an HA pair check that the
       target is mounted on the other server.
      [18521.263041] LustreError: Skipped 122 previous similar messages
      [18571.421594] LNet: 28111:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 192.168.1.109@o2ib: 1 seconds
      [18620.421324] LNet: 28111:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 192.168.1.109@o2ib: 0 seconds
      [18703.424361] Lustre: MGS: Connection restored to 192.168.1.109@o2ib (at 192.168.1.109@o2ib)
      [18704.407564] Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.109@o2ib, removing former export from same NID
      [18737.363675] Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.110@o2ib, removing former export from same NID
      [18737.376324] Lustre: Skipped 1 previous similar message
      [18737.382130] Lustre: soaked-MDT0000: Connection restored to 192.168.1.110@o2ib (at 192.168.1.110@o2ib)
      [18737.392457] Lustre: Skipped 2 previous similar messages
      [18737.423007] LustreError: 31941:0:(osd_oi.c:761:osd_oi_insert()) dm-2: the FID [0x20000c768:0x179ce:0x0] is used by two objects: 402128901/3006680378 357564421/3
      006680381
      [18737.440029] LNet: 28116:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 192.168.1.110@o2ib
      [18741.123580] LustreError: 167-0: soaked-MDT0001-osp-MDT0000: This client was evicted by soaked-MDT0001; in progress operations using this service will fail.
      [18753.014392] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [18753.023170] IP: [<          (null)>]           (null)
      [18753.028819] PGD 0 
      [18753.031075] Oops: 0010 [#1] SMP 
      [18753.034700] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptl
      rpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_c
      m(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) mlx4_ib(OE) ib_core(OE) dm_round_robin sb_edac intel_powerclamp coretemp intel_rapl
       iosf_mbi kvm iTCO_wdt irqbypass iTCO_vendor_support sg crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev pcspkr ipmi
      _ssif mei_me mei lpc_ich wmi i2c_i801 ioatdma ipmi_si ipmi_devintf ipmi_msghandler dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t1
      0dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops isci igb ttm mpt2sas ahci crct10dif_pclmul crct10dif_common libsas liba
      hci crc32c_intel ptp drm mlx4_core(OE) raid_class libata pps_core drm_panel_orientation_quirks scsi_transport_sas mlx_compat(OE) dca devlink i2c_algo_bit
      [18753.146014] CPU: 0 PID: 43000 Comm: mdt_out00_019 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.el7_lustre.x86_64 #1
      [18753.159604] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
      [18753.172121] task: ffff9fcf2591e180 ti: ffff9fcad0a8c000 task.ti: ffff9fcad0a8c000
      [18753.180476] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
      [18753.188859] RSP: 0018:ffff9fcad0a8fb60  EFLAGS: 00010246
      [18753.194788] RAX: 0000000000000000 RBX: ffff9fc7a8190000 RCX: 0000000000000002
      [18753.202754] RDX: ffffffffc12dc770 RSI: ffff9fcad0a8fb68 RDI: ffff9fc7a8190008
      [18753.210718] RBP: ffff9fcad0a8fba0 R08: 0000000000000004 R09: 0000000000000000
      [18753.218682] R10: 0000000000000001 R11: 00000000007fffff R12: ffff9fc736087300
      [18753.226645] R13: ffff9fcb00494c48 R14: ffff9fcefba3a200 R15: ffff9fc7a8190008
      [18753.234611] FS:  0000000000000000(0000) GS:ffff9fcb2e000000(0000) knlGS:0000000000000000
      [18753.243647] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [18753.250052] CR2: 0000000000000000 CR3: 0000000551a10000 CR4: 00000000000607f0
      [18753.258016] Call Trace:
      [18753.260766]  [<ffffffffc12dabee>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
      [18753.269028]  [<ffffffffc12dadc7>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]
      [18753.276640]  [<ffffffffc0e84038>] dt_index_walk+0xf8/0x430 [obdclass]
      [18753.283850]  [<ffffffffc0e84370>] ? dt_index_walk+0x430/0x430 [obdclass]
      [18753.291350]  [<ffffffffc0e85444>] dt_index_read+0x394/0x6a0 [obdclass]
      [18753.298701]  [<ffffffffc10ceb32>] tgt_obd_idx_read+0x612/0x860 [ptlrpc]
      [18753.306117]  [<ffffffffc10d135a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [18753.313825]  [<ffffffffc10aaa51>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
      [18753.322284]  [<ffffffffc0bdebde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
      [18753.330186]  [<ffffffffc107592b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [18753.338795]  [<ffffffffc10727b5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
      [18753.346380]  [<ffffffff9fcd67c2>] ? default_wake_function+0x12/0x20
      [18753.353379]  [<ffffffff9fccba9b>] ? __wake_up_common+0x5b/0x90
      [18753.359934]  [<ffffffffc107925c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      [18753.366951]  [<ffffffffc1078760>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [18753.375209]  [<ffffffff9fcc1c31>] kthread+0xd1/0xe0
      [18753.380662]  [<ffffffff9fcc1b60>] ? insert_kthread_work+0x40/0x40
      [18753.387480]  [<ffffffffa0374c37>] ret_from_fork_nospec_begin+0x21/0x21
      [18753.394766]  [<ffffffff9fcc1b60>] ? insert_kthread_work+0x40/0x40
      [18753.401564] Code:  Bad RIP value.
      [18753.405279] RIP  [<          (null)>]           (null)
      [18753.411023]  RSP <ffff9fcad0a8fb60>
      [18753.414916] CR2: 0000000000000000
      [    0.000000] Initializing cgroup subsys cpuset
      [    0.000000] Initializing cgroup subsys cpu
      [    0.000000] Initializing cgroup subsys cpuacct
      [    0.000000] Linux version 3.10.0-957.el7_lustre.x86_64 (jenkins@trevis-309-el7-x8664-2.trevis.whamcloud.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC) ) #1 SMP Sat Dec 8 05:53:16 UTC 2018
      [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-957.el7_lustre.x86_64 ro console=ttyS0,115200 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr disable_cpu_apicid=0 elfcorehdr=869816K
      [    0.000000] e820: BIOS-provided physical RAM map:
      [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved
      [    0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000008efff] usable
      [    0.000000] BIOS-e820: [mem 0x000000000008f000-0x000000000009ffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
      [    0.000000] BIOS-e820: [mem 0x000000002b000000-0x000000003516dfff] usable
      [    0.000000] BIOS-e820: [mem 0x00000000bb3c7000-0x00000000bdd2efff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000bdd2f000-0x00000000bddccfff] ACPI NVS
      [    0.000000] BIOS-e820: [mem 0x00000000bddcd000-0x00000000bdea0fff] ACPI data
      [    0.000000] BIOS-e820: [mem 0x00000000bdea1000-0x00000000bdf2efff] ACPI NVS
      [    0.000000] BIOS-e820: [mem 0x00000000bdf2f000-0x00000000bdfabfff] ACPI data
      [    0.000000] BIOS-e820: [mem 0x00000000be000000-0x00000000cfffffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000fed19000-0x00000000fed19fff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000ffa20000-0x00000000ffffffff] reserved
      [    0.000000] NX (Execute Disable) protection: active
      [    0.000000] SMBIOS 2.6 present.
      
      

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: