Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11385

client hit BUG: unable to handle kernel NULL pointer dereference at 0000000000000028

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0, Lustre 2.12.4
    • Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.3
    • lustre-master-ib build#128 tag-2.11.55 . DNE mode
    • 3
    • 9223372036854775807

    Description

      Nearly 1/3 of the clients(10/26 clients) hit following error when running soak for several hours

      soak-19 console

      [ 3041.376764] LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2^M
      [ 3041.387652] alg: No test for adler32 (adler32-zlib)^M
      [ 3042.300215] Lustre: Lustre: Build Version: 2.11.55^M
      [ 3042.529425] LNet: Using FMR for registration^M
      [ 3042.547347] LNet: Added LNI 192.168.1.119@o2ib [8/256/0/180]^M
      Sep 15 03:27:01 soak-19 TIME: Time stamp for console^M
      Sep 15 04:01:01 soak-19 TIME: Time stamp for console^M
      [ 6150.233000] Lustre: Mounted soaked-client^M
      [ 8489.389803] LNetError: 3654:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds^M
      [ 8489.401106] LNetError: 3654:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 192.168.1.106@o2ib (32): c: 7, oc: 0, rc: 8^M
      [ 8489.414871] Lustre: 3694:0:(client.c:2126:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1536987077/real 1536987083]  req@ffff9c305ca50c00 x1611642077465472/t0(0) o400->soaked-OST0006-osc-ffff9c30acd99800@192.168.1.106@o2ib:28/4 lens 224/224 e 0 to 1 dl 1536987123 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1^M
      [ 8489.448242] Lustre: soaked-OST0006-osc-ffff9c30acd99800: Connection to soaked-OST0006 (at 192.168.1.106@o2ib) was lost; in progress operations using this service will wait for recovery to complete^M
      [ 8686.219497] perf: interrupt took too long (2607 > 2500), lowering kernel.perf_event_max_sample_rate to 76000^M
      [ 8701.565834] Lustre: soaked-MDT0000-mdc-ffff9c30acd99800: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib)^M
      [ 8727.383766] LNet: 3654:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 192.168.1.106@o2ib: 1 seconds^M
      [ 8727.395186] LNet: 3654:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 192.168.1.106@o2ib: 5 seconds^M
      [ 8832.381649] LustreError: 11-0: soaked-OST0002-osc-ffff9c30acd99800: operation ost_connect to node 192.168.1.107@o2ib failed: rc = -16^M
      Sep 15 05:01:01 soak-19 TIME: Time stamp for console^M
      [ 9072.730020] perf: interrupt took too long (3297 > 3258), lowering kernel.perf_event_max_sample_rate to 60000^M
      [10828.883723] perf: interrupt took too long (4162 > 4121), lowering kernel.perf_event_max_sample_rate to 48000^M
      Sep 15 06:01:01 soak-19 TIME: Time stamp for console^M
      Sep 15 07:01:01 soak-19 TIME: Time stamp for console^M
      [18186.142873] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail^M
      [18201.920492] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028^M
      [18201.929290] IP: [<ffffffffc0c12c80>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]^M
      [18201.937392] PGD 0 ^M
      [18201.939651] Oops: 0000 [#1] SMP ^M
      [18201.943287] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev pcspkr iTCO_wdt iTCO_vendor_support wmi sg ipmi_si ipmi_devintf lpc_ich ipmi_msghandler i2c_i801 mei_me shpchp ioatdma mei auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb ptp drm mlx4_core(OE) isci ahci mpt2sas pps_core libsas libahci crct10dif_pclmul devlink dca crct10dif_common i2c_algo_bit crc32c_intel raid_class libata mlx_compat(OE) i2c_core scsi_transport_sas^M
      [18202.044478] CPU: 10 PID: 16503 Comm: IOR Tainted: G           OE  ------------   3.10.0-862.9.1.el7.x86_64 #1^M
      [18202.055578] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
      [18202.068135] task: ffff9c2cabe70000 ti: ffff9c2a05b28000 task.ti: ffff9c2a05b28000^M
      [18202.076511] RIP: 0010:[<ffffffffc0c12c80>]  [<ffffffffc0c12c80>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]^M
      [18202.087332] RSP: 0018:ffff9c2a05b2b798  EFLAGS: 00010202^M
      [18202.093277] RAX: 0000000000000000 RBX: ffff9c30ab70af80 RCX: 0000000000000106^M
      [18202.101268] RDX: ffff9c2faae20280 RSI: ffffffffc0c19710 RDI: ffff9c2faae20280^M
      [18202.110466] RBP: ffff9c2a05b2b7e8 R08: 0000000000000002 R09: ffffffffc0c20cf1^M
      [18202.119628] R10: ffff9c29ffc07900 R11: ffffffffc0c0536c R12: 00050000c0a8016c^M
      [18202.128791] R13: ffff9c2faae20280 R14: ffff9c2faae20280 R15: ffff9c30a73bc000^M
      [18202.137954] FS:  0000000000000000(0000) GS:ffff9c30ad880000(0000) knlGS:0000000000000000^M
      [18202.148175] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
      [18202.155776] CR2: 0000000000000028 CR3: 000000007320e000 CR4: 00000000000607e0^M
      [18202.164938] Call Trace:^M
      [18202.168854]  [<ffffffffc0c1531c>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd]^M
      [18202.177641]  [<ffffffffc0c15987>] kiblnd_send+0x357/0xa20 [ko2iblnd]^M
      [18202.185950]  [<ffffffffc09deed4>] lnet_ni_send+0x44/0xd0 [lnet]^M
      [18202.193751]  [<ffffffffc09e6752>] lnet_send+0x82/0x1c0 [lnet]^M
      [18202.201346]  [<ffffffffc09e6bec>] LNetPut+0x2cc/0xb60 [lnet]^M
      [18202.208850]  [<ffffffffc0c7f046>] ptl_send_buf+0x146/0x530 [ptlrpc]^M
      [18202.216995]  [<ffffffffc0c80d3d>] ptl_send_rpc+0x69d/0xe70 [ptlrpc]^M
      [18202.225113]  [<ffffffffc0c766e0>] ptlrpc_send_new_req+0x460/0xa70 [ptlrpc]^M
      [18202.233896]  [<ffffffffc0cbeaca>] ? null_alloc_reqbuf+0x19a/0x3a0 [ptlrpc]^M
      [18202.242650]  [<ffffffffc0c7b1c1>] ptlrpc_set_wait+0x291/0x790 [ptlrpc]^M
      [18202.251009]  [<ffffffffc0a7eba7>] ? lprocfs_oh_tally+0x17/0x40 [obdclass]^M
      [18202.259634]  [<ffffffffc0c868fa>] ? lustre_msg_set_jobid+0x9a/0x110 [ptlrpc]^M
      [18202.268531]  [<ffffffffc0c7b73d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]^M
      [18202.276916]  [<ffffffffc0e4668b>] mdc_close+0x1eb/0x980 [mdc]^M
      [18202.284338]  [<ffffffffc084cf04>] lmv_close+0x184/0x2a0 [lmv]^M
      [18202.291745]  [<ffffffffc0e8d7c0>] ll_close_inode_openhandle+0x2e0/0xcd0 [lustre]^M
      [18202.300976]  [<ffffffffc0e91f50>] ll_md_real_close+0xf0/0x1e0 [lustre]^M
      [18202.309221]  [<ffffffffc0e9265b>] ll_file_release+0x61b/0x8c0 [lustre]^M
      [18202.317475]  [<ffffffff8ca1d74c>] __fput+0xec/0x260^M
      [18202.323850]  [<ffffffff8ca1d9ae>] ____fput+0xe/0x10^M
      [18202.330208]  [<ffffffff8c8b803b>] task_work_run+0xbb/0xe0^M
      [18202.337134]  [<ffffffff8c897f21>] do_exit+0x2d1/0xa40^M
      [18202.343652]  [<ffffffff8ca1b538>] ? vfs_write+0x168/0x1f0^M
      [18202.350551]  [<ffffffff8cf206e1>] ? system_call_after_swapgs+0xae/0x146^M
      [18202.358817]  [<ffffffff8c89870f>] do_group_exit+0x3f/0xa0^M
      [18202.365704]  [<ffffffff8c898784>] SyS_exit_group+0x14/0x20^M
      [18202.372663]  [<ffffffff8cf20795>] system_call_fastpath+0x1c/0x21^M
      [18202.380187]  [<ffffffff8cf206e1>] ? system_call_after_swapgs+0xae/0x146^M
      [18202.388370] Code: 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 49 c7 c1 f1 0c c2 c0 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 10 97 c1 c0 <48> 8b 78 28 e8 07 dc 99 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 c1 ^M
      [18202.411764] RIP  [<ffffffffc0c12c80>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]^M
      [18202.420723]  RSP <ffff9c2a05b2b798>^M
      [18202.425371] CR2: 0000000000000028^M
      [18202.431391] ---[ end trace ccdeccc9915a17ce ]---^M
      [18202.509003] Kernel panic - not syncing: Fatal exception^M
      [18202.515717] Kernel Offset: 0xb800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)^M
      

      Attachments

        Issue Links

          Activity

            [LU-11385] client hit BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
            sarah Sarah Liu added a comment -

            Hit same issue on 2.12.3-RC1(b2_12-ib #64)

            [37224.564653] LNetError: 4595:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Timed out RDMA with 192.168.1.108@o2ib (15): c: 7, oc: 0, rc: 8
            [37224.578319] LNetError: 4603:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 192.168.1.108@o2ib added to recovery queue. Health = 900
            [37224.628484] LNetError: 52343:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 192.168.1.128@o2ib added to recovery queue. Health = 900
            [37224.642005] LNetError: 52343:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 22 previous similar messages
            [37226.628596] Lustre: 4631:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1570787383/real 1570787390]  
            req@ffff971940601200 x1647054195451264/t0(0) o400->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1570787390 ref 1 fl Rpc:eXN/0/ffffff
            ff rc 0/-1
            [37226.660634] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fa
            il
            [37243.629633] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
            [37243.638397] IP: [<ffffffffc0ca6cd0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]
            [37243.646478] PGD 0 
            [37243.648729] Oops: 0000 [#1] SMP 
            [37243.652352] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) r
            pcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_cor
            e(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmul
            ni_intel aesni_intel lrw gf128mul ipmi_ssif glue_helper sg lpc_ich mei_me mei ablk_helper cryptd ipmi_si pcspkr ipmi_devintf ioatdma ipmi_msghandler joydev i2c
            _i801 pcc_cpufreq wmi auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mgag200 drm_kms_
            helper isci syscopyarea igb sysfillrect sysimgblt fb_sys_fops ahci ttm libsas scsi_transport_sas libahci mlx4_core(OE) ptp drm crct10dif_pclmul pps_core crct10
            dif_common devlink crc32c_intel dca libata mlx_compat(OE) drm_panel_orientation_quirks i2c_algo_bit
            [37243.755763] CPU: 8 PID: 53315 Comm: mdtest Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.1.1.el7.x86_64 #1
            [37243.768477] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            [37243.780998] task: ffff9713b5a841c0 ti: ffff971601604000 task.ti: ffff971601604000
            [37243.789349] RIP: 0010:[<ffffffffc0ca6cd0>]  [<ffffffffc0ca6cd0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]
            [37243.800137] RSP: 0018:ffff971601607798  EFLAGS: 00010202
            [37243.806064] RAX: 0000000000000000 RBX: ffff971bac9a6100 RCX: 0000000000000106
            [37243.814028] RDX: ffff971b142fde00 RSI: ffffffffc0cae070 RDI: ffff971b142fde00
            [37243.821990] RBP: ffff9716016077e8 R08: 0000000000000002 R09: ffffffffc0cb4cc4
            [37243.829953] R10: ffff9714ffc07900 R11: ffffffffc0c9945c R12: 00050000c0a8016c
            [37243.837916] R13: ffff971b142fde00 R14: ffff971b142fde00 R15: ffff971b94627a00
            [37243.845878] FS:  0000000000000000(0000) GS:ffff971bae200000(0000) knlGS:0000000000000000
            [37243.854906] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [37243.861316] CR2: 0000000000000028 CR3: 000000030e010000 CR4: 00000000000607e0
            [37243.869279] Call Trace:
            [37243.872014]  [<ffffffffc0ca91ac>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd]
            [37243.879591]  [<ffffffffc0ca9817>] kiblnd_send+0x357/0xa20 [ko2iblnd]
            [37243.886691]  [<ffffffffc0a74594>] lnet_ni_send+0x44/0xd0 [lnet]
            [37243.893302]  [<ffffffffc0a7bd32>] lnet_send+0x82/0x1c0 [lnet]
            [37243.899718]  [<ffffffffc0a7c13c>] LNetPut+0x2cc/0xb50 [lnet]
            [37243.906056]  [<ffffffffc0d12856>] ptl_send_buf+0x146/0x530 [ptlrpc]
            [37243.913063]  [<ffffffffc0d1454d>] ptl_send_rpc+0x69d/0xe70 [ptlrpc]
            [37243.920070]  [<ffffffffc0d09dc0>] ptlrpc_send_new_req+0x450/0xa60 [ptlrpc]
            [37243.927760]  [<ffffffffc0d5318a>] ? null_alloc_reqbuf+0x19a/0x3a0 [ptlrpc]
            [37243.935445]  [<ffffffffc0d0e891>] ptlrpc_set_wait+0x291/0x790 [ptlrpc]
            [37243.942749]  [<ffffffffc0b13e17>] ? lprocfs_oh_tally+0x17/0x40 [obdclass]
            [37243.950338]  [<ffffffffc0d1a50a>] ? lustre_msg_set_jobid+0x9a/0x110 [ptlrpc]
            [37243.958216]  [<ffffffffc0d0ee13>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
            [37243.965593]  [<ffffffffc0edccf1>] mdc_close+0x201/0x9e0 [mdc]
            [37243.972007]  [<ffffffffc0f1d0ce>] lmv_close+0x1ae/0x3a0 [lmv]
            [37243.978429]  [<ffffffffc0f55987>] ll_close_inode_openhandle+0x2e7/0xcf0 [lustre]
            [37243.986688]  [<ffffffffc0f5a1e0>] ll_md_real_close+0xf0/0x1e0 [lustre]
            [37243.993976]  [<ffffffffc0f5a8eb>] ll_file_release+0x61b/0x8c0 [lustre]
            [37244.001262]  [<ffffffffa224a9cc>] __fput+0xec/0x260
            [37244.006705]  [<ffffffffa224ac2e>] ____fput+0xe/0x10
            [37244.012149]  [<ffffffffa20c1c0b>] task_work_run+0xbb/0xe0
            [37244.018173]  [<ffffffffa20a0f24>] do_exit+0x2d4/0xa50
            [37244.023809]  [<ffffffffa2787678>] ? __do_page_fault+0x238/0x500
            [37244.030417]  [<ffffffffa278ce21>] ? system_call_after_swapgs+0xae/0x146
            [37244.037798]  [<ffffffffa20a171f>] do_group_exit+0x3f/0xa0
            [37244.043821]  [<ffffffffa20a1794>] SyS_exit_group+0x14/0x20
            [37244.049942]  [<ffffffffa278cede>] system_call_fastpath+0x25/0x2a
            [37244.056645]  [<ffffffffa278ce21>] ? system_call_after_swapgs+0xae/0x146
            [37244.064026] Code: 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 49 c7 c1 c4 4c cb c0 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 70 e0 ca c0 <48> 8b 78 28 e8 57 8c 92 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 c1 
            [37244.085739] RIP  [<ffffffffc0ca6cd0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]
            [37244.093907]  RSP <ffff971601607798>
            [37244.097797] CR2: 0000000000000028
            [    0.000000] Initializing cgroup subsys cpuset
            [    0.000000] Initializing cgroup subsys cpu
            [    0.000000] Initializing cgroup subsys cpuacct
            [    0.000000] Linux version 3.10.0-1062.1.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Fri Sep 13 22:55:44 UTC 2019
            [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-1062.1.1.el7.x86_64 ro console=ttyS0,115200 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd disable_cpu_apicid=0 elfcorehdr=869812K
            
            sarah Sarah Liu added a comment - Hit same issue on 2.12.3-RC1(b2_12-ib #64) [37224.564653] LNetError: 4595:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Timed out RDMA with 192.168.1.108@o2ib (15): c: 7, oc: 0, rc: 8 [37224.578319] LNetError: 4603:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 192.168.1.108@o2ib added to recovery queue. Health = 900 [37224.628484] LNetError: 52343:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 192.168.1.128@o2ib added to recovery queue. Health = 900 [37224.642005] LNetError: 52343:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 22 previous similar messages [37226.628596] Lustre: 4631:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1570787383/real 1570787390] req@ffff971940601200 x1647054195451264/t0(0) o400->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1570787390 ref 1 fl Rpc:eXN/0/ffffff ff rc 0/-1 [37226.660634] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fa il [37243.629633] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 [37243.638397] IP: [<ffffffffc0ca6cd0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd] [37243.646478] PGD 0 [37243.648729] Oops: 0000 [#1] SMP [37243.652352] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) r pcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_cor e(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmul ni_intel aesni_intel lrw gf128mul ipmi_ssif glue_helper sg lpc_ich mei_me mei ablk_helper cryptd ipmi_si pcspkr ipmi_devintf ioatdma ipmi_msghandler joydev i2c _i801 pcc_cpufreq wmi auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mgag200 drm_kms_ helper isci syscopyarea igb sysfillrect sysimgblt fb_sys_fops ahci ttm libsas scsi_transport_sas libahci mlx4_core(OE) ptp drm crct10dif_pclmul pps_core crct10 dif_common devlink crc32c_intel dca libata mlx_compat(OE) drm_panel_orientation_quirks i2c_algo_bit [37243.755763] CPU: 8 PID: 53315 Comm: mdtest Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.1.1.el7.x86_64 #1 [37243.768477] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [37243.780998] task: ffff9713b5a841c0 ti: ffff971601604000 task.ti: ffff971601604000 [37243.789349] RIP: 0010:[<ffffffffc0ca6cd0>] [<ffffffffc0ca6cd0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd] [37243.800137] RSP: 0018:ffff971601607798 EFLAGS: 00010202 [37243.806064] RAX: 0000000000000000 RBX: ffff971bac9a6100 RCX: 0000000000000106 [37243.814028] RDX: ffff971b142fde00 RSI: ffffffffc0cae070 RDI: ffff971b142fde00 [37243.821990] RBP: ffff9716016077e8 R08: 0000000000000002 R09: ffffffffc0cb4cc4 [37243.829953] R10: ffff9714ffc07900 R11: ffffffffc0c9945c R12: 00050000c0a8016c [37243.837916] R13: ffff971b142fde00 R14: ffff971b142fde00 R15: ffff971b94627a00 [37243.845878] FS: 0000000000000000(0000) GS:ffff971bae200000(0000) knlGS:0000000000000000 [37243.854906] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [37243.861316] CR2: 0000000000000028 CR3: 000000030e010000 CR4: 00000000000607e0 [37243.869279] Call Trace: [37243.872014] [<ffffffffc0ca91ac>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd] [37243.879591] [<ffffffffc0ca9817>] kiblnd_send+0x357/0xa20 [ko2iblnd] [37243.886691] [<ffffffffc0a74594>] lnet_ni_send+0x44/0xd0 [lnet] [37243.893302] [<ffffffffc0a7bd32>] lnet_send+0x82/0x1c0 [lnet] [37243.899718] [<ffffffffc0a7c13c>] LNetPut+0x2cc/0xb50 [lnet] [37243.906056] [<ffffffffc0d12856>] ptl_send_buf+0x146/0x530 [ptlrpc] [37243.913063] [<ffffffffc0d1454d>] ptl_send_rpc+0x69d/0xe70 [ptlrpc] [37243.920070] [<ffffffffc0d09dc0>] ptlrpc_send_new_req+0x450/0xa60 [ptlrpc] [37243.927760] [<ffffffffc0d5318a>] ? null_alloc_reqbuf+0x19a/0x3a0 [ptlrpc] [37243.935445] [<ffffffffc0d0e891>] ptlrpc_set_wait+0x291/0x790 [ptlrpc] [37243.942749] [<ffffffffc0b13e17>] ? lprocfs_oh_tally+0x17/0x40 [obdclass] [37243.950338] [<ffffffffc0d1a50a>] ? lustre_msg_set_jobid+0x9a/0x110 [ptlrpc] [37243.958216] [<ffffffffc0d0ee13>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc] [37243.965593] [<ffffffffc0edccf1>] mdc_close+0x201/0x9e0 [mdc] [37243.972007] [<ffffffffc0f1d0ce>] lmv_close+0x1ae/0x3a0 [lmv] [37243.978429] [<ffffffffc0f55987>] ll_close_inode_openhandle+0x2e7/0xcf0 [lustre] [37243.986688] [<ffffffffc0f5a1e0>] ll_md_real_close+0xf0/0x1e0 [lustre] [37243.993976] [<ffffffffc0f5a8eb>] ll_file_release+0x61b/0x8c0 [lustre] [37244.001262] [<ffffffffa224a9cc>] __fput+0xec/0x260 [37244.006705] [<ffffffffa224ac2e>] ____fput+0xe/0x10 [37244.012149] [<ffffffffa20c1c0b>] task_work_run+0xbb/0xe0 [37244.018173] [<ffffffffa20a0f24>] do_exit+0x2d4/0xa50 [37244.023809] [<ffffffffa2787678>] ? __do_page_fault+0x238/0x500 [37244.030417] [<ffffffffa278ce21>] ? system_call_after_swapgs+0xae/0x146 [37244.037798] [<ffffffffa20a171f>] do_group_exit+0x3f/0xa0 [37244.043821] [<ffffffffa20a1794>] SyS_exit_group+0x14/0x20 [37244.049942] [<ffffffffa278cede>] system_call_fastpath+0x25/0x2a [37244.056645] [<ffffffffa278ce21>] ? system_call_after_swapgs+0xae/0x146 [37244.064026] Code: 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 49 c7 c1 c4 4c cb c0 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 70 e0 ca c0 <48> 8b 78 28 e8 57 8c 92 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 c1 [37244.085739] RIP [<ffffffffc0ca6cd0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd] [37244.093907] RSP <ffff971601607798> [37244.097797] CR2: 0000000000000028 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct [ 0.000000] Linux version 3.10.0-1062.1.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Fri Sep 13 22:55:44 UTC 2019 [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-1062.1.1.el7.x86_64 ro console=ttyS0,115200 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd disable_cpu_apicid=0 elfcorehdr=869812K
            scadmin SC Admin added a comment -

            Hi Folks,

            Just reporting that I think we're seeing this issue at Swinburne. AFAICT is triggered by just rebooting one of the LNet routers. Happened once last night, and I was somewhat  able to re-produce it this morning by rebooting the same LNet router a couple of times which resulted in another node crashing. Unlike Sarah, we're not on 2.12/2.13, we're using 2.10.x:

             

            Lustre clients:

            • lustre-client-2.10.7
            • kernel-3.10.0-957.21.3.el7.x86_64

             

            Lustre servers/lnet router in question:

            • lustre-2.10.5-1.el7.x86_64
            • kernel-3.10.0-862.9.1.el7.x86_64

            with patches:

            • lu11082-lu11103-stuckMdtThreads-gerrit32853-3dc08caa.diff
            • lu11111-lfsck-gerrit32796-693fe452.ported.patch
            • lu11201-lfsckDoesntFinish-gerrit33078-4829fb05.patch
            • lu11301-stuckMdtThreads2-c43baa1c.patch
            • lu11418-hungMdtZfs-gerrit33248-eaa3c60d.diff
            • lu11418-refreshStale-gerrit33401-v4-71f409c9.diff
            • lu11418-stopOrphCleanupDaThreadSpinning-gerrit33662-45434fd0.diff
            • lu11419-lfsckDoesntFinish-gerrit33252-22503a1d.diff
            • lu11663-partialPageCorruption-gerrit33748-18d6b8fb.diff

             

            Dump:

            [3114246.460458] LNet: 3543:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.55.232@o2ib: 187 seconds
            [3114990.159188] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
            [3114990.168863] IP: [<ffffffffc061f6c9>] kiblnd_connect_peer+0x69/0x660 [ko2iblnd]
            [3114990.177907] PGD 0 
            [3114990.181713] Oops: 0000 [#1] SMP 
            [3114990.186738] Modules linked in: squashfs loop 8021q garp mrp stp llc mptctl nvidia_uvm(POE) nvidia(POE) ib_ucm nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG xt_limit ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter xfs libcrc32c snd_hda_codec_hdmi iTCO_wdt gpio_ich iTCO_vendor_support dm_mod intel_powerclamp coretemp kvm_intel kvm irqbypass rdma_ucm mgag200 ib_uverbs ib_umad snd_hda_intel ttm snd_hda_codec drm_kms_helper snd_hda_core syscopyarea sysfillrect sysimgblt snd_hwdep fb_sys_fops snd_seq drm snd_seq_device snd_pcm joydev drm_panel_orientation_quirks sg lpc_ich i2c_i801 snd_timer snd ipmi_si soundcore ipmi_devintf ipmi_msghandler ioatdma i7core_edac acpi_cpufreq
            [3114990.269028]  binfmt_misc ip_tables overlay(OET) osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) rdma_cm iw_cm ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) ib_ipoib ib_cm sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic pata_acpi ib_qib rdmavt crc32c_intel ib_core serio_raw ahci firewire_ohci igb libahci firewire_core mptsas pata_jmicron crc_itu_t mptscsih i2c_algo_bit libata dca mptbase ptp scsi_transport_sas pps_core [last unloaded: pcspkr]
            [3114990.319897] CPU: 3 PID: 14147 Comm: sleep Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-957.21.2.el7.x86_64 #1
            [3114990.334596] Hardware name: SGI.COM C3108-TY11/X8DAH, BIOS 1.1     03/16/2011
            [3114990.343437] task: ffff9789ed7330c0 ti: ffff9786b4280000 task.ti: ffff9786b4280000
            [3114990.352701] RIP: 0010:[<ffffffffc061f6c9>]  [<ffffffffc061f6c9>] kiblnd_connect_peer+0x69/0x660 [ko2iblnd]
            [3114990.364161] RSP: 0018:ffff9786b42837a8  EFLAGS: 00010202
            [3114990.371231] RAX: 0000000000000000 RBX: ffff978a23a63040 RCX: 0000000000000106
            [3114990.380112] RDX: ffff97840ba5b000 RSI: ffffffffc0626180 RDI: ffff97840ba5b000
            [3114990.388975] RBP: ffff9786b42837f8 R08: 0000000000000002 R09: ffff97840ba5b000
            [3114990.397832] R10: ffff977f7fc03900 R11: ffffffffc061239c R12: 00050000c0a837e8
            [3114990.406665] R13: ffff97840ba5b000 R14: ffff97840ba5b000 R15: ffff978a2454ce00
            [3114990.415484] FS:  0000000000000000(0000) GS:ffff978427ac0000(0000) knlGS:0000000000000000
            [3114990.425245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [3114990.432659] CR2: 0000000000000028 CR3: 00000001c6c10000 CR4: 00000000000207e0
            [3114990.441448] Call Trace:
            [3114990.445532]  [<ffffffffc0621d0c>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd]
            [3114990.453945]  [<ffffffffc0622377>] kiblnd_send+0x357/0xa10 [ko2iblnd]
            [3114990.461914]  [<ffffffffc03ecbc1>] lnet_ni_send+0x41/0xd0 [lnet]
            [3114990.469412]  [<ffffffffc03f1fb7>] lnet_send+0x77/0x180 [lnet]
            [3114990.476706]  [<ffffffffc03f2305>] LNetPut+0x245/0x7a0 [lnet]
            [3114990.483924]  [<ffffffffc068c226>] ptl_send_buf+0x146/0x530 [ptlrpc]
            [3114990.491693]  [<ffffffffc03e7a44>] ? LNetMDAttach+0x3f4/0x450 [lnet]
            [3114990.499468]  [<ffffffffc068de1d>] ptl_send_rpc+0x67d/0xe60 [ptlrpc]
            [3114990.507222]  [<ffffffffc06834a8>] ptlrpc_send_new_req+0x468/0xa60 [ptlrpc]
            [3114990.515569]  [<ffffffffc06cc7ea>] ? null_alloc_reqbuf+0x19a/0x3a0 [ptlrpc]
            [3114990.523884]  [<ffffffffc06880c1>] ptlrpc_set_wait+0x3d1/0x920 [ptlrpc]
            [3114990.531849]  [<ffffffffc0481899>] ? lustre_get_jobid+0x99/0x4d0 [obdclass]
            [3114990.540136]  [<ffffffffc0693855>] ? lustre_msg_set_jobid+0x95/0x100 [ptlrpc]
            [3114990.548572]  [<ffffffffc068868d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
            [3114990.556551]  [<ffffffffc07a258c>] mdc_close+0x1bc/0x8a0 [mdc]
            [3114990.563646]  [<ffffffffc084d07c>] lmv_close+0x21c/0x550 [lmv]
            [3114990.570719]  [<ffffffffc088d6fe>] ll_close_inode_openhandle+0x2fe/0xe20 [lustre]
            [3114990.579413]  [<ffffffffc08905c0>] ll_md_real_close+0xf0/0x1e0 [lustre]
            [3114990.587228]  [<ffffffffc0890cf8>] ll_file_release+0x648/0xa80 [lustre]
            [3114990.595011]  [<ffffffffa2243bcc>] __fput+0xec/0x260
            [3114990.601134]  [<ffffffffa2243e2e>] ____fput+0xe/0x10
            [3114990.607263]  [<ffffffffa20be90b>] task_work_run+0xbb/0xe0
            [3114990.613888]  [<ffffffffa209ddd1>] do_exit+0x2d1/0xa40
            [3114990.620135]  [<ffffffffa2770628>] ? __do_page_fault+0x228/0x4f0
            [3114990.627243]  [<ffffffffa2775d21>] ? system_call_after_swapgs+0xae/0x146
            [3114990.635018]  [<ffffffffa209e5bf>] do_group_exit+0x3f/0xa0
            [3114990.641553]  [<ffffffffa209e634>] SyS_exit_group+0x14/0x20
            [3114990.648149]  [<ffffffffa2775ddb>] system_call_fastpath+0x22/0x27
            [3114990.655242]  [<ffffffffa2775d21>] ? system_call_after_swapgs+0xae/0x146
            [3114990.662922] Code: 0f 84 82 05 00 00 65 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 80 61 62 c0 <48> 8b 78 28 e8 9e 83 c5 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 c8 
            [3114990.684832] RIP  [<ffffffffc061f6c9>] kiblnd_connect_peer+0x69/0x660 [ko2iblnd]
            [3114990.693213]  RSP <ffff9786b42837a8>
            [3114990.697727] CR2: 0000000000000028
            

             

            I see some folks poking around at the patch sets related to: https://jira.whamcloud.com/browse/LU-12236   (pretty much right ~now actually). 

            Suggestions for us to test anything?

             

            Cheers,

            Simon

             

             

            scadmin SC Admin added a comment - Hi Folks, Just reporting that I think we're seeing this issue at Swinburne. AFAICT is triggered by just rebooting one of the LNet routers. Happened once last night, and I was somewhat   able to re-produce it this morning by rebooting the same LNet router a couple of times which resulted in another node crashing. Unlike Sarah, we're not on 2.12/2.13, we're using 2.10.x:   Lustre clients: lustre-client-2.10.7 kernel-3.10.0-957.21.3.el7.x86_64   Lustre servers/lnet router in question: lustre-2.10.5-1.el7.x86_64 kernel-3.10.0-862.9.1.el7.x86_64 with patches: lu11082-lu11103-stuckMdtThreads-gerrit32853-3dc08caa.diff lu11111-lfsck-gerrit32796-693fe452.ported.patch lu11201-lfsckDoesntFinish-gerrit33078-4829fb05.patch lu11301-stuckMdtThreads2-c43baa1c.patch lu11418-hungMdtZfs-gerrit33248-eaa3c60d.diff lu11418-refreshStale-gerrit33401-v4-71f409c9.diff lu11418-stopOrphCleanupDaThreadSpinning-gerrit33662-45434fd0.diff lu11419-lfsckDoesntFinish-gerrit33252-22503a1d.diff lu11663-partialPageCorruption-gerrit33748-18d6b8fb.diff   Dump: [3114246.460458] LNet: 3543:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.55.232@o2ib: 187 seconds [3114990.159188] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 [3114990.168863] IP: [<ffffffffc061f6c9>] kiblnd_connect_peer+0x69/0x660 [ko2iblnd] [3114990.177907] PGD 0 [3114990.181713] Oops: 0000 [#1] SMP [3114990.186738] Modules linked in: squashfs loop 8021q garp mrp stp llc mptctl nvidia_uvm(POE) nvidia(POE) ib_ucm nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG xt_limit ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter xfs libcrc32c snd_hda_codec_hdmi iTCO_wdt gpio_ich iTCO_vendor_support dm_mod intel_powerclamp coretemp kvm_intel kvm irqbypass rdma_ucm mgag200 ib_uverbs ib_umad snd_hda_intel ttm snd_hda_codec drm_kms_helper snd_hda_core syscopyarea sysfillrect sysimgblt snd_hwdep fb_sys_fops snd_seq drm snd_seq_device snd_pcm joydev drm_panel_orientation_quirks sg lpc_ich i2c_i801 snd_timer snd ipmi_si soundcore ipmi_devintf ipmi_msghandler ioatdma i7core_edac acpi_cpufreq [3114990.269028] binfmt_misc ip_tables overlay(OET) osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) rdma_cm iw_cm ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) ib_ipoib ib_cm sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic pata_acpi ib_qib rdmavt crc32c_intel ib_core serio_raw ahci firewire_ohci igb libahci firewire_core mptsas pata_jmicron crc_itu_t mptscsih i2c_algo_bit libata dca mptbase ptp scsi_transport_sas pps_core [last unloaded: pcspkr] [3114990.319897] CPU: 3 PID: 14147 Comm: sleep Kdump: loaded Tainted: P OE ------------ T 3.10.0-957.21.2.el7.x86_64 #1 [3114990.334596] Hardware name: SGI.COM C3108-TY11/X8DAH, BIOS 1.1 03/16/2011 [3114990.343437] task: ffff9789ed7330c0 ti: ffff9786b4280000 task.ti: ffff9786b4280000 [3114990.352701] RIP: 0010:[<ffffffffc061f6c9>] [<ffffffffc061f6c9>] kiblnd_connect_peer+0x69/0x660 [ko2iblnd] [3114990.364161] RSP: 0018:ffff9786b42837a8 EFLAGS: 00010202 [3114990.371231] RAX: 0000000000000000 RBX: ffff978a23a63040 RCX: 0000000000000106 [3114990.380112] RDX: ffff97840ba5b000 RSI: ffffffffc0626180 RDI: ffff97840ba5b000 [3114990.388975] RBP: ffff9786b42837f8 R08: 0000000000000002 R09: ffff97840ba5b000 [3114990.397832] R10: ffff977f7fc03900 R11: ffffffffc061239c R12: 00050000c0a837e8 [3114990.406665] R13: ffff97840ba5b000 R14: ffff97840ba5b000 R15: ffff978a2454ce00 [3114990.415484] FS: 0000000000000000(0000) GS:ffff978427ac0000(0000) knlGS:0000000000000000 [3114990.425245] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [3114990.432659] CR2: 0000000000000028 CR3: 00000001c6c10000 CR4: 00000000000207e0 [3114990.441448] Call Trace: [3114990.445532] [<ffffffffc0621d0c>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd] [3114990.453945] [<ffffffffc0622377>] kiblnd_send+0x357/0xa10 [ko2iblnd] [3114990.461914] [<ffffffffc03ecbc1>] lnet_ni_send+0x41/0xd0 [lnet] [3114990.469412] [<ffffffffc03f1fb7>] lnet_send+0x77/0x180 [lnet] [3114990.476706] [<ffffffffc03f2305>] LNetPut+0x245/0x7a0 [lnet] [3114990.483924] [<ffffffffc068c226>] ptl_send_buf+0x146/0x530 [ptlrpc] [3114990.491693] [<ffffffffc03e7a44>] ? LNetMDAttach+0x3f4/0x450 [lnet] [3114990.499468] [<ffffffffc068de1d>] ptl_send_rpc+0x67d/0xe60 [ptlrpc] [3114990.507222] [<ffffffffc06834a8>] ptlrpc_send_new_req+0x468/0xa60 [ptlrpc] [3114990.515569] [<ffffffffc06cc7ea>] ? null_alloc_reqbuf+0x19a/0x3a0 [ptlrpc] [3114990.523884] [<ffffffffc06880c1>] ptlrpc_set_wait+0x3d1/0x920 [ptlrpc] [3114990.531849] [<ffffffffc0481899>] ? lustre_get_jobid+0x99/0x4d0 [obdclass] [3114990.540136] [<ffffffffc0693855>] ? lustre_msg_set_jobid+0x95/0x100 [ptlrpc] [3114990.548572] [<ffffffffc068868d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc] [3114990.556551] [<ffffffffc07a258c>] mdc_close+0x1bc/0x8a0 [mdc] [3114990.563646] [<ffffffffc084d07c>] lmv_close+0x21c/0x550 [lmv] [3114990.570719] [<ffffffffc088d6fe>] ll_close_inode_openhandle+0x2fe/0xe20 [lustre] [3114990.579413] [<ffffffffc08905c0>] ll_md_real_close+0xf0/0x1e0 [lustre] [3114990.587228] [<ffffffffc0890cf8>] ll_file_release+0x648/0xa80 [lustre] [3114990.595011] [<ffffffffa2243bcc>] __fput+0xec/0x260 [3114990.601134] [<ffffffffa2243e2e>] ____fput+0xe/0x10 [3114990.607263] [<ffffffffa20be90b>] task_work_run+0xbb/0xe0 [3114990.613888] [<ffffffffa209ddd1>] do_exit+0x2d1/0xa40 [3114990.620135] [<ffffffffa2770628>] ? __do_page_fault+0x228/0x4f0 [3114990.627243] [<ffffffffa2775d21>] ? system_call_after_swapgs+0xae/0x146 [3114990.635018] [<ffffffffa209e5bf>] do_group_exit+0x3f/0xa0 [3114990.641553] [<ffffffffa209e634>] SyS_exit_group+0x14/0x20 [3114990.648149] [<ffffffffa2775ddb>] system_call_fastpath+0x22/0x27 [3114990.655242] [<ffffffffa2775d21>] ? system_call_after_swapgs+0xae/0x146 [3114990.662922] Code: 0f 84 82 05 00 00 65 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 80 61 62 c0 <48> 8b 78 28 e8 9e 83 c5 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 c8 [3114990.684832] RIP [<ffffffffc061f6c9>] kiblnd_connect_peer+0x69/0x660 [ko2iblnd] [3114990.693213] RSP <ffff9786b42837a8> [3114990.697727] CR2: 0000000000000028   I see some folks poking around at the patch sets related to: https://jira.whamcloud.com/browse/LU-12236    (pretty much right ~now actually).  Suggestions for us to test anything?   Cheers, Simon    
            sarah Sarah Liu added a comment -

            Just checking here that the soak is running with the below patch?

            https://review.whamcloud.com/34577

            No, the current build is tip of master which doesn't have the patch. I can test the patch or the one of LU-12236 in a few days(to let soak stays up for longer time).

            sarah Sarah Liu added a comment - Just checking here that the soak is running with the below patch? https://review.whamcloud.com/34577 No, the current build is tip of master which doesn't have the patch. I can test the patch or the one of LU-12236 in a few days(to let soak stays up for longer time).

            I think the solution from LU-12236 is a more complete solution to this problem.

            simmonsja James A Simmons added a comment - I think the solution from LU-12236 is a more complete solution to this problem.

            Just checking here that the soak is running with the below patch?

            https://review.whamcloud.com/34577

            sharmaso Sonia Sharma (Inactive) added a comment - Just checking here that the soak is running with the below patch? https://review.whamcloud.com/34577
            sarah Sarah Liu added a comment - - edited

            2 client nodes hit the same problem on master overnight, lustre-master-ib build #225 version 2.12.53_13_g4191e0c

            sarah Sarah Liu added a comment - - edited 2 client nodes hit the same problem on master overnight, lustre-master-ib build #225 version 2.12.53_13_g4191e0c
            sarah Sarah Liu added a comment - - edited

            A quick update, soak has been running with patch #34577 for almost 3 days, not seeing the bug so far.

            another update, soak has been up for 5 days, not seeing the bug. One thing to clarify, the bug was only hit on b2_12 branch, the patch is based on master

            sarah Sarah Liu added a comment - - edited A quick update, soak has been running with patch #34577 for almost 3 days, not seeing the bug so far. another update, soak has been up for 5 days, not seeing the bug. One thing to clarify, the bug was only hit on b2_12 branch, the patch is based on master
            sarah Sarah Liu added a comment - - edited

            soak hit the problem again when running b2_12-next build#6 on 2 clients when running about 24 hours
            soak-40.log-20190407

            [81137.079621] Lustre: Skipped 1 previous similar message
            [84291.211419] LNetError: 14031:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
            [84291.222786] LNetError: 14031:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 192.168.1.108@o2ib (11): c: 7, oc: 0, rc: 8
            [84291.236612] Lustre: 14052:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1554566512/real 1554566519]  req@ffff8e0101531
            e00 x1629993278173792/t0(0) o400->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1554566519 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
            [84291.268771] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
            [84300.542162] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
            [84300.550923] IP: [<ffffffffc0800cc0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]
            [84300.559000] PGD 0 
            [84300.561254] Oops: 0000 [#1] SMP 
            [84300.564873] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfs
            v4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_e
            n(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO
            _wdt iTCO_vendor_support mxm_wmi ipmi_ssif mei_me pcspkr joydev mei sg lpc_ich i2c_i801 ioatdma ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_pad acpi_power_meter auth_rpcgss su
            nrpc ip_tables ext4 mbcache jbd2 mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb ahci ml
            x4_core(OE) drm crct10dif_pclmul ptp crct10dif_common libahci crc32c_intel pps_core devlink dca libata mlx_compat(OE) drm_panel_orientation_quirks i2c_algo_bit
            [84300.667360] CPU: 9 PID: 114559 Comm: mdtest Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.1.3.el7.x86_64 #1
            [84300.680065] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
            [84300.691905] task: ffff8dfa8f7b6180 ti: ffff8df4cb728000 task.ti: ffff8df4cb728000
            [84300.700255] RIP: 0010:[<ffffffffc0800cc0>]  [<ffffffffc0800cc0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]
            [84300.711042] RSP: 0018:ffff8df4cb72b750  EFLAGS: 00010202
            [84300.716968] RAX: 0000000000000000 RBX: ffff8e02842061c0 RCX: 0000000000000106
            [84300.724930] RDX: ffff8dfbce62cb80 RSI: ffffffffc0807750 RDI: ffff8dfbce62cb80
            [84300.732893] RBP: ffff8df4cb72b7a0 R08: 0000000000000002 R09: ffffffffc080ecf1
            [84300.740855] R10: ffff8df3bfc07900 R11: ffffffffffffff83 R12: 00050000c0a8016c
            [84300.748817] R13: ffff8dfbce62cb80 R14: ffff8dfbce62cb80 R15: ffff8e029b251600
            [84300.756780] FS:  0000000000000000(0000) GS:ffff8dfa9e840000(0000) knlGS:0000000000000000
            [84300.765809] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [84300.772220] CR2: 0000000000000028 CR3: 0000000c73610000 CR4: 00000000003607e0
            [84300.780181] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            [84300.788144] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
            [84300.796104] Call Trace:
            [84300.798837]  [<ffffffffc080335c>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd]
            [84300.806416]  [<ffffffffc08039c7>] kiblnd_send+0x357/0xa20 [ko2iblnd]
            [84300.813521]  [<ffffffffc0b1f1c4>] lnet_ni_send+0x44/0xd0 [lnet]
            [84300.820133]  [<ffffffffc0b26e92>] lnet_send+0x82/0x1c0 [lnet]
            [84300.826550]  [<ffffffffc0b2729c>] LNetPut+0x2cc/0xb50 [lnet]
            [84300.832893]  [<ffffffffc0dbe786>] ptl_send_buf+0x146/0x530 [ptlrpc]
            [84300.839904]  [<ffffffffc0dc047d>] ptl_send_rpc+0x69d/0xe70 [ptlrpc]
            [84300.846914]  [<ffffffffc0db5b00>] ptlrpc_send_new_req+0x450/0xa60 [ptlrpc]
            [84300.854604]  [<ffffffffc0dba728>] ptlrpc_set_wait+0x3f8/0x8d0 [ptlrpc]
            [84300.861905]  [<ffffffffc0dc4840>] ? lustre_msg_buf_v2+0x1b0/0x1b0 [ptlrpc]
            [84300.869594]  [<ffffffffc0dc4857>] ? lustre_msg_buf+0x17/0x60 [ptlrpc]
            [84300.876807]  [<ffffffffc0bfaf15>] ? lustre_get_jobid+0x185/0x2e0 [obdclass]
            [84300.884588]  [<ffffffffc0bc0e47>] ? lprocfs_oh_tally+0x17/0x40 [obdclass]
            [84300.892171]  [<ffffffffc0dc610a>] ? lustre_msg_set_jobid+0x9a/0x110 [ptlrpc]
            [84300.900053]  [<ffffffffc0dbac83>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
            [84300.907439]  [<ffffffffc0d4680b>] mdc_close+0x1eb/0x980 [mdc]
            [84300.913855]  [<ffffffffc0953bee>] lmv_close+0x1ae/0x3a0 [lmv]
            [84300.920280]  [<ffffffffc0f93857>] ll_close_inode_openhandle+0x2e7/0xcf0 [lustre]
            [84300.928541]  [<ffffffffc0f989f0>] ll_md_real_close+0xf0/0x1e0 [lustre]
            [84300.935828]  [<ffffffffc0f990fb>] ll_file_release+0x61b/0x8c0 [lustre]
            [84300.943114]  [<ffffffffae84364c>] __fput+0xec/0x260
            [84300.948555]  [<ffffffffae8438ae>] ____fput+0xe/0x10
            [84300.953999]  [<ffffffffae6be79b>] task_work_run+0xbb/0xe0
            [84300.960023]  [<ffffffffae69dc61>] do_exit+0x2d1/0xa40
            [84300.965663]  [<ffffffffaed6f608>] ? __do_page_fault+0x228/0x500
            [84300.972268]  [<ffffffffae69e44f>] do_group_exit+0x3f/0xa0
            [84300.978290]  [<ffffffffae69e4c4>] SyS_exit_group+0x14/0x20
            [84300.984411]  [<ffffffffaed74ddb>] system_call_fastpath+0x22/0x27
            [84300.991110] Code: 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 49 c7 c1 f1 ec 80 c0 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 50 77 80 c0 <48> 8b 78 28 e8 97 bd e9 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 c1 
            [84301.012866] RIP  [<ffffffffc0800cc0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]
            [84301.021032]  RSP <ffff8df4cb72b750>
            
            sarah Sarah Liu added a comment - - edited soak hit the problem again when running b2_12-next build#6 on 2 clients when running about 24 hours soak-40.log-20190407 [81137.079621] Lustre: Skipped 1 previous similar message [84291.211419] LNetError: 14031:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds [84291.222786] LNetError: 14031:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 192.168.1.108@o2ib (11): c: 7, oc: 0, rc: 8 [84291.236612] Lustre: 14052:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1554566512/real 1554566519] req@ffff8e0101531 e00 x1629993278173792/t0(0) o400->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1554566519 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [84291.268771] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail [84300.542162] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 [84300.550923] IP: [<ffffffffc0800cc0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd] [84300.559000] PGD 0 [84300.561254] Oops: 0000 [#1] SMP [84300.564873] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfs v4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_e n(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO _wdt iTCO_vendor_support mxm_wmi ipmi_ssif mei_me pcspkr joydev mei sg lpc_ich i2c_i801 ioatdma ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_pad acpi_power_meter auth_rpcgss su nrpc ip_tables ext4 mbcache jbd2 mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb ahci ml x4_core(OE) drm crct10dif_pclmul ptp crct10dif_common libahci crc32c_intel pps_core devlink dca libata mlx_compat(OE) drm_panel_orientation_quirks i2c_algo_bit [84300.667360] CPU: 9 PID: 114559 Comm: mdtest Kdump: loaded Tainted: G OE ------------ 3.10.0-957.1.3.el7.x86_64 #1 [84300.680065] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016 [84300.691905] task: ffff8dfa8f7b6180 ti: ffff8df4cb728000 task.ti: ffff8df4cb728000 [84300.700255] RIP: 0010:[<ffffffffc0800cc0>] [<ffffffffc0800cc0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd] [84300.711042] RSP: 0018:ffff8df4cb72b750 EFLAGS: 00010202 [84300.716968] RAX: 0000000000000000 RBX: ffff8e02842061c0 RCX: 0000000000000106 [84300.724930] RDX: ffff8dfbce62cb80 RSI: ffffffffc0807750 RDI: ffff8dfbce62cb80 [84300.732893] RBP: ffff8df4cb72b7a0 R08: 0000000000000002 R09: ffffffffc080ecf1 [84300.740855] R10: ffff8df3bfc07900 R11: ffffffffffffff83 R12: 00050000c0a8016c [84300.748817] R13: ffff8dfbce62cb80 R14: ffff8dfbce62cb80 R15: ffff8e029b251600 [84300.756780] FS: 0000000000000000(0000) GS:ffff8dfa9e840000(0000) knlGS:0000000000000000 [84300.765809] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [84300.772220] CR2: 0000000000000028 CR3: 0000000c73610000 CR4: 00000000003607e0 [84300.780181] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [84300.788144] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [84300.796104] Call Trace: [84300.798837] [<ffffffffc080335c>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd] [84300.806416] [<ffffffffc08039c7>] kiblnd_send+0x357/0xa20 [ko2iblnd] [84300.813521] [<ffffffffc0b1f1c4>] lnet_ni_send+0x44/0xd0 [lnet] [84300.820133] [<ffffffffc0b26e92>] lnet_send+0x82/0x1c0 [lnet] [84300.826550] [<ffffffffc0b2729c>] LNetPut+0x2cc/0xb50 [lnet] [84300.832893] [<ffffffffc0dbe786>] ptl_send_buf+0x146/0x530 [ptlrpc] [84300.839904] [<ffffffffc0dc047d>] ptl_send_rpc+0x69d/0xe70 [ptlrpc] [84300.846914] [<ffffffffc0db5b00>] ptlrpc_send_new_req+0x450/0xa60 [ptlrpc] [84300.854604] [<ffffffffc0dba728>] ptlrpc_set_wait+0x3f8/0x8d0 [ptlrpc] [84300.861905] [<ffffffffc0dc4840>] ? lustre_msg_buf_v2+0x1b0/0x1b0 [ptlrpc] [84300.869594] [<ffffffffc0dc4857>] ? lustre_msg_buf+0x17/0x60 [ptlrpc] [84300.876807] [<ffffffffc0bfaf15>] ? lustre_get_jobid+0x185/0x2e0 [obdclass] [84300.884588] [<ffffffffc0bc0e47>] ? lprocfs_oh_tally+0x17/0x40 [obdclass] [84300.892171] [<ffffffffc0dc610a>] ? lustre_msg_set_jobid+0x9a/0x110 [ptlrpc] [84300.900053] [<ffffffffc0dbac83>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc] [84300.907439] [<ffffffffc0d4680b>] mdc_close+0x1eb/0x980 [mdc] [84300.913855] [<ffffffffc0953bee>] lmv_close+0x1ae/0x3a0 [lmv] [84300.920280] [<ffffffffc0f93857>] ll_close_inode_openhandle+0x2e7/0xcf0 [lustre] [84300.928541] [<ffffffffc0f989f0>] ll_md_real_close+0xf0/0x1e0 [lustre] [84300.935828] [<ffffffffc0f990fb>] ll_file_release+0x61b/0x8c0 [lustre] [84300.943114] [<ffffffffae84364c>] __fput+0xec/0x260 [84300.948555] [<ffffffffae8438ae>] ____fput+0xe/0x10 [84300.953999] [<ffffffffae6be79b>] task_work_run+0xbb/0xe0 [84300.960023] [<ffffffffae69dc61>] do_exit+0x2d1/0xa40 [84300.965663] [<ffffffffaed6f608>] ? __do_page_fault+0x228/0x500 [84300.972268] [<ffffffffae69e44f>] do_group_exit+0x3f/0xa0 [84300.978290] [<ffffffffae69e4c4>] SyS_exit_group+0x14/0x20 [84300.984411] [<ffffffffaed74ddb>] system_call_fastpath+0x22/0x27 [84300.991110] Code: 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 49 c7 c1 f1 ec 80 c0 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 50 77 80 c0 <48> 8b 78 28 e8 97 bd e9 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 c1 [84301.012866] RIP [<ffffffffc0800cc0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd] [84301.021032] RSP <ffff8df4cb72b750>

            Sonia Sharma (sharmaso@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34577
            Subject: LU-11385 lnet: check if current->nsproxy is NULL before using
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c49deb6566ca2ef531e2ab6d68937a857b7dc260

            gerrit Gerrit Updater added a comment - Sonia Sharma (sharmaso@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34577 Subject: LU-11385 lnet: check if current->nsproxy is NULL before using Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c49deb6566ca2ef531e2ab6d68937a857b7dc260

            The problem is the use of current->nsproxy->net_ns. It could be network namespace is not even enabled in the kernel. The only place I see this used by the infinband layer is in the cma code for userland to kernel space interaction which makes sense since user space could be placed in an network namespace. The proper way to handle this is to use kobj_ns_grab_current(KOBJ_NS_TYPE_NET) but that only exist in the most recent kernels. I would suggest at this point to change "current->nsproxy->net_ns" to init_net which will always exist.

            simmonsja James A Simmons added a comment - The problem is the use of current->nsproxy->net_ns. It could be network namespace is not even enabled in the kernel. The only place I see this used by the infinband layer is in the cma code for userland to kernel space interaction which makes sense since user space could be placed in an network namespace. The proper way to handle this is to use kobj_ns_grab_current(KOBJ_NS_TYPE_NET) but that only exist in the most recent kernels. I would suggest at this point to change "current->nsproxy->net_ns" to init_net which will always exist.
            sarah Sarah Liu added a comment -

            No, it cannot be reproduced at this time with over 24h running

            sarah Sarah Liu added a comment - No, it cannot be reproduced at this time with over 24h running

            People

              ssmirnov Serguei Smirnov
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: