Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11385

client hit BUG: unable to handle kernel NULL pointer dereference at 0000000000000028

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0, Lustre 2.12.4
    • Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.3
    • lustre-master-ib build#128 tag-2.11.55 . DNE mode
    • 3
    • 9223372036854775807

    Description

      Nearly 1/3 of the clients(10/26 clients) hit following error when running soak for several hours

      soak-19 console

      [ 3041.376764] LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2^M
      [ 3041.387652] alg: No test for adler32 (adler32-zlib)^M
      [ 3042.300215] Lustre: Lustre: Build Version: 2.11.55^M
      [ 3042.529425] LNet: Using FMR for registration^M
      [ 3042.547347] LNet: Added LNI 192.168.1.119@o2ib [8/256/0/180]^M
      Sep 15 03:27:01 soak-19 TIME: Time stamp for console^M
      Sep 15 04:01:01 soak-19 TIME: Time stamp for console^M
      [ 6150.233000] Lustre: Mounted soaked-client^M
      [ 8489.389803] LNetError: 3654:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds^M
      [ 8489.401106] LNetError: 3654:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 192.168.1.106@o2ib (32): c: 7, oc: 0, rc: 8^M
      [ 8489.414871] Lustre: 3694:0:(client.c:2126:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1536987077/real 1536987083]  req@ffff9c305ca50c00 x1611642077465472/t0(0) o400->soaked-OST0006-osc-ffff9c30acd99800@192.168.1.106@o2ib:28/4 lens 224/224 e 0 to 1 dl 1536987123 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1^M
      [ 8489.448242] Lustre: soaked-OST0006-osc-ffff9c30acd99800: Connection to soaked-OST0006 (at 192.168.1.106@o2ib) was lost; in progress operations using this service will wait for recovery to complete^M
      [ 8686.219497] perf: interrupt took too long (2607 > 2500), lowering kernel.perf_event_max_sample_rate to 76000^M
      [ 8701.565834] Lustre: soaked-MDT0000-mdc-ffff9c30acd99800: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib)^M
      [ 8727.383766] LNet: 3654:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 192.168.1.106@o2ib: 1 seconds^M
      [ 8727.395186] LNet: 3654:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 192.168.1.106@o2ib: 5 seconds^M
      [ 8832.381649] LustreError: 11-0: soaked-OST0002-osc-ffff9c30acd99800: operation ost_connect to node 192.168.1.107@o2ib failed: rc = -16^M
      Sep 15 05:01:01 soak-19 TIME: Time stamp for console^M
      [ 9072.730020] perf: interrupt took too long (3297 > 3258), lowering kernel.perf_event_max_sample_rate to 60000^M
      [10828.883723] perf: interrupt took too long (4162 > 4121), lowering kernel.perf_event_max_sample_rate to 48000^M
      Sep 15 06:01:01 soak-19 TIME: Time stamp for console^M
      Sep 15 07:01:01 soak-19 TIME: Time stamp for console^M
      [18186.142873] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail^M
      [18201.920492] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028^M
      [18201.929290] IP: [<ffffffffc0c12c80>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]^M
      [18201.937392] PGD 0 ^M
      [18201.939651] Oops: 0000 [#1] SMP ^M
      [18201.943287] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev pcspkr iTCO_wdt iTCO_vendor_support wmi sg ipmi_si ipmi_devintf lpc_ich ipmi_msghandler i2c_i801 mei_me shpchp ioatdma mei auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb ptp drm mlx4_core(OE) isci ahci mpt2sas pps_core libsas libahci crct10dif_pclmul devlink dca crct10dif_common i2c_algo_bit crc32c_intel raid_class libata mlx_compat(OE) i2c_core scsi_transport_sas^M
      [18202.044478] CPU: 10 PID: 16503 Comm: IOR Tainted: G           OE  ------------   3.10.0-862.9.1.el7.x86_64 #1^M
      [18202.055578] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
      [18202.068135] task: ffff9c2cabe70000 ti: ffff9c2a05b28000 task.ti: ffff9c2a05b28000^M
      [18202.076511] RIP: 0010:[<ffffffffc0c12c80>]  [<ffffffffc0c12c80>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]^M
      [18202.087332] RSP: 0018:ffff9c2a05b2b798  EFLAGS: 00010202^M
      [18202.093277] RAX: 0000000000000000 RBX: ffff9c30ab70af80 RCX: 0000000000000106^M
      [18202.101268] RDX: ffff9c2faae20280 RSI: ffffffffc0c19710 RDI: ffff9c2faae20280^M
      [18202.110466] RBP: ffff9c2a05b2b7e8 R08: 0000000000000002 R09: ffffffffc0c20cf1^M
      [18202.119628] R10: ffff9c29ffc07900 R11: ffffffffc0c0536c R12: 00050000c0a8016c^M
      [18202.128791] R13: ffff9c2faae20280 R14: ffff9c2faae20280 R15: ffff9c30a73bc000^M
      [18202.137954] FS:  0000000000000000(0000) GS:ffff9c30ad880000(0000) knlGS:0000000000000000^M
      [18202.148175] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
      [18202.155776] CR2: 0000000000000028 CR3: 000000007320e000 CR4: 00000000000607e0^M
      [18202.164938] Call Trace:^M
      [18202.168854]  [<ffffffffc0c1531c>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd]^M
      [18202.177641]  [<ffffffffc0c15987>] kiblnd_send+0x357/0xa20 [ko2iblnd]^M
      [18202.185950]  [<ffffffffc09deed4>] lnet_ni_send+0x44/0xd0 [lnet]^M
      [18202.193751]  [<ffffffffc09e6752>] lnet_send+0x82/0x1c0 [lnet]^M
      [18202.201346]  [<ffffffffc09e6bec>] LNetPut+0x2cc/0xb60 [lnet]^M
      [18202.208850]  [<ffffffffc0c7f046>] ptl_send_buf+0x146/0x530 [ptlrpc]^M
      [18202.216995]  [<ffffffffc0c80d3d>] ptl_send_rpc+0x69d/0xe70 [ptlrpc]^M
      [18202.225113]  [<ffffffffc0c766e0>] ptlrpc_send_new_req+0x460/0xa70 [ptlrpc]^M
      [18202.233896]  [<ffffffffc0cbeaca>] ? null_alloc_reqbuf+0x19a/0x3a0 [ptlrpc]^M
      [18202.242650]  [<ffffffffc0c7b1c1>] ptlrpc_set_wait+0x291/0x790 [ptlrpc]^M
      [18202.251009]  [<ffffffffc0a7eba7>] ? lprocfs_oh_tally+0x17/0x40 [obdclass]^M
      [18202.259634]  [<ffffffffc0c868fa>] ? lustre_msg_set_jobid+0x9a/0x110 [ptlrpc]^M
      [18202.268531]  [<ffffffffc0c7b73d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]^M
      [18202.276916]  [<ffffffffc0e4668b>] mdc_close+0x1eb/0x980 [mdc]^M
      [18202.284338]  [<ffffffffc084cf04>] lmv_close+0x184/0x2a0 [lmv]^M
      [18202.291745]  [<ffffffffc0e8d7c0>] ll_close_inode_openhandle+0x2e0/0xcd0 [lustre]^M
      [18202.300976]  [<ffffffffc0e91f50>] ll_md_real_close+0xf0/0x1e0 [lustre]^M
      [18202.309221]  [<ffffffffc0e9265b>] ll_file_release+0x61b/0x8c0 [lustre]^M
      [18202.317475]  [<ffffffff8ca1d74c>] __fput+0xec/0x260^M
      [18202.323850]  [<ffffffff8ca1d9ae>] ____fput+0xe/0x10^M
      [18202.330208]  [<ffffffff8c8b803b>] task_work_run+0xbb/0xe0^M
      [18202.337134]  [<ffffffff8c897f21>] do_exit+0x2d1/0xa40^M
      [18202.343652]  [<ffffffff8ca1b538>] ? vfs_write+0x168/0x1f0^M
      [18202.350551]  [<ffffffff8cf206e1>] ? system_call_after_swapgs+0xae/0x146^M
      [18202.358817]  [<ffffffff8c89870f>] do_group_exit+0x3f/0xa0^M
      [18202.365704]  [<ffffffff8c898784>] SyS_exit_group+0x14/0x20^M
      [18202.372663]  [<ffffffff8cf20795>] system_call_fastpath+0x1c/0x21^M
      [18202.380187]  [<ffffffff8cf206e1>] ? system_call_after_swapgs+0xae/0x146^M
      [18202.388370] Code: 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 49 c7 c1 f1 0c c2 c0 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 10 97 c1 c0 <48> 8b 78 28 e8 07 dc 99 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 c1 ^M
      [18202.411764] RIP  [<ffffffffc0c12c80>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]^M
      [18202.420723]  RSP <ffff9c2a05b2b798>^M
      [18202.425371] CR2: 0000000000000028^M
      [18202.431391] ---[ end trace ccdeccc9915a17ce ]---^M
      [18202.509003] Kernel panic - not syncing: Fatal exception^M
      [18202.515717] Kernel Offset: 0xb800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)^M
      

      Attachments

        Issue Links

          Activity

            [LU-11385] client hit BUG: unable to handle kernel NULL pointer dereference at 0000000000000028

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37314/
            Subject: LU-11385 odbclass: Handle gracefully if nsproxy is NULL
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 1e519c24829aa5173ea848cd58ef5a65d0fe9739

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37314/ Subject: LU-11385 odbclass: Handle gracefully if nsproxy is NULL Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 1e519c24829aa5173ea848cd58ef5a65d0fe9739

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37313/
            Subject: LU-11385 lnet: check if current->nsproxy is NULL before using
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: fea400b3005982322c4e5c2df9b8376398a561ce

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37313/ Subject: LU-11385 lnet: check if current->nsproxy is NULL before using Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: fea400b3005982322c4e5c2df9b8376398a561ce

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37314
            Subject: LU-11385 odbclass: Handle gracefully if nsproxy is NULL
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 2e4112c41d11dc2e8d00a5e1be2deab786ee74d4

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37314 Subject: LU-11385 odbclass: Handle gracefully if nsproxy is NULL Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 2e4112c41d11dc2e8d00a5e1be2deab786ee74d4

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37313
            Subject: LU-11385 lnet: check if current->nsproxy is NULL before using
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 19cb66f129d3bc5191e84f2517c812229d37ab2d

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37313 Subject: LU-11385 lnet: check if current->nsproxy is NULL before using Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 19cb66f129d3bc5191e84f2517c812229d37ab2d
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36802/
            Subject: LU-11385 odbclass: Handle gracefully if nsproxy is NULL
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 15278c6d32a5a9a7a2b8ac9e08c8702383e0c2ff

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36802/ Subject: LU-11385 odbclass: Handle gracefully if nsproxy is NULL Project: fs/lustre-release Branch: master Current Patch Set: Commit: 15278c6d32a5a9a7a2b8ac9e08c8702383e0c2ff

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34577/
            Subject: LU-11385 lnet: check if current->nsproxy is NULL before using
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ef1783e282f6eba9d69b0957f1b5fed00be0cbd6

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34577/ Subject: LU-11385 lnet: check if current->nsproxy is NULL before using Project: fs/lustre-release Branch: master Current Patch Set: Commit: ef1783e282f6eba9d69b0957f1b5fed00be0cbd6

            Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36802
            Subject: LU-11385 odbclass: Handle gracefully if nsproxy is NULL
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e33f4d6e5dec98de7a7d97c7d6e09edd976fecfd

            gerrit Gerrit Updater added a comment - Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36802 Subject: LU-11385 odbclass: Handle gracefully if nsproxy is NULL Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e33f4d6e5dec98de7a7d97c7d6e09edd976fecfd
            sarah Sarah Liu added a comment -

            Hit same issue on 2.12.3-RC1(b2_12-ib #64)

            [37224.564653] LNetError: 4595:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Timed out RDMA with 192.168.1.108@o2ib (15): c: 7, oc: 0, rc: 8
            [37224.578319] LNetError: 4603:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 192.168.1.108@o2ib added to recovery queue. Health = 900
            [37224.628484] LNetError: 52343:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 192.168.1.128@o2ib added to recovery queue. Health = 900
            [37224.642005] LNetError: 52343:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 22 previous similar messages
            [37226.628596] Lustre: 4631:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1570787383/real 1570787390]  
            req@ffff971940601200 x1647054195451264/t0(0) o400->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1570787390 ref 1 fl Rpc:eXN/0/ffffff
            ff rc 0/-1
            [37226.660634] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fa
            il
            [37243.629633] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
            [37243.638397] IP: [<ffffffffc0ca6cd0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]
            [37243.646478] PGD 0 
            [37243.648729] Oops: 0000 [#1] SMP 
            [37243.652352] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) r
            pcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_cor
            e(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmul
            ni_intel aesni_intel lrw gf128mul ipmi_ssif glue_helper sg lpc_ich mei_me mei ablk_helper cryptd ipmi_si pcspkr ipmi_devintf ioatdma ipmi_msghandler joydev i2c
            _i801 pcc_cpufreq wmi auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mgag200 drm_kms_
            helper isci syscopyarea igb sysfillrect sysimgblt fb_sys_fops ahci ttm libsas scsi_transport_sas libahci mlx4_core(OE) ptp drm crct10dif_pclmul pps_core crct10
            dif_common devlink crc32c_intel dca libata mlx_compat(OE) drm_panel_orientation_quirks i2c_algo_bit
            [37243.755763] CPU: 8 PID: 53315 Comm: mdtest Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.1.1.el7.x86_64 #1
            [37243.768477] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            [37243.780998] task: ffff9713b5a841c0 ti: ffff971601604000 task.ti: ffff971601604000
            [37243.789349] RIP: 0010:[<ffffffffc0ca6cd0>]  [<ffffffffc0ca6cd0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]
            [37243.800137] RSP: 0018:ffff971601607798  EFLAGS: 00010202
            [37243.806064] RAX: 0000000000000000 RBX: ffff971bac9a6100 RCX: 0000000000000106
            [37243.814028] RDX: ffff971b142fde00 RSI: ffffffffc0cae070 RDI: ffff971b142fde00
            [37243.821990] RBP: ffff9716016077e8 R08: 0000000000000002 R09: ffffffffc0cb4cc4
            [37243.829953] R10: ffff9714ffc07900 R11: ffffffffc0c9945c R12: 00050000c0a8016c
            [37243.837916] R13: ffff971b142fde00 R14: ffff971b142fde00 R15: ffff971b94627a00
            [37243.845878] FS:  0000000000000000(0000) GS:ffff971bae200000(0000) knlGS:0000000000000000
            [37243.854906] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [37243.861316] CR2: 0000000000000028 CR3: 000000030e010000 CR4: 00000000000607e0
            [37243.869279] Call Trace:
            [37243.872014]  [<ffffffffc0ca91ac>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd]
            [37243.879591]  [<ffffffffc0ca9817>] kiblnd_send+0x357/0xa20 [ko2iblnd]
            [37243.886691]  [<ffffffffc0a74594>] lnet_ni_send+0x44/0xd0 [lnet]
            [37243.893302]  [<ffffffffc0a7bd32>] lnet_send+0x82/0x1c0 [lnet]
            [37243.899718]  [<ffffffffc0a7c13c>] LNetPut+0x2cc/0xb50 [lnet]
            [37243.906056]  [<ffffffffc0d12856>] ptl_send_buf+0x146/0x530 [ptlrpc]
            [37243.913063]  [<ffffffffc0d1454d>] ptl_send_rpc+0x69d/0xe70 [ptlrpc]
            [37243.920070]  [<ffffffffc0d09dc0>] ptlrpc_send_new_req+0x450/0xa60 [ptlrpc]
            [37243.927760]  [<ffffffffc0d5318a>] ? null_alloc_reqbuf+0x19a/0x3a0 [ptlrpc]
            [37243.935445]  [<ffffffffc0d0e891>] ptlrpc_set_wait+0x291/0x790 [ptlrpc]
            [37243.942749]  [<ffffffffc0b13e17>] ? lprocfs_oh_tally+0x17/0x40 [obdclass]
            [37243.950338]  [<ffffffffc0d1a50a>] ? lustre_msg_set_jobid+0x9a/0x110 [ptlrpc]
            [37243.958216]  [<ffffffffc0d0ee13>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
            [37243.965593]  [<ffffffffc0edccf1>] mdc_close+0x201/0x9e0 [mdc]
            [37243.972007]  [<ffffffffc0f1d0ce>] lmv_close+0x1ae/0x3a0 [lmv]
            [37243.978429]  [<ffffffffc0f55987>] ll_close_inode_openhandle+0x2e7/0xcf0 [lustre]
            [37243.986688]  [<ffffffffc0f5a1e0>] ll_md_real_close+0xf0/0x1e0 [lustre]
            [37243.993976]  [<ffffffffc0f5a8eb>] ll_file_release+0x61b/0x8c0 [lustre]
            [37244.001262]  [<ffffffffa224a9cc>] __fput+0xec/0x260
            [37244.006705]  [<ffffffffa224ac2e>] ____fput+0xe/0x10
            [37244.012149]  [<ffffffffa20c1c0b>] task_work_run+0xbb/0xe0
            [37244.018173]  [<ffffffffa20a0f24>] do_exit+0x2d4/0xa50
            [37244.023809]  [<ffffffffa2787678>] ? __do_page_fault+0x238/0x500
            [37244.030417]  [<ffffffffa278ce21>] ? system_call_after_swapgs+0xae/0x146
            [37244.037798]  [<ffffffffa20a171f>] do_group_exit+0x3f/0xa0
            [37244.043821]  [<ffffffffa20a1794>] SyS_exit_group+0x14/0x20
            [37244.049942]  [<ffffffffa278cede>] system_call_fastpath+0x25/0x2a
            [37244.056645]  [<ffffffffa278ce21>] ? system_call_after_swapgs+0xae/0x146
            [37244.064026] Code: 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 49 c7 c1 c4 4c cb c0 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 70 e0 ca c0 <48> 8b 78 28 e8 57 8c 92 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 c1 
            [37244.085739] RIP  [<ffffffffc0ca6cd0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]
            [37244.093907]  RSP <ffff971601607798>
            [37244.097797] CR2: 0000000000000028
            [    0.000000] Initializing cgroup subsys cpuset
            [    0.000000] Initializing cgroup subsys cpu
            [    0.000000] Initializing cgroup subsys cpuacct
            [    0.000000] Linux version 3.10.0-1062.1.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Fri Sep 13 22:55:44 UTC 2019
            [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-1062.1.1.el7.x86_64 ro console=ttyS0,115200 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd disable_cpu_apicid=0 elfcorehdr=869812K
            
            sarah Sarah Liu added a comment - Hit same issue on 2.12.3-RC1(b2_12-ib #64) [37224.564653] LNetError: 4595:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Timed out RDMA with 192.168.1.108@o2ib (15): c: 7, oc: 0, rc: 8 [37224.578319] LNetError: 4603:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 192.168.1.108@o2ib added to recovery queue. Health = 900 [37224.628484] LNetError: 52343:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 192.168.1.128@o2ib added to recovery queue. Health = 900 [37224.642005] LNetError: 52343:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 22 previous similar messages [37226.628596] Lustre: 4631:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1570787383/real 1570787390] req@ffff971940601200 x1647054195451264/t0(0) o400->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1570787390 ref 1 fl Rpc:eXN/0/ffffff ff rc 0/-1 [37226.660634] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fa il [37243.629633] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 [37243.638397] IP: [<ffffffffc0ca6cd0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd] [37243.646478] PGD 0 [37243.648729] Oops: 0000 [#1] SMP [37243.652352] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) r pcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_cor e(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmul ni_intel aesni_intel lrw gf128mul ipmi_ssif glue_helper sg lpc_ich mei_me mei ablk_helper cryptd ipmi_si pcspkr ipmi_devintf ioatdma ipmi_msghandler joydev i2c _i801 pcc_cpufreq wmi auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mgag200 drm_kms_ helper isci syscopyarea igb sysfillrect sysimgblt fb_sys_fops ahci ttm libsas scsi_transport_sas libahci mlx4_core(OE) ptp drm crct10dif_pclmul pps_core crct10 dif_common devlink crc32c_intel dca libata mlx_compat(OE) drm_panel_orientation_quirks i2c_algo_bit [37243.755763] CPU: 8 PID: 53315 Comm: mdtest Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.1.1.el7.x86_64 #1 [37243.768477] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [37243.780998] task: ffff9713b5a841c0 ti: ffff971601604000 task.ti: ffff971601604000 [37243.789349] RIP: 0010:[<ffffffffc0ca6cd0>] [<ffffffffc0ca6cd0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd] [37243.800137] RSP: 0018:ffff971601607798 EFLAGS: 00010202 [37243.806064] RAX: 0000000000000000 RBX: ffff971bac9a6100 RCX: 0000000000000106 [37243.814028] RDX: ffff971b142fde00 RSI: ffffffffc0cae070 RDI: ffff971b142fde00 [37243.821990] RBP: ffff9716016077e8 R08: 0000000000000002 R09: ffffffffc0cb4cc4 [37243.829953] R10: ffff9714ffc07900 R11: ffffffffc0c9945c R12: 00050000c0a8016c [37243.837916] R13: ffff971b142fde00 R14: ffff971b142fde00 R15: ffff971b94627a00 [37243.845878] FS: 0000000000000000(0000) GS:ffff971bae200000(0000) knlGS:0000000000000000 [37243.854906] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [37243.861316] CR2: 0000000000000028 CR3: 000000030e010000 CR4: 00000000000607e0 [37243.869279] Call Trace: [37243.872014] [<ffffffffc0ca91ac>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd] [37243.879591] [<ffffffffc0ca9817>] kiblnd_send+0x357/0xa20 [ko2iblnd] [37243.886691] [<ffffffffc0a74594>] lnet_ni_send+0x44/0xd0 [lnet] [37243.893302] [<ffffffffc0a7bd32>] lnet_send+0x82/0x1c0 [lnet] [37243.899718] [<ffffffffc0a7c13c>] LNetPut+0x2cc/0xb50 [lnet] [37243.906056] [<ffffffffc0d12856>] ptl_send_buf+0x146/0x530 [ptlrpc] [37243.913063] [<ffffffffc0d1454d>] ptl_send_rpc+0x69d/0xe70 [ptlrpc] [37243.920070] [<ffffffffc0d09dc0>] ptlrpc_send_new_req+0x450/0xa60 [ptlrpc] [37243.927760] [<ffffffffc0d5318a>] ? null_alloc_reqbuf+0x19a/0x3a0 [ptlrpc] [37243.935445] [<ffffffffc0d0e891>] ptlrpc_set_wait+0x291/0x790 [ptlrpc] [37243.942749] [<ffffffffc0b13e17>] ? lprocfs_oh_tally+0x17/0x40 [obdclass] [37243.950338] [<ffffffffc0d1a50a>] ? lustre_msg_set_jobid+0x9a/0x110 [ptlrpc] [37243.958216] [<ffffffffc0d0ee13>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc] [37243.965593] [<ffffffffc0edccf1>] mdc_close+0x201/0x9e0 [mdc] [37243.972007] [<ffffffffc0f1d0ce>] lmv_close+0x1ae/0x3a0 [lmv] [37243.978429] [<ffffffffc0f55987>] ll_close_inode_openhandle+0x2e7/0xcf0 [lustre] [37243.986688] [<ffffffffc0f5a1e0>] ll_md_real_close+0xf0/0x1e0 [lustre] [37243.993976] [<ffffffffc0f5a8eb>] ll_file_release+0x61b/0x8c0 [lustre] [37244.001262] [<ffffffffa224a9cc>] __fput+0xec/0x260 [37244.006705] [<ffffffffa224ac2e>] ____fput+0xe/0x10 [37244.012149] [<ffffffffa20c1c0b>] task_work_run+0xbb/0xe0 [37244.018173] [<ffffffffa20a0f24>] do_exit+0x2d4/0xa50 [37244.023809] [<ffffffffa2787678>] ? __do_page_fault+0x238/0x500 [37244.030417] [<ffffffffa278ce21>] ? system_call_after_swapgs+0xae/0x146 [37244.037798] [<ffffffffa20a171f>] do_group_exit+0x3f/0xa0 [37244.043821] [<ffffffffa20a1794>] SyS_exit_group+0x14/0x20 [37244.049942] [<ffffffffa278cede>] system_call_fastpath+0x25/0x2a [37244.056645] [<ffffffffa278ce21>] ? system_call_after_swapgs+0xae/0x146 [37244.064026] Code: 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 49 c7 c1 c4 4c cb c0 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 70 e0 ca c0 <48> 8b 78 28 e8 57 8c 92 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 c1 [37244.085739] RIP [<ffffffffc0ca6cd0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd] [37244.093907] RSP <ffff971601607798> [37244.097797] CR2: 0000000000000028 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct [ 0.000000] Linux version 3.10.0-1062.1.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Fri Sep 13 22:55:44 UTC 2019 [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-1062.1.1.el7.x86_64 ro console=ttyS0,115200 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd disable_cpu_apicid=0 elfcorehdr=869812K
            scadmin SC Admin added a comment -

            Hi Folks,

            Just reporting that I think we're seeing this issue at Swinburne. AFAICT is triggered by just rebooting one of the LNet routers. Happened once last night, and I was somewhat  able to re-produce it this morning by rebooting the same LNet router a couple of times which resulted in another node crashing. Unlike Sarah, we're not on 2.12/2.13, we're using 2.10.x:

             

            Lustre clients:

            • lustre-client-2.10.7
            • kernel-3.10.0-957.21.3.el7.x86_64

             

            Lustre servers/lnet router in question:

            • lustre-2.10.5-1.el7.x86_64
            • kernel-3.10.0-862.9.1.el7.x86_64

            with patches:

            • lu11082-lu11103-stuckMdtThreads-gerrit32853-3dc08caa.diff
            • lu11111-lfsck-gerrit32796-693fe452.ported.patch
            • lu11201-lfsckDoesntFinish-gerrit33078-4829fb05.patch
            • lu11301-stuckMdtThreads2-c43baa1c.patch
            • lu11418-hungMdtZfs-gerrit33248-eaa3c60d.diff
            • lu11418-refreshStale-gerrit33401-v4-71f409c9.diff
            • lu11418-stopOrphCleanupDaThreadSpinning-gerrit33662-45434fd0.diff
            • lu11419-lfsckDoesntFinish-gerrit33252-22503a1d.diff
            • lu11663-partialPageCorruption-gerrit33748-18d6b8fb.diff

             

            Dump:

            [3114246.460458] LNet: 3543:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.55.232@o2ib: 187 seconds
            [3114990.159188] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
            [3114990.168863] IP: [<ffffffffc061f6c9>] kiblnd_connect_peer+0x69/0x660 [ko2iblnd]
            [3114990.177907] PGD 0 
            [3114990.181713] Oops: 0000 [#1] SMP 
            [3114990.186738] Modules linked in: squashfs loop 8021q garp mrp stp llc mptctl nvidia_uvm(POE) nvidia(POE) ib_ucm nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG xt_limit ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter xfs libcrc32c snd_hda_codec_hdmi iTCO_wdt gpio_ich iTCO_vendor_support dm_mod intel_powerclamp coretemp kvm_intel kvm irqbypass rdma_ucm mgag200 ib_uverbs ib_umad snd_hda_intel ttm snd_hda_codec drm_kms_helper snd_hda_core syscopyarea sysfillrect sysimgblt snd_hwdep fb_sys_fops snd_seq drm snd_seq_device snd_pcm joydev drm_panel_orientation_quirks sg lpc_ich i2c_i801 snd_timer snd ipmi_si soundcore ipmi_devintf ipmi_msghandler ioatdma i7core_edac acpi_cpufreq
            [3114990.269028]  binfmt_misc ip_tables overlay(OET) osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) rdma_cm iw_cm ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) ib_ipoib ib_cm sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic pata_acpi ib_qib rdmavt crc32c_intel ib_core serio_raw ahci firewire_ohci igb libahci firewire_core mptsas pata_jmicron crc_itu_t mptscsih i2c_algo_bit libata dca mptbase ptp scsi_transport_sas pps_core [last unloaded: pcspkr]
            [3114990.319897] CPU: 3 PID: 14147 Comm: sleep Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-957.21.2.el7.x86_64 #1
            [3114990.334596] Hardware name: SGI.COM C3108-TY11/X8DAH, BIOS 1.1     03/16/2011
            [3114990.343437] task: ffff9789ed7330c0 ti: ffff9786b4280000 task.ti: ffff9786b4280000
            [3114990.352701] RIP: 0010:[<ffffffffc061f6c9>]  [<ffffffffc061f6c9>] kiblnd_connect_peer+0x69/0x660 [ko2iblnd]
            [3114990.364161] RSP: 0018:ffff9786b42837a8  EFLAGS: 00010202
            [3114990.371231] RAX: 0000000000000000 RBX: ffff978a23a63040 RCX: 0000000000000106
            [3114990.380112] RDX: ffff97840ba5b000 RSI: ffffffffc0626180 RDI: ffff97840ba5b000
            [3114990.388975] RBP: ffff9786b42837f8 R08: 0000000000000002 R09: ffff97840ba5b000
            [3114990.397832] R10: ffff977f7fc03900 R11: ffffffffc061239c R12: 00050000c0a837e8
            [3114990.406665] R13: ffff97840ba5b000 R14: ffff97840ba5b000 R15: ffff978a2454ce00
            [3114990.415484] FS:  0000000000000000(0000) GS:ffff978427ac0000(0000) knlGS:0000000000000000
            [3114990.425245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [3114990.432659] CR2: 0000000000000028 CR3: 00000001c6c10000 CR4: 00000000000207e0
            [3114990.441448] Call Trace:
            [3114990.445532]  [<ffffffffc0621d0c>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd]
            [3114990.453945]  [<ffffffffc0622377>] kiblnd_send+0x357/0xa10 [ko2iblnd]
            [3114990.461914]  [<ffffffffc03ecbc1>] lnet_ni_send+0x41/0xd0 [lnet]
            [3114990.469412]  [<ffffffffc03f1fb7>] lnet_send+0x77/0x180 [lnet]
            [3114990.476706]  [<ffffffffc03f2305>] LNetPut+0x245/0x7a0 [lnet]
            [3114990.483924]  [<ffffffffc068c226>] ptl_send_buf+0x146/0x530 [ptlrpc]
            [3114990.491693]  [<ffffffffc03e7a44>] ? LNetMDAttach+0x3f4/0x450 [lnet]
            [3114990.499468]  [<ffffffffc068de1d>] ptl_send_rpc+0x67d/0xe60 [ptlrpc]
            [3114990.507222]  [<ffffffffc06834a8>] ptlrpc_send_new_req+0x468/0xa60 [ptlrpc]
            [3114990.515569]  [<ffffffffc06cc7ea>] ? null_alloc_reqbuf+0x19a/0x3a0 [ptlrpc]
            [3114990.523884]  [<ffffffffc06880c1>] ptlrpc_set_wait+0x3d1/0x920 [ptlrpc]
            [3114990.531849]  [<ffffffffc0481899>] ? lustre_get_jobid+0x99/0x4d0 [obdclass]
            [3114990.540136]  [<ffffffffc0693855>] ? lustre_msg_set_jobid+0x95/0x100 [ptlrpc]
            [3114990.548572]  [<ffffffffc068868d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
            [3114990.556551]  [<ffffffffc07a258c>] mdc_close+0x1bc/0x8a0 [mdc]
            [3114990.563646]  [<ffffffffc084d07c>] lmv_close+0x21c/0x550 [lmv]
            [3114990.570719]  [<ffffffffc088d6fe>] ll_close_inode_openhandle+0x2fe/0xe20 [lustre]
            [3114990.579413]  [<ffffffffc08905c0>] ll_md_real_close+0xf0/0x1e0 [lustre]
            [3114990.587228]  [<ffffffffc0890cf8>] ll_file_release+0x648/0xa80 [lustre]
            [3114990.595011]  [<ffffffffa2243bcc>] __fput+0xec/0x260
            [3114990.601134]  [<ffffffffa2243e2e>] ____fput+0xe/0x10
            [3114990.607263]  [<ffffffffa20be90b>] task_work_run+0xbb/0xe0
            [3114990.613888]  [<ffffffffa209ddd1>] do_exit+0x2d1/0xa40
            [3114990.620135]  [<ffffffffa2770628>] ? __do_page_fault+0x228/0x4f0
            [3114990.627243]  [<ffffffffa2775d21>] ? system_call_after_swapgs+0xae/0x146
            [3114990.635018]  [<ffffffffa209e5bf>] do_group_exit+0x3f/0xa0
            [3114990.641553]  [<ffffffffa209e634>] SyS_exit_group+0x14/0x20
            [3114990.648149]  [<ffffffffa2775ddb>] system_call_fastpath+0x22/0x27
            [3114990.655242]  [<ffffffffa2775d21>] ? system_call_after_swapgs+0xae/0x146
            [3114990.662922] Code: 0f 84 82 05 00 00 65 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 80 61 62 c0 <48> 8b 78 28 e8 9e 83 c5 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 c8 
            [3114990.684832] RIP  [<ffffffffc061f6c9>] kiblnd_connect_peer+0x69/0x660 [ko2iblnd]
            [3114990.693213]  RSP <ffff9786b42837a8>
            [3114990.697727] CR2: 0000000000000028
            

             

            I see some folks poking around at the patch sets related to: https://jira.whamcloud.com/browse/LU-12236   (pretty much right ~now actually). 

            Suggestions for us to test anything?

             

            Cheers,

            Simon

             

             

            scadmin SC Admin added a comment - Hi Folks, Just reporting that I think we're seeing this issue at Swinburne. AFAICT is triggered by just rebooting one of the LNet routers. Happened once last night, and I was somewhat   able to re-produce it this morning by rebooting the same LNet router a couple of times which resulted in another node crashing. Unlike Sarah, we're not on 2.12/2.13, we're using 2.10.x:   Lustre clients: lustre-client-2.10.7 kernel-3.10.0-957.21.3.el7.x86_64   Lustre servers/lnet router in question: lustre-2.10.5-1.el7.x86_64 kernel-3.10.0-862.9.1.el7.x86_64 with patches: lu11082-lu11103-stuckMdtThreads-gerrit32853-3dc08caa.diff lu11111-lfsck-gerrit32796-693fe452.ported.patch lu11201-lfsckDoesntFinish-gerrit33078-4829fb05.patch lu11301-stuckMdtThreads2-c43baa1c.patch lu11418-hungMdtZfs-gerrit33248-eaa3c60d.diff lu11418-refreshStale-gerrit33401-v4-71f409c9.diff lu11418-stopOrphCleanupDaThreadSpinning-gerrit33662-45434fd0.diff lu11419-lfsckDoesntFinish-gerrit33252-22503a1d.diff lu11663-partialPageCorruption-gerrit33748-18d6b8fb.diff   Dump: [3114246.460458] LNet: 3543:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.55.232@o2ib: 187 seconds [3114990.159188] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 [3114990.168863] IP: [<ffffffffc061f6c9>] kiblnd_connect_peer+0x69/0x660 [ko2iblnd] [3114990.177907] PGD 0 [3114990.181713] Oops: 0000 [#1] SMP [3114990.186738] Modules linked in: squashfs loop 8021q garp mrp stp llc mptctl nvidia_uvm(POE) nvidia(POE) ib_ucm nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG xt_limit ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter xfs libcrc32c snd_hda_codec_hdmi iTCO_wdt gpio_ich iTCO_vendor_support dm_mod intel_powerclamp coretemp kvm_intel kvm irqbypass rdma_ucm mgag200 ib_uverbs ib_umad snd_hda_intel ttm snd_hda_codec drm_kms_helper snd_hda_core syscopyarea sysfillrect sysimgblt snd_hwdep fb_sys_fops snd_seq drm snd_seq_device snd_pcm joydev drm_panel_orientation_quirks sg lpc_ich i2c_i801 snd_timer snd ipmi_si soundcore ipmi_devintf ipmi_msghandler ioatdma i7core_edac acpi_cpufreq [3114990.269028] binfmt_misc ip_tables overlay(OET) osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) rdma_cm iw_cm ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) ib_ipoib ib_cm sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic pata_acpi ib_qib rdmavt crc32c_intel ib_core serio_raw ahci firewire_ohci igb libahci firewire_core mptsas pata_jmicron crc_itu_t mptscsih i2c_algo_bit libata dca mptbase ptp scsi_transport_sas pps_core [last unloaded: pcspkr] [3114990.319897] CPU: 3 PID: 14147 Comm: sleep Kdump: loaded Tainted: P OE ------------ T 3.10.0-957.21.2.el7.x86_64 #1 [3114990.334596] Hardware name: SGI.COM C3108-TY11/X8DAH, BIOS 1.1 03/16/2011 [3114990.343437] task: ffff9789ed7330c0 ti: ffff9786b4280000 task.ti: ffff9786b4280000 [3114990.352701] RIP: 0010:[<ffffffffc061f6c9>] [<ffffffffc061f6c9>] kiblnd_connect_peer+0x69/0x660 [ko2iblnd] [3114990.364161] RSP: 0018:ffff9786b42837a8 EFLAGS: 00010202 [3114990.371231] RAX: 0000000000000000 RBX: ffff978a23a63040 RCX: 0000000000000106 [3114990.380112] RDX: ffff97840ba5b000 RSI: ffffffffc0626180 RDI: ffff97840ba5b000 [3114990.388975] RBP: ffff9786b42837f8 R08: 0000000000000002 R09: ffff97840ba5b000 [3114990.397832] R10: ffff977f7fc03900 R11: ffffffffc061239c R12: 00050000c0a837e8 [3114990.406665] R13: ffff97840ba5b000 R14: ffff97840ba5b000 R15: ffff978a2454ce00 [3114990.415484] FS: 0000000000000000(0000) GS:ffff978427ac0000(0000) knlGS:0000000000000000 [3114990.425245] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [3114990.432659] CR2: 0000000000000028 CR3: 00000001c6c10000 CR4: 00000000000207e0 [3114990.441448] Call Trace: [3114990.445532] [<ffffffffc0621d0c>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd] [3114990.453945] [<ffffffffc0622377>] kiblnd_send+0x357/0xa10 [ko2iblnd] [3114990.461914] [<ffffffffc03ecbc1>] lnet_ni_send+0x41/0xd0 [lnet] [3114990.469412] [<ffffffffc03f1fb7>] lnet_send+0x77/0x180 [lnet] [3114990.476706] [<ffffffffc03f2305>] LNetPut+0x245/0x7a0 [lnet] [3114990.483924] [<ffffffffc068c226>] ptl_send_buf+0x146/0x530 [ptlrpc] [3114990.491693] [<ffffffffc03e7a44>] ? LNetMDAttach+0x3f4/0x450 [lnet] [3114990.499468] [<ffffffffc068de1d>] ptl_send_rpc+0x67d/0xe60 [ptlrpc] [3114990.507222] [<ffffffffc06834a8>] ptlrpc_send_new_req+0x468/0xa60 [ptlrpc] [3114990.515569] [<ffffffffc06cc7ea>] ? null_alloc_reqbuf+0x19a/0x3a0 [ptlrpc] [3114990.523884] [<ffffffffc06880c1>] ptlrpc_set_wait+0x3d1/0x920 [ptlrpc] [3114990.531849] [<ffffffffc0481899>] ? lustre_get_jobid+0x99/0x4d0 [obdclass] [3114990.540136] [<ffffffffc0693855>] ? lustre_msg_set_jobid+0x95/0x100 [ptlrpc] [3114990.548572] [<ffffffffc068868d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc] [3114990.556551] [<ffffffffc07a258c>] mdc_close+0x1bc/0x8a0 [mdc] [3114990.563646] [<ffffffffc084d07c>] lmv_close+0x21c/0x550 [lmv] [3114990.570719] [<ffffffffc088d6fe>] ll_close_inode_openhandle+0x2fe/0xe20 [lustre] [3114990.579413] [<ffffffffc08905c0>] ll_md_real_close+0xf0/0x1e0 [lustre] [3114990.587228] [<ffffffffc0890cf8>] ll_file_release+0x648/0xa80 [lustre] [3114990.595011] [<ffffffffa2243bcc>] __fput+0xec/0x260 [3114990.601134] [<ffffffffa2243e2e>] ____fput+0xe/0x10 [3114990.607263] [<ffffffffa20be90b>] task_work_run+0xbb/0xe0 [3114990.613888] [<ffffffffa209ddd1>] do_exit+0x2d1/0xa40 [3114990.620135] [<ffffffffa2770628>] ? __do_page_fault+0x228/0x4f0 [3114990.627243] [<ffffffffa2775d21>] ? system_call_after_swapgs+0xae/0x146 [3114990.635018] [<ffffffffa209e5bf>] do_group_exit+0x3f/0xa0 [3114990.641553] [<ffffffffa209e634>] SyS_exit_group+0x14/0x20 [3114990.648149] [<ffffffffa2775ddb>] system_call_fastpath+0x22/0x27 [3114990.655242] [<ffffffffa2775d21>] ? system_call_after_swapgs+0xae/0x146 [3114990.662922] Code: 0f 84 82 05 00 00 65 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 80 61 62 c0 <48> 8b 78 28 e8 9e 83 c5 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 c8 [3114990.684832] RIP [<ffffffffc061f6c9>] kiblnd_connect_peer+0x69/0x660 [ko2iblnd] [3114990.693213] RSP <ffff9786b42837a8> [3114990.697727] CR2: 0000000000000028   I see some folks poking around at the patch sets related to: https://jira.whamcloud.com/browse/LU-12236    (pretty much right ~now actually).  Suggestions for us to test anything?   Cheers, Simon    

            People

              ssmirnov Serguei Smirnov
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: