Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11385

client hit BUG: unable to handle kernel NULL pointer dereference at 0000000000000028

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0, Lustre 2.12.4
    • Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.3
    • lustre-master-ib build#128 tag-2.11.55 . DNE mode
    • 3
    • 9223372036854775807

    Description

      Nearly 1/3 of the clients(10/26 clients) hit following error when running soak for several hours

      soak-19 console

      [ 3041.376764] LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2^M
      [ 3041.387652] alg: No test for adler32 (adler32-zlib)^M
      [ 3042.300215] Lustre: Lustre: Build Version: 2.11.55^M
      [ 3042.529425] LNet: Using FMR for registration^M
      [ 3042.547347] LNet: Added LNI 192.168.1.119@o2ib [8/256/0/180]^M
      Sep 15 03:27:01 soak-19 TIME: Time stamp for console^M
      Sep 15 04:01:01 soak-19 TIME: Time stamp for console^M
      [ 6150.233000] Lustre: Mounted soaked-client^M
      [ 8489.389803] LNetError: 3654:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds^M
      [ 8489.401106] LNetError: 3654:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 192.168.1.106@o2ib (32): c: 7, oc: 0, rc: 8^M
      [ 8489.414871] Lustre: 3694:0:(client.c:2126:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1536987077/real 1536987083]  req@ffff9c305ca50c00 x1611642077465472/t0(0) o400->soaked-OST0006-osc-ffff9c30acd99800@192.168.1.106@o2ib:28/4 lens 224/224 e 0 to 1 dl 1536987123 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1^M
      [ 8489.448242] Lustre: soaked-OST0006-osc-ffff9c30acd99800: Connection to soaked-OST0006 (at 192.168.1.106@o2ib) was lost; in progress operations using this service will wait for recovery to complete^M
      [ 8686.219497] perf: interrupt took too long (2607 > 2500), lowering kernel.perf_event_max_sample_rate to 76000^M
      [ 8701.565834] Lustre: soaked-MDT0000-mdc-ffff9c30acd99800: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib)^M
      [ 8727.383766] LNet: 3654:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 192.168.1.106@o2ib: 1 seconds^M
      [ 8727.395186] LNet: 3654:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 192.168.1.106@o2ib: 5 seconds^M
      [ 8832.381649] LustreError: 11-0: soaked-OST0002-osc-ffff9c30acd99800: operation ost_connect to node 192.168.1.107@o2ib failed: rc = -16^M
      Sep 15 05:01:01 soak-19 TIME: Time stamp for console^M
      [ 9072.730020] perf: interrupt took too long (3297 > 3258), lowering kernel.perf_event_max_sample_rate to 60000^M
      [10828.883723] perf: interrupt took too long (4162 > 4121), lowering kernel.perf_event_max_sample_rate to 48000^M
      Sep 15 06:01:01 soak-19 TIME: Time stamp for console^M
      Sep 15 07:01:01 soak-19 TIME: Time stamp for console^M
      [18186.142873] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail^M
      [18201.920492] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028^M
      [18201.929290] IP: [<ffffffffc0c12c80>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]^M
      [18201.937392] PGD 0 ^M
      [18201.939651] Oops: 0000 [#1] SMP ^M
      [18201.943287] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev pcspkr iTCO_wdt iTCO_vendor_support wmi sg ipmi_si ipmi_devintf lpc_ich ipmi_msghandler i2c_i801 mei_me shpchp ioatdma mei auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb ptp drm mlx4_core(OE) isci ahci mpt2sas pps_core libsas libahci crct10dif_pclmul devlink dca crct10dif_common i2c_algo_bit crc32c_intel raid_class libata mlx_compat(OE) i2c_core scsi_transport_sas^M
      [18202.044478] CPU: 10 PID: 16503 Comm: IOR Tainted: G           OE  ------------   3.10.0-862.9.1.el7.x86_64 #1^M
      [18202.055578] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
      [18202.068135] task: ffff9c2cabe70000 ti: ffff9c2a05b28000 task.ti: ffff9c2a05b28000^M
      [18202.076511] RIP: 0010:[<ffffffffc0c12c80>]  [<ffffffffc0c12c80>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]^M
      [18202.087332] RSP: 0018:ffff9c2a05b2b798  EFLAGS: 00010202^M
      [18202.093277] RAX: 0000000000000000 RBX: ffff9c30ab70af80 RCX: 0000000000000106^M
      [18202.101268] RDX: ffff9c2faae20280 RSI: ffffffffc0c19710 RDI: ffff9c2faae20280^M
      [18202.110466] RBP: ffff9c2a05b2b7e8 R08: 0000000000000002 R09: ffffffffc0c20cf1^M
      [18202.119628] R10: ffff9c29ffc07900 R11: ffffffffc0c0536c R12: 00050000c0a8016c^M
      [18202.128791] R13: ffff9c2faae20280 R14: ffff9c2faae20280 R15: ffff9c30a73bc000^M
      [18202.137954] FS:  0000000000000000(0000) GS:ffff9c30ad880000(0000) knlGS:0000000000000000^M
      [18202.148175] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
      [18202.155776] CR2: 0000000000000028 CR3: 000000007320e000 CR4: 00000000000607e0^M
      [18202.164938] Call Trace:^M
      [18202.168854]  [<ffffffffc0c1531c>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd]^M
      [18202.177641]  [<ffffffffc0c15987>] kiblnd_send+0x357/0xa20 [ko2iblnd]^M
      [18202.185950]  [<ffffffffc09deed4>] lnet_ni_send+0x44/0xd0 [lnet]^M
      [18202.193751]  [<ffffffffc09e6752>] lnet_send+0x82/0x1c0 [lnet]^M
      [18202.201346]  [<ffffffffc09e6bec>] LNetPut+0x2cc/0xb60 [lnet]^M
      [18202.208850]  [<ffffffffc0c7f046>] ptl_send_buf+0x146/0x530 [ptlrpc]^M
      [18202.216995]  [<ffffffffc0c80d3d>] ptl_send_rpc+0x69d/0xe70 [ptlrpc]^M
      [18202.225113]  [<ffffffffc0c766e0>] ptlrpc_send_new_req+0x460/0xa70 [ptlrpc]^M
      [18202.233896]  [<ffffffffc0cbeaca>] ? null_alloc_reqbuf+0x19a/0x3a0 [ptlrpc]^M
      [18202.242650]  [<ffffffffc0c7b1c1>] ptlrpc_set_wait+0x291/0x790 [ptlrpc]^M
      [18202.251009]  [<ffffffffc0a7eba7>] ? lprocfs_oh_tally+0x17/0x40 [obdclass]^M
      [18202.259634]  [<ffffffffc0c868fa>] ? lustre_msg_set_jobid+0x9a/0x110 [ptlrpc]^M
      [18202.268531]  [<ffffffffc0c7b73d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]^M
      [18202.276916]  [<ffffffffc0e4668b>] mdc_close+0x1eb/0x980 [mdc]^M
      [18202.284338]  [<ffffffffc084cf04>] lmv_close+0x184/0x2a0 [lmv]^M
      [18202.291745]  [<ffffffffc0e8d7c0>] ll_close_inode_openhandle+0x2e0/0xcd0 [lustre]^M
      [18202.300976]  [<ffffffffc0e91f50>] ll_md_real_close+0xf0/0x1e0 [lustre]^M
      [18202.309221]  [<ffffffffc0e9265b>] ll_file_release+0x61b/0x8c0 [lustre]^M
      [18202.317475]  [<ffffffff8ca1d74c>] __fput+0xec/0x260^M
      [18202.323850]  [<ffffffff8ca1d9ae>] ____fput+0xe/0x10^M
      [18202.330208]  [<ffffffff8c8b803b>] task_work_run+0xbb/0xe0^M
      [18202.337134]  [<ffffffff8c897f21>] do_exit+0x2d1/0xa40^M
      [18202.343652]  [<ffffffff8ca1b538>] ? vfs_write+0x168/0x1f0^M
      [18202.350551]  [<ffffffff8cf206e1>] ? system_call_after_swapgs+0xae/0x146^M
      [18202.358817]  [<ffffffff8c89870f>] do_group_exit+0x3f/0xa0^M
      [18202.365704]  [<ffffffff8c898784>] SyS_exit_group+0x14/0x20^M
      [18202.372663]  [<ffffffff8cf20795>] system_call_fastpath+0x1c/0x21^M
      [18202.380187]  [<ffffffff8cf206e1>] ? system_call_after_swapgs+0xae/0x146^M
      [18202.388370] Code: 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 49 c7 c1 f1 0c c2 c0 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 10 97 c1 c0 <48> 8b 78 28 e8 07 dc 99 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 c1 ^M
      [18202.411764] RIP  [<ffffffffc0c12c80>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]^M
      [18202.420723]  RSP <ffff9c2a05b2b798>^M
      [18202.425371] CR2: 0000000000000028^M
      [18202.431391] ---[ end trace ccdeccc9915a17ce ]---^M
      [18202.509003] Kernel panic - not syncing: Fatal exception^M
      [18202.515717] Kernel Offset: 0xb800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)^M
      

      Attachments

        Issue Links

          Activity

            People

              ssmirnov Serguei Smirnov
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: