[LU-11385] client hit BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 Created: 17/Sep/18  Updated: 27/Jan/20  Resolved: 23/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.3
Fix Version/s: Lustre 2.14.0, Lustre 2.12.4

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: Serguei Smirnov
Resolution: Fixed Votes: 0
Labels: soak
Environment:

lustre-master-ib build#128 tag-2.11.55 . DNE mode


Attachments: File ko2iblnd.ko    
Issue Links:
Duplicate
Related
is related to LU-12236 Support more than the default root ne... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Nearly 1/3 of the clients(10/26 clients) hit following error when running soak for several hours

soak-19 console

[ 3041.376764] LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2^M
[ 3041.387652] alg: No test for adler32 (adler32-zlib)^M
[ 3042.300215] Lustre: Lustre: Build Version: 2.11.55^M
[ 3042.529425] LNet: Using FMR for registration^M
[ 3042.547347] LNet: Added LNI 192.168.1.119@o2ib [8/256/0/180]^M
Sep 15 03:27:01 soak-19 TIME: Time stamp for console^M
Sep 15 04:01:01 soak-19 TIME: Time stamp for console^M
[ 6150.233000] Lustre: Mounted soaked-client^M
[ 8489.389803] LNetError: 3654:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds^M
[ 8489.401106] LNetError: 3654:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 192.168.1.106@o2ib (32): c: 7, oc: 0, rc: 8^M
[ 8489.414871] Lustre: 3694:0:(client.c:2126:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1536987077/real 1536987083]  req@ffff9c305ca50c00 x1611642077465472/t0(0) o400->soaked-OST0006-osc-ffff9c30acd99800@192.168.1.106@o2ib:28/4 lens 224/224 e 0 to 1 dl 1536987123 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1^M
[ 8489.448242] Lustre: soaked-OST0006-osc-ffff9c30acd99800: Connection to soaked-OST0006 (at 192.168.1.106@o2ib) was lost; in progress operations using this service will wait for recovery to complete^M
[ 8686.219497] perf: interrupt took too long (2607 > 2500), lowering kernel.perf_event_max_sample_rate to 76000^M
[ 8701.565834] Lustre: soaked-MDT0000-mdc-ffff9c30acd99800: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib)^M
[ 8727.383766] LNet: 3654:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 192.168.1.106@o2ib: 1 seconds^M
[ 8727.395186] LNet: 3654:0:(o2iblnd_cb.c:3370:kiblnd_check_conns()) Timed out tx for 192.168.1.106@o2ib: 5 seconds^M
[ 8832.381649] LustreError: 11-0: soaked-OST0002-osc-ffff9c30acd99800: operation ost_connect to node 192.168.1.107@o2ib failed: rc = -16^M
Sep 15 05:01:01 soak-19 TIME: Time stamp for console^M
[ 9072.730020] perf: interrupt took too long (3297 > 3258), lowering kernel.perf_event_max_sample_rate to 60000^M
[10828.883723] perf: interrupt took too long (4162 > 4121), lowering kernel.perf_event_max_sample_rate to 48000^M
Sep 15 06:01:01 soak-19 TIME: Time stamp for console^M
Sep 15 07:01:01 soak-19 TIME: Time stamp for console^M
[18186.142873] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail^M
[18201.920492] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028^M
[18201.929290] IP: [<ffffffffc0c12c80>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]^M
[18201.937392] PGD 0 ^M
[18201.939651] Oops: 0000 [#1] SMP ^M
[18201.943287] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev pcspkr iTCO_wdt iTCO_vendor_support wmi sg ipmi_si ipmi_devintf lpc_ich ipmi_msghandler i2c_i801 mei_me shpchp ioatdma mei auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb ptp drm mlx4_core(OE) isci ahci mpt2sas pps_core libsas libahci crct10dif_pclmul devlink dca crct10dif_common i2c_algo_bit crc32c_intel raid_class libata mlx_compat(OE) i2c_core scsi_transport_sas^M
[18202.044478] CPU: 10 PID: 16503 Comm: IOR Tainted: G           OE  ------------   3.10.0-862.9.1.el7.x86_64 #1^M
[18202.055578] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
[18202.068135] task: ffff9c2cabe70000 ti: ffff9c2a05b28000 task.ti: ffff9c2a05b28000^M
[18202.076511] RIP: 0010:[<ffffffffc0c12c80>]  [<ffffffffc0c12c80>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]^M
[18202.087332] RSP: 0018:ffff9c2a05b2b798  EFLAGS: 00010202^M
[18202.093277] RAX: 0000000000000000 RBX: ffff9c30ab70af80 RCX: 0000000000000106^M
[18202.101268] RDX: ffff9c2faae20280 RSI: ffffffffc0c19710 RDI: ffff9c2faae20280^M
[18202.110466] RBP: ffff9c2a05b2b7e8 R08: 0000000000000002 R09: ffffffffc0c20cf1^M
[18202.119628] R10: ffff9c29ffc07900 R11: ffffffffc0c0536c R12: 00050000c0a8016c^M
[18202.128791] R13: ffff9c2faae20280 R14: ffff9c2faae20280 R15: ffff9c30a73bc000^M
[18202.137954] FS:  0000000000000000(0000) GS:ffff9c30ad880000(0000) knlGS:0000000000000000^M
[18202.148175] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[18202.155776] CR2: 0000000000000028 CR3: 000000007320e000 CR4: 00000000000607e0^M
[18202.164938] Call Trace:^M
[18202.168854]  [<ffffffffc0c1531c>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd]^M
[18202.177641]  [<ffffffffc0c15987>] kiblnd_send+0x357/0xa20 [ko2iblnd]^M
[18202.185950]  [<ffffffffc09deed4>] lnet_ni_send+0x44/0xd0 [lnet]^M
[18202.193751]  [<ffffffffc09e6752>] lnet_send+0x82/0x1c0 [lnet]^M
[18202.201346]  [<ffffffffc09e6bec>] LNetPut+0x2cc/0xb60 [lnet]^M
[18202.208850]  [<ffffffffc0c7f046>] ptl_send_buf+0x146/0x530 [ptlrpc]^M
[18202.216995]  [<ffffffffc0c80d3d>] ptl_send_rpc+0x69d/0xe70 [ptlrpc]^M
[18202.225113]  [<ffffffffc0c766e0>] ptlrpc_send_new_req+0x460/0xa70 [ptlrpc]^M
[18202.233896]  [<ffffffffc0cbeaca>] ? null_alloc_reqbuf+0x19a/0x3a0 [ptlrpc]^M
[18202.242650]  [<ffffffffc0c7b1c1>] ptlrpc_set_wait+0x291/0x790 [ptlrpc]^M
[18202.251009]  [<ffffffffc0a7eba7>] ? lprocfs_oh_tally+0x17/0x40 [obdclass]^M
[18202.259634]  [<ffffffffc0c868fa>] ? lustre_msg_set_jobid+0x9a/0x110 [ptlrpc]^M
[18202.268531]  [<ffffffffc0c7b73d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]^M
[18202.276916]  [<ffffffffc0e4668b>] mdc_close+0x1eb/0x980 [mdc]^M
[18202.284338]  [<ffffffffc084cf04>] lmv_close+0x184/0x2a0 [lmv]^M
[18202.291745]  [<ffffffffc0e8d7c0>] ll_close_inode_openhandle+0x2e0/0xcd0 [lustre]^M
[18202.300976]  [<ffffffffc0e91f50>] ll_md_real_close+0xf0/0x1e0 [lustre]^M
[18202.309221]  [<ffffffffc0e9265b>] ll_file_release+0x61b/0x8c0 [lustre]^M
[18202.317475]  [<ffffffff8ca1d74c>] __fput+0xec/0x260^M
[18202.323850]  [<ffffffff8ca1d9ae>] ____fput+0xe/0x10^M
[18202.330208]  [<ffffffff8c8b803b>] task_work_run+0xbb/0xe0^M
[18202.337134]  [<ffffffff8c897f21>] do_exit+0x2d1/0xa40^M
[18202.343652]  [<ffffffff8ca1b538>] ? vfs_write+0x168/0x1f0^M
[18202.350551]  [<ffffffff8cf206e1>] ? system_call_after_swapgs+0xae/0x146^M
[18202.358817]  [<ffffffff8c89870f>] do_group_exit+0x3f/0xa0^M
[18202.365704]  [<ffffffff8c898784>] SyS_exit_group+0x14/0x20^M
[18202.372663]  [<ffffffff8cf20795>] system_call_fastpath+0x1c/0x21^M
[18202.380187]  [<ffffffff8cf206e1>] ? system_call_after_swapgs+0xae/0x146^M
[18202.388370] Code: 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 49 c7 c1 f1 0c c2 c0 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 10 97 c1 c0 <48> 8b 78 28 e8 07 dc 99 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 c1 ^M
[18202.411764] RIP  [<ffffffffc0c12c80>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]^M
[18202.420723]  RSP <ffff9c2a05b2b798>^M
[18202.425371] CR2: 0000000000000028^M
[18202.431391] ---[ end trace ccdeccc9915a17ce ]---^M
[18202.509003] Kernel panic - not syncing: Fatal exception^M
[18202.515717] Kernel Offset: 0xb800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)^M


 Comments   
Comment by Peter Jones [ 18/Sep/18 ]

Amir

Is there any concern that this issue might have been introduced with recent LNet changes?

Peter

Comment by Amir Shehata (Inactive) [ 19/Sep/18 ]

Looking at this it appears that the crash occurs here:

 130 #ifdef HAVE_RDMA_CREATE_ID_5ARG
 131 # define kiblnd_rdma_create_id(cb, dev, ps, qpt) rdma_create_id(current->nsproxy->net_ns, \
 132 »·······»·······»·······»·······»·······»·······»·······»·······cb, dev, \
 133 »·······»·······»·······»·······»·······»·······»·······»·······ps, qpt)
HAVE_RDMA_CREATE_ID_5ARG is defined

(gdb) print/x (int)&((struct nsproxy*)0)->net_ns
$8 = 0x28

From the crash info attached to the bug it looks like we're trying to access a field at offset 0x28

this looks to be current->nsproxy->net_ns

Would we be able to try and reproduce but with a kernel ofed build instead of a mofed build?

Also if we're able to collect the crash dump that would be helpful.

Comment by Peter Jones [ 25/Sep/18 ]

sarah

Is this issue still reoccurring while soak is running?

Peter

Comment by Sarah Liu [ 25/Sep/18 ]

No, it cannot be reproduced at this time with over 24h running

Comment by James A Simmons [ 08/Mar/19 ]

The problem is the use of current->nsproxy->net_ns. It could be network namespace is not even enabled in the kernel. The only place I see this used by the infinband layer is in the cma code for userland to kernel space interaction which makes sense since user space could be placed in an network namespace. The proper way to handle this is to use kobj_ns_grab_current(KOBJ_NS_TYPE_NET) but that only exist in the most recent kernels. I would suggest at this point to change "current->nsproxy->net_ns" to init_net which will always exist.

Comment by Gerrit Updater [ 02/Apr/19 ]

Sonia Sharma (sharmaso@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34577
Subject: LU-11385 lnet: check if current->nsproxy is NULL before using
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c49deb6566ca2ef531e2ab6d68937a857b7dc260

Comment by Sarah Liu [ 09/Apr/19 ]

soak hit the problem again when running b2_12-next build#6 on 2 clients when running about 24 hours
soak-40.log-20190407

[81137.079621] Lustre: Skipped 1 previous similar message
[84291.211419] LNetError: 14031:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
[84291.222786] LNetError: 14031:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 192.168.1.108@o2ib (11): c: 7, oc: 0, rc: 8
[84291.236612] Lustre: 14052:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1554566512/real 1554566519]  req@ffff8e0101531
e00 x1629993278173792/t0(0) o400->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1554566519 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[84291.268771] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail
[84300.542162] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
[84300.550923] IP: [<ffffffffc0800cc0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]
[84300.559000] PGD 0 
[84300.561254] Oops: 0000 [#1] SMP 
[84300.564873] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfs
v4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_e
n(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO
_wdt iTCO_vendor_support mxm_wmi ipmi_ssif mei_me pcspkr joydev mei sg lpc_ich i2c_i801 ioatdma ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_pad acpi_power_meter auth_rpcgss su
nrpc ip_tables ext4 mbcache jbd2 mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb ahci ml
x4_core(OE) drm crct10dif_pclmul ptp crct10dif_common libahci crc32c_intel pps_core devlink dca libata mlx_compat(OE) drm_panel_orientation_quirks i2c_algo_bit
[84300.667360] CPU: 9 PID: 114559 Comm: mdtest Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.1.3.el7.x86_64 #1
[84300.680065] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
[84300.691905] task: ffff8dfa8f7b6180 ti: ffff8df4cb728000 task.ti: ffff8df4cb728000
[84300.700255] RIP: 0010:[<ffffffffc0800cc0>]  [<ffffffffc0800cc0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]
[84300.711042] RSP: 0018:ffff8df4cb72b750  EFLAGS: 00010202
[84300.716968] RAX: 0000000000000000 RBX: ffff8e02842061c0 RCX: 0000000000000106
[84300.724930] RDX: ffff8dfbce62cb80 RSI: ffffffffc0807750 RDI: ffff8dfbce62cb80
[84300.732893] RBP: ffff8df4cb72b7a0 R08: 0000000000000002 R09: ffffffffc080ecf1
[84300.740855] R10: ffff8df3bfc07900 R11: ffffffffffffff83 R12: 00050000c0a8016c
[84300.748817] R13: ffff8dfbce62cb80 R14: ffff8dfbce62cb80 R15: ffff8e029b251600
[84300.756780] FS:  0000000000000000(0000) GS:ffff8dfa9e840000(0000) knlGS:0000000000000000
[84300.765809] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[84300.772220] CR2: 0000000000000028 CR3: 0000000c73610000 CR4: 00000000003607e0
[84300.780181] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[84300.788144] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[84300.796104] Call Trace:
[84300.798837]  [<ffffffffc080335c>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd]
[84300.806416]  [<ffffffffc08039c7>] kiblnd_send+0x357/0xa20 [ko2iblnd]
[84300.813521]  [<ffffffffc0b1f1c4>] lnet_ni_send+0x44/0xd0 [lnet]
[84300.820133]  [<ffffffffc0b26e92>] lnet_send+0x82/0x1c0 [lnet]
[84300.826550]  [<ffffffffc0b2729c>] LNetPut+0x2cc/0xb50 [lnet]
[84300.832893]  [<ffffffffc0dbe786>] ptl_send_buf+0x146/0x530 [ptlrpc]
[84300.839904]  [<ffffffffc0dc047d>] ptl_send_rpc+0x69d/0xe70 [ptlrpc]
[84300.846914]  [<ffffffffc0db5b00>] ptlrpc_send_new_req+0x450/0xa60 [ptlrpc]
[84300.854604]  [<ffffffffc0dba728>] ptlrpc_set_wait+0x3f8/0x8d0 [ptlrpc]
[84300.861905]  [<ffffffffc0dc4840>] ? lustre_msg_buf_v2+0x1b0/0x1b0 [ptlrpc]
[84300.869594]  [<ffffffffc0dc4857>] ? lustre_msg_buf+0x17/0x60 [ptlrpc]
[84300.876807]  [<ffffffffc0bfaf15>] ? lustre_get_jobid+0x185/0x2e0 [obdclass]
[84300.884588]  [<ffffffffc0bc0e47>] ? lprocfs_oh_tally+0x17/0x40 [obdclass]
[84300.892171]  [<ffffffffc0dc610a>] ? lustre_msg_set_jobid+0x9a/0x110 [ptlrpc]
[84300.900053]  [<ffffffffc0dbac83>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
[84300.907439]  [<ffffffffc0d4680b>] mdc_close+0x1eb/0x980 [mdc]
[84300.913855]  [<ffffffffc0953bee>] lmv_close+0x1ae/0x3a0 [lmv]
[84300.920280]  [<ffffffffc0f93857>] ll_close_inode_openhandle+0x2e7/0xcf0 [lustre]
[84300.928541]  [<ffffffffc0f989f0>] ll_md_real_close+0xf0/0x1e0 [lustre]
[84300.935828]  [<ffffffffc0f990fb>] ll_file_release+0x61b/0x8c0 [lustre]
[84300.943114]  [<ffffffffae84364c>] __fput+0xec/0x260
[84300.948555]  [<ffffffffae8438ae>] ____fput+0xe/0x10
[84300.953999]  [<ffffffffae6be79b>] task_work_run+0xbb/0xe0
[84300.960023]  [<ffffffffae69dc61>] do_exit+0x2d1/0xa40
[84300.965663]  [<ffffffffaed6f608>] ? __do_page_fault+0x228/0x500
[84300.972268]  [<ffffffffae69e44f>] do_group_exit+0x3f/0xa0
[84300.978290]  [<ffffffffae69e4c4>] SyS_exit_group+0x14/0x20
[84300.984411]  [<ffffffffaed74ddb>] system_call_fastpath+0x22/0x27
[84300.991110] Code: 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 49 c7 c1 f1 ec 80 c0 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 50 77 80 c0 <48> 8b 78 28 e8 97 bd e9 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 c1 
[84301.012866] RIP  [<ffffffffc0800cc0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]
[84301.021032]  RSP <ffff8df4cb72b750>
Comment by Sarah Liu [ 16/Apr/19 ]

A quick update, soak has been running with patch #34577 for almost 3 days, not seeing the bug so far.

another update, soak has been up for 5 days, not seeing the bug. One thing to clarify, the bug was only hit on b2_12 branch, the patch is based on master

Comment by Sarah Liu [ 07/May/19 ]

2 client nodes hit the same problem on master overnight, lustre-master-ib build #225 version 2.12.53_13_g4191e0c

Comment by Sonia Sharma (Inactive) [ 07/May/19 ]

Just checking here that the soak is running with the below patch?

https://review.whamcloud.com/34577

Comment by James A Simmons [ 07/May/19 ]

I think the solution from LU-12236 is a more complete solution to this problem.

Comment by Sarah Liu [ 07/May/19 ]

Just checking here that the soak is running with the below patch?

https://review.whamcloud.com/34577

No, the current build is tip of master which doesn't have the patch. I can test the patch or the one of LU-12236 in a few days(to let soak stays up for longer time).

Comment by SC Admin (Inactive) [ 19/Jul/19 ]

Hi Folks,

Just reporting that I think we're seeing this issue at Swinburne. AFAICT is triggered by just rebooting one of the LNet routers. Happened once last night, and I was somewhat  able to re-produce it this morning by rebooting the same LNet router a couple of times which resulted in another node crashing. Unlike Sarah, we're not on 2.12/2.13, we're using 2.10.x:

 

Lustre clients:

  • lustre-client-2.10.7
  • kernel-3.10.0-957.21.3.el7.x86_64

 

Lustre servers/lnet router in question:

  • lustre-2.10.5-1.el7.x86_64
  • kernel-3.10.0-862.9.1.el7.x86_64

with patches:

  • lu11082-lu11103-stuckMdtThreads-gerrit32853-3dc08caa.diff
  • lu11111-lfsck-gerrit32796-693fe452.ported.patch
  • lu11201-lfsckDoesntFinish-gerrit33078-4829fb05.patch
  • lu11301-stuckMdtThreads2-c43baa1c.patch
  • lu11418-hungMdtZfs-gerrit33248-eaa3c60d.diff
  • lu11418-refreshStale-gerrit33401-v4-71f409c9.diff
  • lu11418-stopOrphCleanupDaThreadSpinning-gerrit33662-45434fd0.diff
  • lu11419-lfsckDoesntFinish-gerrit33252-22503a1d.diff
  • lu11663-partialPageCorruption-gerrit33748-18d6b8fb.diff

 

Dump:

[3114246.460458] LNet: 3543:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.55.232@o2ib: 187 seconds
[3114990.159188] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
[3114990.168863] IP: [<ffffffffc061f6c9>] kiblnd_connect_peer+0x69/0x660 [ko2iblnd]
[3114990.177907] PGD 0 
[3114990.181713] Oops: 0000 [#1] SMP 
[3114990.186738] Modules linked in: squashfs loop 8021q garp mrp stp llc mptctl nvidia_uvm(POE) nvidia(POE) ib_ucm nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG xt_limit ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter xfs libcrc32c snd_hda_codec_hdmi iTCO_wdt gpio_ich iTCO_vendor_support dm_mod intel_powerclamp coretemp kvm_intel kvm irqbypass rdma_ucm mgag200 ib_uverbs ib_umad snd_hda_intel ttm snd_hda_codec drm_kms_helper snd_hda_core syscopyarea sysfillrect sysimgblt snd_hwdep fb_sys_fops snd_seq drm snd_seq_device snd_pcm joydev drm_panel_orientation_quirks sg lpc_ich i2c_i801 snd_timer snd ipmi_si soundcore ipmi_devintf ipmi_msghandler ioatdma i7core_edac acpi_cpufreq
[3114990.269028]  binfmt_misc ip_tables overlay(OET) osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) rdma_cm iw_cm ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) ib_ipoib ib_cm sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic pata_acpi ib_qib rdmavt crc32c_intel ib_core serio_raw ahci firewire_ohci igb libahci firewire_core mptsas pata_jmicron crc_itu_t mptscsih i2c_algo_bit libata dca mptbase ptp scsi_transport_sas pps_core [last unloaded: pcspkr]
[3114990.319897] CPU: 3 PID: 14147 Comm: sleep Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-957.21.2.el7.x86_64 #1
[3114990.334596] Hardware name: SGI.COM C3108-TY11/X8DAH, BIOS 1.1     03/16/2011
[3114990.343437] task: ffff9789ed7330c0 ti: ffff9786b4280000 task.ti: ffff9786b4280000
[3114990.352701] RIP: 0010:[<ffffffffc061f6c9>]  [<ffffffffc061f6c9>] kiblnd_connect_peer+0x69/0x660 [ko2iblnd]
[3114990.364161] RSP: 0018:ffff9786b42837a8  EFLAGS: 00010202
[3114990.371231] RAX: 0000000000000000 RBX: ffff978a23a63040 RCX: 0000000000000106
[3114990.380112] RDX: ffff97840ba5b000 RSI: ffffffffc0626180 RDI: ffff97840ba5b000
[3114990.388975] RBP: ffff9786b42837f8 R08: 0000000000000002 R09: ffff97840ba5b000
[3114990.397832] R10: ffff977f7fc03900 R11: ffffffffc061239c R12: 00050000c0a837e8
[3114990.406665] R13: ffff97840ba5b000 R14: ffff97840ba5b000 R15: ffff978a2454ce00
[3114990.415484] FS:  0000000000000000(0000) GS:ffff978427ac0000(0000) knlGS:0000000000000000
[3114990.425245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3114990.432659] CR2: 0000000000000028 CR3: 00000001c6c10000 CR4: 00000000000207e0
[3114990.441448] Call Trace:
[3114990.445532]  [<ffffffffc0621d0c>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd]
[3114990.453945]  [<ffffffffc0622377>] kiblnd_send+0x357/0xa10 [ko2iblnd]
[3114990.461914]  [<ffffffffc03ecbc1>] lnet_ni_send+0x41/0xd0 [lnet]
[3114990.469412]  [<ffffffffc03f1fb7>] lnet_send+0x77/0x180 [lnet]
[3114990.476706]  [<ffffffffc03f2305>] LNetPut+0x245/0x7a0 [lnet]
[3114990.483924]  [<ffffffffc068c226>] ptl_send_buf+0x146/0x530 [ptlrpc]
[3114990.491693]  [<ffffffffc03e7a44>] ? LNetMDAttach+0x3f4/0x450 [lnet]
[3114990.499468]  [<ffffffffc068de1d>] ptl_send_rpc+0x67d/0xe60 [ptlrpc]
[3114990.507222]  [<ffffffffc06834a8>] ptlrpc_send_new_req+0x468/0xa60 [ptlrpc]
[3114990.515569]  [<ffffffffc06cc7ea>] ? null_alloc_reqbuf+0x19a/0x3a0 [ptlrpc]
[3114990.523884]  [<ffffffffc06880c1>] ptlrpc_set_wait+0x3d1/0x920 [ptlrpc]
[3114990.531849]  [<ffffffffc0481899>] ? lustre_get_jobid+0x99/0x4d0 [obdclass]
[3114990.540136]  [<ffffffffc0693855>] ? lustre_msg_set_jobid+0x95/0x100 [ptlrpc]
[3114990.548572]  [<ffffffffc068868d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
[3114990.556551]  [<ffffffffc07a258c>] mdc_close+0x1bc/0x8a0 [mdc]
[3114990.563646]  [<ffffffffc084d07c>] lmv_close+0x21c/0x550 [lmv]
[3114990.570719]  [<ffffffffc088d6fe>] ll_close_inode_openhandle+0x2fe/0xe20 [lustre]
[3114990.579413]  [<ffffffffc08905c0>] ll_md_real_close+0xf0/0x1e0 [lustre]
[3114990.587228]  [<ffffffffc0890cf8>] ll_file_release+0x648/0xa80 [lustre]
[3114990.595011]  [<ffffffffa2243bcc>] __fput+0xec/0x260
[3114990.601134]  [<ffffffffa2243e2e>] ____fput+0xe/0x10
[3114990.607263]  [<ffffffffa20be90b>] task_work_run+0xbb/0xe0
[3114990.613888]  [<ffffffffa209ddd1>] do_exit+0x2d1/0xa40
[3114990.620135]  [<ffffffffa2770628>] ? __do_page_fault+0x228/0x4f0
[3114990.627243]  [<ffffffffa2775d21>] ? system_call_after_swapgs+0xae/0x146
[3114990.635018]  [<ffffffffa209e5bf>] do_group_exit+0x3f/0xa0
[3114990.641553]  [<ffffffffa209e634>] SyS_exit_group+0x14/0x20
[3114990.648149]  [<ffffffffa2775ddb>] system_call_fastpath+0x22/0x27
[3114990.655242]  [<ffffffffa2775d21>] ? system_call_after_swapgs+0xae/0x146
[3114990.662922] Code: 0f 84 82 05 00 00 65 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 80 61 62 c0 <48> 8b 78 28 e8 9e 83 c5 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 c8 
[3114990.684832] RIP  [<ffffffffc061f6c9>] kiblnd_connect_peer+0x69/0x660 [ko2iblnd]
[3114990.693213]  RSP <ffff9786b42837a8>
[3114990.697727] CR2: 0000000000000028

 

I see some folks poking around at the patch sets related to: https://jira.whamcloud.com/browse/LU-12236   (pretty much right ~now actually). 

Suggestions for us to test anything?

 

Cheers,

Simon

 

 

Comment by Sarah Liu [ 14/Oct/19 ]

Hit same issue on 2.12.3-RC1(b2_12-ib #64)

[37224.564653] LNetError: 4595:0:(o2iblnd_cb.c:3425:kiblnd_check_conns()) Timed out RDMA with 192.168.1.108@o2ib (15): c: 7, oc: 0, rc: 8
[37224.578319] LNetError: 4603:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 192.168.1.108@o2ib added to recovery queue. Health = 900
[37224.628484] LNetError: 52343:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 192.168.1.128@o2ib added to recovery queue. Health = 900
[37224.642005] LNetError: 52343:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 22 previous similar messages
[37226.628596] Lustre: 4631:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1570787383/real 1570787390]  
req@ffff971940601200 x1647054195451264/t0(0) o400->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1570787390 ref 1 fl Rpc:eXN/0/ffffff
ff rc 0/-1
[37226.660634] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fa
il
[37243.629633] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
[37243.638397] IP: [<ffffffffc0ca6cd0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]
[37243.646478] PGD 0 
[37243.648729] Oops: 0000 [#1] SMP 
[37243.652352] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) r
pcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_cor
e(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmul
ni_intel aesni_intel lrw gf128mul ipmi_ssif glue_helper sg lpc_ich mei_me mei ablk_helper cryptd ipmi_si pcspkr ipmi_devintf ioatdma ipmi_msghandler joydev i2c
_i801 pcc_cpufreq wmi auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mgag200 drm_kms_
helper isci syscopyarea igb sysfillrect sysimgblt fb_sys_fops ahci ttm libsas scsi_transport_sas libahci mlx4_core(OE) ptp drm crct10dif_pclmul pps_core crct10
dif_common devlink crc32c_intel dca libata mlx_compat(OE) drm_panel_orientation_quirks i2c_algo_bit
[37243.755763] CPU: 8 PID: 53315 Comm: mdtest Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.1.1.el7.x86_64 #1
[37243.768477] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
[37243.780998] task: ffff9713b5a841c0 ti: ffff971601604000 task.ti: ffff971601604000
[37243.789349] RIP: 0010:[<ffffffffc0ca6cd0>]  [<ffffffffc0ca6cd0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]
[37243.800137] RSP: 0018:ffff971601607798  EFLAGS: 00010202
[37243.806064] RAX: 0000000000000000 RBX: ffff971bac9a6100 RCX: 0000000000000106
[37243.814028] RDX: ffff971b142fde00 RSI: ffffffffc0cae070 RDI: ffff971b142fde00
[37243.821990] RBP: ffff9716016077e8 R08: 0000000000000002 R09: ffffffffc0cb4cc4
[37243.829953] R10: ffff9714ffc07900 R11: ffffffffc0c9945c R12: 00050000c0a8016c
[37243.837916] R13: ffff971b142fde00 R14: ffff971b142fde00 R15: ffff971b94627a00
[37243.845878] FS:  0000000000000000(0000) GS:ffff971bae200000(0000) knlGS:0000000000000000
[37243.854906] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[37243.861316] CR2: 0000000000000028 CR3: 000000030e010000 CR4: 00000000000607e0
[37243.869279] Call Trace:
[37243.872014]  [<ffffffffc0ca91ac>] kiblnd_launch_tx+0x90c/0xc20 [ko2iblnd]
[37243.879591]  [<ffffffffc0ca9817>] kiblnd_send+0x357/0xa20 [ko2iblnd]
[37243.886691]  [<ffffffffc0a74594>] lnet_ni_send+0x44/0xd0 [lnet]
[37243.893302]  [<ffffffffc0a7bd32>] lnet_send+0x82/0x1c0 [lnet]
[37243.899718]  [<ffffffffc0a7c13c>] LNetPut+0x2cc/0xb50 [lnet]
[37243.906056]  [<ffffffffc0d12856>] ptl_send_buf+0x146/0x530 [ptlrpc]
[37243.913063]  [<ffffffffc0d1454d>] ptl_send_rpc+0x69d/0xe70 [ptlrpc]
[37243.920070]  [<ffffffffc0d09dc0>] ptlrpc_send_new_req+0x450/0xa60 [ptlrpc]
[37243.927760]  [<ffffffffc0d5318a>] ? null_alloc_reqbuf+0x19a/0x3a0 [ptlrpc]
[37243.935445]  [<ffffffffc0d0e891>] ptlrpc_set_wait+0x291/0x790 [ptlrpc]
[37243.942749]  [<ffffffffc0b13e17>] ? lprocfs_oh_tally+0x17/0x40 [obdclass]
[37243.950338]  [<ffffffffc0d1a50a>] ? lustre_msg_set_jobid+0x9a/0x110 [ptlrpc]
[37243.958216]  [<ffffffffc0d0ee13>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
[37243.965593]  [<ffffffffc0edccf1>] mdc_close+0x201/0x9e0 [mdc]
[37243.972007]  [<ffffffffc0f1d0ce>] lmv_close+0x1ae/0x3a0 [lmv]
[37243.978429]  [<ffffffffc0f55987>] ll_close_inode_openhandle+0x2e7/0xcf0 [lustre]
[37243.986688]  [<ffffffffc0f5a1e0>] ll_md_real_close+0xf0/0x1e0 [lustre]
[37243.993976]  [<ffffffffc0f5a8eb>] ll_file_release+0x61b/0x8c0 [lustre]
[37244.001262]  [<ffffffffa224a9cc>] __fput+0xec/0x260
[37244.006705]  [<ffffffffa224ac2e>] ____fput+0xe/0x10
[37244.012149]  [<ffffffffa20c1c0b>] task_work_run+0xbb/0xe0
[37244.018173]  [<ffffffffa20a0f24>] do_exit+0x2d4/0xa50
[37244.023809]  [<ffffffffa2787678>] ? __do_page_fault+0x238/0x500
[37244.030417]  [<ffffffffa278ce21>] ? system_call_after_swapgs+0xae/0x146
[37244.037798]  [<ffffffffa20a171f>] do_group_exit+0x3f/0xa0
[37244.043821]  [<ffffffffa20a1794>] SyS_exit_group+0x14/0x20
[37244.049942]  [<ffffffffa278cede>] system_call_fastpath+0x25/0x2a
[37244.056645]  [<ffffffffa278ce21>] ? system_call_after_swapgs+0xae/0x146
[37244.064026] Code: 48 8b 04 25 80 0e 01 00 48 8b 80 60 07 00 00 49 c7 c1 c4 4c cb c0 41 b8 02 00 00 00 b9 06 01 00 00 4c 89 ea 48 c7 c6 70 e0 ca c0 <48> 8b 78 28 e8 57 8c 92 ff 48 3d 00 f0 ff ff 49 89 c6 0f 87 c1 
[37244.085739] RIP  [<ffffffffc0ca6cd0>] kiblnd_connect_peer+0x70/0x660 [ko2iblnd]
[37244.093907]  RSP <ffff971601607798>
[37244.097797] CR2: 0000000000000028
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 3.10.0-1062.1.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Fri Sep 13 22:55:44 UTC 2019
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-1062.1.1.el7.x86_64 ro console=ttyS0,115200 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd disable_cpu_apicid=0 elfcorehdr=869812K
Comment by Gerrit Updater [ 19/Nov/19 ]

Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36802
Subject: LU-11385 odbclass: Handle gracefully if nsproxy is NULL
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e33f4d6e5dec98de7a7d97c7d6e09edd976fecfd

Comment by Gerrit Updater [ 14/Dec/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34577/
Subject: LU-11385 lnet: check if current->nsproxy is NULL before using
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ef1783e282f6eba9d69b0957f1b5fed00be0cbd6

Comment by Gerrit Updater [ 23/Jan/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36802/
Subject: LU-11385 odbclass: Handle gracefully if nsproxy is NULL
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 15278c6d32a5a9a7a2b8ac9e08c8702383e0c2ff

Comment by Peter Jones [ 23/Jan/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 23/Jan/20 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37313
Subject: LU-11385 lnet: check if current->nsproxy is NULL before using
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 19cb66f129d3bc5191e84f2517c812229d37ab2d

Comment by Gerrit Updater [ 23/Jan/20 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37314
Subject: LU-11385 odbclass: Handle gracefully if nsproxy is NULL
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 2e4112c41d11dc2e8d00a5e1be2deab786ee74d4

Comment by Gerrit Updater [ 27/Jan/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37313/
Subject: LU-11385 lnet: check if current->nsproxy is NULL before using
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: fea400b3005982322c4e5c2df9b8376398a561ce

Comment by Gerrit Updater [ 27/Jan/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37314/
Subject: LU-11385 odbclass: Handle gracefully if nsproxy is NULL
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 1e519c24829aa5173ea848cd58ef5a65d0fe9739

Generated at Sat Feb 10 02:43:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.