[LU-11624]  BUG: unable to handle kernel NULL pointer at nid_hash() Created: 06/Nov/18  Updated: 10/Nov/18  Resolved: 10/Nov/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Shuichi Ihara Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

master (ae828cd)


Issue Links:
Related
is related to LU-8130 Migrate from libcfs hash to rhashtable Open
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

at first lustre mount after new filesystem creation, multiple server crasehd when clients are mounted below.

[29913.329459] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[29913.330932] IP: [<ffffffffc090ca4d>] nid_hash+0x2d/0x50 [obdclass]
[29913.331962] PGD 800000169ea73067 PUD 169de48067 PMD 0 
[29913.332841] Oops: 0000 [#1] SMP 
[29913.333408] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) ksocklnd(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) virtio_scsi(OE) sd_mod crc_t10dif crct10dif_generic rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) sunrpc ppdev iTCO_wdt sb_edac iTCO_vendor_support iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd parport_pc joydev pcspkr sg i2c_i801 parport lpc_ich i6300esb ip_tables ext4 mbcache jbd2 virtio_net virtio_blk mlx5_ib(OE) ib_core(OE) sr_mod cdrom bochs_drm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci drm
[29913.345975]  libahci libata mlx5_core(OE) crct10dif_pclmul crct10dif_common crc32c_intel mlxfw(OE) ptp pps_core serio_raw virtio_pci i2c_core devlink igbvf virtio_ring virtio mlx_compat(OE) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: virtio_scsi]
[29913.349795] CPU: 8 PID: 25289 Comm: ll_ost02_003 Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-862.9.1.el7_lustre.ddn1.x86_64 #1
[29913.351764] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
[29913.353573] task: ffff9d571e619fa0 ti: ffff9d4ad9e30000 task.ti: ffff9d4ad9e30000
[29913.354729] RIP: 0010:[<ffffffffc090ca4d>]  [<ffffffffc090ca4d>] nid_hash+0x2d/0x50 [obdclass]
[29913.356117] RSP: 0018:ffff9d4ad9e33b40  EFLAGS: 00010206
[29913.356943] RAX: 000000000002b5a5 RBX: ffff9d429e529b00 RCX: 0000000000000001
[29913.358050] RDX: 000000000000007f RSI: 0000000000000010 RDI: 000000000002a0a0
[29913.359161] RBP: ffff9d4ad9e33b68 R08: 0000000000000000 R09: ffffffffc0b56370
[29913.360267] R10: ffff9d572541baa0 R11: ffff9d440bc3dc00 R12: 0000000000000007
[29913.361371] R13: ffff9d4ad9e33b88 R14: ffff9d571f3206c0 R15: ffff9d56b3e90000
[29913.362479] FS:  0000000000000000(0000) GS:ffff9d5725400000(0000) knlGS:0000000000000000
[29913.363725] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[29913.364622] CR2: 0000000000000010 CR3: 000000169dffe000 CR4: 00000000003607e0
[29913.365728] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[29913.366838] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[29913.367945] Call Trace:
[29913.368360]  [<ffffffffc07ad2b8>] ? cfs_hash_bd_from_key+0x38/0xb0 [libcfs]
[29913.369451]  [<ffffffffc07ad355>] cfs_hash_bd_get+0x25/0x70 [libcfs]
[29913.370447]  [<ffffffffc07b0602>] cfs_hash_add+0x52/0x1a0 [libcfs]
[29913.371463]  [<ffffffffc0b20855>] target_handle_connect+0x1fe5/0x29b0 [ptlrpc]
[29913.372590]  [<ffffffffac8d8e4c>] ? dequeue_entity+0x11c/0x5e0
[29913.373575]  [<ffffffffc0bc4e8a>] tgt_request_handle+0x50a/0x1580 [ptlrpc]
[29913.374675]  [<ffffffffc0ba0e01>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[29913.375862]  [<ffffffffac8f944f>] ? __getnstimeofday64+0x3f/0xd0
[29913.376830]  [<ffffffffc0b6bccb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[29913.378044]  [<ffffffffc0b68b55>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[29913.379113]  [<ffffffffac8cf670>] ? wake_up_state+0x20/0x20
[29913.380006]  [<ffffffffc0b6f5fc>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc]
[29913.380989]  [<ffffffffac8c9e50>] ? finish_task_switch+0x50/0x170
[29913.382888]  [<ffffffffc0b6eb00>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[29913.384960]  [<ffffffffac8bb621>] kthread+0xd1/0xe0
[29913.386652]  [<ffffffffac8bb550>] ? insert_kthread_work+0x40/0x40
[29913.388530]  [<ffffffffacf205f7>] ret_from_fork_nospec_begin+0x21/0x21
[29913.390460]  [<ffffffffac8bb550>] ? insert_kthread_work+0x40/0x40
[29913.392305] Code: 44 00 00 48 85 f6 74 37 b9 01 00 00 00 45 31 c0 b8 05 15 00 00 eb 0d 0f 1f 80 00 00 00 00 49 89 c8 48 89 f9 89 c7 c1 e7 05 01 f8 <42> 0f be 3c 06 01 f8 48 8d 79 01 48 83 ff 09 75 e2 21 d0 c3 55 
[29913.398550] RIP  [<ffffffffc090ca4d>] nid_hash+0x2d/0x50 [obdclass]
[29913.400487]  RSP <ffff9d4ad9e33b40>
[29913.401934] CR2: 0000000000000010


 Comments   
Comment by Shuichi Ihara [ 06/Nov/18 ]

Per Oleg suggestion, I didn't see crash after reverted commit 7b3f9e5d6c509fabcec3cbd71e541a84987db2ff  so far.

commit 7b3f9e5d6c509fabcec3cbd71e541a84987db2ff
 Author: NeilBrown <neilb@suse.com>
 Date: Tue Aug 28 17:05:42 2018 -0400

LU-8130 ptlrpc: convert conn_hash to rhashtable

Linux has a resizeable hashtable implementation in lib,
 so we should use that instead of having one in libcfs.

This patch converts the ptlrpc conn_hash to use rhashtable.
 In the process we gain lockless lookup.

As connections are never deleted until the hash table is destroyed,
 there is no need to count the reference in the hash table. There
 is also no need to enable automatic_shrinking.

Linux-commit: ac2370ac2bc5215daf78546cd8d925510065bb7f

Change-Id: I576daf314c3ac31a58df02d731292e1e8bb408c6
 Signed-off-by: NeilBrown <neilb@suse.com>
 Signed-off-by: James Simmons <uja.ornl@yahoo.com>
 Reviewed-on: [https://review.whamcloud.com/32036]
 Tested-by: Jenkins
 Tested-by: Maloo <hpdd-maloo@intel.com>
 Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
 Reviewed-by: Yang Sheng <ys@whamcloud.com>
 Reviewed-by: Oleg Drokin <green@whamcloud.com>
 
Comment by James A Simmons [ 06/Nov/18 ]

Really. I wonder if both have to land at the same time or if nid hash has to be the base patch? Lustre is one of those systems that has very complex hash interactions. For example the nid hash also influences the behave of the flock hash as well. We have 3 different hash impacting each other. 

Comment by Gerrit Updater [ 07/Nov/18 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/33616
Subject: LU-11624 ptlrpc: handle no ptlrpc no connection case
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 32df018c1af8fdbc833c537aafeea081ab3d8d7e

Comment by James A Simmons [ 07/Nov/18 ]

I will send this patch to lustre-devel for Neil to look over as well.

Comment by Peter Jones [ 10/Nov/18 ]

The patch from LU-8130 was reverted so it looks like you should move the work to fix it up under that ticket simmonsja

Generated at Sat Feb 10 02:45:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.