Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11624

BUG: unable to handle kernel NULL pointer at nid_hash()

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.12.0
    • None
    • master (ae828cd)
    • 2
    • 9223372036854775807

    Description

      at first lustre mount after new filesystem creation, multiple server crasehd when clients are mounted below.

      [29913.329459] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
      [29913.330932] IP: [<ffffffffc090ca4d>] nid_hash+0x2d/0x50 [obdclass]
      [29913.331962] PGD 800000169ea73067 PUD 169de48067 PMD 0 
      [29913.332841] Oops: 0000 [#1] SMP 
      [29913.333408] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) ksocklnd(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) virtio_scsi(OE) sd_mod crc_t10dif crct10dif_generic rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) sunrpc ppdev iTCO_wdt sb_edac iTCO_vendor_support iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd parport_pc joydev pcspkr sg i2c_i801 parport lpc_ich i6300esb ip_tables ext4 mbcache jbd2 virtio_net virtio_blk mlx5_ib(OE) ib_core(OE) sr_mod cdrom bochs_drm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci drm
      [29913.345975]  libahci libata mlx5_core(OE) crct10dif_pclmul crct10dif_common crc32c_intel mlxfw(OE) ptp pps_core serio_raw virtio_pci i2c_core devlink igbvf virtio_ring virtio mlx_compat(OE) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: virtio_scsi]
      [29913.349795] CPU: 8 PID: 25289 Comm: ll_ost02_003 Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-862.9.1.el7_lustre.ddn1.x86_64 #1
      [29913.351764] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
      [29913.353573] task: ffff9d571e619fa0 ti: ffff9d4ad9e30000 task.ti: ffff9d4ad9e30000
      [29913.354729] RIP: 0010:[<ffffffffc090ca4d>]  [<ffffffffc090ca4d>] nid_hash+0x2d/0x50 [obdclass]
      [29913.356117] RSP: 0018:ffff9d4ad9e33b40  EFLAGS: 00010206
      [29913.356943] RAX: 000000000002b5a5 RBX: ffff9d429e529b00 RCX: 0000000000000001
      [29913.358050] RDX: 000000000000007f RSI: 0000000000000010 RDI: 000000000002a0a0
      [29913.359161] RBP: ffff9d4ad9e33b68 R08: 0000000000000000 R09: ffffffffc0b56370
      [29913.360267] R10: ffff9d572541baa0 R11: ffff9d440bc3dc00 R12: 0000000000000007
      [29913.361371] R13: ffff9d4ad9e33b88 R14: ffff9d571f3206c0 R15: ffff9d56b3e90000
      [29913.362479] FS:  0000000000000000(0000) GS:ffff9d5725400000(0000) knlGS:0000000000000000
      [29913.363725] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [29913.364622] CR2: 0000000000000010 CR3: 000000169dffe000 CR4: 00000000003607e0
      [29913.365728] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [29913.366838] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [29913.367945] Call Trace:
      [29913.368360]  [<ffffffffc07ad2b8>] ? cfs_hash_bd_from_key+0x38/0xb0 [libcfs]
      [29913.369451]  [<ffffffffc07ad355>] cfs_hash_bd_get+0x25/0x70 [libcfs]
      [29913.370447]  [<ffffffffc07b0602>] cfs_hash_add+0x52/0x1a0 [libcfs]
      [29913.371463]  [<ffffffffc0b20855>] target_handle_connect+0x1fe5/0x29b0 [ptlrpc]
      [29913.372590]  [<ffffffffac8d8e4c>] ? dequeue_entity+0x11c/0x5e0
      [29913.373575]  [<ffffffffc0bc4e8a>] tgt_request_handle+0x50a/0x1580 [ptlrpc]
      [29913.374675]  [<ffffffffc0ba0e01>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
      [29913.375862]  [<ffffffffac8f944f>] ? __getnstimeofday64+0x3f/0xd0
      [29913.376830]  [<ffffffffc0b6bccb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [29913.378044]  [<ffffffffc0b68b55>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
      [29913.379113]  [<ffffffffac8cf670>] ? wake_up_state+0x20/0x20
      [29913.380006]  [<ffffffffc0b6f5fc>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc]
      [29913.380989]  [<ffffffffac8c9e50>] ? finish_task_switch+0x50/0x170
      [29913.382888]  [<ffffffffc0b6eb00>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [29913.384960]  [<ffffffffac8bb621>] kthread+0xd1/0xe0
      [29913.386652]  [<ffffffffac8bb550>] ? insert_kthread_work+0x40/0x40
      [29913.388530]  [<ffffffffacf205f7>] ret_from_fork_nospec_begin+0x21/0x21
      [29913.390460]  [<ffffffffac8bb550>] ? insert_kthread_work+0x40/0x40
      [29913.392305] Code: 44 00 00 48 85 f6 74 37 b9 01 00 00 00 45 31 c0 b8 05 15 00 00 eb 0d 0f 1f 80 00 00 00 00 49 89 c8 48 89 f9 89 c7 c1 e7 05 01 f8 <42> 0f be 3c 06 01 f8 48 8d 79 01 48 83 ff 09 75 e2 21 d0 c3 55 
      [29913.398550] RIP  [<ffffffffc090ca4d>] nid_hash+0x2d/0x50 [obdclass]
      [29913.400487]  RSP <ffff9d4ad9e33b40>
      [29913.401934] CR2: 0000000000000010
      

      Attachments

        Issue Links

          Activity

            [LU-11624] BUG: unable to handle kernel NULL pointer at nid_hash()
            pjones Peter Jones added a comment -

            The patch from LU-8130 was reverted so it looks like you should move the work to fix it up under that ticket simmonsja

            pjones Peter Jones added a comment - The patch from LU-8130 was reverted so it looks like you should move the work to fix it up under that ticket simmonsja

            I will send this patch to lustre-devel for Neil to look over as well.

            simmonsja James A Simmons added a comment - I will send this patch to lustre-devel for Neil to look over as well.

            James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/33616
            Subject: LU-11624 ptlrpc: handle no ptlrpc no connection case
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 32df018c1af8fdbc833c537aafeea081ab3d8d7e

            gerrit Gerrit Updater added a comment - James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/33616 Subject: LU-11624 ptlrpc: handle no ptlrpc no connection case Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 32df018c1af8fdbc833c537aafeea081ab3d8d7e
            simmonsja James A Simmons added a comment - - edited

            Really. I wonder if both have to land at the same time or if nid hash has to be the base patch? Lustre is one of those systems that has very complex hash interactions. For example the nid hash also influences the behave of the flock hash as well. We have 3 different hash impacting each other. 

            simmonsja James A Simmons added a comment - - edited Really. I wonder if both have to land at the same time or if nid hash has to be the base patch? Lustre is one of those systems that has very complex hash interactions. For example the nid hash also influences the behave of the flock hash as well. We have 3 different hash impacting each other. 

            Per Oleg suggestion, I didn't see crash after reverted commit 7b3f9e5d6c509fabcec3cbd71e541a84987db2ff  so far.

            commit 7b3f9e5d6c509fabcec3cbd71e541a84987db2ff
             Author: NeilBrown <neilb@suse.com>
             Date: Tue Aug 28 17:05:42 2018 -0400
            
            LU-8130 ptlrpc: convert conn_hash to rhashtable
            
            Linux has a resizeable hashtable implementation in lib,
             so we should use that instead of having one in libcfs.
            
            This patch converts the ptlrpc conn_hash to use rhashtable.
             In the process we gain lockless lookup.
            
            As connections are never deleted until the hash table is destroyed,
             there is no need to count the reference in the hash table. There
             is also no need to enable automatic_shrinking.
            
            Linux-commit: ac2370ac2bc5215daf78546cd8d925510065bb7f
            
            Change-Id: I576daf314c3ac31a58df02d731292e1e8bb408c6
             Signed-off-by: NeilBrown <neilb@suse.com>
             Signed-off-by: James Simmons <uja.ornl@yahoo.com>
             Reviewed-on: [https://review.whamcloud.com/32036]
             Tested-by: Jenkins
             Tested-by: Maloo <hpdd-maloo@intel.com>
             Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
             Reviewed-by: Yang Sheng <ys@whamcloud.com>
             Reviewed-by: Oleg Drokin <green@whamcloud.com>
             
            sihara Shuichi Ihara added a comment - Per Oleg suggestion, I didn't see crash after reverted commit 7b3f9e5d6c509fabcec3cbd71e541a84987db2ff  so far. commit 7b3f9e5d6c509fabcec3cbd71e541a84987db2ff Author: NeilBrown <neilb@suse.com> Date: Tue Aug 28 17:05:42 2018 -0400 LU-8130 ptlrpc: convert conn_hash to rhashtable Linux has a resizeable hashtable implementation in lib, so we should use that instead of having one in libcfs. This patch converts the ptlrpc conn_hash to use rhashtable. In the process we gain lockless lookup. As connections are never deleted until the hash table is destroyed, there is no need to count the reference in the hash table. There is also no need to enable automatic_shrinking. Linux-commit: ac2370ac2bc5215daf78546cd8d925510065bb7f Change-Id: I576daf314c3ac31a58df02d731292e1e8bb408c6 Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: James Simmons <uja.ornl@yahoo.com> Reviewed-on: [https://review.whamcloud.com/32036] Tested-by: Jenkins Tested-by: Maloo <hpdd-maloo@intel.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Yang Sheng <ys@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>  

            People

              wc-triage WC Triage
              sihara Shuichi Ihara
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: