Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17071

o2iblnd: Oops caused by IBLND_REJECT_EARLY code

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      Patch "LU-16393 o2iblnd: add IBLND_REJECT_EARLY reject reason" introduced code which doesn't do enough checks before dereferencing pointers. This can lead to a crash so needs to be fixed.

       

      Attachments

        Issue Links

          Activity

            [LU-17071] o2iblnd: Oops caused by IBLND_REJECT_EARLY code
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52202/
            Subject: LU-17071 o2iblnd: IBLND_REJECT_EARLY condition causes Oops
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ecea24d843f7b2f5fdd5a72fde06eac9237542fc

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52202/ Subject: LU-17071 o2iblnd: IBLND_REJECT_EARLY condition causes Oops Project: fs/lustre-release Branch: master Current Patch Set: Commit: ecea24d843f7b2f5fdd5a72fde06eac9237542fc

            This wasn't reproduced on master until recently. This is indeed not an LBUG and I can't remember now why I originally used LBUG in the title. Here's the recent trace from description of LU-17247:

            [14161.702631] libcfs: HW NUMA nodes: 1, HW CPU cores: 24, npartitions: 4
            [14161.705274] alg: No test for adler32 (adler32-zlib)
            [14162.456545] Key type ._llcrypt registered
            [14162.457357] Key type .llcrypt registered
            [14162.484133] Lustre: Lustre: Build Version: 2.15.58_109_g40074d3
            [14162.540341] LNet: Using FastReg for registration
            [14162.750736] LNet: Added LNI 10.0.11.209@o2ib12 [32/1024/0/180]
            [14162.950680] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
            [14162.951989] PGD 0 
            [14162.952520] Oops: 0000 [#1] SMP NOPTI
            [14162.953250] CPU: 22 PID: 201160 Comm: kworker/22:4 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-425.13.1.el8_lustre.ddn17.x86_64 #1
            [14162.955184] Hardware name: DDN SFA400NVX2E, BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
            [14162.956667] Workqueue: ib_cm cm_work_handler [ib_cm]
            [14162.957565] RIP: 0010:kiblnd_passive_connect+0x1395/0x1620 [ko2iblnd]
            [14162.958644] Code: c7 05 63 81 01 00 00 01 00 00 e8 26 03 f4 ff 48 89 df ba 40 00 00 00 48 89 c6 e8 06 10 f4 ff 45 8b b4 24 24 01 00 00 49 89 c7 <48> 8b 04 25 40 00 00 00 48 8d 58 38 e8 fa 02 f4 ff 48 89 df ba 40
            [14162.961535] RSP: 0018:ff7a599b4dca79a0 EFLAGS: 00010246
            [14162.962473] RAX: ffffffffc1038f00 RBX: 0005001614010bd1 RCX: 0000000000000000
            [14162.963534] LNet: Added LNI 20.1.11.209@o2ib22 [32/1024/0/180]
            [14162.963649] RDX: ffffffffc1038f12 RSI: 0000000000000000 RDI: 0000000000000000
            [14162.965863] RBP: ff36491ca4dbcc00 R08: 0000000000000001 R09: 0000000000000000
            [14162.967015] R10: ffffffffc1038f40 R11: ffffffffc1038f12 R12: ff364925b2ba2a00
            [14162.968167] R13: ff36492daa67a5b0 R14: 0000000000000000 R15: ffffffffc1038f00
            [14162.969313] FS:  0000000000000000(0000) GS:ff36493e31b80000(0000) knlGS:0000000000000000
            [14162.970594] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [14162.971560] CR2: 0000000000000040 CR3: 0000000f8bc10003 CR4: 0000000000771ee0
            [14162.972711] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            [14162.973846] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
            [14162.974976] PKRU: 55555554
            [14162.975553] Call Trace:
            [14162.976085]  ? xas_store+0x56/0x5a0
            [14162.976755]  kiblnd_cm_callback+0x3d7/0x1e90 [ko2iblnd]
            [14162.977639]  ? __xa_alloc_cyclic+0x49/0xe0
            [14162.978375]  cma_cm_event_handler+0x25/0xd0 [rdma_cm]
            [14162.979227]  cma_ib_req_handler+0x7d1/0x1260 [rdma_cm]
            [14162.980090]  ? update_group_capacity+0x25/0x220
            [14162.980872]  cm_process_work+0x22/0xf0 [ib_cm]
            [14162.981638]  cm_req_handler+0x7f1/0xf40 [ib_cm]
            [14162.982416]  cm_work_handler+0x79c/0xf30 [ib_cm]
            [14162.983198]  ? __switch_to+0x10c/0x450
            [14162.983872]  ? finish_task_switch+0xaf/0x2e0
            [14162.984607]  process_one_work+0x1a7/0x360
            [14162.985300]  ? create_worker+0x1a0/0x1a0
            [14162.985979]  worker_thread+0x30/0x390
            [14162.986623]  ? create_worker+0x1a0/0x1a0
            [14162.987292]  kthread+0x10b/0x130
            [14162.987874]  ? set_kthread_struct+0x50/0x50
            [14162.988577]  ret_from_fork+0x1f/0x40
            [14162.989205] Modules linked in: ko2iblnd(OE) ptlrpc(OE+) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) sunrpc intel_rapl_msr intel_rapl_common nfit libnvdimm kvm_intel kvm irqbypass iTCO_wdt ppdev iTCO_vendor_support crct10dif_pclmul crc32_pclmul bochs drm_vram_helper drm_ttm_helper ghash_clmulni_intel ttm rapl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr i2c_i801 drm joydev lpc_ich i6300esb parport_pc parport ext4 mbcache jbd2 sr_mod sd_mod cdrom t10_pi sg mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) mlx5_core(OE) mlxfw(OE) pci_hyperv_intf ahci tls libahci psample mlxdevm(OE) virtio_net libata bnxt_en crc32c_intel net_failover serio_raw virtio_blk mlx_compat(OE) virtio_scsi failover dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
            AttachmentsOptions

            I'll rename this ticket and corresponding commit to avoid confusion.

            ssmirnov Serguei Smirnov added a comment - This wasn't reproduced on master until recently. This is indeed not an LBUG and I can't remember now why I originally used LBUG in the title. Here's the recent trace from description of LU-17247 : [14161.702631] libcfs: HW NUMA nodes: 1, HW CPU cores: 24, npartitions: 4 [14161.705274] alg: No test for adler32 (adler32-zlib) [14162.456545] Key type ._llcrypt registered [14162.457357] Key type .llcrypt registered [14162.484133] Lustre: Lustre: Build Version: 2.15.58_109_g40074d3 [14162.540341] LNet: Using FastReg for registration [14162.750736] LNet: Added LNI 10.0.11.209@o2ib12 [32/1024/0/180] [14162.950680] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040 [14162.951989] PGD 0 [14162.952520] Oops: 0000 [#1] SMP NOPTI [14162.953250] CPU: 22 PID: 201160 Comm: kworker/22:4 Kdump: loaded Tainted: G OE --------- - - 4.18.0-425.13.1.el8_lustre.ddn17.x86_64 #1 [14162.955184] Hardware name: DDN SFA400NVX2E, BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 [14162.956667] Workqueue: ib_cm cm_work_handler [ib_cm] [14162.957565] RIP: 0010:kiblnd_passive_connect+0x1395/0x1620 [ko2iblnd] [14162.958644] Code: c7 05 63 81 01 00 00 01 00 00 e8 26 03 f4 ff 48 89 df ba 40 00 00 00 48 89 c6 e8 06 10 f4 ff 45 8b b4 24 24 01 00 00 49 89 c7 <48> 8b 04 25 40 00 00 00 48 8d 58 38 e8 fa 02 f4 ff 48 89 df ba 40 [14162.961535] RSP: 0018:ff7a599b4dca79a0 EFLAGS: 00010246 [14162.962473] RAX: ffffffffc1038f00 RBX: 0005001614010bd1 RCX: 0000000000000000 [14162.963534] LNet: Added LNI 20.1.11.209@o2ib22 [32/1024/0/180] [14162.963649] RDX: ffffffffc1038f12 RSI: 0000000000000000 RDI: 0000000000000000 [14162.965863] RBP: ff36491ca4dbcc00 R08: 0000000000000001 R09: 0000000000000000 [14162.967015] R10: ffffffffc1038f40 R11: ffffffffc1038f12 R12: ff364925b2ba2a00 [14162.968167] R13: ff36492daa67a5b0 R14: 0000000000000000 R15: ffffffffc1038f00 [14162.969313] FS: 0000000000000000(0000) GS:ff36493e31b80000(0000) knlGS:0000000000000000 [14162.970594] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [14162.971560] CR2: 0000000000000040 CR3: 0000000f8bc10003 CR4: 0000000000771ee0 [14162.972711] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [14162.973846] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [14162.974976] PKRU: 55555554 [14162.975553] Call Trace: [14162.976085] ? xas_store+0x56/0x5a0 [14162.976755] kiblnd_cm_callback+0x3d7/0x1e90 [ko2iblnd] [14162.977639] ? __xa_alloc_cyclic+0x49/0xe0 [14162.978375] cma_cm_event_handler+0x25/0xd0 [rdma_cm] [14162.979227] cma_ib_req_handler+0x7d1/0x1260 [rdma_cm] [14162.980090] ? update_group_capacity+0x25/0x220 [14162.980872] cm_process_work+0x22/0xf0 [ib_cm] [14162.981638] cm_req_handler+0x7f1/0xf40 [ib_cm] [14162.982416] cm_work_handler+0x79c/0xf30 [ib_cm] [14162.983198] ? __switch_to+0x10c/0x450 [14162.983872] ? finish_task_switch+0xaf/0x2e0 [14162.984607] process_one_work+0x1a7/0x360 [14162.985300] ? create_worker+0x1a0/0x1a0 [14162.985979] worker_thread+0x30/0x390 [14162.986623] ? create_worker+0x1a0/0x1a0 [14162.987292] kthread+0x10b/0x130 [14162.987874] ? set_kthread_struct+0x50/0x50 [14162.988577] ret_from_fork+0x1f/0x40 [14162.989205] Modules linked in: ko2iblnd(OE) ptlrpc(OE+) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) sunrpc intel_rapl_msr intel_rapl_common nfit libnvdimm kvm_intel kvm irqbypass iTCO_wdt ppdev iTCO_vendor_support crct10dif_pclmul crc32_pclmul bochs drm_vram_helper drm_ttm_helper ghash_clmulni_intel ttm rapl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr i2c_i801 drm joydev lpc_ich i6300esb parport_pc parport ext4 mbcache jbd2 sr_mod sd_mod cdrom t10_pi sg mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) mlx5_core(OE) mlxfw(OE) pci_hyperv_intf ahci tls libahci psample mlxdevm(OE) virtio_net libata bnxt_en crc32c_intel net_failover serio_raw virtio_blk mlx_compat(OE) virtio_scsi failover dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] AttachmentsOptions I'll rename this ticket and corresponding commit to avoid confusion.
            hornc Chris Horn added a comment -

            ssmirnov What LBUG is tripped by this issue? It would be helpful for tickets like this to include the exact stack trace/panic/oops/etc so that it can be found when these issues are seen in the field.

            hornc Chris Horn added a comment - ssmirnov What LBUG is tripped by this issue? It would be helpful for tickets like this to include the exact stack trace/panic/oops/etc so that it can be found when these issues are seen in the field.

            "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52202
            Subject: LU-17071 o2iblnd: IBLND_REJECT_EARLY condition causes LBUG
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a0fa2440765ee81b173de810b85a5bdb325bd274

            gerrit Gerrit Updater added a comment - "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52202 Subject: LU-17071 o2iblnd: IBLND_REJECT_EARLY condition causes LBUG Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a0fa2440765ee81b173de810b85a5bdb325bd274

            People

              ssmirnov Serguei Smirnov
              ssmirnov Serguei Smirnov
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: