[LU-10952] recovery-small test_57: BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 in ida_remove() Created: 25/Apr/18  Updated: 02/May/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Bruno Faccini (Inactive)
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for bfaccini <bruno.faccini@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/9d1f0ea0-44ed-11e8-8f8a-52540065bddc

test_57 failed with the following error:

Test crashed during recovery-small test_57

Crash infos/msgs from client Console log :

[ 7091.852836] Lustre: DEBUG MARKER: == recovery-small test 57: read procfs entries causes kernel crash =================================== 22:51:04 (1524264664)
[ 7092.902971] Lustre: DEBUG MARKER: grep -c /mnt/lustre' ' /proc/mounts
[ 7092.912675] Lustre: DEBUG MARKER: lsof -t /mnt/lustre
[ 7093.020798] Lustre: DEBUG MARKER: umount /mnt/lustre 2>&1
[ 7093.049261] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
[ 7093.050110] IP: [<ffffffff813261e1>] ida_remove+0x91/0xe0
[ 7093.050655] PGD 8000000064f1b067 PUD 5f9a7067 PMD 0 
[ 7093.051182] Oops: 0000 [#1] SMP 
[ 7093.051534] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel nfsd ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd i2c_piix4 i2c_core pcspkr joydev virtio_balloon nfs_acl lockd parport_pc parport grace auth_rpcgss sunrpc ip_tables ata_generic pata_acpi ext4 mbcache jbd2 ata_piix libata virtio_blk 8139too crct10dif_pclmul crct10dif_common crc32c_intel virtio_pci virtio_ring serio_raw virtio 8139cp floppy mii
[ 7093.060088] CPU: 0 PID: 23436 Comm: lctl Tainted: G           OE  ------------   3.10.0-693.17.1.el7.x86_64 #1
[ 7093.061013] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 7093.061555] task: ffff880062b33f40 ti: ffff8800622f8000 task.ti: ffff8800622f8000
[ 7093.062250] RIP: 0010:[<ffffffff813261e1>]  [<ffffffff813261e1>] ida_remove+0x91/0xe0
[ 7093.063009] RSP: 0018:ffff8800622fbe20  EFLAGS: 00010086
[ 7093.063511] RAX: 0000000000000180 RBX: 0000000000000000 RCX: 00000000000000f4
[ 7093.064172] RDX: 00000000001ec6f4 RSI: 00000000736a1480 RDI: ffff88007c0a0008
[ 7093.064838] RBP: ffff8800622fbe30 R08: ffff88007c0a0000 R09: 00000000000000f4
[ 7093.065506] R10: 00000000636a1001 R11: ffffea00018da840 R12: ffffffff81fd15e0
[ 7093.066164] R13: ffff88007ab27600 R14: ffff8800636a1000 R15: ffffffff81217790
[ 7093.066850] FS:  00007f4e87410740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[ 7093.067600] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7093.068132] CR2: 0000000000000038 CR3: 000000006091e000 CR4: 00000000000606f0
[ 7093.068797] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 7093.069461] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 7093.070114] Call Trace:
[ 7093.070355]  [<ffffffff8127880f>] proc_free_inum+0x2f/0x50
[ 7093.070879]  [<ffffffff81278c97>] pde_put+0x27/0x50
[ 7093.071333]  [<ffffffff81278e1c>] proc_readdir_de+0x15c/0x220
[ 7093.071887]  [<ffffffff81217790>] ? fillonedir+0xe0/0xe0
[ 7093.072382]  [<ffffffff81278f02>] proc_readdir+0x22/0x30
[ 7093.072884]  [<ffffffff81217680>] vfs_readdir+0xb0/0xe0
[ 7093.073379]  [<ffffffff81217aa5>] SyS_getdents+0x95/0x120
[ 7093.073894]  [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
[ 7093.074459]  [<ffffffff816b889d>] ? system_call_after_swapgs+0xca/0x214
[ 7093.075071] Code: 45 0f b3 48 08 4f 8b 44 c8 28 83 e9 08 4d 85 c0 75 e0 4d 85 c0 74 4b 0f b6 ca 49 8d 78 08 41 0f b3 48 08 4c 63 c9 4b 8b 5c c8 28 <0f> a3 43 08 45 19 c0 45 85 c0 74 2b 0f b3 43 08 48 83 2b 01 75 
[ 7093.078106] RIP  [<ffffffff813261e1>] ida_remove+0x91/0xe0
[ 7093.078655]  RSP <ffff8800622fbe20>
[ 7093.078987] CR2: 0000000000000038

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
recovery-small test_57 - Test crashed during recovery-small test_57



 Comments   
Comment by Bruno Faccini (Inactive) [ 25/Apr/18 ]

According to my first crash dump analysis results, the crash occurs in ida_remove() because the current/panic thread id trying to unalloc an already unallocated proc inode number, when done with its associated proc_dir_entry and releasing its reference on it using pde_put(). This should not happen at this point ("osc/lustre-OST0006-osc-ffff880063459800/checksum_type" tunable's pde being browsed/listed by a user-land/lctl process) if the proc_dir_entry is valid, but it is not and is itself an unallocated slab object now...

This could have only happened if the pde (and its associated inode number) has just already been unallocated, which is very likely to have been done by the following umount thread I have found in the process to remove the proc tunables in the same directory ("osc/lustre-OST0006-osc-ffff880063459800") than the one being presently browsed by the panicking lctl thread :

PID: 23433  TASK: ffff88000016eeb0  CPU: 1   COMMAND: "umount"
 #0 [ffff88007fd08e48] crash_nmi_callback at ffffffff8104fd11
 #1 [ffff88007fd08e58] nmi_handle at ffffffff816b0c57
 #2 [ffff88007fd08eb0] do_nmi at ffffffff816b0e8d
 #3 [ffff88007fd08ef0] end_repeat_nmi at ffffffff816b00b9
    [exception RIP: __pv_queued_spin_lock_slowpath+218]
    RIP: ffffffff810fbf0a  RSP: ffff8800608a7978  RFLAGS: 00000206
    RAX: 0000000000000005  RBX: ffffffff81fd15c0  RCX: 0000000000007b84
    RDX: 0000000000000005  RSI: 0000000000000005  RDI: 0000000000000246
    RBP: ffff8800608a79a8   R8: 0000000000000000   R9: 0000000000000040
    R10: 0000000000000000  R11: 0000000000000001  R12: ffff88007fd195c0
    R13: 0000000000000001  R14: 0000000000090000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff8800608a7978] __pv_queued_spin_lock_slowpath at ffffffff810fbf0a
 #5 [ffff8800608a79b0] queued_spin_lock_slowpath at ffffffff816a08df
 #6 [ffff8800608a79c0] _raw_spin_lock at ffffffff816adff0
 #7 [ffff8800608a79d0] remove_proc_subtree at ffffffff812791bc
 #8 [ffff8800608a7a20] proc_remove at ffffffff8127922e
 #9 [ffff8800608a7a30] lprocfs_obd_cleanup at ffffffffc0724b33 [obdclass]
#10 [ffff8800608a7a48] ptlrpc_lprocfs_unregister_obd at ffffffffc094d462 [ptlrpc]
#11 [ffff8800608a7a60] osc_precleanup at ffffffffc0ae25df [osc]
#12 [ffff8800608a7a78] class_cleanup at ffffffffc07388b9 [obdclass]
#13 [ffff8800608a7af0] class_process_config at ffffffffc0739edc [obdclass]
#14 [ffff8800608a7bb0] class_manual_cleanup at ffffffffc073c1b6 [obdclass]
#15 [ffff8800608a7c58] lov_putref at ffffffffc0b557f2 [lov]
#16 [ffff8800608a7cb8] lov_disconnect at ffffffffc0b5d062 [lov]
#17 [ffff8800608a7ce0] obd_disconnect at ffffffffc0bd9ecb [lustre]
#18 [ffff8800608a7d00] ll_put_super at ffffffffc0bdd361 [lustre]
#19 [ffff8800608a7e38] generic_shutdown_super at ffffffff812056d2
#20 [ffff8800608a7e60] kill_anon_super at ffffffff81205aa2
#21 [ffff8800608a7e78] lustre_kill_super at ffffffffc073ea75 [obdclass]
#22 [ffff8800608a7e90] deactivate_locked_super at ffffffff81205e59
#23 [ffff8800608a7eb0] deactivate_super at ffffffff812065c6
#24 [ffff8800608a7ec8] cleanup_mnt at ffffffff8122395f
#25 [ffff8800608a7ee0] __cleanup_mnt at ffffffff812239f2
#26 [ffff8800608a7ef0] task_work_run at ffffffff810aefe5
#27 [ffff8800608a7f30] do_notify_resume at ffffffff8102ab52
#28 [ffff8800608a7f50] int_signal at ffffffff816b8d37
    RIP: 00007f3942ac33e7  RSP: 00007fff3665cdd8  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: 000055ab33267040  RCX: ffffffffffffffff
    RDX: 0000000000000001  RSI: 0000000000000000  RDI: 000055ab332682e0
    RBP: 000055ab332682e0   R8: 0000000000000008   R9: 000000000000000c
    R10: 00007fff3665c860  R11: 0000000000000246  R12: 00007f394363ad78
    R13: 0000000000000000  R14: 000055ab33267210  R15: 000055ab33267040
    ORIG_RAX: 00000000000000a6  CS: 0033  SS: 002b

So, this crash is very likely to be due to a race between proc_readdir_de() routine's loop (to gather the available proc tunables in the same directory for lctl command when having to parse a parameter specification using wildcard/, like "lctl get_param osc..*" in recovery-small/test_57), and remove_proc_subtree() loop (to destroy the same set of proc tunable entries) in a umount thread context running concurrently, as found for this crash.

The Kernel patch described at "https://lkml.org/lkml/2014/4/19/62" ("idr: fix NULL pointer dereference when ida_remove(unallocated_id)") may somewhat help to detect that the proc inode number has already been unallocated, but we will then certainly crash trying to double free/unallocate the proc_dir_entry in pde_put()/free_proc_entry().

Also, this problem is very likely to have been introduced by the recent switch to rb-tree for the proc_dir_entries in-memory organization/retrieval ("fs/proc: use a rb tree for the directory entries", at https://lkml.org/lkml/2014/10/6/235) in RHEL 7.x Kernels.

After more crash-dump analysis and concerned kernel code browsing, I think there are 2 joint causes for this crash, 1) rb_erase() seems to leave the rb-tree nodes it disconnects with out-dated left/right references which can puzzle current node's concurrent users ..., 2) the current sequence of spin_[un]lock()s in both remove_proc_subtree() and proc_readdir_de() loops may lead to the latter to sometime work-on/use a removed/freed pde ... :

int remove_proc_subtree(const char *name, struct proc_dir_entry *parent)
{
        struct proc_dir_entry *root = NULL, *de, *next;
        const char *fn = name;
        unsigned int len;

        spin_lock(&proc_subdir_lock);
        if (__xlate_proc_name(name, &parent, &fn) != 0) {
                spin_unlock(&proc_subdir_lock);
                return -ENOENT;
        }
        len = strlen(fn);

        root = pde_subdir_find(parent, fn, len);
        if (!root) {
                spin_unlock(&proc_subdir_lock);
                return -ENOENT;
        }
        rb_erase(&root->subdir_node, &parent->subdir);

        de = root;
        while (1) {
                next = pde_subdir_first(de); <<< 
                if (next) {
                        rb_erase(&next->subdir_node, &de->subdir); <<< rb_erase()  
                        de = next; <<< to work on next now
                        continue; <<<< down to subdir or process current entry
                }
                spin_unlock(&proc_subdir_lock);

                proc_entry_rundown(de);
                next = de->parent; <<<< go back/up to parent dir for loop
                if (S_ISDIR(de->mode))
                        next->nlink--;
                de->nlink = 0;
                if (de == root)
                        break;
                pde_put(de); <<< expect to release the last ref and where inode num and de be freed

                spin_lock(&proc_subdir_lock);
                de = next;
        }
        pde_put(root);
        return 0;
}
EXPORT_SYMBOL(remove_proc_subtree);

===============================================

/*
 * This returns non-zero if at EOF, so that the /proc
 * root directory can use this and check if it should
 * continue with the <pid> entries..
 *
 * Note that the VFS-layer doesn't care about the return
 * value of the readdir() call, as long as it's non-negative
 * for success..
 */
int proc_readdir_de(struct proc_dir_entry *de, struct file *filp, void *dirent,
                filldir_t filldir)
{
        unsigned int ino;
        int i;
        struct inode *inode = file_inode(filp);
        int ret = 0;

        ino = inode->i_ino;
        i = filp->f_pos;
        switch (i) {
                case 0:
                        if (filldir(dirent, ".", 1, i, ino, DT_DIR) < 0)
                                goto out;
                        i++;
                        filp->f_pos++;
                        /* fall through */
                case 1: 
                        if (filldir(dirent, "..", 2, i,
                                    parent_ino(filp->f_path.dentry),
                                    DT_DIR) < 0)
                                goto out;
                        i++;
                        filp->f_pos++;
                        /* fall through */
                default:
                        spin_lock(&proc_subdir_lock);
                        de = pde_subdir_first(de);
                        i -= 2;
                        for (;;) {
                                if (!de) {
                                        ret = 1;
                                        spin_unlock(&proc_subdir_lock);
                                        goto out;
                                }
                                if (!i) 
                                        break;
                                de = pde_subdir_next(de);
                                i--;
                        }

                        do {
                                struct proc_dir_entry *next;

                                /* filldir passes info to user space */
                                pde_get(de);  <<<<<<<<<<<<<<<< takes a ref on de
                                spin_unlock(&proc_subdir_lock); <<< umount can take lock and rb_erase() de
                                if (filldir(dirent, de->name, de->namelen, filp->f_pos, <<<< safe to use de here
                                            de->low_ino, de->mode >> 12) < 0) {
                                        pde_put(de);
                                        goto out;
                                }
                                spin_lock(&proc_subdir_lock);
                                filp->f_pos++;
                                next = pde_subdir_next(de); <<< de removed from rbtree with ood next ?
                                pde_put(de); <<< will release last ref on de, so inode num and de be freed now
                                de = next; use out-of-date/removed next reference ?
                        } while (de);
                        spin_unlock(&proc_subdir_lock);
        }
        ret = 1;
out:
        return ret;
}

and the combination of both causes/problems in these codes may lead to out-of-date/removed proc_dir_entry processing, hence this crash??

I will try to reproduce this scenario on a non-Lustre proc set/single-level-directory of tunables, to see if my assumptions are correct.

Comment by Bruno Faccini (Inactive) [ 26/Apr/18 ]

And bingo!!
Using the following simple module code :

#include <linux/module.h>
#include <linux/version.h>
#include <linux/kernel.h>
#include <linux/proc_fs.h>

#define NUM 100

static const struct file_operations bfi_seq_file_fops = {
};

int __init bfi_proc_init(void)
{
        int i;
        char path[NUM];

        if (!proc_mkdir("bfi", NULL))
                return -ENOMEM;

        for (i = 0; i < NUM; i++) {
                snprintf(path, NUM, "bfi/file%d", i);
                if (!proc_create(path , S_IRUGO, NULL, &bfi_seq_file_fops))
                        goto out;
        }

        return 0;

out:
        printk(KERN_ERR "failed to create %s for i=%d\n", path, i);
        remove_proc_subtree("bfi", NULL);
        return -ENOMEM;
}
module_init(bfi_proc_init);

void __exit bfi_proc_exit(void)
{
        remove_proc_subtree("bfi", NULL);
}
module_exit(bfi_proc_exit);

MODULE_AUTHOR("Bruno Faccini <bruno.faccini@intel.com>");
MODULE_DESCRIPTION("stupid module to create a hundred of dummy tunable files in a same dir");
MODULE_LICENSE("GPL");

only creating hundred of /proc/bfi/file[0-99] dummy tunables during load and removing the whole /proc/bfi directory, using remove_proc_subtree(), upon unload.

And running the following 2 scripts concurrently :

# while true ; do
> insmod bfi.ko
> rmmod bfi
> done &
---------------------------
# while true ; do
> ls -la /proc/bfi/
> done &

I have been able to quickly crash the same way :

[ 1253.343901] bfi: loading out-of-tree module taints kernel.^M
[ 1253.350174] bfi: module verification failed: signature and/or required key missing - tainting kernel^M
[ 1411.844012] BUG: unable to handle kernel NULL pointer dereference at 0000000000000070^M
[ 1411.852845] IP: [<ffffffff81326511>] ida_remove+0x91/0xe0^M
[ 1411.858905] PGD 80000000b393f067 PUD b3a58067 PMD 0 ^M
[ 1411.864516] Oops: 0000 [#1] SMP ^M
[ 1411.868159] Modules linked in: bfi(OE-) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt ipmi_ssif iTCO_vendor_support joydev sg pcspkr ipmi_si ipmi_devintf ipmi_msghandler shpchp wmi ioatdma i2c_i801 mei_me mei lpc_ich nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper syscopyarea isci sysfillrect sysimgblt fb_sys_fops igb ttm libsas ahci crct10dif_pclmul libahci scsi_transport_sas ptp crct10dif_common mlx4_core drm crc32c_intel pps_core libata dca i2c_algo_bit i2c_core devlink [last unloaded: bfi]^M
[ 1411.965188] CPU: 21 PID: 33919 Comm: ls Tainted: G           OE  ------------   3.10.0-693.17.1.el7_lustre_10826.x86_64 #1^M
[ 1411.977563] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
[ 1411.990128] task: ffff8800b4ba9fa0 ti: ffff8800b06ac000 task.ti: ffff8800b06ac000^M
[ 1411.998512] RIP: 0010:[<ffffffff81326511>]  [<ffffffff81326511>] ida_remove+0x91/0xe0^M
[ 1412.007295] RSP: 0018:ffff8800b06afe20  EFLAGS: 00010086^M
[ 1412.013249] RAX: 0000000000000340 RBX: 0000000000000000 RCX: 00000000000000f7^M
[ 1412.021240] RDX: 00000000000f52f7 RSI: 0000000039772180 RDI: ffff88017a280008^M
[ 1412.029232] RBP: ffff8800b06afe30 R08: ffff88017a280000 R09: 00000000000000f7^M
[ 1412.037223] R10: 0000000029773e01 R11: ffffea0020a5dc80 R12: ffffffff81fd15e0^M
[ 1412.045213] R13: ffff880829772d80 R14: ffff880829772000 R15: ffffffff81217790^M
[ 1412.053204] FS:  00007fef74d83800(0000) GS:ffff88042e340000(0000) knlGS:0000000000000000^M
[ 1412.062270] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[ 1412.068709] CR2: 0000000000000070 CR3: 00000000b0576000 CR4: 00000000000607e0^M
[ 1412.076704] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
[ 1412.084702] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M
[ 1412.092698] Call Trace:^M
[ 1412.095444]  [<ffffffff8127890f>] proc_free_inum+0x2f/0x50^M
[ 1412.101587]  [<ffffffff81278d97>] pde_put+0x27/0x50^M
[ 1412.107049]  [<ffffffff81278f1c>] proc_readdir_de+0x15c/0x220^M
[ 1412.113489]  [<ffffffff81217790>] ? fillonedir+0xe0/0xe0^M
[ 1412.119444]  [<ffffffff81279002>] proc_readdir+0x22/0x30^M
[ 1412.125403]  [<ffffffff81217680>] vfs_readdir+0xb0/0xe0^M
[ 1412.131261]  [<ffffffff81217aa5>] SyS_getdents+0x95/0x120^M
[ 1412.138446]  [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b^M
[ 1412.146293]  [<ffffffff816b889d>] ? system_call_after_swapgs+0xca/0x214^M
[ 1412.154824] Code: 45 0f b3 48 08 4f 8b 44 c8 28 83 e9 08 4d 85 c0 75 e0 4d 85 c0 74 4b 0f b6 ca 49 8d 78 08 41 0f b3 48 08 4c 63 c9 4b 8b 5c c8 28 <0f> a3 43 08 45 19 c0 45 85 c0 74 2b 0f b3 43 08 48 83 2b 01 75 ^M
[ 1412.178892] RIP  [<ffffffff81326511>] ida_remove+0x91/0xe0^M
[ 1412.186228]  RSP <ffff8800b06afe20>^M
[ 1412.191273] CR2: 0000000000000070^M
[    0.000000] Initializing cgroup subsys cpuset^M

that my further analysis has confirmed that the scenario leading to it is exactly the same I have described before and where the only difference is the invalid memory address/offset (0x0000000000000070 vs 0x0000000000000038) which is explained by the computed offset in IDR bitmap, based on current PDE's low_ino value.

Comment by Bruno Faccini (Inactive) [ 26/Apr/18 ]

Based on my previous thoughts about the root cause and possible fix, I have created the following Kernel patch :

This patch fixes a bug in PDEs processing when a race occurs
between a thread presently browsing entries in a /proc
subdir using proc_readdir_de() and an other thread removing
the whole subdir using remove_proc_subtree().

Signed-off-by: Bruno Faccini <bruno.faccini@intel.com>

--- orig/fs/proc/generic.c      2018-01-14 15:02:54.000000000 +0000
+++ bfi/fs/proc/generic.c       2018-04-26 13:53:25.139261920 +0000
@@ -295,8 +295,9 @@ int proc_readdir_de(struct proc_dir_entr
                filldir_t filldir)
 {
        unsigned int ino;
-       int i;
+       int i, j;
        struct inode *inode = file_inode(filp);
+       struct proc_dir_entry *pde[4];
        int ret = 0;
 
        ino = inode->i_ino;
@@ -317,6 +318,7 @@ int proc_readdir_de(struct proc_dir_entr
                        filp->f_pos++;
                        /* fall through */
                default:
+loop4more:
                        spin_lock(&proc_subdir_lock);
                        de = pde_subdir_first(de);
                        i -= 2;
@@ -332,24 +334,39 @@ int proc_readdir_de(struct proc_dir_entr
                                i--;
                        }
 
+                       i = 0;
+                       /* process entries 4 by 4 with lock held */
                        do {
                                struct proc_dir_entry *next;
 
-                               /* filldir passes info to user space */
+                               /* take a ref on de */
                                pde_get(de);
-                               spin_unlock(&proc_subdir_lock);
+                               filp->f_pos++;
+                               /* save de for further filldir usage without
+                                * lock
+                                */
+                               pde[i++] = de;
+                               next = pde_subdir_next(de);
+                               de = next;
+                       } while (de && i < 4);
+                       spin_unlock(&proc_subdir_lock);
+
+                       /* process entries found */
+                       for (j = 0; j < i ; j++) {
+                               de = pde[j];
+                               /* filldir passes info to user space, no lock
+                                * can be held */
                                if (filldir(dirent, de->name, de->namelen, filp->f_pos,
                                            de->low_ino, de->mode >> 12) < 0) {
-                                       pde_put(de);
+                                       while (j < i)
+                                               pde_put(pde[j]);
                                        goto out;
                                }
-                               spin_lock(&proc_subdir_lock);
-                               filp->f_pos++;
-                               next = pde_subdir_next(de);
                                pde_put(de);
-                               de = next;
-                       } while (de);
-                       spin_unlock(&proc_subdir_lock);
+                       }
+                       i = filp->f_pos;
+                       if (j == 4)
+                               goto loop4more;
        }
        ret = 1;
 out:

that I have successfully exposed to my previously described reproducer.

I will now report this problem+fix to Kernel maintainers and also at RedHat.

Comment by Bruno Faccini (Inactive) [ 02/May/18 ]

My related kernel.org bug report can be found at https://bugzilla.kernel.org/show_bug.cgi?id=199593

Generated at Sat Feb 10 02:39:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.