Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4451

Kernel Oops with NFS reexport using mainline 3.12 client

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.4.2
    • None
    • 4
    • 12201

    Description

      Jan 7 18:44:47 fltpu-login kernel: [31468.631107] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
      Jan 7 18:44:47 fltpu-login kernel: [31468.639140] IP: [<ffffffffa0b9c23d>] ll_sai_unplug+0x1d/0x470 [lustre]
      Jan 7 18:44:47 fltpu-login kernel: [31468.645869] PGD 0
      Jan 7 18:44:47 fltpu-login kernel: [31468.647983] Oops: 0000 1 SMP
      Jan 7 18:44:47 fltpu-login kernel: [31468.651330] Modules linked in: lmv(C) fld(C) mgc(C) lustre(C) lov(C) osc(C) mdc(C) fid(C) ptlrpc(C) obdclass(C) lvfs(C) zfs(PO) zcommon(PO) znvpair(PO) zavl(PO) zunicode(PO) spl(O) ksocklnd(C) ko2iblnd(C) lnet(C) sha512_ssse3 sha512_generic sha256_ssse3 sha256_generic crc32 crc32c_intel libcfs(C) sg ib_umad ib_ipoib xfs libcrc32c nfsd exportfs ch st hid_generic usbhid ehci_pci ehci_hcd psmouse uhci_hcd lpc_ich mfd_core aacraid usbcore usb_common i7core_edac igb edac_core i2c_algo_bit ib_qib ixgbe mdio rdma_ucm rdma_cm ib_cm acpi_cpufreq iw_cm ib_sa ib_mad ib_addr processor ipv6 ib_uverbs ib_core qla2xxx blcr(O) scsi_transport_fc blcr_imports(O) scsi_tgt dm_mod
      Jan 7 18:44:47 fltpu-login kernel: [31468.712442] CPU: 2 PID: 15987 Comm: nfsd Tainted: P C O 3.12.5-ql-generic-15 #1
      Jan 7 18:44:47 fltpu-login kernel: [31468.720706] Hardware name: Supermicro X8DTN/X8DTN, BIOS 080015 05/04/2009
      Jan 7 18:44:47 fltpu-login kernel: [31468.727726] task: ffff8800bac99cc0 ti: ffff88019a610000 task.ti: ffff88019a610000
      Jan 7 18:44:47 fltpu-login kernel: [31468.735354] RIP: 0010:[<ffffffffa0b9c23d>] [<ffffffffa0b9c23d>] ll_sai_unplug+0x1d/0x470 [lustre]
      Jan 7 18:44:47 fltpu-login kernel: [31468.744483] RSP: 0018:ffff88019a611718 EFLAGS: 00010282
      Jan 7 18:44:47 fltpu-login kernel: [31468.749945] RAX: 000000005a5a5a5a RBX: ffff8800802fa000 RCX: ffff8800802fa060
      Jan 7 18:44:47 fltpu-login kernel: [31468.757177] RDX: ffff8800802fa060 RSI: ffff8800816ab580 RDI: ffff8800802fa000
      Jan 7 18:44:47 fltpu-login kernel: [31468.764405] RBP: ffff88019a611798 R08: ffff88019a610000 R09: 0000000000000211
      Jan 7 18:44:47 fltpu-login kernel: [31468.771737] R10: 0000000000000000 R11: 0140000000000000 R12: ffff8800816ab580
      Jan 7 18:44:47 fltpu-login kernel: [31468.779009] R13: 0000000000000000 R14: ffff88019a611848 R15: ffff8800802fa058
      Jan 7 18:44:47 fltpu-login kernel: [31468.786254] FS: 0000000000000000(0000) GS:ffff8801b9c80000(0000) knlGS:0000000000000000
      Jan 7 18:44:47 fltpu-login kernel: [31468.794586] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      Jan 7 18:44:47 fltpu-login kernel: [31468.800429] CR2: 0000000000000028 CR3: 000000000169e000 CR4: 00000000000007e0
      Jan 7 18:44:47 fltpu-login kernel: [31468.807677] Stack:
      Jan 7 18:44:47 fltpu-login kernel: [31468.809763] 0000000000000000 dead000000200200 00000001002f9aab ffff8801b7134000
      Jan 7 18:44:47 fltpu-login kernel: [31468.817515] ffffffff8104c950 ffff8800bac99cc0 ffff88019a611768 ffffffff8104dd0a
      Jan 7 18:44:47 fltpu-login kernel: [31468.825162] 0000000000000000 0000000000000282 ffff88019a611798 ffff8800816ab580
      Jan 7 18:44:47 fltpu-login kernel: [31468.832806] Call Trace:
      Jan 7 18:44:47 fltpu-login kernel: [31468.835341] [<ffffffff8104c950>] ? usleep_range+0x40/0x40
      Jan 7 18:44:47 fltpu-login kernel: [31468.840959] [<ffffffff8104dd0a>] ? recalc_sigpending+0x1a/0x50
      Jan 7 18:44:47 fltpu-login kernel: [31468.846969] [<ffffffffa0b9ff03>] do_statahead_enter+0x183/0x13a0 [lustre]
      Jan 7 18:44:47 fltpu-login kernel: [31468.854006] [<ffffffffa093b8bd>] ? ldlm_res_hop_get_locked+0xd/0x10 [ptlrpc]
      Jan 7 18:44:47 fltpu-login kernel: [31468.861226] [<ffffffff810aeaad>] ? from_kgid+0xd/0x10
      Jan 7 18:44:47 fltpu-login kernel: [31468.866452] [<ffffffffa0986495>] ? get_my_ctx+0x55/0x120 [ptlrpc]
      Jan 7 18:44:47 fltpu-login kernel: [31468.872786] [<ffffffff8106c070>] ? try_to_wake_up+0x290/0x290
      Jan 7 18:44:47 fltpu-login kernel: [31468.878734] [<ffffffffa0b8c0d2>] ll_lookup_it+0x552/0x970 [lustre]
      Jan 7 18:44:47 fltpu-login kernel: [31468.885130] [<ffffffffa0b8acdb>] ? ll_iget+0x13b/0x280 [lustre]
      Jan 7 18:44:47 fltpu-login kernel: [31468.891316] [<ffffffffa0c21444>] ? lmv_get_lustre_md+0xf4/0x290 [lmv]
      Jan 7 18:44:47 fltpu-login kernel: [31468.897920] [<ffffffffa0c21787>] ? lmv_free_lustre_md+0x1a7/0x4a0 [lmv]
      Jan 7 18:44:47 fltpu-login kernel: [31468.904777] [<ffffffffa0b519f2>] ? ll_dcompare+0x42/0x100 [lustre]
      Jan 7 18:44:47 fltpu-login kernel: [31468.911148] [<ffffffffa0b8d21a>] ll_lookup_nd+0x7a/0x170 [lustre]
      Jan 7 18:44:47 fltpu-login kernel: [31468.917413] [<ffffffff81140fa8>] lookup_real+0x18/0x50
      Jan 7 18:44:47 fltpu-login kernel: [31468.922782] [<ffffffff81141bd3>] __lookup_hash+0x33/0x40
      Jan 7 18:44:47 fltpu-login kernel: [31468.928260] [<ffffffff81147246>] lookup_one_len+0xc6/0x120
      Jan 7 18:44:47 fltpu-login kernel: [31468.933928] [<ffffffffa03fe880>] encode_entryplus_baggage+0x70/0x160 [nfsd]
      Jan 7 18:44:47 fltpu-login kernel: [31468.941157] [<ffffffffa03fecf5>] encode_entry.isra.11+0x2c5/0x300 [nfsd]
      Jan 7 18:44:47 fltpu-login kernel: [31468.948028] [<ffffffffa04000e0>] ? nfs3svc_encode_entry+0x10/0x10 [nfsd]
      Jan 7 18:44:47 fltpu-login kernel: [31468.954962] [<ffffffffa04000ef>] nfs3svc_encode_entry_plus+0xf/0x20 [nfsd]
      Jan 7 18:44:47 fltpu-login kernel: [31468.962102] [<ffffffffa03f5a37>] nfsd_readdir+0x177/0x270 [nfsd]
      Jan 7 18:44:47 fltpu-login kernel: [31468.968278] [<ffffffff8148d34c>] ? cache_check+0x5c/0x330
      Jan 7 18:44:47 fltpu-login kernel: [31468.973912] [<ffffffffa03f3550>] ? _get_posix_acl+0x60/0x60 [nfsd]
      Jan 7 18:44:47 fltpu-login kernel: [31468.980260] [<ffffffffa03fcfac>] nfsd3_proc_readdirplus+0xac/0x1b0 [nfsd]
      Jan 7 18:44:47 fltpu-login kernel: [31468.987298] [<ffffffffa03efca1>] nfsd_dispatch+0xa1/0x1b0 [nfsd]
      Jan 7 18:44:47 fltpu-login kernel: [31468.993474] [<ffffffff81481d3f>] svc_process_common+0x2ef/0x5a0
      Jan 7 18:44:47 fltpu-login kernel: [31468.999555] [<ffffffff8148233f>] svc_process+0xff/0x150
      Jan 7 18:44:47 fltpu-login kernel: [31469.004949] [<ffffffffa03ef6ef>] nfsd+0xbf/0x130 [nfsd]
      Jan 7 18:44:47 fltpu-login kernel: [31469.010341] [<ffffffffa03ef630>] ? nfsd_destroy+0x80/0x80 [nfsd]
      Jan 7 18:44:47 fltpu-login kernel: [31469.016520] [<ffffffff8106013b>] kthread+0xbb/0xc0
      Jan 7 18:44:47 fltpu-login kernel: [31469.021498] [<ffffffff81060080>] ? kthread_freezable_should_stop+0x70/0x70
      Jan 7 18:44:47 fltpu-login kernel: [31469.028541] [<ffffffff8150a8bc>] ret_from_fork+0x7c/0xb0
      Jan 7 18:44:47 fltpu-login kernel: [31469.034023] [<ffffffff81060080>] ? kthread_freezable_should_stop+0x70/0x70
      Jan 7 18:44:47 fltpu-login kernel: [31469.041063] Code: c7 00 7f bc a0 e8 84 f8 96 ff 0f 1f 40 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 f4 53 48 89 fb 48 83 ec 58 4c 8b 2f 48 85 f6 <49> 8b 45 28 48 8b 80 f8 02 00 00 4c 8b 78 18 0f 84 26 03 00 00
      Jan 7 18:44:47 fltpu-login kernel: [31469.061700] RIP [<ffffffffa0b9c23d>] ll_sai_unplug+0x1d/0x470 [lustre]
      Jan 7 18:44:47 fltpu-login kernel: [31469.068471] RSP <ffff88019a611718>
      Jan 7 18:44:47 fltpu-login kernel: [31469.072091] CR2: 0000000000000028

      Attachments

        Issue Links

          Activity

            [LU-4451] Kernel Oops with NFS reexport using mainline 3.12 client

            Can some one link this to LU-3270

            simmonsja James A Simmons added a comment - Can some one link this to LU-3270

            Yes. The second patch is absolutely needed. Our system runs fine for several months now with those two patches.

            rfehren Roland Fehrenbacher added a comment - Yes. The second patch is absolutely needed. Our system runs fine for several months now with those two patches.

            Patch 1 was merged upstream as commit : commit f236f69b48727d6459c02bfabcadb9bfaacbe504. The second patch has not been merged. Roland is this problem still present in the latest kernel tree?

            simmonsja James A Simmons added a comment - Patch 1 was merged upstream as commit : commit f236f69b48727d6459c02bfabcadb9bfaacbe504. The second patch has not been merged. Roland is this problem still present in the latest kernel tree?

            I've ported the patch to the in-kernel client. Also needed to add ll_revalidate_dentry and change ll_revalidate_nd as in master (see patch 1) The problem is gone.
            Can someone review the patches and make sure they are included upstream.

            rfehren Roland Fehrenbacher added a comment - I've ported the patch to the in-kernel client. Also needed to add ll_revalidate_dentry and change ll_revalidate_nd as in master (see patch 1) The problem is gone. Can someone review the patches and make sure they are included upstream.
            laisiyao Lai Siyao added a comment -

            I don't have the test environment for 3.12.8, and lustre client support for vanilla kernel is done by Peng Tao. I've just updated the patch to latest master.

            laisiyao Lai Siyao added a comment - I don't have the test environment for 3.12.8, and lustre client support for vanilla kernel is done by Peng Tao. I've just updated the patch to latest master.

            Please note, that we'd need a patch against the in-kernel code of vanilla 3.12.8, not against Lustre master.

            rfehren Roland Fehrenbacher added a comment - Please note, that we'd need a patch against the in-kernel code of vanilla 3.12.8, not against Lustre master.
            laisiyao Lai Siyao added a comment -

            Hmm, I'll rebase it to latest master code later.

            laisiyao Lai Siyao added a comment - Hmm, I'll rebase it to latest master code later.

            Sorry for the late reply. Been busy with other stuff ...

            The patch in http://review.whamcloud.com/#/c/6392 fails miserably when trying to apply
            to the mainline 3.12 client code (see below). Do you have a patch that works?

            patch l -p 1 < ./0001LU-3270statahead-statahead-thread-wait-for-RPCs-to.patch
            patching file lustre/include/obd.h
            Hunk #1 succeeded at 1098 (offset -40 lines).
            patching file lustre/llite/dcache.c
            Hunk #1 FAILED at 376.
            1 out of 1 hunk FAILED – saving rejects to file lustre/llite/dcache.c.rej
            patching file lustre/llite/file.c
            Hunk #1 succeeded at 321 (offset -49 lines).
            Hunk #2 FAILED at 541.
            Hunk #3 succeeded at 664 (offset -47 lines).
            1 out of 3 hunks FAILED – saving rejects to file lustre/llite/file.c.rej
            patching file lustre/llite/llite_internal.h
            Hunk #1 FAILED at 141.
            Hunk #2 FAILED at 183.
            Hunk #3 succeeded at 240 (offset -4 lines).
            Hunk #4 FAILED at 511.
            Hunk #5 FAILED at 1252.
            Hunk #6 FAILED at 1284.
            Hunk #7 succeeded at 1248 (offset -64 lines).
            Hunk #8 FAILED at 1319.
            6 out of 8 hunks FAILED – saving rejects to file lustre/llite/llite_internal.h.rej
            patching file lustre/llite/llite_lib.c
            Hunk #1 FAILED at 137.
            Hunk #2 FAILED at 719.
            Hunk #3 FAILED at 740.
            Hunk #4 succeeded at 926 (offset -29 lines).
            3 out of 4 hunks FAILED – saving rejects to file lustre/llite/llite_lib.c.rej
            patching file lustre/llite/statahead.c
            Hunk #1 FAILED at 64.
            Hunk #2 FAILED at 212.
            Hunk #3 succeeded at 244 with fuzz 2 (offset -1 lines).
            Hunk #4 succeeded at 303 (offset 23 lines).
            Hunk #5 FAILED at 299.
            Hunk #6 succeeded at 360 with fuzz 1 (offset 23 lines).
            Hunk #7 succeeded at 378 with fuzz 1 (offset 23 lines).
            Hunk #8 succeeded at 418 (offset 23 lines).
            Hunk #9 succeeded at 440 (offset 23 lines).
            Hunk #10 FAILED at 441.
            Hunk #11 FAILED at 476.
            Hunk #12 FAILED at 524.
            Hunk #13 FAILED at 599.
            Hunk #14 FAILED at 616.
            Hunk #15 FAILED at 678.
            Hunk #16 succeeded at 792 (offset 9 lines).
            Hunk #17 succeeded at 814 (offset 9 lines).
            Hunk #18 FAILED at 930.
            Hunk #19 FAILED at 1002.
            Hunk #20 FAILED at 1036.
            Hunk #21 succeeded at 1070 (offset 3 lines).
            Hunk #22 succeeded at 1142 (offset 3 lines).
            Hunk #23 FAILED at 1195.
            Hunk #24 succeeded at 1252 with fuzz 1 (offset 2 lines).
            Hunk #25 FAILED at 1487.
            14 out of 25 hunks FAILED – saving rejects to file lustre/llite/statahead.c.rej

            rfehren Roland Fehrenbacher added a comment - Sorry for the late reply. Been busy with other stuff ... The patch in http://review.whamcloud.com/#/c/6392 fails miserably when trying to apply to the mainline 3.12 client code (see below). Do you have a patch that works? patch l -p 1 < ./0001 LU-3270 statahead-statahead-thread-wait-for-RPCs-to .patch patching file lustre/include/obd.h Hunk #1 succeeded at 1098 (offset -40 lines). patching file lustre/llite/dcache.c Hunk #1 FAILED at 376. 1 out of 1 hunk FAILED – saving rejects to file lustre/llite/dcache.c.rej patching file lustre/llite/file.c Hunk #1 succeeded at 321 (offset -49 lines). Hunk #2 FAILED at 541. Hunk #3 succeeded at 664 (offset -47 lines). 1 out of 3 hunks FAILED – saving rejects to file lustre/llite/file.c.rej patching file lustre/llite/llite_internal.h Hunk #1 FAILED at 141. Hunk #2 FAILED at 183. Hunk #3 succeeded at 240 (offset -4 lines). Hunk #4 FAILED at 511. Hunk #5 FAILED at 1252. Hunk #6 FAILED at 1284. Hunk #7 succeeded at 1248 (offset -64 lines). Hunk #8 FAILED at 1319. 6 out of 8 hunks FAILED – saving rejects to file lustre/llite/llite_internal.h.rej patching file lustre/llite/llite_lib.c Hunk #1 FAILED at 137. Hunk #2 FAILED at 719. Hunk #3 FAILED at 740. Hunk #4 succeeded at 926 (offset -29 lines). 3 out of 4 hunks FAILED – saving rejects to file lustre/llite/llite_lib.c.rej patching file lustre/llite/statahead.c Hunk #1 FAILED at 64. Hunk #2 FAILED at 212. Hunk #3 succeeded at 244 with fuzz 2 (offset -1 lines). Hunk #4 succeeded at 303 (offset 23 lines). Hunk #5 FAILED at 299. Hunk #6 succeeded at 360 with fuzz 1 (offset 23 lines). Hunk #7 succeeded at 378 with fuzz 1 (offset 23 lines). Hunk #8 succeeded at 418 (offset 23 lines). Hunk #9 succeeded at 440 (offset 23 lines). Hunk #10 FAILED at 441. Hunk #11 FAILED at 476. Hunk #12 FAILED at 524. Hunk #13 FAILED at 599. Hunk #14 FAILED at 616. Hunk #15 FAILED at 678. Hunk #16 succeeded at 792 (offset 9 lines). Hunk #17 succeeded at 814 (offset 9 lines). Hunk #18 FAILED at 930. Hunk #19 FAILED at 1002. Hunk #20 FAILED at 1036. Hunk #21 succeeded at 1070 (offset 3 lines). Hunk #22 succeeded at 1142 (offset 3 lines). Hunk #23 FAILED at 1195. Hunk #24 succeeded at 1252 with fuzz 1 (offset 2 lines). Hunk #25 FAILED at 1487. 14 out of 25 hunks FAILED – saving rejects to file lustre/llite/statahead.c.rej
            laisiyao Lai Siyao added a comment -

            I did commit to master branch, and you should be able to use `git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/92/6392/21 && git cherry-pick FETCH_HEAD` to cherry-pick to your branch, eg. 2.4.2.

            laisiyao Lai Siyao added a comment - I did commit to master branch, and you should be able to use `git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/92/6392/21 && git cherry-pick FETCH_HEAD` to cherry-pick to your branch, eg. 2.4.2.

            I can't find this commit in my git master clone? Where did you commit it?

            rfehren Roland Fehrenbacher added a comment - I can't find this commit in my git master clone? Where did you commit it?

            People

              laisiyao Lai Siyao
              rfehren Roland Fehrenbacher
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: