[LU-4451] Kernel Oops with NFS reexport using mainline 3.12 client Created: 07/Jan/14  Updated: 19/Sep/16  Resolved: 19/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Roland Fehrenbacher Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None
Environment:
  • Client mainline kernel 3.12.5 with patches (mentioned in LU-4400) / lustre-utils 2.4.2
  • Servers lustre 2.4.2/ZFS OSDs
  • ko2iblnd

Attachments: File 0001-QL-Lustre-Add-ported-ll_revalidate_dentry-ll_revalid.patch     File 0002-QL-Lustre-port-patch-LU-3270-statahead-statahead-thr.patch    
Issue Links:
Related
is related to LU-3270 ptlrpcd strnlen crash trying to log a... Resolved
is related to LU-4011 problems with upstream lustre client ... Closed
is related to LU-6215 Sync Lustre external tree with lustre... Resolved
is related to LU-4416 support for 3.12 linux kernel Resolved
Severity: 4
Rank (Obsolete): 12201

 Description   

Jan 7 18:44:47 fltpu-login kernel: [31468.631107] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
Jan 7 18:44:47 fltpu-login kernel: [31468.639140] IP: [<ffffffffa0b9c23d>] ll_sai_unplug+0x1d/0x470 [lustre]
Jan 7 18:44:47 fltpu-login kernel: [31468.645869] PGD 0
Jan 7 18:44:47 fltpu-login kernel: [31468.647983] Oops: 0000 1 SMP
Jan 7 18:44:47 fltpu-login kernel: [31468.651330] Modules linked in: lmv(C) fld(C) mgc(C) lustre(C) lov(C) osc(C) mdc(C) fid(C) ptlrpc(C) obdclass(C) lvfs(C) zfs(PO) zcommon(PO) znvpair(PO) zavl(PO) zunicode(PO) spl(O) ksocklnd(C) ko2iblnd(C) lnet(C) sha512_ssse3 sha512_generic sha256_ssse3 sha256_generic crc32 crc32c_intel libcfs(C) sg ib_umad ib_ipoib xfs libcrc32c nfsd exportfs ch st hid_generic usbhid ehci_pci ehci_hcd psmouse uhci_hcd lpc_ich mfd_core aacraid usbcore usb_common i7core_edac igb edac_core i2c_algo_bit ib_qib ixgbe mdio rdma_ucm rdma_cm ib_cm acpi_cpufreq iw_cm ib_sa ib_mad ib_addr processor ipv6 ib_uverbs ib_core qla2xxx blcr(O) scsi_transport_fc blcr_imports(O) scsi_tgt dm_mod
Jan 7 18:44:47 fltpu-login kernel: [31468.712442] CPU: 2 PID: 15987 Comm: nfsd Tainted: P C O 3.12.5-ql-generic-15 #1
Jan 7 18:44:47 fltpu-login kernel: [31468.720706] Hardware name: Supermicro X8DTN/X8DTN, BIOS 080015 05/04/2009
Jan 7 18:44:47 fltpu-login kernel: [31468.727726] task: ffff8800bac99cc0 ti: ffff88019a610000 task.ti: ffff88019a610000
Jan 7 18:44:47 fltpu-login kernel: [31468.735354] RIP: 0010:[<ffffffffa0b9c23d>] [<ffffffffa0b9c23d>] ll_sai_unplug+0x1d/0x470 [lustre]
Jan 7 18:44:47 fltpu-login kernel: [31468.744483] RSP: 0018:ffff88019a611718 EFLAGS: 00010282
Jan 7 18:44:47 fltpu-login kernel: [31468.749945] RAX: 000000005a5a5a5a RBX: ffff8800802fa000 RCX: ffff8800802fa060
Jan 7 18:44:47 fltpu-login kernel: [31468.757177] RDX: ffff8800802fa060 RSI: ffff8800816ab580 RDI: ffff8800802fa000
Jan 7 18:44:47 fltpu-login kernel: [31468.764405] RBP: ffff88019a611798 R08: ffff88019a610000 R09: 0000000000000211
Jan 7 18:44:47 fltpu-login kernel: [31468.771737] R10: 0000000000000000 R11: 0140000000000000 R12: ffff8800816ab580
Jan 7 18:44:47 fltpu-login kernel: [31468.779009] R13: 0000000000000000 R14: ffff88019a611848 R15: ffff8800802fa058
Jan 7 18:44:47 fltpu-login kernel: [31468.786254] FS: 0000000000000000(0000) GS:ffff8801b9c80000(0000) knlGS:0000000000000000
Jan 7 18:44:47 fltpu-login kernel: [31468.794586] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 7 18:44:47 fltpu-login kernel: [31468.800429] CR2: 0000000000000028 CR3: 000000000169e000 CR4: 00000000000007e0
Jan 7 18:44:47 fltpu-login kernel: [31468.807677] Stack:
Jan 7 18:44:47 fltpu-login kernel: [31468.809763] 0000000000000000 dead000000200200 00000001002f9aab ffff8801b7134000
Jan 7 18:44:47 fltpu-login kernel: [31468.817515] ffffffff8104c950 ffff8800bac99cc0 ffff88019a611768 ffffffff8104dd0a
Jan 7 18:44:47 fltpu-login kernel: [31468.825162] 0000000000000000 0000000000000282 ffff88019a611798 ffff8800816ab580
Jan 7 18:44:47 fltpu-login kernel: [31468.832806] Call Trace:
Jan 7 18:44:47 fltpu-login kernel: [31468.835341] [<ffffffff8104c950>] ? usleep_range+0x40/0x40
Jan 7 18:44:47 fltpu-login kernel: [31468.840959] [<ffffffff8104dd0a>] ? recalc_sigpending+0x1a/0x50
Jan 7 18:44:47 fltpu-login kernel: [31468.846969] [<ffffffffa0b9ff03>] do_statahead_enter+0x183/0x13a0 [lustre]
Jan 7 18:44:47 fltpu-login kernel: [31468.854006] [<ffffffffa093b8bd>] ? ldlm_res_hop_get_locked+0xd/0x10 [ptlrpc]
Jan 7 18:44:47 fltpu-login kernel: [31468.861226] [<ffffffff810aeaad>] ? from_kgid+0xd/0x10
Jan 7 18:44:47 fltpu-login kernel: [31468.866452] [<ffffffffa0986495>] ? get_my_ctx+0x55/0x120 [ptlrpc]
Jan 7 18:44:47 fltpu-login kernel: [31468.872786] [<ffffffff8106c070>] ? try_to_wake_up+0x290/0x290
Jan 7 18:44:47 fltpu-login kernel: [31468.878734] [<ffffffffa0b8c0d2>] ll_lookup_it+0x552/0x970 [lustre]
Jan 7 18:44:47 fltpu-login kernel: [31468.885130] [<ffffffffa0b8acdb>] ? ll_iget+0x13b/0x280 [lustre]
Jan 7 18:44:47 fltpu-login kernel: [31468.891316] [<ffffffffa0c21444>] ? lmv_get_lustre_md+0xf4/0x290 [lmv]
Jan 7 18:44:47 fltpu-login kernel: [31468.897920] [<ffffffffa0c21787>] ? lmv_free_lustre_md+0x1a7/0x4a0 [lmv]
Jan 7 18:44:47 fltpu-login kernel: [31468.904777] [<ffffffffa0b519f2>] ? ll_dcompare+0x42/0x100 [lustre]
Jan 7 18:44:47 fltpu-login kernel: [31468.911148] [<ffffffffa0b8d21a>] ll_lookup_nd+0x7a/0x170 [lustre]
Jan 7 18:44:47 fltpu-login kernel: [31468.917413] [<ffffffff81140fa8>] lookup_real+0x18/0x50
Jan 7 18:44:47 fltpu-login kernel: [31468.922782] [<ffffffff81141bd3>] __lookup_hash+0x33/0x40
Jan 7 18:44:47 fltpu-login kernel: [31468.928260] [<ffffffff81147246>] lookup_one_len+0xc6/0x120
Jan 7 18:44:47 fltpu-login kernel: [31468.933928] [<ffffffffa03fe880>] encode_entryplus_baggage+0x70/0x160 [nfsd]
Jan 7 18:44:47 fltpu-login kernel: [31468.941157] [<ffffffffa03fecf5>] encode_entry.isra.11+0x2c5/0x300 [nfsd]
Jan 7 18:44:47 fltpu-login kernel: [31468.948028] [<ffffffffa04000e0>] ? nfs3svc_encode_entry+0x10/0x10 [nfsd]
Jan 7 18:44:47 fltpu-login kernel: [31468.954962] [<ffffffffa04000ef>] nfs3svc_encode_entry_plus+0xf/0x20 [nfsd]
Jan 7 18:44:47 fltpu-login kernel: [31468.962102] [<ffffffffa03f5a37>] nfsd_readdir+0x177/0x270 [nfsd]
Jan 7 18:44:47 fltpu-login kernel: [31468.968278] [<ffffffff8148d34c>] ? cache_check+0x5c/0x330
Jan 7 18:44:47 fltpu-login kernel: [31468.973912] [<ffffffffa03f3550>] ? _get_posix_acl+0x60/0x60 [nfsd]
Jan 7 18:44:47 fltpu-login kernel: [31468.980260] [<ffffffffa03fcfac>] nfsd3_proc_readdirplus+0xac/0x1b0 [nfsd]
Jan 7 18:44:47 fltpu-login kernel: [31468.987298] [<ffffffffa03efca1>] nfsd_dispatch+0xa1/0x1b0 [nfsd]
Jan 7 18:44:47 fltpu-login kernel: [31468.993474] [<ffffffff81481d3f>] svc_process_common+0x2ef/0x5a0
Jan 7 18:44:47 fltpu-login kernel: [31468.999555] [<ffffffff8148233f>] svc_process+0xff/0x150
Jan 7 18:44:47 fltpu-login kernel: [31469.004949] [<ffffffffa03ef6ef>] nfsd+0xbf/0x130 [nfsd]
Jan 7 18:44:47 fltpu-login kernel: [31469.010341] [<ffffffffa03ef630>] ? nfsd_destroy+0x80/0x80 [nfsd]
Jan 7 18:44:47 fltpu-login kernel: [31469.016520] [<ffffffff8106013b>] kthread+0xbb/0xc0
Jan 7 18:44:47 fltpu-login kernel: [31469.021498] [<ffffffff81060080>] ? kthread_freezable_should_stop+0x70/0x70
Jan 7 18:44:47 fltpu-login kernel: [31469.028541] [<ffffffff8150a8bc>] ret_from_fork+0x7c/0xb0
Jan 7 18:44:47 fltpu-login kernel: [31469.034023] [<ffffffff81060080>] ? kthread_freezable_should_stop+0x70/0x70
Jan 7 18:44:47 fltpu-login kernel: [31469.041063] Code: c7 00 7f bc a0 e8 84 f8 96 ff 0f 1f 40 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 f4 53 48 89 fb 48 83 ec 58 4c 8b 2f 48 85 f6 <49> 8b 45 28 48 8b 80 f8 02 00 00 4c 8b 78 18 0f 84 26 03 00 00
Jan 7 18:44:47 fltpu-login kernel: [31469.061700] RIP [<ffffffffa0b9c23d>] ll_sai_unplug+0x1d/0x470 [lustre]
Jan 7 18:44:47 fltpu-login kernel: [31469.068471] RSP <ffff88019a611718>
Jan 7 18:44:47 fltpu-login kernel: [31469.072091] CR2: 0000000000000028



 Comments   
Comment by Roland Fehrenbacher [ 07/Jan/14 ]

The Oops occurs after a couple of seconds when doing a "rm -r" on a large directory.

Comment by Jodi Levi (Inactive) [ 10/Jan/14 ]

Lai,
Could you please have a look and comment on this one?
Thank you!

Comment by Lai Siyao [ 14/Jan/14 ]

http://review.whamcloud.com/#/c/6392/ should be able to fix this, could you apply this patch and verify?

Comment by Roland Fehrenbacher [ 14/Jan/14 ]

I can't find this commit in my git master clone? Where did you commit it?

Comment by Lai Siyao [ 15/Jan/14 ]

I did commit to master branch, and you should be able to use `git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/92/6392/21 && git cherry-pick FETCH_HEAD` to cherry-pick to your branch, eg. 2.4.2.

Comment by Roland Fehrenbacher [ 21/Jan/14 ]

Sorry for the late reply. Been busy with other stuff ...

The patch in http://review.whamcloud.com/#/c/6392 fails miserably when trying to apply
to the mainline 3.12 client code (see below). Do you have a patch that works?

patch l -p 1 < ./0001LU-3270statahead-statahead-thread-wait-for-RPCs-to.patch
patching file lustre/include/obd.h
Hunk #1 succeeded at 1098 (offset -40 lines).
patching file lustre/llite/dcache.c
Hunk #1 FAILED at 376.
1 out of 1 hunk FAILED – saving rejects to file lustre/llite/dcache.c.rej
patching file lustre/llite/file.c
Hunk #1 succeeded at 321 (offset -49 lines).
Hunk #2 FAILED at 541.
Hunk #3 succeeded at 664 (offset -47 lines).
1 out of 3 hunks FAILED – saving rejects to file lustre/llite/file.c.rej
patching file lustre/llite/llite_internal.h
Hunk #1 FAILED at 141.
Hunk #2 FAILED at 183.
Hunk #3 succeeded at 240 (offset -4 lines).
Hunk #4 FAILED at 511.
Hunk #5 FAILED at 1252.
Hunk #6 FAILED at 1284.
Hunk #7 succeeded at 1248 (offset -64 lines).
Hunk #8 FAILED at 1319.
6 out of 8 hunks FAILED – saving rejects to file lustre/llite/llite_internal.h.rej
patching file lustre/llite/llite_lib.c
Hunk #1 FAILED at 137.
Hunk #2 FAILED at 719.
Hunk #3 FAILED at 740.
Hunk #4 succeeded at 926 (offset -29 lines).
3 out of 4 hunks FAILED – saving rejects to file lustre/llite/llite_lib.c.rej
patching file lustre/llite/statahead.c
Hunk #1 FAILED at 64.
Hunk #2 FAILED at 212.
Hunk #3 succeeded at 244 with fuzz 2 (offset -1 lines).
Hunk #4 succeeded at 303 (offset 23 lines).
Hunk #5 FAILED at 299.
Hunk #6 succeeded at 360 with fuzz 1 (offset 23 lines).
Hunk #7 succeeded at 378 with fuzz 1 (offset 23 lines).
Hunk #8 succeeded at 418 (offset 23 lines).
Hunk #9 succeeded at 440 (offset 23 lines).
Hunk #10 FAILED at 441.
Hunk #11 FAILED at 476.
Hunk #12 FAILED at 524.
Hunk #13 FAILED at 599.
Hunk #14 FAILED at 616.
Hunk #15 FAILED at 678.
Hunk #16 succeeded at 792 (offset 9 lines).
Hunk #17 succeeded at 814 (offset 9 lines).
Hunk #18 FAILED at 930.
Hunk #19 FAILED at 1002.
Hunk #20 FAILED at 1036.
Hunk #21 succeeded at 1070 (offset 3 lines).
Hunk #22 succeeded at 1142 (offset 3 lines).
Hunk #23 FAILED at 1195.
Hunk #24 succeeded at 1252 with fuzz 1 (offset 2 lines).
Hunk #25 FAILED at 1487.
14 out of 25 hunks FAILED – saving rejects to file lustre/llite/statahead.c.rej

Comment by Lai Siyao [ 23/Jan/14 ]

Hmm, I'll rebase it to latest master code later.

Comment by Roland Fehrenbacher [ 23/Jan/14 ]

Please note, that we'd need a patch against the in-kernel code of vanilla 3.12.8, not against Lustre master.

Comment by Lai Siyao [ 29/Jan/14 ]

I don't have the test environment for 3.12.8, and lustre client support for vanilla kernel is done by Peng Tao. I've just updated the patch to latest master.

Comment by Roland Fehrenbacher [ 18/Feb/14 ]

I've ported the patch to the in-kernel client. Also needed to add ll_revalidate_dentry and change ll_revalidate_nd as in master (see patch 1) The problem is gone.
Can someone review the patches and make sure they are included upstream.

Comment by James A Simmons [ 10/Jun/14 ]

Patch 1 was merged upstream as commit : commit f236f69b48727d6459c02bfabcadb9bfaacbe504. The second patch has not been merged. Roland is this problem still present in the latest kernel tree?

Comment by Roland Fehrenbacher [ 12/Jun/14 ]

Yes. The second patch is absolutely needed. Our system runs fine for several months now with those two patches.

Comment by James A Simmons [ 16/Jun/14 ]

Can some one link this to LU-3270

Generated at Sat Feb 10 01:42:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.