[LU-5507] sanity-quota test_18: Oops: IP: lustre_msg_get_opc+0xe/0x110 [ptlrpc] Created: 20/Aug/14  Updated: 09/Jun/15  Resolved: 05/Jan/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/80/
Distro/Arch: SLES11SP3/x86_64 (client), RHEL6.5/x86_64 (server)


Issue Links:
Related
is related to LU-5169 Lustre client panic during MDS failover Resolved
Severity: 3
Rank (Obsolete): 15364

 Description   

While running sanity-quota test 18, one of the client nodes hit the following error:

[60756.462327] BUG: unable to handle kernel NULL pointer dereference at 0000000000000007^M
[60756.465418] IP: [<ffffffffa088a9d1>] lustre_msg_get_opc+0x1/0x100 [ptlrpc]^M
[60756.466234] PGD 0 ^M
[60756.466234] Oops: 0000 [#1] SMP ^M
[60756.466234] CPU 0 ^M
[60756.466234] Modules linked in: lustre(EN) obdecho(EN) mgc(EN) lov(EN) osc(EN) mdc(EN) lmv(EN) fid(EN) fld(EN) ptlrpc(EN) obdclass(EN) lvfs(EN) ksocklnd(EN) lnet(EN) libcfs(EN) ext2 sha512_generic sha1_generic md5 crc32c nfs lockd fscache auth_rpcgss nfs_acl sunrpc rdma_ucm rdma_cm iw_cm ib_addr ib_srp scsi_transport_srp scsi_tgt ib_ipoib ib_cm ib_uverbs ib_umad iw_cxgb3 cxgb3 mdio mlx4_en mlx4_ib ib_sa mlx4_core ib_mthca ib_mad ib_core mperf loop dm_mod floppy 8139too ipv6 ipv6_lib rtc_cmos pcspkr virtio_balloon i2c_piix4 8139cp mii button ttm drm_kms_helper drm i2c_core sysimgblt sysfillrect syscopyarea uhci_hcd ehci_hcd usbcore usb_common intel_agp intel_gtt scsi_dh_emc scsi_dh_rdac scsi_dh_alua scsi_dh_hp_sw scsi_dh virtio_pci ata_generic virtio_blk virtio virtio_ring ata_piix edd ext3 mbcache jbd fan processor ahci libahci libata scsi_mod thermal thermal_sys hwmon [last unloaded: libcfs]^M
[60756.466234] Supported: No, Unsupported modules are loaded^M
[60756.466234] ^M
[60756.466234] Pid: 12735, comm: ptlrpcd_rcv Tainted: G           EN  3.0.101-0.35-default #1 Red Hat KVM^M
[60756.466234] RIP: 0010:[<ffffffffa088a9d1>]  [<ffffffffa088a9d1>] lustre_msg_get_opc+0x1/0x100 [ptlrpc]^M
[60756.466234] RSP: 0018:ffff880078f3dcb0  EFLAGS: 00010286^M
[60756.466234] RAX: ffff8800201efa08 RBX: 0000000000000000 RCX: 0000000000000002^M
[60756.466234] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffffffffffffffff^M
[60756.466234] RBP: ffff88006a655500 R08: ffff8800201efa08 R09: 00000000000000d8^M
[60756.466234] R10: 000000000000000a R11: 0000000000000000 R12: ffff88006295c800^M
[60756.466234] R13: ffff8800201efa08 R14: ffff880079dcbee0 R15: ffff88006e9838f0^M
[60756.466234] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000^M
[60756.494980] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
[60756.494980] CR2: 0000000000000007 CR3: 000000007ae8a000 CR4: 00000000000006f0^M
[60756.494980] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
[60756.494980] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M
[60756.494980] Process ptlrpcd_rcv (pid: 12735, threadinfo ffff880078f3c000, task ffff880017dde540)^M
[60756.494980] Stack:^M
[60756.494980]  0000000000000000 ffffffffa099d04b ffff880079dcbc00 000000c10002ee7e^M
[60756.494980]  ffff880079dcbc00 000000c10002ee7e ffff880079dcbc00 ffff8800290c0a88^M
[60756.494980]  ffff8800290c0800 ffffffffa087eb5a 00000000ebc0de01 ffff880079dcbc00^M
[60756.494980] Call Trace:^M
[60756.494980]  [<ffffffffa099d04b>] mdc_replay_open+0xab/0x430 [mdc]^M
[60756.494980]  [<ffffffffa087eb5a>] ptlrpc_replay_interpret+0x14a/0x740 [ptlrpc]^M
[60756.494980]  [<ffffffffa0880452>] ptlrpc_check_set+0x532/0x1b30 [ptlrpc]^M
[60756.494980]  [<ffffffffa08abdcb>] ptlrpcd_check+0x52b/0x550 [ptlrpc]^M
[60756.494980]  [<ffffffffa08ac32b>] ptlrpcd+0x24b/0x3b0 [ptlrpc]^M
[60756.494980]  [<ffffffff810829a6>] kthread+0x96/0xa0^M
[60756.494980]  [<ffffffff8146b164>] kernel_thread_helper+0x4/0x10^M
[60756.494980] Code: 89 44 24 48 48 83 c4 58 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 45 31 ed e9 fb fe ff ff 66 66 66 2e 0f 1f 84 00 00 00 00 00 53 ^M
[60756.494980]  7f 08 d3 0b d0 0b 48 89 fb 74 73 c7 05 49 0[    0.000000] Initializing cgroup subsys cpuset^M
[    0.000000] Initializing cgroup subsys cpu^M

Maloo report: https://testing.hpdd.intel.com/test_sets/4f4c437a-268b-11e4-84f2-5254006e85c2



 Comments   
Comment by Jian Yu [ 20/Aug/14 ]

Lustre client build: https://build.hpdd.intel.com/job/lustre-b2_5/80/
Lustre server build: https://build.hpdd.intel.com/job/lustre-b2_4/73/ (2.4.3)
Distro/Arch: RHEL6.5/x86_64

The same failure occurred: https://testing.hpdd.intel.com/test_sets/ea35137e-266f-11e4-8ee8-5254006e85c2

Comment by Jian Yu [ 21/Aug/14 ]

So far, the failure has not occurred in Lustre b2_5 build #82 and #83.

Comment by Jian Yu [ 31/Aug/14 ]

Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/86/ (2.5.3 RC1)

The same failure occurred: https://testing.hpdd.intel.com/test_sets/651d9592-30da-11e4-b503-5254006e85c2

Comment by Peter Jones [ 04/Nov/14 ]

This seems to occur sometimes. Any idea why?

Comment by Niu Yawei (Inactive) [ 11/Nov/14 ]

Seems it's a race of close vs. open replay, that's introduced in the fix of LU-2613 (4322e0f9): To free the queued open & close requests promptly, we free them on file close, however, replay open may jump in at this time to fix the stale open handle on the open & close requests. I'm going to post a patch soon.

Comment by Niu Yawei (Inactive) [ 11/Nov/14 ]

http://review.whamcloud.com/12667

Comment by Gerrit Updater [ 03/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12667/
Subject: LU-5507 recovery: don't replay closed open
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cfbfcc6ad9ebb5893be2d1e85fc959794fd914ed

Comment by Niu Yawei (Inactive) [ 05/Jan/15 ]

patch landed on master.

Generated at Sat Feb 10 01:52:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.