[LU-5446] Test timeout lustre-rsync-test test_4: NULL deref osc_sync_interpret+0x147 Created: 04/Aug/14  Updated: 02/Oct/14  Resolved: 02/Oct/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Emoly Liu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-5261 user process is unkillable in wait_fo... Reopened
Severity: 3
Rank (Obsolete): 15164

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/a119340e-180a-11e4-a93e-5254006e85c2
https://testing.hpdd.intel.com/test_sets/90da9c4c-1ab0-11e4-9259-5254006e85c2

The sub-test test_4 failed with the following error:

test failed to respond and timed out

Info required for matching: lustre-rsync-test 4

Client Console Log:

15:42:52:Lustre: DEBUG MARKER: == lustre-rsync-test test 4: Replicate files created by iozone == 21:41:12 (1407015672)
15:42:52:BUG: unable to handle kernel NULL pointer dereference at (null)
15:42:52:IP: [<ffffffffa1846827>] osc_sync_interpret+0x147/0x200 [osc]
15:42:52:PGD 7c211067 PUD 7b680067 PMD 0 
15:42:52:Oops: 0002 [#1] SMP 
15:42:52:last sysfs file: /sys/devices/system/cpu/online
15:42:53:CPU 1 
15:42:53:Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc_gss(U) ptlrpc(U) obdclass(U) ksocklnd(U) lnet(U) libcfs(U) sha512_generic sha256_generic nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon i2c_piix4 i2c_core 8139too 8139cp mii ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
15:42:53:
15:42:53:Pid: 20333, comm: ptlrpcd_1 Not tainted 2.6.32-431.20.3.el6.x86_64 #1 Red Hat KVM
15:42:53:RIP: 0010:[<ffffffffa1846827>]  [<ffffffffa1846827>] osc_sync_interpret+0x147/0x200 [osc]
15:42:53:RSP: 0018:ffff88007d445cb0  EFLAGS: 00010282
15:42:53:RAX: ffff88007a042580 RBX: ffff8800713a4ae0 RCX: 000000000000001a
15:42:53:RDX: 0000000000000000 RSI: ffff88007a042580 RDI: 0000000000000000
15:42:53:RBP: ffff88007d445cd0 R08: 0000000000000000 R09: 0000000000000001
15:42:53:R10: ffff88007bff9800 R11: 00000000000002a0 R12: 0000000000000000
15:42:53:R13: ffff8800713a4800 R14: ffff88007d3bb000 R15: ffff8800713a48c8
15:42:53:FS:  0000000000000000(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
15:42:53:CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
15:42:53:CR2: 0000000000000000 CR3: 0000000073c9a000 CR4: 00000000000006e0
15:42:53:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
15:42:53:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
15:42:53:Process ptlrpcd_1 (pid: 20333, threadinfo ffff88007d444000, task ffff88007d0b4ae0)
15:42:53:Stack:
15:42:53: ffff88005b4df7b0 ffff8800713a4800 ffff88005b4df7b0 ffff88005b4df780
15:42:53:<d> ffff88007d445d70 ffffffffa15e1531 0000000000000000 0000000000000286
15:42:53:<d> ffff88007d445d40 0000000100000001 ffff88007d445d20 ffff88007d0b5158
15:42:53:Call Trace:
15:42:53: [<ffffffffa15e1531>] ptlrpc_check_set+0x2c1/0x1b50 [ptlrpc]
15:42:53: [<ffffffffa160d5ab>] ptlrpcd_check+0x53b/0x560 [ptlrpc]
15:42:53: [<ffffffffa160dbfb>] ptlrpcd+0x33b/0x3f0 [ptlrpc]
15:42:53: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
15:42:53: [<ffffffffa160d8c0>] ? ptlrpcd+0x0/0x3f0 [ptlrpc]
15:42:53: [<ffffffff8109abf6>] kthread+0x96/0xa0
15:42:53: [<ffffffff8100c20a>] child_rip+0xa/0x20
15:42:53: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
15:42:53: [<ffffffff8100c200>] ? child_rip+0x0/0x20
15:42:53:Code: ff 49 8d bd 70 03 00 00 48 c7 c6 60 f4 67 a1 e8 70 da dc ff 48 85 c0 74 1b 48 8b 13 b9 1a 00 00 00 48 89 c6 48 8b 52 40 48 89 d7 <f3> 48 a5 e9 fb fe ff ff 90 48 c7 c6 e7 3f 88 a1 48 c7 c7 20 58 


 Comments   
Comment by Jodi Levi (Inactive) [ 07/Aug/14 ]

Jinshan,
Can you please comment on this one?

Comment by Peter Jones [ 14/Aug/14 ]

Emoly

Do you think that this recent test failure on zfs runs could be related to this commit?

http://git.whamcloud.com/fs/lustre-release.git/commit/2b3663dda896f669c87feb49e7f3c7d85a89cefe

Jinshan notes that it has been the only recent change in this area of code

Thanks

Peter

Comment by Emoly Liu [ 15/Aug/14 ]

I'm not sure if it's related to this patch. I will have a look.

Comment by Emoly Liu [ 22/Aug/14 ]

It only happened on ZFS and I can't reproduce it locally. I will work with Jinshan and see if it's related to the change of osc_io_fsync_end() made by http://review.whamcloud.com/11021.

Comment by Jinshan Xiong (Inactive) [ 27/Aug/14 ]

Yes, this issue is related to patch 11021.

In fsync and setattr RPC, we used some memory from osc_io but when the waiting process is interrupted, it will release the memory. Therefore when the client receives the reply later and tries to access those memory, it will hit this BUG.

For a solution, I would like to revert the patch 11021, and set the SETATTR and PUNCH RPC timeout-able.

Comment by Emoly Liu [ 28/Aug/14 ]

Thanks Xiong!

Oleg, do you agree to revert the patch http://review.whamcloud.com/11021 per Xiong's comment?

Comment by Oleg Drokin [ 12/Sep/14 ]

reverted patch 11021

Comment by Emoly Liu [ 17/Sep/14 ]

This problem was caused by the patch 11021, and it has gone away since oleg reverted that patch. So can we lower its priority ?

Comment by Jodi Levi (Inactive) [ 02/Oct/14 ]

Reverted http://review.whamcloud.com/11021 and resolved this problem.

Generated at Sat Feb 10 01:51:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.