[LU-5446] Test timeout lustre-rsync-test test_4: NULL deref osc_sync_interpret+0x147 Created: 04/Aug/14 Updated: 02/Oct/14 Resolved: 02/Oct/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Emoly Liu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 15164 | ||||||||
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/a119340e-180a-11e4-a93e-5254006e85c2 The sub-test test_4 failed with the following error:
Info required for matching: lustre-rsync-test 4 Client Console Log: 15:42:52:Lustre: DEBUG MARKER: == lustre-rsync-test test 4: Replicate files created by iozone == 21:41:12 (1407015672) 15:42:52:BUG: unable to handle kernel NULL pointer dereference at (null) 15:42:52:IP: [<ffffffffa1846827>] osc_sync_interpret+0x147/0x200 [osc] 15:42:52:PGD 7c211067 PUD 7b680067 PMD 0 15:42:52:Oops: 0002 [#1] SMP 15:42:52:last sysfs file: /sys/devices/system/cpu/online 15:42:53:CPU 1 15:42:53:Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc_gss(U) ptlrpc(U) obdclass(U) ksocklnd(U) lnet(U) libcfs(U) sha512_generic sha256_generic nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon i2c_piix4 i2c_core 8139too 8139cp mii ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] 15:42:53: 15:42:53:Pid: 20333, comm: ptlrpcd_1 Not tainted 2.6.32-431.20.3.el6.x86_64 #1 Red Hat KVM 15:42:53:RIP: 0010:[<ffffffffa1846827>] [<ffffffffa1846827>] osc_sync_interpret+0x147/0x200 [osc] 15:42:53:RSP: 0018:ffff88007d445cb0 EFLAGS: 00010282 15:42:53:RAX: ffff88007a042580 RBX: ffff8800713a4ae0 RCX: 000000000000001a 15:42:53:RDX: 0000000000000000 RSI: ffff88007a042580 RDI: 0000000000000000 15:42:53:RBP: ffff88007d445cd0 R08: 0000000000000000 R09: 0000000000000001 15:42:53:R10: ffff88007bff9800 R11: 00000000000002a0 R12: 0000000000000000 15:42:53:R13: ffff8800713a4800 R14: ffff88007d3bb000 R15: ffff8800713a48c8 15:42:53:FS: 0000000000000000(0000) GS:ffff880002300000(0000) knlGS:0000000000000000 15:42:53:CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b 15:42:53:CR2: 0000000000000000 CR3: 0000000073c9a000 CR4: 00000000000006e0 15:42:53:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 15:42:53:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 15:42:53:Process ptlrpcd_1 (pid: 20333, threadinfo ffff88007d444000, task ffff88007d0b4ae0) 15:42:53:Stack: 15:42:53: ffff88005b4df7b0 ffff8800713a4800 ffff88005b4df7b0 ffff88005b4df780 15:42:53:<d> ffff88007d445d70 ffffffffa15e1531 0000000000000000 0000000000000286 15:42:53:<d> ffff88007d445d40 0000000100000001 ffff88007d445d20 ffff88007d0b5158 15:42:53:Call Trace: 15:42:53: [<ffffffffa15e1531>] ptlrpc_check_set+0x2c1/0x1b50 [ptlrpc] 15:42:53: [<ffffffffa160d5ab>] ptlrpcd_check+0x53b/0x560 [ptlrpc] 15:42:53: [<ffffffffa160dbfb>] ptlrpcd+0x33b/0x3f0 [ptlrpc] 15:42:53: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 15:42:53: [<ffffffffa160d8c0>] ? ptlrpcd+0x0/0x3f0 [ptlrpc] 15:42:53: [<ffffffff8109abf6>] kthread+0x96/0xa0 15:42:53: [<ffffffff8100c20a>] child_rip+0xa/0x20 15:42:53: [<ffffffff8109ab60>] ? kthread+0x0/0xa0 15:42:53: [<ffffffff8100c200>] ? child_rip+0x0/0x20 15:42:53:Code: ff 49 8d bd 70 03 00 00 48 c7 c6 60 f4 67 a1 e8 70 da dc ff 48 85 c0 74 1b 48 8b 13 b9 1a 00 00 00 48 89 c6 48 8b 52 40 48 89 d7 <f3> 48 a5 e9 fb fe ff ff 90 48 c7 c6 e7 3f 88 a1 48 c7 c7 20 58 |
| Comments |
| Comment by Jodi Levi (Inactive) [ 07/Aug/14 ] |
|
Jinshan, |
| Comment by Peter Jones [ 14/Aug/14 ] |
|
Emoly Do you think that this recent test failure on zfs runs could be related to this commit? http://git.whamcloud.com/fs/lustre-release.git/commit/2b3663dda896f669c87feb49e7f3c7d85a89cefe Jinshan notes that it has been the only recent change in this area of code Thanks Peter |
| Comment by Emoly Liu [ 15/Aug/14 ] |
|
I'm not sure if it's related to this patch. I will have a look. |
| Comment by Emoly Liu [ 22/Aug/14 ] |
|
It only happened on ZFS and I can't reproduce it locally. I will work with Jinshan and see if it's related to the change of osc_io_fsync_end() made by http://review.whamcloud.com/11021. |
| Comment by Jinshan Xiong (Inactive) [ 27/Aug/14 ] |
|
Yes, this issue is related to patch 11021. In fsync and setattr RPC, we used some memory from osc_io but when the waiting process is interrupted, it will release the memory. Therefore when the client receives the reply later and tries to access those memory, it will hit this BUG. For a solution, I would like to revert the patch 11021, and set the SETATTR and PUNCH RPC timeout-able. |
| Comment by Emoly Liu [ 28/Aug/14 ] |
|
Thanks Xiong! Oleg, do you agree to revert the patch http://review.whamcloud.com/11021 per Xiong's comment? |
| Comment by Oleg Drokin [ 12/Sep/14 ] |
|
reverted patch 11021 |
| Comment by Emoly Liu [ 17/Sep/14 ] |
|
This problem was caused by the patch 11021, and it has gone away since oleg reverted that patch. So can we lower its priority ? |
| Comment by Jodi Levi (Inactive) [ 02/Oct/14 ] |
|
Reverted http://review.whamcloud.com/11021 and resolved this problem. |