Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5446

Test timeout lustre-rsync-test test_4: NULL deref osc_sync_interpret+0x147

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0
    • None
    • None
    • 3
    • 15164

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/a119340e-180a-11e4-a93e-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/90da9c4c-1ab0-11e4-9259-5254006e85c2

      The sub-test test_4 failed with the following error:

      test failed to respond and timed out

      Info required for matching: lustre-rsync-test 4

      Client Console Log:

      15:42:52:Lustre: DEBUG MARKER: == lustre-rsync-test test 4: Replicate files created by iozone == 21:41:12 (1407015672)
      15:42:52:BUG: unable to handle kernel NULL pointer dereference at (null)
      15:42:52:IP: [<ffffffffa1846827>] osc_sync_interpret+0x147/0x200 [osc]
      15:42:52:PGD 7c211067 PUD 7b680067 PMD 0 
      15:42:52:Oops: 0002 [#1] SMP 
      15:42:52:last sysfs file: /sys/devices/system/cpu/online
      15:42:53:CPU 1 
      15:42:53:Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc_gss(U) ptlrpc(U) obdclass(U) ksocklnd(U) lnet(U) libcfs(U) sha512_generic sha256_generic nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon i2c_piix4 i2c_core 8139too 8139cp mii ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
      15:42:53:
      15:42:53:Pid: 20333, comm: ptlrpcd_1 Not tainted 2.6.32-431.20.3.el6.x86_64 #1 Red Hat KVM
      15:42:53:RIP: 0010:[<ffffffffa1846827>]  [<ffffffffa1846827>] osc_sync_interpret+0x147/0x200 [osc]
      15:42:53:RSP: 0018:ffff88007d445cb0  EFLAGS: 00010282
      15:42:53:RAX: ffff88007a042580 RBX: ffff8800713a4ae0 RCX: 000000000000001a
      15:42:53:RDX: 0000000000000000 RSI: ffff88007a042580 RDI: 0000000000000000
      15:42:53:RBP: ffff88007d445cd0 R08: 0000000000000000 R09: 0000000000000001
      15:42:53:R10: ffff88007bff9800 R11: 00000000000002a0 R12: 0000000000000000
      15:42:53:R13: ffff8800713a4800 R14: ffff88007d3bb000 R15: ffff8800713a48c8
      15:42:53:FS:  0000000000000000(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
      15:42:53:CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      15:42:53:CR2: 0000000000000000 CR3: 0000000073c9a000 CR4: 00000000000006e0
      15:42:53:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      15:42:53:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      15:42:53:Process ptlrpcd_1 (pid: 20333, threadinfo ffff88007d444000, task ffff88007d0b4ae0)
      15:42:53:Stack:
      15:42:53: ffff88005b4df7b0 ffff8800713a4800 ffff88005b4df7b0 ffff88005b4df780
      15:42:53:<d> ffff88007d445d70 ffffffffa15e1531 0000000000000000 0000000000000286
      15:42:53:<d> ffff88007d445d40 0000000100000001 ffff88007d445d20 ffff88007d0b5158
      15:42:53:Call Trace:
      15:42:53: [<ffffffffa15e1531>] ptlrpc_check_set+0x2c1/0x1b50 [ptlrpc]
      15:42:53: [<ffffffffa160d5ab>] ptlrpcd_check+0x53b/0x560 [ptlrpc]
      15:42:53: [<ffffffffa160dbfb>] ptlrpcd+0x33b/0x3f0 [ptlrpc]
      15:42:53: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      15:42:53: [<ffffffffa160d8c0>] ? ptlrpcd+0x0/0x3f0 [ptlrpc]
      15:42:53: [<ffffffff8109abf6>] kthread+0x96/0xa0
      15:42:53: [<ffffffff8100c20a>] child_rip+0xa/0x20
      15:42:53: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      15:42:53: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      15:42:53:Code: ff 49 8d bd 70 03 00 00 48 c7 c6 60 f4 67 a1 e8 70 da dc ff 48 85 c0 74 1b 48 8b 13 b9 1a 00 00 00 48 89 c6 48 8b 52 40 48 89 d7 <f3> 48 a5 e9 fb fe ff ff 90 48 c7 c6 e7 3f 88 a1 48 c7 c7 20 58 
      

      Attachments

        Issue Links

          Activity

            [LU-5446] Test timeout lustre-rsync-test test_4: NULL deref osc_sync_interpret+0x147

            Reverted http://review.whamcloud.com/11021 and resolved this problem.

            jlevi Jodi Levi (Inactive) added a comment - Reverted http://review.whamcloud.com/11021 and resolved this problem.
            emoly.liu Emoly Liu added a comment -

            This problem was caused by the patch 11021, and it has gone away since oleg reverted that patch. So can we lower its priority ?

            emoly.liu Emoly Liu added a comment - This problem was caused by the patch 11021, and it has gone away since oleg reverted that patch. So can we lower its priority ?
            green Oleg Drokin added a comment -

            reverted patch 11021

            green Oleg Drokin added a comment - reverted patch 11021
            emoly.liu Emoly Liu added a comment -

            Thanks Xiong!

            Oleg, do you agree to revert the patch http://review.whamcloud.com/11021 per Xiong's comment?

            emoly.liu Emoly Liu added a comment - Thanks Xiong! Oleg, do you agree to revert the patch http://review.whamcloud.com/11021 per Xiong's comment?

            Yes, this issue is related to patch 11021.

            In fsync and setattr RPC, we used some memory from osc_io but when the waiting process is interrupted, it will release the memory. Therefore when the client receives the reply later and tries to access those memory, it will hit this BUG.

            For a solution, I would like to revert the patch 11021, and set the SETATTR and PUNCH RPC timeout-able.

            jay Jinshan Xiong (Inactive) added a comment - Yes, this issue is related to patch 11021. In fsync and setattr RPC, we used some memory from osc_io but when the waiting process is interrupted, it will release the memory. Therefore when the client receives the reply later and tries to access those memory, it will hit this BUG. For a solution, I would like to revert the patch 11021, and set the SETATTR and PUNCH RPC timeout-able.
            emoly.liu Emoly Liu added a comment -

            It only happened on ZFS and I can't reproduce it locally. I will work with Jinshan and see if it's related to the change of osc_io_fsync_end() made by http://review.whamcloud.com/11021.

            emoly.liu Emoly Liu added a comment - It only happened on ZFS and I can't reproduce it locally. I will work with Jinshan and see if it's related to the change of osc_io_fsync_end() made by http://review.whamcloud.com/11021 .
            emoly.liu Emoly Liu added a comment -

            I'm not sure if it's related to this patch. I will have a look.

            emoly.liu Emoly Liu added a comment - I'm not sure if it's related to this patch. I will have a look.
            pjones Peter Jones added a comment -

            Emoly

            Do you think that this recent test failure on zfs runs could be related to this commit?

            http://git.whamcloud.com/fs/lustre-release.git/commit/2b3663dda896f669c87feb49e7f3c7d85a89cefe

            Jinshan notes that it has been the only recent change in this area of code

            Thanks

            Peter

            pjones Peter Jones added a comment - Emoly Do you think that this recent test failure on zfs runs could be related to this commit? http://git.whamcloud.com/fs/lustre-release.git/commit/2b3663dda896f669c87feb49e7f3c7d85a89cefe Jinshan notes that it has been the only recent change in this area of code Thanks Peter

            Jinshan,
            Can you please comment on this one?

            jlevi Jodi Levi (Inactive) added a comment - Jinshan, Can you please comment on this one?

            People

              emoly.liu Emoly Liu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: