Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3270

ptlrpcd strnlen crash trying to log a message

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0
    • None
    • None
    • 3
    • 8107

    Description

      Just hit this (use after free, I think) after a few days of recovery-small:

      [307783.003678] Lustre: DEBUG MARKER: == recovery-small test complete, duration 1814 sec == 08:30:02 (1367584202)
      [307783.153758] BUG: unable to handle kernel paging request at ffff880097114fa8
      [307783.154227] IP: [<ffffffff8127cc39>] strnlen+0x9/0x40
      [307783.154604] PGD 1a26063 PUD 501067 PMD 5ba067 PTE 8000000097114163
      [307783.155028] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      [307783.155408] last sysfs file: /sys/devices/system/cpu/possible
      [307783.155791] CPU 0 
      [307783.155847] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcache jbd2 virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs]
      [307783.156905] 
      [307783.156905] Pid: 17616, comm: ptlrpcd_3 Not tainted 2.6.32-rhe6.4-debug #2 Bochs Bochs
      [307783.156905] RIP: 0010:[<ffffffff8127cc39>]  [<ffffffff8127cc39>] strnlen+0x9/0x40
      [307783.156905] RSP: 0018:ffff8800af8e9980  EFLAGS: 00010206
      [307783.156905] RAX: ffffffff817a6cc1 RBX: ffff8800b00e1000 RCX: 0000000000000001
      [307783.156905] RDX: ffff880097114fa8 RSI: 0000000000000017 RDI: ffff880097114fa8
      [307783.156905] RBP: ffff8800af8e9980 R08: 000000000000002e R09: 0000000000000053
      [307783.156905] R10: 0000000000000001 R11: 0000000000000104 R12: ffff8800b00e0a4b
      [307783.156905] R13: ffff880097114fa8 R14: 00000000ffffffff R15: 0000000000000000
      [307783.156905] FS:  00007fbed06dc700(0000) GS:ffff880006200000(0000) knlGS:0000000000000000
      [307783.156905] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [307783.156905] CR2: ffff880097114fa8 CR3: 0000000001a25000 CR4: 00000000000006f0
      [307783.156905] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [307783.156905] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [307783.156905] Process ptlrpcd_3 (pid: 17616, threadinfo ffff8800af8e8000, task ffff8800b08aa5c0)
      [307783.156905] Stack:
      [307783.156905]  ffff8800af8e99c0 ffffffff8127def0 ffff880099facdf0 ffff8800b00e0a4b
      [307783.156905] <d> ffffffffa13328e0 ffffffffa13328df ffff8800af8e9b40 ffff8800b00e1000
      [307783.156905] <d> ffff8800af8e9a60 ffffffff8127f338 0000000000000004 0000000affffffff
      [307783.156905] Call Trace:
      [307783.156905]  [<ffffffff8127def0>] string+0x40/0x100
      [307783.156905]  [<ffffffff8127f338>] vsnprintf+0x218/0x5e0
      [307783.156905]  [<ffffffffa0a9127b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
      [307783.156905]  [<ffffffffa0aa1b6b>] libcfs_debug_vmsg2+0x2fb/0xab0 [libcfs]
      [307783.156905]  [<ffffffffa0aa2361>] libcfs_debug_msg+0x41/0x50 [libcfs]
      [307783.156905]  [<ffffffffa0a9cde8>] ? libcfs_log_return+0x28/0x40 [libcfs]
      [307783.156905]  [<ffffffffa132c2e5>] mdc_finish_intent_lock+0x535/0x7f0 [mdc]
      [307783.156905]  [<ffffffffa0a9cde8>] ? libcfs_log_return+0x28/0x40 [libcfs]
      [307783.156905]  [<ffffffffa132d388>] ? mdc_finish_enqueue+0x208/0x1180 [mdc]
      [307783.156905]  [<ffffffffa132e626>] mdc_intent_getattr_async_interpret+0x326/0x4c0 [mdc]
      [307783.156905]  [<ffffffffa0ea46b2>] ptlrpc_check_set+0x2b2/0x1da0 [ptlrpc]
      [307783.156905]  [<ffffffffa0ed1e1b>] ptlrpcd_check+0x55b/0x590 [ptlrpc]
      [307783.156905]  [<ffffffffa0ed2363>] ptlrpcd+0x233/0x390 [ptlrpc]
      [307783.156905]  [<ffffffff8105ad10>] ? default_wake_function+0x0/0x20
      [307783.156905]  [<ffffffffa0ed2130>] ? ptlrpcd+0x0/0x390 [ptlrpc]
      [307783.156905]  [<ffffffff8100c10a>] child_rip+0xa/0x20
      [307783.156905]  [<ffffffffa0ed2130>] ? ptlrpcd+0x0/0x390 [ptlrpc]
      [307783.156905]  [<ffffffffa0ed2130>] ? ptlrpcd+0x0/0x390 [ptlrpc]
      [307783.156905]  [<ffffffff8100c100>] ? child_rip+0x0/0x20
      [307783.156905] Code: 66 90 48 83 c2 01 80 3a 00 75 f7 48 89 d0 48 29 f8 c9 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 89 e5 74 2e <80> 3f 00 74 29 48 83 ee 01 48 89 f8 eb 12 66 0f 1f 84 00 00 00 
      

      Crashdump and modules are in /exports/crashdumps/192.168.10.217-2013-05-03-08\:30\:08/
      source tag is 20130430

      Attachments

        Issue Links

          Activity

            [LU-3270] ptlrpcd strnlen crash trying to log a message

            Last patch has been merged. This ticket can be closed.

            simmonsja James A Simmons added a comment - Last patch has been merged. This ticket can be closed.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/9667/
            Subject: LU-3270 statahead: small fixes and cleanup
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 672ab0e00d6148330561d21ccff9331124914d35

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/9667/ Subject: LU-3270 statahead: small fixes and cleanup Project: fs/lustre-release Branch: master Current Patch Set: Commit: 672ab0e00d6148330561d21ccff9331124914d35

            Many patches have been landed. Only one patch left at http://review.whamcloud.com/#/c/9667. Please inspect.

            simmonsja James A Simmons added a comment - Many patches have been landed. Only one patch left at http://review.whamcloud.com/#/c/9667 . Please inspect.

            Refreshed http://review.whamcloud.com/#/c/9663. Patch is ready for inspection.

            simmonsja James A Simmons added a comment - Refreshed http://review.whamcloud.com/#/c/9663 . Patch is ready for inspection.

            Has this work been delayed until 2.8?

            simmonsja James A Simmons added a comment - Has this work been delayed until 2.8?
            laisiyao Lai Siyao added a comment -

            Thanks Ryan, and as is pointed out in LU-4985, this has been fixed by LU-3498, which has been landed to master, and waiting to land on 2.5.

            laisiyao Lai Siyao added a comment - Thanks Ryan, and as is pointed out in LU-4985 , this has been fixed by LU-3498 , which has been landed to master, and waiting to land on 2.5.
            haasken Ryan Haasken added a comment -

            Lai, I have created LU-4985 for the bug about incorrect handling of the kthread_run return value in start_statahead_thread. That bug is also for other calls to kthread_run which are handled incorrectly.

            haasken Ryan Haasken added a comment - Lai, I have created LU-4985 for the bug about incorrect handling of the kthread_run return value in start_statahead_thread. That bug is also for other calls to kthread_run which are handled incorrectly.
            laisiyao Lai Siyao added a comment -

            Thank you, I'll fix it in related patch soon.

            laisiyao Lai Siyao added a comment - Thank you, I'll fix it in related patch soon.

            >> Any idea why statahead may not have gotten started? The clue is probably this error message:
            >>
            >> (statahead.c:1660:do_statahead_enter()) can't start ll_sa thread, rc: -2080

            Finally figured out the cause of this bug. The 64-bit output of kthread_run is being assigned to an int and truncated. In the cases where a valid task pointer, when interpreted as an int, falls between 0 and -4095 it is treated as an error. The code in ll_start_agl handles the return from kthread_run correctly.

            amk Ann Koehler (Inactive) added a comment - >> Any idea why statahead may not have gotten started? The clue is probably this error message: >> >> (statahead.c:1660:do_statahead_enter()) can't start ll_sa thread, rc: -2080 Finally figured out the cause of this bug. The 64-bit output of kthread_run is being assigned to an int and truncated. In the cases where a valid task pointer, when interpreted as an int, falls between 0 and -4095 it is treated as an error. The code in ll_start_agl handles the return from kthread_run correctly.

            With patchset 20 backported to 2.5, an ll_agl_XXXX thread occasionally hits a general protection fault in __wake_up. In all cases, the corresponding ll_sa_XXXX thread has terminated. Looks like the sai struct has been deallocated so the ll_agl_XXXX thread panics when it goes to access it.

            Dump uploaded to: ftp.whamcloud.com:/upload/LU-3270/LU-3270_ll_agl_gpf.tgz
            Client is Lustre 2.5 with LU-3270 PS20 running on SLES11 SP3

            crash> ps | grep ll_sa
              20979      2   2  ffff88080f12f7e0  IN   0.0       0      0  [ll_sa_20914]
            crash> ps | grep ll_agl
            > 20978      2   0  ffff8807d9783080  RU   0.0       0      0  [ll_agl_20915]
              20980      2   3  ffff88080f12f040  IN   0.0       0      0  [ll_agl_20914]
            crash> bt
            PID: 20978  TASK: ffff8807d9783080  CPU: 0   COMMAND: "ll_agl_20915"
             #0 [ffff8807d9c1fcc0] die at ffffffff81005f6b
             #1 [ffff8807d9c1fcf0] do_general_protection at ffffffff8100361a
             #2 [ffff8807d9c1fd20] general_protection at ffffffff813628df
                [exception RIP: __wake_up_common+49]
                RIP: ffffffff8102fc91  RSP: ffff8807d9c1fdd0  RFLAGS: 00010096
                RAX: 5a5a5a5a5a5a5a5a  RBX: ffff88080095b118  RCX: 0000000000000000
                RDX: 0000000000000001  RSI: 0000000000000003  RDI: ffff88080095b118
                RBP: ffff8807d9c1fe00   R8: 5a5a5a5a5a5a5a42   R9: 0000000000000000
                R10: 0000000000000001  R11: 0000000000000001  R12: 0000000000000000
                R13: 0000000000000001  R14: ffff88080095b120  R15: 0000000000000000
                ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
             #3 [ffff8807d9c1fe08] __wake_up at ffffffff81032b68
             #4 [ffff8807d9c1fe48] ll_agl_thread at ffffffffa0795aa2 [lustre]
             #5 [ffff8807d9c1fee8] kthread at ffffffff8106610e
             #6 [ffff8807d9c1ff48] kernel_thread_helper at ffffffff81364034
            
            crash> ll_inode_info ffff8806325a9580 | grep magic
              lli_inode_magic = 287116773,
            crash> ll_inode_info ffff8806325a9580 | grep sai
                  d_sai = 0x0,
            
            amk Ann Koehler (Inactive) added a comment - With patchset 20 backported to 2.5, an ll_agl_XXXX thread occasionally hits a general protection fault in __wake_up. In all cases, the corresponding ll_sa_XXXX thread has terminated. Looks like the sai struct has been deallocated so the ll_agl_XXXX thread panics when it goes to access it. Dump uploaded to: ftp.whamcloud.com:/upload/ LU-3270 / LU-3270 _ll_agl_gpf.tgz Client is Lustre 2.5 with LU-3270 PS20 running on SLES11 SP3 crash> ps | grep ll_sa 20979 2 2 ffff88080f12f7e0 IN 0.0 0 0 [ll_sa_20914] crash> ps | grep ll_agl > 20978 2 0 ffff8807d9783080 RU 0.0 0 0 [ll_agl_20915] 20980 2 3 ffff88080f12f040 IN 0.0 0 0 [ll_agl_20914] crash> bt PID: 20978 TASK: ffff8807d9783080 CPU: 0 COMMAND: "ll_agl_20915" #0 [ffff8807d9c1fcc0] die at ffffffff81005f6b #1 [ffff8807d9c1fcf0] do_general_protection at ffffffff8100361a #2 [ffff8807d9c1fd20] general_protection at ffffffff813628df [exception RIP: __wake_up_common+49] RIP: ffffffff8102fc91 RSP: ffff8807d9c1fdd0 RFLAGS: 00010096 RAX: 5a5a5a5a5a5a5a5a RBX: ffff88080095b118 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88080095b118 RBP: ffff8807d9c1fe00 R8: 5a5a5a5a5a5a5a42 R9: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000001 R14: ffff88080095b120 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #3 [ffff8807d9c1fe08] __wake_up at ffffffff81032b68 #4 [ffff8807d9c1fe48] ll_agl_thread at ffffffffa0795aa2 [lustre] #5 [ffff8807d9c1fee8] kthread at ffffffff8106610e #6 [ffff8807d9c1ff48] kernel_thread_helper at ffffffff81364034 crash> ll_inode_info ffff8806325a9580 | grep magic lli_inode_magic = 287116773, crash> ll_inode_info ffff8806325a9580 | grep sai d_sai = 0x0,

            For what it's worth, we've also hit the "can't start ll_sa thread" bug four times, though with a slightly different followup assertion:

            LustreError: 14399:0:(statahead.c:1698:do_statahead_enter()) can't start ll_sa thread: rc: -2752
            LustreError: 14399:0:(statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed

            I don't have all four crashes, but I believe that rc was different each time.

            Lustre: Lustre: Build Version: 2.5.0-RC1--PRISTINE-2.6.32-279.14.1.el6.x86_64
            Linux hnod0032 2.6.32-279.19.1.el6.x86_64 #1 SMP Wed Dec 19 07:05:20 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

            phils@dugeo.com Phil Schwan (Inactive) added a comment - For what it's worth, we've also hit the "can't start ll_sa thread" bug four times, though with a slightly different followup assertion: LustreError: 14399:0:(statahead.c:1698:do_statahead_enter()) can't start ll_sa thread: rc: -2752 LustreError: 14399:0:(statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed I don't have all four crashes, but I believe that rc was different each time. Lustre: Lustre: Build Version: 2.5.0-RC1--PRISTINE-2.6.32-279.14.1.el6.x86_64 Linux hnod0032 2.6.32-279.19.1.el6.x86_64 #1 SMP Wed Dec 19 07:05:20 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

            People

              laisiyao Lai Siyao
              green Oleg Drokin
              Votes:
              1 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: