Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-76

Racer kernel panic in _ldlm_lock_debug

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • Lustre 1.8.6
    • Lustre 2.0.0, Lustre 2.1.0
    • None
    • 3
    • 24,099
    • 10109

    Description

      Oracle reports this failure:
      2010-11-02 12:42:17 Lustre: DEBUG MARKER: == runracer test 1: racer on clients: sfire31,sfire32
      DURATION=120 ================================== 12:39:34 (1288723174)
      2010-11-02 12:42:40 general protection fault: 0000 [1] SMP
      2010-11-02 12:42:40 last sysfs file: /devices/pci0000:00/0000:00:09.0/irq
      2010-11-02 12:42:40 CPU 3
      2010-11-02 12:42:40 Modules linked in: llite_lloop(U) lustre(U) mgc(U) lov(U) mdc(U) lmv(U) fid(U)
      fld(U) lquota(U) osc(U) obdecho(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U)
      autofs4 hidp rfcomm l2cap bluetooth ipv6 xfrm_nalgo crypto_api loop dm_mirror dm_log dm_multipath
      scsi_dh dm_mod raid1 video backlight sbs power_meter i2c_ec dell_wmi wmi button battery asus_acpi
      acpi_memhotplug ac parport_pc lp parport sd_mod sg sata_nv pcspkr ohci_hcd i2c_nforce2 shpchp
      libata i2c_core ehci_hcd scsi_mod forcedeth k8temp k8_edac edac_mc hwmon serio_raw tg3 nfs lockd
      fscache nfs_acl sunrpc
      2010-11-02 12:42:40 Pid: 27192, comm: ldlm_bl_08 Tainted: G 2.6.18-194.17.1.0.1.el5 #1
      2010-11-02 12:42:40 RIP: 0010:[<ffffffff885c4215>] [<ffffffff885c4215>]
      :ptlrpc:_ldlm_lock_debug+0x545/0x6d0
      2010-11-02 12:42:40 RSP: 0000:ffff81010f451c80 EFLAGS: 00010246
      2010-11-02 12:42:40 RAX: 5a5a5a5a5a5a5a5a RBX: ffff8101080e9840 RCX: ffffffff88657298
      2010-11-02 12:42:40 RDX: 0000000000010000 RSI: 0000000000010000 RDI: 0000000000000000
      2010-11-02 12:42:40 RBP: 0000000000006f10 R08: ffffffff8864e220 R09: 00000000000005b0
      2010-11-02 12:42:40 R10: ffff8101122b9900 R11: 0000000000000000 R12: 00000000ffffff9d
      2010-11-02 12:42:40 R13: ffff81010f451ee0 R14: ffffffff88657428 R15: 0000000000010000
      2010-11-02 12:42:40 FS: 00002ac83f14e230(0000) GS:ffff81011fc819c0(0000) knlGS:00000000f7f1b6c0
      2010-11-02 12:42:40 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      2010-11-02 12:42:40 CR2: 0000003168952990 CR3: 0000000105cd8000 CR4: 00000000000006e0
      2010-11-02 12:42:40 Process ldlm_bl_08 (pid: 27192, threadinfo ffff81010f450000, task
      ffff810111ca7100)
      2010-11-02 12:42:40 Stack: ffffffff00000000 ffffffff88661d78 ffffffff88661d87 0000000200000407
      2010-11-02 12:42:40 ffff8101122b9900 45bcf320a97fb383 0000006000000005 ffffffff00000001
      2010-11-02 12:42:40 0000000000000000 ffffffff88661d78 ffffffff88661d81 5a5a5a5a5a5a5a5a
      2010-11-02 12:42:40 Call Trace:
      2010-11-02 12:42:40 [<ffffffff80062ff8>] thread_return+0x62/0xfe
      2010-11-02 12:42:40 [<ffffffff885e46b0>] :ptlrpc:ldlm_handle_bl_callback+0x90/0x260
      2010-11-02 12:42:40 [<ffffffff8003bc4c>] remove_wait_queue+0x1c/0x2c
      2010-11-02 12:42:40 [<ffffffff885eaf54>] :ptlrpc:ldlm_bl_thread_main+0x284/0x410
      2010-11-02 12:42:40 [<ffffffff8008cfbc>] default_wake_function+0x0/0xe
      2010-11-02 12:42:40 [<ffffffff800b7aa5>] audit_syscall_exit+0x336/0x362
      2010-11-02 12:42:40 [<ffffffff8005dfb1>] child_rip+0xa/0x11
      2010-11-02 12:42:40 [<ffffffff885eacd0>] :ptlrpc:ldlm_bl_thread_main+0x0/0x410
      2010-11-02 12:42:40 [<ffffffff8005dfa7>] child_rip+0x0/0x11

      We don't have any available debug data for this ourselves.

      Attachments

        Issue Links

          Activity

            [LU-76] Racer kernel panic in _ldlm_lock_debug

            Close old bug.

            adilger Andreas Dilger added a comment - Close old bug.
            yujian Jian Yu added a comment -

            Branch: b1_8
            Client Distro/Arch: RHEL6.0/x86_64 (patchless kernel version: 2.6.32-71.18.2.el6.x86_64)
            Server Distro/Arch: CentOS5.5/x86_64 (kernel version: 2.6.18-194.17.1.el5_lustre.20110407083448)
            Network Type: IB (in-kernel OFED)
            Client Nodes: client-8, client-9
            MDS Node: client-16
            OSS Node: fat-amd-4 (6 OSTs)

            While running racer test on Toro cluster, one client node (client-8) hit kernel panic as follows:

            Lustre: DEBUG MARKER: -----============= acceptance-small: racer ============----- Tue Apr 12 05:03:34 PDT 2011
            Lustre: DEBUG MARKER: excepting tests:
            Lustre: DEBUG MARKER: Using TIMEOUT=20
            Lustre: DEBUG MARKER: == test 1: racer on clients: client-8-ib,client-9-ib DURATION=900 == 05:03:36 (1302609816)
            LustreError: 10180:0:(file.c:3329:ll_inode_revalidate_fini()) failure -2 inode 94042
            BUG: unable to handle kernel paging request at 0000000273713030
            IP: [<ffffffffa09504c6>] _ldlm_lock_debug+0xf6/0x680 [ptlrpc]
            PGD 0 
            Oops: 0000 [#1] SMP 
            last sysfs file: /sys/devices/virtual/block/lloop14/removable
            CPU 3 
            Modules linked in: llite_lloop(U) lustre(U) mgc(U) lov(U) osc(U) mdc(U) lquota(U) ko2iblnd(U) ptlrpc(U) obdclass(U) 
            lvfs(U) ksocklnd(U) lnet(U) libcfs(U) ext2 rdma_cm iw_cm ib_addr nfs lockd fscache nfs_acl auth_rpcgss autofs4 sunrpc 
            ib_ipoib ib_cm ib_sa ipv6 serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core 
            mlx4_ib ib_mad ib_core mlx4_en mlx4_core igb dca ext3 jbd mbcache sd_mod crc_t10dif ahci dm_mod [last unloaded: libcfs]
            
            Modules linked in: llite_lloop(U) lustre(U) mgc(U) lov(U) osc(U) mdc(U) lquota(U) ko2iblnd(U) ptlrpc(U) obdclass(U) 
            lvfs(U) ksocklnd(U) lnet(U) libcfs(U) ext2 rdma_cm iw_cm ib_addr nfs lockd fscache nfs_acl auth_rpcgss autofs4 sunrpc 
            ib_ipoib ib_cm ib_sa ipv6 serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core 
            mlx4_ib ib_mad ib_core mlx4_en mlx4_core igb dca ext3 jbd mbcache sd_mod crc_t10dif ahci dm_mod [last unloaded: libcfs]
            Pid: 12199, comm: ldlm_bl_00 Not tainted 2.6.32-71.18.2.el6.x86_64 #1 X8DTT
            RIP: 0010:[<ffffffffa09504c6>]  [<ffffffffa09504c6>] _ldlm_lock_debug+0xf6/0x680 [ptlrpc]
            RSP: 0018:ffff8802f7f1bcc0  EFLAGS: 00010246
            RAX: 0000000000000000 RBX: ffff8802f35b5800 RCX: ffffffffa09d2070
            RDX: 0000000010000000 RSI: 0000000000010000 RDI: ffff8802f364e000
            RBP: ffff8802f7f1be10 R08: ffffffffa09caa70 R09: 000000000000058d
            R10: 0000000000010000 R11: ffffffffa09d1e40 R12: 000000005a5a5a5a
            R13: 00000000ffffff9d R14: 0000000000007646 R15: 0000000000000000
            FS:  00007fcf9ca59700(0000) GS:ffff880032e60000(0000) knlGS:0000000000000000
            CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
            CR2: 0000000273713030 CR3: 0000000001001000 CR4: 00000000000006e0
            DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            Process ldlm_bl_00 (pid: 12199, threadinfo ffff8802f7f1a000, task ffff8802f763b520)
            Stack:
             ffffffffa09da57c 0000000000026928 ffff8802f7f1bd10 ffffffff8105c806
            <0> ffff880200000002 ffffffffa09da594 ffff880032e169f0 ffff8802f763b558
            <0> 0000000000000001 ffff8802f763b520 ffff8802f7f1bd40 ffffffff81061c21
            Call Trace:
             [<ffffffff8105c806>] ? update_curr+0xe6/0x1e0
             [<ffffffff81061c21>] ? dequeue_entity+0x1a1/0x1e0
             [<ffffffff81059dc2>] ? finish_task_switch+0x42/0xd0
             [<ffffffff814c8fb6>] ? thread_return+0x4e/0x778
             [<ffffffffa0952fed>] ? ldlm_lock_put+0x19d/0x450 [ptlrpc]
             [<ffffffffa09751dd>] ldlm_handle_bl_callback+0x1ad/0x260 [ptlrpc]
             [<ffffffff810921ac>] ? remove_wait_queue+0x3c/0x50
             [<ffffffffa097df71>] ldlm_bl_thread_main+0x1f1/0x440 [ptlrpc]
             [<ffffffff8111f059>] ? free_pages+0x49/0x50
             [<ffffffff8105c540>] ? default_wake_function+0x0/0x20
             [<ffffffff810141ca>] child_rip+0xa/0x20
             [<ffffffffa097dd80>] ? ldlm_bl_thread_main+0x0/0x440 [ptlrpc]
             [<ffffffff810141c0>] ? child_rip+0x0/0x20
            Code: 44 89 b4 24 90 00 00 00 44 89 ac 24 88 00 00 00 48 8b 97 c8 00 00 00 48 89 94 24 80 00 00 00 48 8b 97 f0 00 00 00
             48 89 54 24 78 <4a> 8b 14 e5 60 5d 9e a0 48 89 54 24 70 48 8b 93 88 00 00 00 48 
            RIP  [<ffffffffa09504c6>] _ldlm_lock_debug+0xf6/0x680 [ptlrpc]
             RSP <ffff8802f7f1bcc0>
            CR2: 0000000273713030
            ---[ end trace ec850569dd6fda5e ]---
            Kernel panic - not syncing: Fatal exception
            

            The console log of client-8 is in the attachment.

            yujian Jian Yu added a comment - Branch: b1_8 Client Distro/Arch: RHEL6.0/x86_64 (patchless kernel version: 2.6.32-71.18.2.el6.x86_64) Server Distro/Arch: CentOS5.5/x86_64 (kernel version: 2.6.18-194.17.1.el5_lustre.20110407083448) Network Type: IB (in-kernel OFED) Client Nodes: client-8, client-9 MDS Node: client-16 OSS Node: fat-amd-4 (6 OSTs) While running racer test on Toro cluster, one client node (client-8) hit kernel panic as follows: Lustre: DEBUG MARKER: -----============= acceptance-small: racer ============----- Tue Apr 12 05:03:34 PDT 2011 Lustre: DEBUG MARKER: excepting tests: Lustre: DEBUG MARKER: Using TIMEOUT=20 Lustre: DEBUG MARKER: == test 1: racer on clients: client-8-ib,client-9-ib DURATION=900 == 05:03:36 (1302609816) LustreError: 10180:0:(file.c:3329:ll_inode_revalidate_fini()) failure -2 inode 94042 BUG: unable to handle kernel paging request at 0000000273713030 IP: [<ffffffffa09504c6>] _ldlm_lock_debug+0xf6/0x680 [ptlrpc] PGD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/virtual/block/lloop14/removable CPU 3 Modules linked in: llite_lloop(U) lustre(U) mgc(U) lov(U) osc(U) mdc(U) lquota(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) ext2 rdma_cm iw_cm ib_addr nfs lockd fscache nfs_acl auth_rpcgss autofs4 sunrpc ib_ipoib ib_cm ib_sa ipv6 serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core mlx4_ib ib_mad ib_core mlx4_en mlx4_core igb dca ext3 jbd mbcache sd_mod crc_t10dif ahci dm_mod [last unloaded: libcfs] Modules linked in: llite_lloop(U) lustre(U) mgc(U) lov(U) osc(U) mdc(U) lquota(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) ext2 rdma_cm iw_cm ib_addr nfs lockd fscache nfs_acl auth_rpcgss autofs4 sunrpc ib_ipoib ib_cm ib_sa ipv6 serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core mlx4_ib ib_mad ib_core mlx4_en mlx4_core igb dca ext3 jbd mbcache sd_mod crc_t10dif ahci dm_mod [last unloaded: libcfs] Pid: 12199, comm: ldlm_bl_00 Not tainted 2.6.32-71.18.2.el6.x86_64 #1 X8DTT RIP: 0010:[<ffffffffa09504c6>] [<ffffffffa09504c6>] _ldlm_lock_debug+0xf6/0x680 [ptlrpc] RSP: 0018:ffff8802f7f1bcc0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8802f35b5800 RCX: ffffffffa09d2070 RDX: 0000000010000000 RSI: 0000000000010000 RDI: ffff8802f364e000 RBP: ffff8802f7f1be10 R08: ffffffffa09caa70 R09: 000000000000058d R10: 0000000000010000 R11: ffffffffa09d1e40 R12: 000000005a5a5a5a R13: 00000000ffffff9d R14: 0000000000007646 R15: 0000000000000000 FS: 00007fcf9ca59700(0000) GS:ffff880032e60000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000273713030 CR3: 0000000001001000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process ldlm_bl_00 (pid: 12199, threadinfo ffff8802f7f1a000, task ffff8802f763b520) Stack: ffffffffa09da57c 0000000000026928 ffff8802f7f1bd10 ffffffff8105c806 <0> ffff880200000002 ffffffffa09da594 ffff880032e169f0 ffff8802f763b558 <0> 0000000000000001 ffff8802f763b520 ffff8802f7f1bd40 ffffffff81061c21 Call Trace: [<ffffffff8105c806>] ? update_curr+0xe6/0x1e0 [<ffffffff81061c21>] ? dequeue_entity+0x1a1/0x1e0 [<ffffffff81059dc2>] ? finish_task_switch+0x42/0xd0 [<ffffffff814c8fb6>] ? thread_return+0x4e/0x778 [<ffffffffa0952fed>] ? ldlm_lock_put+0x19d/0x450 [ptlrpc] [<ffffffffa09751dd>] ldlm_handle_bl_callback+0x1ad/0x260 [ptlrpc] [<ffffffff810921ac>] ? remove_wait_queue+0x3c/0x50 [<ffffffffa097df71>] ldlm_bl_thread_main+0x1f1/0x440 [ptlrpc] [<ffffffff8111f059>] ? free_pages+0x49/0x50 [<ffffffff8105c540>] ? default_wake_function+0x0/0x20 [<ffffffff810141ca>] child_rip+0xa/0x20 [<ffffffffa097dd80>] ? ldlm_bl_thread_main+0x0/0x440 [ptlrpc] [<ffffffff810141c0>] ? child_rip+0x0/0x20 Code: 44 89 b4 24 90 00 00 00 44 89 ac 24 88 00 00 00 48 8b 97 c8 00 00 00 48 89 94 24 80 00 00 00 48 8b 97 f0 00 00 00 48 89 54 24 78 <4a> 8b 14 e5 60 5d 9e a0 48 89 54 24 70 48 8b 93 88 00 00 00 48 RIP [<ffffffffa09504c6>] _ldlm_lock_debug+0xf6/0x680 [ptlrpc] RSP <ffff8802f7f1bcc0> CR2: 0000000273713030 ---[ end trace ec850569dd6fda5e ]--- Kernel panic - not syncing: Fatal exception The console log of client-8 is in the attachment.

            People

              green Oleg Drokin
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: