[LU-76] Racer kernel panic in _ldlm_lock_debug Created: 09/Feb/11  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.0.0, Lustre 2.1.0
Fix Version/s: Lustre 1.8.6

Type: Bug Priority: Minor
Reporter: Oleg Drokin Assignee: Oleg Drokin
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: Text File client-8.console.log    
Severity: 3
Bugzilla ID: 24,099
Rank (Obsolete): 10109

 Description   

Oracle reports this failure:
2010-11-02 12:42:17 Lustre: DEBUG MARKER: == runracer test 1: racer on clients: sfire31,sfire32
DURATION=120 ================================== 12:39:34 (1288723174)
2010-11-02 12:42:40 general protection fault: 0000 [1] SMP
2010-11-02 12:42:40 last sysfs file: /devices/pci0000:00/0000:00:09.0/irq
2010-11-02 12:42:40 CPU 3
2010-11-02 12:42:40 Modules linked in: llite_lloop(U) lustre(U) mgc(U) lov(U) mdc(U) lmv(U) fid(U)
fld(U) lquota(U) osc(U) obdecho(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U)
autofs4 hidp rfcomm l2cap bluetooth ipv6 xfrm_nalgo crypto_api loop dm_mirror dm_log dm_multipath
scsi_dh dm_mod raid1 video backlight sbs power_meter i2c_ec dell_wmi wmi button battery asus_acpi
acpi_memhotplug ac parport_pc lp parport sd_mod sg sata_nv pcspkr ohci_hcd i2c_nforce2 shpchp
libata i2c_core ehci_hcd scsi_mod forcedeth k8temp k8_edac edac_mc hwmon serio_raw tg3 nfs lockd
fscache nfs_acl sunrpc
2010-11-02 12:42:40 Pid: 27192, comm: ldlm_bl_08 Tainted: G 2.6.18-194.17.1.0.1.el5 #1
2010-11-02 12:42:40 RIP: 0010:[<ffffffff885c4215>] [<ffffffff885c4215>]
:ptlrpc:_ldlm_lock_debug+0x545/0x6d0
2010-11-02 12:42:40 RSP: 0000:ffff81010f451c80 EFLAGS: 00010246
2010-11-02 12:42:40 RAX: 5a5a5a5a5a5a5a5a RBX: ffff8101080e9840 RCX: ffffffff88657298
2010-11-02 12:42:40 RDX: 0000000000010000 RSI: 0000000000010000 RDI: 0000000000000000
2010-11-02 12:42:40 RBP: 0000000000006f10 R08: ffffffff8864e220 R09: 00000000000005b0
2010-11-02 12:42:40 R10: ffff8101122b9900 R11: 0000000000000000 R12: 00000000ffffff9d
2010-11-02 12:42:40 R13: ffff81010f451ee0 R14: ffffffff88657428 R15: 0000000000010000
2010-11-02 12:42:40 FS: 00002ac83f14e230(0000) GS:ffff81011fc819c0(0000) knlGS:00000000f7f1b6c0
2010-11-02 12:42:40 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
2010-11-02 12:42:40 CR2: 0000003168952990 CR3: 0000000105cd8000 CR4: 00000000000006e0
2010-11-02 12:42:40 Process ldlm_bl_08 (pid: 27192, threadinfo ffff81010f450000, task
ffff810111ca7100)
2010-11-02 12:42:40 Stack: ffffffff00000000 ffffffff88661d78 ffffffff88661d87 0000000200000407
2010-11-02 12:42:40 ffff8101122b9900 45bcf320a97fb383 0000006000000005 ffffffff00000001
2010-11-02 12:42:40 0000000000000000 ffffffff88661d78 ffffffff88661d81 5a5a5a5a5a5a5a5a
2010-11-02 12:42:40 Call Trace:
2010-11-02 12:42:40 [<ffffffff80062ff8>] thread_return+0x62/0xfe
2010-11-02 12:42:40 [<ffffffff885e46b0>] :ptlrpc:ldlm_handle_bl_callback+0x90/0x260
2010-11-02 12:42:40 [<ffffffff8003bc4c>] remove_wait_queue+0x1c/0x2c
2010-11-02 12:42:40 [<ffffffff885eaf54>] :ptlrpc:ldlm_bl_thread_main+0x284/0x410
2010-11-02 12:42:40 [<ffffffff8008cfbc>] default_wake_function+0x0/0xe
2010-11-02 12:42:40 [<ffffffff800b7aa5>] audit_syscall_exit+0x336/0x362
2010-11-02 12:42:40 [<ffffffff8005dfb1>] child_rip+0xa/0x11
2010-11-02 12:42:40 [<ffffffff885eacd0>] :ptlrpc:ldlm_bl_thread_main+0x0/0x410
2010-11-02 12:42:40 [<ffffffff8005dfa7>] child_rip+0x0/0x11

We don't have any available debug data for this ourselves.



 Comments   
Comment by Jian Yu [ 12/Apr/11 ]

Branch: b1_8
Client Distro/Arch: RHEL6.0/x86_64 (patchless kernel version: 2.6.32-71.18.2.el6.x86_64)
Server Distro/Arch: CentOS5.5/x86_64 (kernel version: 2.6.18-194.17.1.el5_lustre.20110407083448)
Network Type: IB (in-kernel OFED)
Client Nodes: client-8, client-9
MDS Node: client-16
OSS Node: fat-amd-4 (6 OSTs)

While running racer test on Toro cluster, one client node (client-8) hit kernel panic as follows:

Lustre: DEBUG MARKER: -----============= acceptance-small: racer ============----- Tue Apr 12 05:03:34 PDT 2011
Lustre: DEBUG MARKER: excepting tests:
Lustre: DEBUG MARKER: Using TIMEOUT=20
Lustre: DEBUG MARKER: == test 1: racer on clients: client-8-ib,client-9-ib DURATION=900 == 05:03:36 (1302609816)
LustreError: 10180:0:(file.c:3329:ll_inode_revalidate_fini()) failure -2 inode 94042
BUG: unable to handle kernel paging request at 0000000273713030
IP: [<ffffffffa09504c6>] _ldlm_lock_debug+0xf6/0x680 [ptlrpc]
PGD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/virtual/block/lloop14/removable
CPU 3 
Modules linked in: llite_lloop(U) lustre(U) mgc(U) lov(U) osc(U) mdc(U) lquota(U) ko2iblnd(U) ptlrpc(U) obdclass(U) 
lvfs(U) ksocklnd(U) lnet(U) libcfs(U) ext2 rdma_cm iw_cm ib_addr nfs lockd fscache nfs_acl auth_rpcgss autofs4 sunrpc 
ib_ipoib ib_cm ib_sa ipv6 serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core 
mlx4_ib ib_mad ib_core mlx4_en mlx4_core igb dca ext3 jbd mbcache sd_mod crc_t10dif ahci dm_mod [last unloaded: libcfs]

Modules linked in: llite_lloop(U) lustre(U) mgc(U) lov(U) osc(U) mdc(U) lquota(U) ko2iblnd(U) ptlrpc(U) obdclass(U) 
lvfs(U) ksocklnd(U) lnet(U) libcfs(U) ext2 rdma_cm iw_cm ib_addr nfs lockd fscache nfs_acl auth_rpcgss autofs4 sunrpc 
ib_ipoib ib_cm ib_sa ipv6 serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core 
mlx4_ib ib_mad ib_core mlx4_en mlx4_core igb dca ext3 jbd mbcache sd_mod crc_t10dif ahci dm_mod [last unloaded: libcfs]
Pid: 12199, comm: ldlm_bl_00 Not tainted 2.6.32-71.18.2.el6.x86_64 #1 X8DTT
RIP: 0010:[<ffffffffa09504c6>]  [<ffffffffa09504c6>] _ldlm_lock_debug+0xf6/0x680 [ptlrpc]
RSP: 0018:ffff8802f7f1bcc0  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8802f35b5800 RCX: ffffffffa09d2070
RDX: 0000000010000000 RSI: 0000000000010000 RDI: ffff8802f364e000
RBP: ffff8802f7f1be10 R08: ffffffffa09caa70 R09: 000000000000058d
R10: 0000000000010000 R11: ffffffffa09d1e40 R12: 000000005a5a5a5a
R13: 00000000ffffff9d R14: 0000000000007646 R15: 0000000000000000
FS:  00007fcf9ca59700(0000) GS:ffff880032e60000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000273713030 CR3: 0000000001001000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ldlm_bl_00 (pid: 12199, threadinfo ffff8802f7f1a000, task ffff8802f763b520)
Stack:
 ffffffffa09da57c 0000000000026928 ffff8802f7f1bd10 ffffffff8105c806
<0> ffff880200000002 ffffffffa09da594 ffff880032e169f0 ffff8802f763b558
<0> 0000000000000001 ffff8802f763b520 ffff8802f7f1bd40 ffffffff81061c21
Call Trace:
 [<ffffffff8105c806>] ? update_curr+0xe6/0x1e0
 [<ffffffff81061c21>] ? dequeue_entity+0x1a1/0x1e0
 [<ffffffff81059dc2>] ? finish_task_switch+0x42/0xd0
 [<ffffffff814c8fb6>] ? thread_return+0x4e/0x778
 [<ffffffffa0952fed>] ? ldlm_lock_put+0x19d/0x450 [ptlrpc]
 [<ffffffffa09751dd>] ldlm_handle_bl_callback+0x1ad/0x260 [ptlrpc]
 [<ffffffff810921ac>] ? remove_wait_queue+0x3c/0x50
 [<ffffffffa097df71>] ldlm_bl_thread_main+0x1f1/0x440 [ptlrpc]
 [<ffffffff8111f059>] ? free_pages+0x49/0x50
 [<ffffffff8105c540>] ? default_wake_function+0x0/0x20
 [<ffffffff810141ca>] child_rip+0xa/0x20
 [<ffffffffa097dd80>] ? ldlm_bl_thread_main+0x0/0x440 [ptlrpc]
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
Code: 44 89 b4 24 90 00 00 00 44 89 ac 24 88 00 00 00 48 8b 97 c8 00 00 00 48 89 94 24 80 00 00 00 48 8b 97 f0 00 00 00
 48 89 54 24 78 <4a> 8b 14 e5 60 5d 9e a0 48 89 54 24 70 48 8b 93 88 00 00 00 48 
RIP  [<ffffffffa09504c6>] _ldlm_lock_debug+0xf6/0x680 [ptlrpc]
 RSP <ffff8802f7f1bcc0>
CR2: 0000000273713030
---[ end trace ec850569dd6fda5e ]---
Kernel panic - not syncing: Fatal exception

The console log of client-8 is in the attachment.

Comment by Andreas Dilger [ 29/May/17 ]

Close old bug.

Generated at Sat Feb 10 01:03:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.