[LU-6902] recovery-small test 51 target_recovery_thread crash Created: 25/Jul/15  Updated: 15/Sep/15  Resolved: 15/Sep/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: Mikhail Pershin
Resolution: Cannot Reproduce Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Running a recent master I hit recovery small test 51 crash taht I don't think is to any of the patches I have applied.

<1>[120791.287489] BUG: unable to handle kernel paging request at ffff880089130fe0
<1>[120791.287690] IP: [<ffffffffa12ec9cd>] target_recovery_thread+0x5cd/0x1f00 [ptlrpc]
<4>[120791.288059] PGD 1a26063 PUD 501067 PMD 54a067 PTE 8000000089130060
<4>[120791.288314] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
<4>[120791.288495] last sysfs file: /sys/devices/system/cpu/possible
<4>[120791.288675] CPU 0 
<4>[120791.288702] Modules linked in: lustre ofd osp lod ost mdt mdd mgs osd_ldiskfs ldiskfs lquota lfsck obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass ksocklnd lnet libcfs zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl zlib_deflate exportfs jbd sha512_generic sha256_generic ext4 jbd2 mbcache virtio_balloon virtio_console i2c_piix4 i2c_core virtio_net virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs]
<4>[120791.290740] 
<4>[120791.290740] Pid: 23779, comm: tgt_recov Tainted: P           ---------------    2.6.32-rhe6.6-debug #1 Red Hat KVM
<4>[120791.290740] RIP: 0010:[<ffffffffa12ec9cd>]  [<ffffffffa12ec9cd>] target_recovery_thread+0x5cd/0x1f00 [ptlrpc]
<4>[120791.290740] RSP: 0018:ffff88003c241e10  EFLAGS: 00010202
<4>[120791.290740] RAX: 0000000000000001 RBX: ffff8800443f30b0 RCX: 0000000000000000
<4>[120791.290740] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88008caf2bf0
<4>[120791.290740] RBP: ffff88003c241ed0 R08: 00000000fffffffb R09: 00000000fffffffe
<4>[120791.290740] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88001d32a300
<4>[120791.290740] R13: ffff880089130f30 R14: ffff88001d32a300 R15: ffff880039446f30
<4>[120791.290740] FS:  0000000000000000(0000) GS:ffff880006200000(0000) knlGS:0000000000000000
<4>[120791.290740] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>[120791.290740] CR2: ffff880089130fe0 CR3: 0000000001a25000 CR4: 00000000000006f0
<4>[120791.290740] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[120791.290740] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[120791.290740] Process tgt_recov (pid: 23779, threadinfo ffff88003c240000, task ffff880029800180)
<4>[120791.290740] Stack:
<4>[120791.290740]  0000000000000000 ffff88002069fa30 ffff88002069fa30 0000000000000001
<4>[120791.290740] <d> 0000000000000000 0000000101cba328 ffff88001d32a478 0000000081061642
<4>[120791.290740] <d> ffff880029800738 ffff880029800180 000000000000ebe8 ffff880029800738
<4>[120791.290740] Call Trace:
<4>[120791.290740]  [<ffffffffa12ec400>] ? target_recovery_thread+0x0/0x1f00 [ptlrpc]
<4>[120791.290740]  [<ffffffff8109ce4e>] kthread+0x9e/0xc0
<4>[120791.290740]  [<ffffffff8100c24a>] child_rip+0xa/0x20
<4>[120791.290740]  [<ffffffff8109cdb0>] ? kthread+0x0/0xc0
<4>[120791.290740]  [<ffffffff8100c240>] ? child_rip+0x0/0x20
<4>[120791.290740] Code: 00 48 8b 7d b8 e8 44 5b 23 e0 48 8d 75 cc 48 89 df e8 a8 9e ff ff 48 89 c1 8b 45 cc 83 f8 01 0f 94 c2 4d 85 ed 74 11 84 d2 74 0d <49> 3b 8d b0 00 00 00 0f 84 56 06 00 00 48 85 c9 0f 85 3d 02 00 
<1>[120791.290740] RIP  [<ffffffffa12ec9cd>] target_recovery_thread+0x5cd/0x1f00 [ptlrpc]

Crashdump in /exports/crashdumps/192.168.10.223-2015-07-24-20\:40\:25
tag in my tree master-20150723



 Comments   
Comment by Oleg Drokin [ 25/Jul/15 ]

it appears I also hit this this same day on a different node: /exports/crashdumps/192.168.10.217-2015-07-24-05\:04\:56/

And checking through my logs - first seen occurrence of this was on July 15th, thugh the list of patches is a bit hard to guess now.

Comment by Oleg Drokin [ 17/Aug/15 ]

Just hit this again on current master

Comment by Peter Jones [ 15/Sep/15 ]

Oleg has not seen this for some weeks

Generated at Sat Feb 10 02:04:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.