[LU-3638] GPF crash in osp_key_exit Created: 25/Jul/13  Updated: 09/Jan/20  Resolved: 09/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9371

 Description   

Just hit this running recent master:

<4>[113366.463322] Lustre: DEBUG MARKER: == replay-single test 0c: check replay-barrier == 10:40:38 (1374676838)
<3>[113367.223979] LustreError: 22867:0:(osd_handler.c:1191:osd_ro()) *** setting lustre-MDT0000 read-only ***
<4>[113367.225361] Turning device loop0 (0x700000) read-only
<4>[113367.303612] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
<4>[113367.345628] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000
<4>[113367.526705] Lustre: Unmounted lustre-client
<4>[113367.907894] Lustre: Failing over lustre-MDT0000
<1>[113368.092078] BUG: unable to handle kernel paging request at ffff8800b6966c68
<1>[113368.092813] IP: [<ffffffffa0936019>] osp_key_exit+0x9/0x20 [osp]
<4>[113368.093486] PGD 1a26063 PUD 501067 PMD 6b6067 PTE 80000000b6966060
<4>[113368.094166] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
<4>[113368.094774] last sysfs file: /sys/devices/system/cpu/possible
<4>[113368.095424] CPU 2 
<4>[113368.095515] Modules linked in:
<3>[113368.096298] LustreError: 11-0: lustre-MDT0000-lwp-OST0001: Communicating with 0@lo, operation obd_ping failed with -107.
<4>[113368.096304] Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
<4>[113368.096034]  lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd
<3>[113368.116346] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 0@lo (no target)
<4>[113368.096034]  mgs lquota lfsck obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcache jbd2 virtio_balloon i2c_piix4 i2c_core virtio_console virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs]
<4>[113368.096034] 
<4>[113368.096034] Pid: 22149, comm: mdt00_002 Not tainted 2.6.32-rhe6.4-debug #2 Red Hat KVM
<4>[113368.096034] RIP: 0010:[<ffffffffa0936019>]  [<ffffffffa0936019>] osp_key_exit+0x9/0x20 [osp]
<4>[113368.096034] RSP: 0018:ffff8800973dfe10  EFLAGS: 00010282
<4>[113368.096034] RAX: ffffffffa0936010 RBX: 00000000000000c8 RCX: 0000000000000000
<4>[113368.096034] RDX: ffff8800b6966bf0 RSI: ffffffffa095ac00 RDI: ffff8800b7121610
<4>[113368.096034] RBP: ffff8800973dfe10 R08: 0000000000000001 R09: 0000000000000000
<4>[113368.096034] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800b7121610
<4>[113368.096034] R13: ffff88008a49af30 R14: ffff88006bbafef0 R15: ffff8800b52aac20
<4>[113368.096034] FS:  0000000000000000(0000) GS:ffff880006280000(0000) knlGS:0000000000000000
<4>[113368.096034] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>[113368.096034] CR2: ffff8800b6966c68 CR3: 0000000001a25000 CR4: 00000000000006e0
<4>[113368.096034] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[113368.096034] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[113368.096034] Process mdt00_002 (pid: 22149, threadinfo ffff8800973de000, task ffff8800973dc380)
<4>[113368.096034] Stack:
<4>[113368.096034]  ffff8800973dfe30 ffffffffa0ff4488 ffff8800b69677f0 ffff8800b52aabf0
<4>[113368.096034] <d> ffff8800973dfed0 ffffffffa1190899 ffff8800973dfe50 ffff880000000000
<4>[113368.096034] <d> ffff8800b52aac80 00000000973dffd8 ffff88006bbafef0 00000000973dc938
<4>[113368.096034] Call Trace:
<4>[113368.096034]  [<ffffffffa0ff4488>] lu_context_exit+0x58/0xa0 [obdclass]
<4>[113368.096034]  [<ffffffffa1190899>] ptlrpc_main+0x9d9/0x1650 [ptlrpc]
<4>[113368.096034]  [<ffffffffa118fec0>] ? ptlrpc_main+0x0/0x1650 [ptlrpc]
<4>[113368.096034]  [<ffffffff81094606>] kthread+0x96/0xa0
<4>[113368.096034]  [<ffffffff8100c10a>] child_rip+0xa/0x20
<4>[113368.096034]  [<ffffffff81094570>] ? kthread+0x0/0xa0
<4>[113368.096034]  [<ffffffff8100c100>] ? child_rip+0x0/0x20
<4>[113368.096034] Code: <48> c7 42 78 00 00 00 00 c9 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 

Crashdump and modules are in /exports/crashdumps/192.168.10.221-2013-07-24-10\:40\:42
source branch in my tree: master-20130723



 Comments   
Comment by Oleg Drokin [ 25/Jul/13 ]

Hm, apparently I hit it twice, here's another report:

<4>[113381.495165] Lustre: DEBUG MARKER: == replay-single test 13: open chmod 0 |x| write close == 10:40:54 (1374676854)
<4>[113382.445236] Turning device loop0 (0x700000) read-only
<4>[113382.590732] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
<4>[113382.610038] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000
<1>[113383.065720] BUG: unable to handle kernel paging request at ffff880094800c68
<1>[113383.068910] IP: [<ffffffffa0944019>] osp_key_exit+0x9/0x20 [osp]
<4>[113383.069122] PGD 1a26063 PUD 501067 PMD 5a6067 PTE 8000000094800060
<4>[113383.069122] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
<4>[113383.069122] last sysfs file: /sys/devices/system/cpu/possible
<4>[113383.069122] CPU 3 
<4>[113383.069122] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mgs lquota lfsck obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcache jbd2 virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs]
<4>[113383.069122] 
<4>[113383.069122] Pid: 30405, comm: mdt00_001 Not tainted 2.6.32-rhe6.4-debug #2 Red Hat KVM
<4>[113383.069122] RIP: 0010:[<ffffffffa0944019>]  [<ffffffffa0944019>] osp_key_exit+0x9/0x20 [osp]
<4>[113383.069122] RSP: 0018:ffff8800954c7e10  EFLAGS: 00010282
<4>[113383.069122] RAX: ffffffffa0944010 RBX: 00000000000000c8 RCX: 0000000000000000
<4>[113383.069122] RDX: ffff880094800bf0 RSI: ffffffffa0968c00 RDI: ffff8800b3328df8
<4>[113383.069122] RBP: ffff8800954c7e10 R08: ffff88009a39c7f8 R09: 000000000000007c
<4>[113383.069122] R10: 0000000000000001 R11: 20736e6172742029 R12: ffff8800b3328df8
<4>[113383.069122] R13: ffff8800b3d9df30 R14: ffff8800846fcef0 R15: ffff880077d93c20
<4>[113383.069122] FS:  0000000000000000(0000) GS:ffff8800062c0000(0000) knlGS:0000000000000000
<4>[113383.069122] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>[113383.069122] CR2: ffff880094800c68 CR3: 0000000001a25000 CR4: 00000000000006e0
<4>[113383.069122] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[113383.069122] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[113383.069122] Process mdt00_001 (pid: 30405, threadinfo ffff8800954c6000, task ffff8800b03882c0)
<4>[113383.069122] Stack:
<4>[113383.069122]  ffff8800954c7e30 ffffffffa1000488 0000000000000001 ffff880077d93bf0
<4>[113383.069122] <d> ffff8800954c7ed0 ffffffffa119c8eb 00000000000f4240 ffff880000000000
<4>[113383.069122] <d> ffff880077d93c80 00000000954c7fd8 ffff8800846fcef0 00000000b0388878
<4>[113383.069122] Call Trace:
<4>[113383.069122]  [<ffffffffa1000488>] lu_context_exit+0x58/0xa0 [obdclass]
<4>[113383.069122]  [<ffffffffa119c8eb>] ptlrpc_main+0xa2b/0x1650 [ptlrpc]
<4>[113383.069122]  [<ffffffffa119bec0>] ? ptlrpc_main+0x0/0x1650 [ptlrpc]
<4>[113383.069122]  [<ffffffff81094606>] kthread+0x96/0xa0
<4>[113383.069122]  [<ffffffff8100c10a>] child_rip+0xa/0x20
<4>[113383.069122]  [<ffffffff81094570>] ? kthread+0x0/0xa0
<4>[113383.069122]  [<ffffffff8100c100>] ? child_rip+0x0/0x20
<4>[113383.069122] Code: <48> c7 42 78 00 00 00 00 c9 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 

crashdump is in /exports/crashdumps/192.168.10.222-2013-07-24-10\:40\:58
same modules and source base as the main report.

Comment by Oleg Drokin [ 01/Apr/14 ]

This is still happening, I just got two more crashes there in current master

Comment by Andreas Dilger [ 09/Jan/20 ]

Close old bug

Generated at Sat Feb 10 01:35:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.