[LU-3387] Oops in osp_statfs_timer_cb Created: 23/May/13  Updated: 23/Dec/16

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 8389

 Description   

Running replay-single in a loop, I hit this oops today:

[352448.539528] Lustre: DEBUG MARKER: == replay-single test 3b: replay failed open -ENOMEM == 15:27:17 (1369337237)
[352449.238483] LustreError: 539:0:(osd_handler.c:1134:osd_ro()) *** setting lustre-MDT0000 read-only ***
[352449.238948] Turning device loop0 (0x700000) read-only
[352449.349409] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
[352449.367296] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000
[352449.530689] Lustre: *** cfs_fail_loc=114, val=0***
[352450.067118] LustreError: 27045:0:(client.c:1048:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880081bc77f0 x1435853793530172/t0(0) o13->lustre-OST0001-osc-MDT0000@0@lo:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
[352450.031173] BUG: unable to handle kernel paging request at ffff8800893ef9d0
[352450.031173] IP: [<ffffffff812872cf>] _raw_spin_lock+0xdf/0x180
[352450.031173] PGD 1a26063 PUD 501067 PMD 54b067 PTE 80000000893ef060
[352450.031173] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[352450.031173] last sysfs file: /sys/devices/system/cpu/possible
[352450.031173] CPU 7 
[352450.031173] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcache jbd2 virtio_balloon virtio_console i2c_piix4 i2c_core virtio_net virtio_blk pata_acpi ata_generic ata_piix virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs]
[352450.031173] 
[352450.031173] Pid: 0, comm: swapper Not tainted 2.6.32-rhe6.4-debug #2 Bochs Bochs
[352450.031173] RIP: 0010:[<ffffffff812872cf>]  [<ffffffff812872cf>] _raw_spin_lock+0xdf/0x180
[352450.031173] RSP: 0018:ffff8800063c3d80  EFLAGS: 00010016
[352450.031173] RAX: 000000006ec735fa RBX: ffff8800893ef9d0 RCX: 000000006ec735fa
[352450.031173] RDX: 0000000000000361 RSI: 0000000000000003 RDI: 0000000000000001
[352450.031173] RBP: ffff8800063c3dc0 R08: ffff880096082608 R09: 0000000000000000
[352450.031173] R10: 0000000000000800 R11: ffffffff81d5fb40 R12: 00000000ca3244a0
[352450.031173] R13: ffff8800bac08440 R14: ffff8800bac08ae0 R15: 000000000008295c
[352450.031173] FS:  0000000000000000(0000) GS:ffff8800063c0000(0000) knlGS:0000000000000000
[352450.031173] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[352450.031173] CR2: ffff8800893ef9d0 CR3: 0000000001a25000 CR4: 00000000000006e0
[352450.031173] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[352450.031173] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[352450.031173] Process swapper (pid: 0, threadinfo ffff8800bac0a000, task ffff8800bac08440)
[352450.031173] Stack:
[352450.031173]  ffff8800063d6880 0000000100000286 ffff8800063c3db0 0000000000000282
[352450.031173] <d> ffff8800893efc48 0000000000000003 0000000000000001 0000000000000000
[352450.031173] <d> ffff8800063c3de0 ffffffff814fe204 0000000700000001 ffff8800893ef9d0
[352450.031173] Call Trace:
[352450.031173]  <IRQ> 
[352450.031173]  [<ffffffff814fe204>] _spin_lock_irqsave+0x24/0x30
[352450.031173]  [<ffffffff810545f2>] __wake_up+0x32/0x70
[352450.031173]  [<ffffffffa09095f0>] ? osp_statfs_timer_cb+0x0/0x50 [osp]
[352450.031173]  [<ffffffffa0df078a>] cfs_waitq_signal+0x1a/0x20 [libcfs]
[352450.031173]  [<ffffffffa090960a>] osp_statfs_timer_cb+0x1a/0x50 [osp]
[352450.031173]  [<ffffffff8107f3f2>] run_timer_softirq+0x192/0x330
[352450.031173]  [<ffffffff8102e2fd>] ? lapic_next_event+0x1d/0x30
[352450.031173]  [<ffffffff810749d1>] __do_softirq+0xc1/0x1e0
[352450.031173]  [<ffffffff8109940a>] ? hrtimer_interrupt+0x14a/0x270
[352450.031173]  [<ffffffff8100c20c>] call_softirq+0x1c/0x30
[352450.031173]  [<ffffffff8100de45>] do_softirq+0x65/0xa0
[352450.031173]  [<ffffffff810747b5>] irq_exit+0x85/0x90
[352450.031173]  [<ffffffff81504ed0>] smp_apic_timer_interrupt+0x70/0x9b
[352450.031173]  [<ffffffff8100bbd3>] apic_timer_interrupt+0x13/0x20
[352450.031173]  <EOI> 
[352450.031173]  [<ffffffff8103a86b>] ? native_safe_halt+0xb/0x10
[352450.031173]  [<ffffffff81014a0d>] default_idle+0x4d/0xb0
[352450.031173]  [<ffffffff81009a26>] cpu_idle+0xb6/0x110
[352450.031173]  [<ffffffff814f4289>] start_secondary+0x2ac/0x2ef
[352450.031173] Code: 85 7a 00 fa 00 00 00 c7 45 cc 01 00 00 00 4d 8d b5 a0 06 00 00 eb 0c 0f 1f 44 00 00 8b 45 cc 85 c0 75 63 45 31 ff 4d 39 fc 76 f1 <8b> 03 89 c2 c1 c0 10 39 c2 8d 90 00 00 01 00 75 04 f0 0f b1 13 

base code is 2.4.0-RC1, my code branch master-20130523

Crashdump and modules are in /exports/crashdumps/192.168.10.221-2013-05-23-15\:27\:24



 Comments   
Comment by Oleg Drokin [ 23/May/13 ]

A somewhat similar crash that happened in another VM a few hours later:

[365601.849681] BUG: unable to handle kernel paging request at ffff88008494caa8
[365601.849681] IP: [<ffffffff812872cf>] _raw_spin_lock+0xdf/0x180
[365601.849681] PGD 1a26063 PUD 501067 PMD 526067 PTE 800000008494c060
[365601.849681] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[365601.849681] last sysfs file: /sys/devices/system/cpu/possible
[365601.849681] CPU 1 
[365601.849681] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcache jbd2 virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs]
[365601.849681] 
[365601.849681] Pid: 20495, comm: osp-syn-0 Not tainted 2.6.32-rhe6.4-debug #2 Bochs Bochs
[365601.849681] RIP: 0010:[<ffffffff812872cf>]  [<ffffffff812872cf>] _raw_spin_lock+0xdf/0x180
[365601.866734] RSP: 0018:ffff8800861e1d60  EFLAGS: 00010006
[365601.866734] RAX: ffffffffac3ad717 RBX: ffff88008494caa8 RCX: 00000000ac3ad717
[365601.866734] RDX: 00000000012a7ae5 RSI: 0000000000000003 RDI: 0000000000000001
[365601.866734] RBP: ffff8800861e1da0 R08: 0000000000000000 R09: ffff880000000000
[365601.866734] R10: 2222222222222222 R11: 0000000000000000 R12: 00000000ca3244a0
[365601.866734] R13: ffff88006bcf20c0 R14: ffff88006bcf2760 R15: 000000000002fb00
[365601.866734] FS:  00007fe0073d7700(0000) GS:ffff880006240000(0000) knlGS:0000000000000000
[365601.866734] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[365601.866734] CR2: ffff88008494caa8 CR3: 0000000001a25000 CR4: 00000000000006e0
[365601.866734] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[365601.866734] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[365601.866734] Process osp-syn-0 (pid: 20495, threadinfo ffff8800861e0000, task ffff88006bcf20c0)
[365601.866734] Stack:
[365601.866734]  ffff8800898b55d8 00000001861e1e80 0000000000000002 0000000000000282
[365601.866734] <d> ffff8800861e1e80 0000000000000003 0000000000000001 0000000000000000
[365601.866734] <d> ffff8800861e1dc0 ffffffff814fe204 ffff8800861e1dc0 ffff88008494caa8
[365601.866734] Call Trace:
[365601.866734]  [<ffffffff814fe204>] _spin_lock_irqsave+0x24/0x30
[365601.866734]  [<ffffffff810545f2>] __wake_up+0x32/0x70
[365601.866734]  [<ffffffffa0f5278a>] cfs_waitq_signal+0x1a/0x20 [libcfs]
[365601.866734]  [<ffffffffa0850ca4>] osp_sync_thread+0x414/0x800 [osp]
[365601.866734]  [<ffffffff814fe09e>] ? _spin_unlock_irq+0xe/0x20
[365601.866734]  [<ffffffffa0850890>] ? osp_sync_thread+0x0/0x800 [osp]
[365601.866734]  [<ffffffff8100c10a>] child_rip+0xa/0x20
[365601.866734]  [<ffffffffa0850890>] ? osp_sync_thread+0x0/0x800 [osp]
[365601.866734]  [<ffffffffa0850890>] ? osp_sync_thread+0x0/0x800 [osp]
[365601.866734]  [<ffffffff8100c100>] ? child_rip+0x0/0x20
[365601.866734] Code: 85 7a 00 fa 00 00 00 c7 45 cc 01 00 00 00 4d 8d b5 a0 06 00 00 eb 0c 0f 1f 44 00 00 8b 45 cc 85 c0 75 63 45 31 ff 4d 39 fc 76 f1 <8b> 03 89 c2 c1 c0 10 39 c2 8d 90 00 00 01 00 75 04 f0 0f b1 13 
[365601.866734] RIP  [<ffffffff812872cf>] _raw_spin_lock+0xdf/0x180
[365601.866734]  RSP <ffff8800861e1d60>
[365601.866734] CR2: ffff88008494caa8

Crashdump is in /exports/crashdumps/192.168.10.222-2013-05-23-19\:08\:53

Comment by Oleg Drokin [ 18/Jun/13 ]

another hit on current master:

<4>[110783.415924] Lustre: DEBUG MARKER: == replay-single test 19: |X| mcreate, open, write, rename == 08:18:37 (1371298717)
<4>[110783.917909] Turning device loop0 (0x700000) read-only
<4>[110784.028980] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
<4>[110784.056261] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000
<1>[110784.612016] BUG: unable to handle kernel paging request at ffff880081e1c9d0
<1>[110784.612016] IP: [<ffffffff812872cf>] _raw_spin_lock+0xdf/0x180
<4>[110784.612016] PGD 1a26063 PUD 501067 PMD 511067 PTE 8000000081e1c060
<4>[110784.612016] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
<4>[110784.612016] last sysfs file: /sys/devices/system/cpu/possible
<4>[110784.612016] CPU 4 
<4>[110784.612016] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mgs lquota lfsck obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcache jbd2 virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net pata_acpi ata_generic ata_piix virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs]
<4>[110784.638459] 
<4>[110784.638459] Pid: 0, comm: swapper Not tainted 2.6.32-rhe6.4-debug #2 Red Hat KVM
<4>[110784.638459] RIP: 0010:[<ffffffff812872cf>]  [<ffffffff812872cf>] _raw_spin_lock+0xdf/0x180
<4>[110784.638459] RSP: 0018:ffff880006303d80  EFLAGS: 00010012
<4>[110784.638459] RAX: ffffffffc7256127 RBX: ffff880081e1c9d0 RCX: 00000000c7256127
<4>[110784.638459] RDX: 0000000000000027 RSI: 0000000000000003 RDI: 0000000000000001
<4>[110784.638459] RBP: ffff880006303dc0 R08: ffff88000630e0e0 R09: 000064c20d34d800
<4>[110784.638459] R10: 0000000000000001 R11: 0000000000000003 R12: 00000000caa837f0
<4>[110784.638459] R13: ffff8800bb2e8080 R14: ffff8800bb2e8720 R15: 00000000000657e3
<4>[110784.638459] FS:  0000000000000000(0000) GS:ffff880006300000(0000) knlGS:0000000000000000
<4>[110784.638459] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>[110784.638459] CR2: ffff880081e1c9d0 CR3: 0000000001a25000 CR4: 00000000000006e0
<4>[110784.638459] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[110784.638459] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[110784.638459] Process swapper (pid: 0, threadinfo ffff8800bb2ea000, task ffff8800bb2e8080)
<4>[110784.638459] Stack:
<4>[110784.638459]  ffff8800bb2e8080 0000000106303d98 ffffffff8103a80c 0000000000000282
<4>[110784.638459] <d> ffff880081e1cc48 0000000000000003 0000000000000001 0000000000000000
<3>[110784.672878] LustreError: 137-5: lustre-MDT0000_UUID: not available for con<4>[110784.638459]  [<ffffffffa092b580>] ? osp_statfs_timer_cb+0x0/0x50 [osp]
<4>[110784.638459]  [<ffffffffa0e3d78a>] cfs_waitq_signal+0x1a/0x20 [libcfs]
<4>[110784.638459]  [<ffffffffa092b59a>] osp_statfs_timer_cb+0x1a/0x50 [osp]
<4>[110784.638459]  [<ffffffff8107f3f2>] run_timer_softirq+0x192/0x330
<4>[110784.638459]  [<ffffffff8102e2fd>] ? lapic_next_event+0x1d/0x30
<4>[110784.638459]  [<ffffffff810749d1>] __do_softirq+0xc1/0x1e0
<4>[110784.638459]  [<ffffffff8109940a>] ? hrtimer_interrupt+0x14a/0x270
<4>[110784.638459]  [<ffffffff8100c20c>] call_softirq+0x1c/0x30
<4>[110784.638459]  [<ffffffff8100de45>] do_softirq+0x65/0xa0
<4>[110784.638459]  [<ffffffff810747b5>] irq_exit+0x85/0x90
<4>[110784.638459]  [<ffffffff81504ed0>] smp_apic_timer_interrupt+0x70/0x9b
<4>[110784.638459]  [<ffffffff8100bbd3>] apic_timer_interrupt+0x13/0x20
<4>[110784.638459]  <EOI> 
<4>[110784.638459]  [<ffffffff8103a86b>] ? native_safe_halt+0xb/0x10
<4>[110784.638459]  [<ffffffff81014a0d>] default_idle+0x4d/0xb0
<4>[110784.638459]  [<ffffffff81009a26>] cpu_idle+0xb6/0x110
<4>[110784.638459]  [<ffffffff814f4289>] start_secondary+0x2ac/0x2ef
<4>[110784.638459] Code: 85 7a 00 fa 00 00 00 c7 45 cc 01 00 00 00 4d 8d b5 a0 06 00 00 eb 0c 0f 1f 44 00 00 8b 45 cc 85 c0 75 63 45 31 ff 4d 39 fc 76 f1 <8b> 03 89 c2 c1 c0 10 39 c2 8d 90 00 00 01 00 75 04 f0 0f b1 13 
<1>[110784.638459] RIP  [<ffffffff812872cf>] _raw_spin_lock+0xdf/0x180
<4>[110784.638459]  RSP <ffff880006303d80>
<4>[110784.638459] CR2: ffff880081e1c9d0

vmcore and modules are in /exports/crashdumps/192.168.10.222-2013-06-15-08\:18\:40

Comment by Oleg Drokin [ 01/May/16 ]

And I just hit this once more:

<1>[244260.988016] BUG: unable to handle kernel paging request at ffff88008081c990
<1>[244260.988016] IP: [<ffffffff812a08cf>] _raw_spin_lock+0xdf/0x180
<4>[244260.988016] PGD 1a2e063 PUD 501067 PMD 506067 PTE 800000008081c060
<4>[244260.988016] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
<4>[244260.988016] last sysfs file: /sys/devices/system/cpu/online
<4>[244260.988016] CPU 7 
<4>[244260.988016] Modules linked in: lustre ofd osp lod ost mdt mdd mgs osd_ldiskfs ldiskfs lquota lfsck obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass ksocklnd lnet libcfs zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl zlib_deflate exportfs sha512_generic sha256_generic ext4 jbd2 mbcache virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio cxgb3i libcxgbi ipv6 cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs]
<4>[244260.988016] 
<4>[244260.988016] Pid: 0, comm: swapper Tainted: P           -- ------------    2.6.32-rhe6.7-debug #1 Red Hat KVM
<4>[244260.988016] RIP: 0010:[<ffffffff812a08cf>]  [<ffffffff812a08cf>] _raw_spin_lock+0xdf/0x180
<4>[244260.988016] RSP: 0018:ffff8800063c3d90  EFLAGS: 00010012
<4>[244260.988016] RAX: ffffffffd7eb4dc0 RBX: ffff88008081c990 RCX: 00000000d7eb4dc0
<4>[244260.988016] RDX: 00000000041c0d28 RSI: 0000000000000003 RDI: 0000000000000001
<4>[244260.988016] RBP: ffff8800063c3dd0 R08: 0000000000000001 R09: 0000de27712f0e00
<4>[244260.988016] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000cb40a760
<4>[244260.988016] R13: ffff8800bac86480 R14: ffff8800bac86b20 R15: 000000000005b1eb
<4>[244260.988016] FS:  0000000000000000(0000) GS:ffff8800063c0000(0000) knlGS:0000000000000000
<4>[244260.988016] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>[244260.988016] CR2: ffff88008081c990 CR3: 0000000001a2d000 CR4: 00000000000006e0
<4>[244260.988016] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[244260.988016] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[244260.988016] Process swapper (pid: 0, threadinfo ffff8800bac88000, task ffff8800bac86480)
<4>[244260.988016] Stack:
<4>[244260.988016]  ffff8800063c3db0 0000000100000286 ffff8800063c3db0 0000000000000282
<4>[244260.988016] <d> ffff88008081cc90 0000000000000003 0000000000000001 0000000000000000
<4>[244260.988016] <d> ffff8800063c3df0 ffffffff81530c14 0000000000015c00 ffff88008081c990
<4>[244260.988016] Call Trace:
<4>[244260.988016]  <IRQ> 
<4>[244260.988016]  [<ffffffff81530c14>] _spin_lock_irqsave+0x24/0x30
<4>[244260.988016]  [<ffffffff8105d562>] __wake_up+0x32/0x70
<4>[244260.988016]  [<ffffffffa0ab1450>] ? osp_statfs_timer_cb+0x0/0x80 [osp]
<4>[244260.988016]  [<ffffffffa0ab1490>] osp_statfs_timer_cb+0x40/0x80 [osp]
<4>[244260.988016]  [<ffffffff81088b52>] run_timer_softirq+0x192/0x330
<4>[244260.988016]  [<ffffffff81042f1c>] ? kvm_clock_read+0x1c/0x20
<4>[244260.988016]  [<ffffffff81034eed>] ? lapic_next_event+0x1d/0x30
<4>[244260.988016]  [<ffffffff8107e4b1>] __do_softirq+0xc1/0x1e0
<4>[244260.988016]  [<ffffffff810b1e2f>] ? tick_program_event+0x2f/0x40
<4>[244260.988016]  [<ffffffff8100c3cc>] call_softirq+0x1c/0x30
<4>[244260.988016]  [<ffffffff8100fc15>] do_softirq+0x65/0xa0
<4>[244260.988016]  [<ffffffff8107e365>] irq_exit+0x85/0x90
<4>[244260.988016]  [<ffffffff815378da>] smp_apic_timer_interrupt+0x4a/0x60
<4>[244260.988016]  [<ffffffff8100bc53>] apic_timer_interrupt+0x13/0x20
<4>[244260.988016]  <EOI> 
<4>[244260.988016]  [<ffffffff81042f7b>] ? native_safe_halt+0xb/0x10
<4>[244260.988016]  [<ffffffff8101684d>] default_idle+0x4d/0xb0
<4>[244260.988016]  [<ffffffff81009a46>] cpu_idle+0xb6/0x110
<4>[244260.988016]  [<ffffffff81526085>] start_secondary+0x2c0/0x316
<4>[244260.988016] Code: 6f 79 00 fa 00 00 00 c7 45 cc 01 00 00 00 4d 8d b5 a0 06 00 00 eb 0c 0f 1f 44 00 00 8b 45 cc 85 c0 75 63 45 31 ff 4d 39 fc 76 f1 <8b> 03 89 c2 c1 c0 10 39 c2 8d 90 00 00 01 00 75 04 f0 0f b1 13 
<1>[244260.988016] RIP  [<ffffffff812a08cf>] _raw_spin_lock+0xdf/0x180
<4>[244260.988016]  RSP <ffff8800063c3d90>
<4>[244260.988016] CR2: ffff88008081c990
Comment by Oleg Drokin [ 23/Dec/16 ]

I finally traced this to a race between osp_statfs_timer_cb and osp_precreate_fini.

osp_precreate_fini:

        cfs_timer_disarm(&d->opd_statfs_timer);

        if (d->opd_pre == NULL)
                RETURN_EXIT;

        thread = &d->opd_pre_thread;

        thread->t_flags = SVC_STOPPING;
        wake_up(&d->opd_pre_waitq);

        wait_event(thread->t_ctl_waitq, thread->t_flags & SVC_STOPPED);

        OBD_FREE_PTR(d->opd_pre);
        d->opd_pre = NULL;

So if we enter here while the timer cb is running, disabling it does not help, we fall through here and free the opd_pre while cb is doing something else and so

static void osp_statfs_timer_cb(unsigned long _d)
{
        struct osp_device *d = (struct osp_device *) _d;

        LASSERT(d);
        if (d->opd_pre != NULL && osp_precreate_running(d))
                wake_up(&d->opd_pre_waitq);

wake_up goes to access freed memory.

I suspect a refcount on opd_pre is too heavy, but I don't see anything supereasy to fix this otherwise.

Generated at Sat Feb 10 01:33:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.