[LU-4558] Crash in cl_lock_put on racer Created: 29/Jan/14  Updated: 19/May/14  Resolved: 17/Apr/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.2
Fix Version/s: Lustre 2.6.0, Lustre 2.5.2

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: mn4, sdsc

Issue Links:
Duplicate
is duplicated by LU-4874 (osc_lock.c:240:osc_lock_fini()) ASSE... Resolved
Related
is related to LU-4797 ASSERTION( cl_lock_is_mutexed(slice->... Resolved
is related to LU-5071 statahead.c:1704:do_statahead_enter()... Resolved
Severity: 3
Rank (Obsolete): 12447

 Description   

I am seeing this from time to time.

<1>[  997.876619] BUG: unable to handle kernel paging request at ffff88006558ef30
<1>[  997.876963] IP: [<ffffffffa05c6cbd>] cl_lock_put+0x10d/0x420 [obdclass]
<4>[  997.877297] PGD 1a26063 PUD 300067 PMD 42b067 PTE 800000006558e060
<4>[  997.877614] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
<4>[  997.877886] last sysfs file: /sys/devices/system/cpu/possible
<4>[  997.878160] CPU 1 
<4>[  997.878199] Modules linked in: lustre ofd osp lod ost mdt mdd mgs nodemap osd_ldiskfs ldiskfs exportfs lquota lfsck jbd obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass ksocklnd lnet sha512_generic sha256_generic libcfs ext4 jbd2 mbcache ppdev parport_pc parport virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: speedstep_lib]
<4>[  997.881412] 
<4>[  997.881412] Pid: 15574, comm: dd Not tainted 2.6.32-rhe6.4-debug2 #1 Bochs Bochs
<4>[  997.881412] RIP: 0010:[<ffffffffa05c6cbd>]  [<ffffffffa05c6cbd>] cl_lock_put+0x10d/0x420 [obdclass]
<4>[  997.881412] RSP: 0000:ffff88000b20db58  EFLAGS: 00010202
<4>[  997.881412] RAX: ffff88006558ef01 RBX: ffff88006558eee0 RCX: 0000000000000001
<4>[  997.881412] RDX: 0000000000000000 RSI: ffff88006558eee0 RDI: ffff880079754f50
<4>[  997.881412] RBP: ffff88000b20db78 R08: ffffffffa05e5e20 R09: 00000000000002f2
<4>[  997.881412] R10: 0000000000000003 R11: 000000000000000f R12: ffff880079754f50
<4>[  997.881412] R13: ffff880068b7de80 R14: ffff880068b7de80 R15: ffff880068b7df78
<4>[  997.881412] FS:  00007f4cfcdfd700(0000) GS:ffff880006280000(0000) knlGS:0000000000000000
<4>[  997.881412] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>[  997.881412] CR2: ffff88006558ef30 CR3: 00000000297f4000 CR4: 00000000000006e0
<4>[  997.881412] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[  997.881412] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[  997.881412] Process dd (pid: 15574, threadinfo ffff88000b20c000, task ffff8800ae354400)
<4>[  997.881412] Stack:
<4>[  997.881412]  ffff880068b7df70 ffff880079754f50 ffff880068b7de80 ffff880068b7de80
<4>[  997.881412] <d> ffff88000b20dbb8 ffffffffa05c8360 ffff880087cef140 ffff880079754f50
<4>[  997.881412] <d> ffff880087cef140 ffff880068b7df70 0000000000000000 ffff880012baaaf8
<4>[  997.881412] Call Trace:
<4>[  997.881412]  [<ffffffffa05c8360>] cl_lock_disclosure+0xa0/0x110 [obdclass]
<4>[  997.881412]  [<ffffffffa095866f>] lov_sublock_unlock+0x5f/0x140 [lov]
<4>[  997.881412]  [<ffffffffa09595e9>] lov_lock_delete+0xc9/0x310 [lov]
<4>[  997.881412]  [<ffffffffa05c55b5>] cl_lock_delete0+0xb5/0x1d0 [obdclass]
<4>[  997.881412]  [<ffffffffa05c5b8e>] cl_lock_hold_release+0x19e/0x2a0 [obdclass]
<4>[  997.881412]  [<ffffffffa05c9d44>] cl_lock_request+0x1a4/0x270 [obdclass]
<4>[  997.881412]  [<ffffffffa05cea8c>] cl_io_lock+0x3cc/0x560 [obdclass]
<4>[  997.881412]  [<ffffffffa05cecc2>] cl_io_loop+0xa2/0x1b0 [obdclass]
<4>[  997.881412]  [<ffffffffa0ec1b66>] ll_file_io_generic+0x2b6/0x710 [lustre]
<4>[  997.881412]  [<ffffffffa05bed59>] ? cl_env_get+0x29/0x350 [obdclass]
<4>[  997.881412]  [<ffffffffa0ec2832>] ll_file_aio_write+0x142/0x2c0 [lustre]
<4>[  997.881412]  [<ffffffffa0ec2b1c>] ll_file_write+0x16c/0x2a0 [lustre]
<4>[  997.881412]  [<ffffffff811818a8>] vfs_write+0xb8/0x1a0
<4>[  997.881412]  [<ffffffff81182111>] sys_write+0x51/0x90
<4>[  997.881412]  [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b

Sample crash report in /exports/crashdumps/192.168.10.218-2014-01-23-00\:33\:42 and source tag master-20140127



 Comments   
Comment by Jinshan Xiong (Inactive) [ 10/Feb/14 ]

This may be a race window of accessing cl_lock state.

Thread 1 got here:

        if (atomic_dec_and_test(&lock->cll_ref)) {

>>>>>> being preempted.

                if (lock->cll_state == CLS_FREEING) {

Thread 2 executed destroyed the lock.

Thread 1 continued to run and hit the bug.

Comment by Jinshan Xiong (Inactive) [ 19/Mar/14 ]

As long as cl_lock is dying, we're not going to fix the problem.

Comment by Cory Spitz [ 03/Apr/14 ]

b2_4: http://review.whamcloud.com/#/c/9876
master: http://review.whamcloud.com/#/c/9881

Comment by Bob Glossman (Inactive) [ 04/Apr/14 ]

b2_5: http://review.whamcloud.com/9887

Comment by Jinshan Xiong (Inactive) [ 12/Apr/14 ]

patch for b2_1: http://review.whamcloud.com/9939

Comment by Jinshan Xiong (Inactive) [ 17/Apr/14 ]

patch landed.

Generated at Sat Feb 10 01:43:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.