[LU-2208] deadlock in add_lsmref Created: 18/Oct/12  Updated: 13/Nov/12  Resolved: 13/Nov/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Jinshan Xiong (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Attachments: File lu2208.txt.gz    
Issue Links:
Related
is related to LU-1612 Test failure on test suite parallel-s... Resolved
Severity: 3
Rank (Obsolete): 5253

 Description   

Hit this in sanity 118k

[19970.051046] Lustre: DEBUG MARKER: == sanity test 118k: bio alloc -ENOMEM and IO TERM handling =========== 22:02:17 (1350525737)
[20160.612156] INFO: task flush-lustre-5:18802 blocked for more than 120 seconds.
[20160.612995] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[20160.613824] flush-lustre- D 0000000000000000  3504 18802      2 0x00000000
[20160.614395]  ffff880047a39958 0000000000000046 0000000000000000 ffff880047a39900
[20160.615163]  ffff880047a399a0 0000000100000000 0000000000000020 ffff88005f620c20
[20160.615529]  ffff880020730738 ffff880047a39fd8 000000000000fba8 ffff880020730738
[20160.615893] Call Trace:
[20160.616057]  [<ffffffff814fabbd>] rwsem_down_failed_common+0x8d/0x1d0
[20160.616272]  [<ffffffff814fad56>] rwsem_down_read_failed+0x26/0x30
[20160.616483]  [<ffffffff8127c104>] call_rwsem_down_read_failed+0x14/0x30
[20160.616718]  [<ffffffff814f9ec7>] ? down_read+0x37/0x40
[20160.616954]  [<ffffffffa1527b24>] lov_lsm_addref+0x34/0x150 [lov]
[20160.617186]  [<ffffffffa1528043>] lov_io_init+0x73/0x160 [lov]
[20160.617417]  [<ffffffffa10eb4e8>] cl_io_init0+0x98/0x160 [obdclass]
[20160.617647]  [<ffffffffa10ee2a4>] cl_io_init+0x64/0x100 [obdclass]
[20160.617869]  [<ffffffffa095e9ce>] cl_sync_file_range+0x11e/0x560 [lustre]
[20160.618793]  [<ffffffffa0984342>] ll_writepages+0x72/0x1b0 [lustre]
[20160.619027]  [<ffffffff81127f44>] do_writepages+0x24/0x40
[20160.619228]  [<ffffffff811a5594>] writeback_single_inode+0xe4/0x2d0
[20160.619439]  [<ffffffff811a5a13>] writeback_sb_inodes+0xd3/0x190
[20160.619646]  [<ffffffff811a5b4b>] writeback_inodes_wb+0x7b/0x1a0
[20160.619854]  [<ffffffff811a5efb>] wb_writeback+0x28b/0x3d0
[20160.620060]  [<ffffffff8107c6e2>] ? del_timer_sync+0x22/0x30
[20160.620266]  [<ffffffff811a61e5>] wb_do_writeback+0x1a5/0x250
[20160.620474]  [<ffffffff811a62f3>] bdi_writeback_task+0x63/0x1b0
[20160.620714]  [<ffffffff8108fc27>] ? bit_waitqueue+0x17/0xd0
[20160.620928]  [<ffffffff81136dc0>] ? bdi_start_fn+0x0/0x100
[20160.621172]  [<ffffffff81136e46>] bdi_start_fn+0x86/0x100
[20160.621396]  [<ffffffff81136dc0>] ? bdi_start_fn+0x0/0x100
[20160.621598]  [<ffffffff8108fa16>] kthread+0x96/0xa0
[20160.621790]  [<ffffffff8100c14a>] child_rip+0xa/0x20
[20160.621985]  [<ffffffff8108f980>] ? kthread+0x0/0xa0
[20160.622191]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[20160.622411] INFO: task dd:26845 blocked for more than 120 seconds.
[20160.622631] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[20160.622987] dd            D 0000000000000007  2608 26845      1 0x00000000
[20160.623220]  ffff88005bea3470 0000000000000086 0000000000000000 0000000000000286
[20160.623585]  0000000000000092 ffff88003b6a8548 000000015bea3458 ffffffff81f61808
[20160.623949]  ffff880046f00b78 ffff88005bea3fd8 000000000000fba8 ffff880046f00b78
[20160.624318] Call Trace:
[20160.624476]  [<ffffffff81044f4e>] ? kernel_map_pages+0xfe/0x110
[20160.624700]  [<ffffffff814fabbd>] rwsem_down_failed_common+0x8d/0x1d0
[20160.624978]  [<ffffffffa08d1afe>] ? cfs_mem_cache_free+0xe/0x10 [libcfs]
[20160.625319]  [<ffffffff814fad56>] rwsem_down_read_failed+0x26/0x30
[20160.625557]  [<ffffffff8127c104>] call_rwsem_down_read_failed+0x14/0x30
[20160.625774]  [<ffffffff814f9ec7>] ? down_read+0x37/0x40
[20160.625992]  [<ffffffffa1527b24>] lov_lsm_addref+0x34/0x150 [lov]
[20160.626238]  [<ffffffffa1528043>] lov_io_init+0x73/0x160 [lov]
[20160.626467]  [<ffffffffa10eb4e8>] cl_io_init0+0x98/0x160 [obdclass]
[20160.626690]  [<ffffffffa10ee2a4>] cl_io_init+0x64/0x100 [obdclass]
[20160.626922]  [<ffffffffa149b6d3>] osc_lru_shrink+0x4a3/0x8c0 [osc]
[20160.627143]  [<ffffffff8116133a>] ? cache_alloc_debugcheck_after+0x14a/0x210
[20160.627378]  [<ffffffffa149bfa8>] osc_page_init+0x4b8/0xb40 [osc]
[20160.627610]  [<ffffffffa10de7b5>] ? cl_page_slice_add+0x55/0x140 [obdclass]
[20160.627856]  [<ffffffffa10e2e7b>] cl_page_find0+0x2db/0x900 [obdclass]
[20160.628081]  [<ffffffff8116133a>] ? cache_alloc_debugcheck_after+0x14a/0x210
[20160.628327]  [<ffffffffa10e34b8>] cl_page_find_sub+0x18/0x20 [obdclass]
[20160.628575]  [<ffffffffa152a503>] lov_page_init_raid0+0x1a3/0x780 [lov]
[20160.628806]  [<ffffffffa1527f58>] lov_page_init+0x68/0xe0 [lov]
[20160.629038]  [<ffffffffa10e2e7b>] cl_page_find0+0x2db/0x900 [obdclass]
[20160.629268]  [<ffffffffa08e83e2>] ? cfs_hash_lookup+0x82/0xa0 [libcfs]
[20160.629492]  [<ffffffff811704f5>] ? mem_cgroup_charge_common+0xa5/0xd0
[20160.629728]  [<ffffffffa10e34d1>] cl_page_find+0x11/0x20 [obdclass]
[20160.629962]  [<ffffffffa0985ab4>] ll_cl_init+0x154/0x5b0 [lustre]
[20160.630190]  [<ffffffff814faeee>] ? _spin_unlock_irq+0xe/0x20
[20160.630430]  [<ffffffffa0986163>] ll_prepare_write+0x53/0x1a0 [lustre]
[20160.630672]  [<ffffffffa099dfbe>] ll_write_begin+0x7e/0x1a0 [lustre]
[20160.630898]  [<ffffffff81112d23>] generic_file_buffered_write+0x123/0x300
[20160.631130]  [<ffffffff8106fea7>] ? current_fs_time+0x27/0x30
[20160.631341]  [<ffffffff811147e0>] __generic_file_aio_write+0x250/0x480
[20160.631563]  [<ffffffff81114a7f>] generic_file_aio_write+0x6f/0xe0
[20160.631790]  [<ffffffffa09b171c>] vvp_io_write_start+0x9c/0x240 [lustre]
[20160.632034]  [<ffffffffa10eb26a>] cl_io_start+0x6a/0x140 [obdclass]
[20160.632285]  [<ffffffffa10efa54>] cl_io_loop+0xb4/0x1b0 [obdclass]
[20160.632529]  [<ffffffffa095e05b>] ll_file_io_generic+0x42b/0x550 [lustre]
[20160.632763]  [<ffffffffa095ef4c>] ll_file_aio_write+0x13c/0x2c0 [lustre]
[20160.632994]  [<ffffffffa095f239>] ll_file_write+0x169/0x2a0 [lustre]
[20160.633222]  [<ffffffff8117b2e8>] vfs_write+0xb8/0x1a0
[20160.633423]  [<ffffffff8117bbb1>] sys_write+0x51/0x90
[20160.633624]  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
[20160.633837] INFO: task dd:26851 blocked for more than 120 seconds.

There's a bunch of other dd processes hung in there.
Jinshan, I am leaving centos6-1 node in this state for you to look when you have time. Feel free to crash it with echo c >/proc/sysrq-trigger if you feel you would benefit from using crash to further investigate it that way.



 Comments   
Comment by Oleg Drokin [ 18/Oct/12 ]

Might be related to LU-1612

Comment by Oleg Drokin [ 18/Oct/12 ]

Oh, and test output to date (hanging at this stage for many hours already):

== sanity test 118k: bio alloc -ENOMEM and IO TERM handling =========== 22:02:17 (1350525737)
fail_loc=0x20e
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.0580369 s, 181 MB/s
sanity.sh: line 6428: kill: (26840) - No such process
sanity.sh: line 6429: 26843 Terminated              ( dd if=/dev/zero of=$DIR/$tdir/$tfile-$i bs=1M count=10 || error "dd to $DIR/$tdir/$tfile-$i failed" )
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.0606453 s, 173 MB/s
sanity.sh: line 6428: kill: (26846) - No such process
sanity.sh: line 6429: 26849 Terminated              ( dd if=/dev/zero of=$DIR/$tdir/$tfile-$i bs=1M count=10 || error "dd to $DIR/$tdir/$tfile-$i failed" )
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.0595693 s, 176 MB/s
sanity.sh: line 6428: kill: (26852) - No such process
sanity.sh: line 6429: 26855 Terminated              ( dd if=/dev/zero of=$DIR/$tdir/$tfile-$i bs=1M count=10 || error "dd to $DIR/$tdir/$tfile-$i failed" )
sanity.sh: line 6429: 26858 Terminated              ( dd if=/dev/zero of=$DIR/$tdir/$tfile-$i bs=1M count=10 || error "dd to $DIR/$tdir/$tfile-$i failed" )
sanity.sh: line 6429: 26861 Terminated              ( dd if=/dev/zero of=$DIR/$tdir/$tfile-$i bs=1M count=10 || error "dd to $DIR/$tdir/$tfile-$i failed" )
sanity.sh: line 6429: 26864 Terminated              ( dd if=/dev/zero of=$DIR/$tdir/$tfile-$i bs=1M count=10 || error "dd to $DIR/$tdir/$tfile-$i failed" )
sanity.sh: line 6429: 26867 Terminated              ( dd if=/dev/zero of=$DIR/$tdir/$tfile-$i bs=1M count=10 || error "dd to $DIR/$tdir/$tfile-$i failed" )
fail_loc=0
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 2.63154 s, 4.0 MB/s
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 1.63022 s, 6.4 MB/s
Comment by Oleg Drokin [ 28/Oct/12 ]

Hit this again, here's the debug log

Comment by Jinshan Xiong (Inactive) [ 01/Nov/12 ]

patch is at http://review.whamcloud.com/4416

Comment by Jinshan Xiong (Inactive) [ 13/Nov/12 ]

This problem was imported by layout lock client implementation. Set this as duplicate of LU-1876 as it will be fixed over there.

Generated at Sat Feb 10 01:23:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.