[LU-2297] Test failure on test suite replay-single, subtest test_74: client umount hang Created: 07/Nov/12  Updated: 05/Dec/12  Resolved: 05/Dec/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: NFBlocker

Severity: 3
Rank (Obsolete): 5490

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/c2d37650-2819-11e2-aa14-52540035b04c.

The sub-test test_74 failed with the following error:

test failed to respond and timed out

client umount hang

12:25:18:Lustre: DEBUG MARKER: == replay-single test 74: Ensure applications don't fail waiting for OST recovery == 12:25:18 (1352147118)
12:25:30:Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre' ' /proc/mounts);
12:25:30:if [ $running -ne 0 ] ; then
12:25:30:echo Stopping client $(hostname) /mnt/lustre opts:;
12:25:30:lsof /mnt/lustre || need_kill=no;
12:25:30:if [ x != x -a x$need_kill != xno ]; then
12:25:30:    pids=$(lsof -t /mnt/lustre | sort -u);
12:25:30:    if 
12:29:22:INFO: task umount:29383 blocked for more than 120 seconds.
12:29:22:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
12:29:22:umount        D 0000000000000000     0 29383  29376 0x00000080
12:29:22: ffff880079f97b68 0000000000000082 ffff8800ffffffff 00001d5805116b96
12:29:22: ffff880079f97ad8 ffff880037f6ae50 00000000004059ee ffffffffaf68090e
12:29:22: ffff88003a921058 ffff880079f97fd8 000000000000fb88 ffff88003a921058
12:29:22:Call Trace:
12:29:22: [<ffffffff8109cd49>] ? ktime_get_ts+0xa9/0xe0
12:29:22: [<ffffffff811141f0>] ? sync_page+0x0/0x50
12:29:22: [<ffffffff814fe0f3>] io_schedule+0x73/0xc0
12:29:22: [<ffffffff8111422d>] sync_page+0x3d/0x50
12:29:22: [<ffffffff814feaaf>] __wait_on_bit+0x5f/0x90
12:29:22: [<ffffffff81114463>] wait_on_page_bit+0x73/0x80
12:29:22: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
12:29:22: [<ffffffff8112a965>] ? pagevec_lookup_tag+0x25/0x40
12:29:22: [<ffffffff811148db>] wait_on_page_writeback_range+0xfb/0x190
12:29:22: [<ffffffff8111499f>] filemap_fdatawait+0x2f/0x40
12:29:22: [<ffffffff811a4874>] sync_inodes_sb+0x114/0x190
12:29:22: [<ffffffff811aa312>] __sync_filesystem+0x82/0x90
12:29:22: [<ffffffff811aa51b>] sync_filesystem+0x4b/0x70
12:29:22: [<ffffffff8117d317>] generic_shutdown_super+0x27/0xe0
12:29:22: [<ffffffff8117d436>] kill_anon_super+0x16/0x60
12:29:22: [<ffffffffa052a94a>] lustre_kill_super+0x4a/0x60 [obdclass]
12:29:22: [<ffffffff8117e4b0>] deactivate_super+0x70/0x90
12:29:22: [<ffffffff8119a4ef>] mntput_no_expire+0xbf/0x110
12:29:22: [<ffffffff8119af8b>] sys_umount+0x7b/0x3a0
12:29:22: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
12:29:23: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
12:31:14:INFO: task umount:29383 blocked for more than 120 seconds.


 Comments   
Comment by nasf (Inactive) [ 13/Nov/12 ]

Another failure instance:

https://maloo.whamcloud.com/sub_tests/ac4fd6f8-2d82-11e2-89bf-52540035b04c

Comment by Peter Jones [ 28/Nov/12 ]

Bobijam will look into this one

Comment by Mikhail Pershin [ 02/Dec/12 ]

this becomes critical bug, it happens very often throwing away Maloo testing efforts - about 25 times for weekend. It is almost impossible to pass tests due to this bug and couple of another. Maybe it was caused by some recent landing?

Comment by Oleg Drokin [ 03/Dec/12 ]

There's a patch at http://review.whamcloud.com/#change,4717

Comment by Peter Jones [ 05/Dec/12 ]

Landed for 2.4

Generated at Sat Feb 10 01:24:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.