[LU-2297] Test failure on test suite replay-single, subtest test_74: client umount hang Created: 07/Nov/12 Updated: 05/Dec/12 Resolved: 05/Dec/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | NFBlocker | ||
| Severity: | 3 |
| Rank (Obsolete): | 5490 |
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/c2d37650-2819-11e2-aa14-52540035b04c. The sub-test test_74 failed with the following error:
client umount hang 12:25:18:Lustre: DEBUG MARKER: == replay-single test 74: Ensure applications don't fail waiting for OST recovery == 12:25:18 (1352147118) 12:25:30:Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre' ' /proc/mounts); 12:25:30:if [ $running -ne 0 ] ; then 12:25:30:echo Stopping client $(hostname) /mnt/lustre opts:; 12:25:30:lsof /mnt/lustre || need_kill=no; 12:25:30:if [ x != x -a x$need_kill != xno ]; then 12:25:30: pids=$(lsof -t /mnt/lustre | sort -u); 12:25:30: if 12:29:22:INFO: task umount:29383 blocked for more than 120 seconds. 12:29:22:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 12:29:22:umount D 0000000000000000 0 29383 29376 0x00000080 12:29:22: ffff880079f97b68 0000000000000082 ffff8800ffffffff 00001d5805116b96 12:29:22: ffff880079f97ad8 ffff880037f6ae50 00000000004059ee ffffffffaf68090e 12:29:22: ffff88003a921058 ffff880079f97fd8 000000000000fb88 ffff88003a921058 12:29:22:Call Trace: 12:29:22: [<ffffffff8109cd49>] ? ktime_get_ts+0xa9/0xe0 12:29:22: [<ffffffff811141f0>] ? sync_page+0x0/0x50 12:29:22: [<ffffffff814fe0f3>] io_schedule+0x73/0xc0 12:29:22: [<ffffffff8111422d>] sync_page+0x3d/0x50 12:29:22: [<ffffffff814feaaf>] __wait_on_bit+0x5f/0x90 12:29:22: [<ffffffff81114463>] wait_on_page_bit+0x73/0x80 12:29:22: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50 12:29:22: [<ffffffff8112a965>] ? pagevec_lookup_tag+0x25/0x40 12:29:22: [<ffffffff811148db>] wait_on_page_writeback_range+0xfb/0x190 12:29:22: [<ffffffff8111499f>] filemap_fdatawait+0x2f/0x40 12:29:22: [<ffffffff811a4874>] sync_inodes_sb+0x114/0x190 12:29:22: [<ffffffff811aa312>] __sync_filesystem+0x82/0x90 12:29:22: [<ffffffff811aa51b>] sync_filesystem+0x4b/0x70 12:29:22: [<ffffffff8117d317>] generic_shutdown_super+0x27/0xe0 12:29:22: [<ffffffff8117d436>] kill_anon_super+0x16/0x60 12:29:22: [<ffffffffa052a94a>] lustre_kill_super+0x4a/0x60 [obdclass] 12:29:22: [<ffffffff8117e4b0>] deactivate_super+0x70/0x90 12:29:22: [<ffffffff8119a4ef>] mntput_no_expire+0xbf/0x110 12:29:22: [<ffffffff8119af8b>] sys_umount+0x7b/0x3a0 12:29:22: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0 12:29:23: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 12:31:14:INFO: task umount:29383 blocked for more than 120 seconds. |
| Comments |
| Comment by nasf (Inactive) [ 13/Nov/12 ] |
|
Another failure instance: https://maloo.whamcloud.com/sub_tests/ac4fd6f8-2d82-11e2-89bf-52540035b04c |
| Comment by Peter Jones [ 28/Nov/12 ] |
|
Bobijam will look into this one |
| Comment by Mikhail Pershin [ 02/Dec/12 ] |
|
this becomes critical bug, it happens very often throwing away Maloo testing efforts - about 25 times for weekend. It is almost impossible to pass tests due to this bug and couple of another. Maybe it was caused by some recent landing? |
| Comment by Oleg Drokin [ 03/Dec/12 ] |
|
There's a patch at http://review.whamcloud.com/#change,4717 |
| Comment by Peter Jones [ 05/Dec/12 ] |
|
Landed for 2.4 |