[LU-5584] EL7 client: Test failure on test suite runtests, subtest test_1 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.7.0
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
15577

Description

This issue was created by maloo for Minh Diep <minh.diep@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/d54472f6-2fc6-11e4-957a-5254006e85c2.

The sub-test test_1 failed with the following error:

old and new files are different: rc=22

Info required for matching: runtests 1

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

after-umount-log
3.39 MB
20/Oct/14 2:46 PM
copy-end-log
3.41 MB
20/Oct/14 2:46 PM

Activity

[LU-5584] EL7 client: Test failure on test suite runtests, subtest test_1

Peter Jones added a comment - 31/Oct/14 4:47 PM

Landed for 2.7

Peter Jones added a comment - 31/Oct/14 4:47 PM Landed for 2.7

Bob Glossman (Inactive) added a comment - 31/Oct/14 4:06 PM - edited

I've verified that with the recent landing of http://review.whamcloud.com/#/c/12103 in master I no longer see this failure when running the test on el7 clients. I think this ticket can be marked Fixed.

Bob Glossman (Inactive) added a comment - 31/Oct/14 4:06 PM - edited I've verified that with the recent landing of http://review.whamcloud.com/#/c/12103 in master I no longer see this failure when running the test on el7 clients. I think this ticket can be marked Fixed.

Yang Sheng added a comment - 25/Oct/14 12:46 PM

The umount will hang while invoke with CL_FSYNC_ALL in replay-single test_89.

{no format}
[ 8119.181431] umount D ffff88003fc14580 0 4225 3612 0x00000080
[ 8119.181431] ffff880020cb7ae0 0000000000000086 ffff880020cb7fd8 0000000000014580
[ 8119.181431] ffff880020cb7fd8 0000000000014580 ffff880012100b60 ffff880009f50e80
[ 8119.181431] ffff880009f50e88 7fffffffffffffff ffff880012100b60 0000000000000000
[ 8119.181431] Call Trace:
[ 8119.181431] [<ffffffff815e71b9>] schedule+0x29/0x70
[ 8119.181431] [<ffffffff815e50b9>] schedule_timeout+0x209/0x2d0
[ 8119.181431] [<ffffffffa0783d60>] ? ptlrpcd_add_req+0x210/0x300 [ptlrpc]
[ 8119.181431] [<ffffffffa0764370>] ? lustre_swab_obdo+0x100/0x100 [ptlrpc]
[ 8119.181431] [<ffffffff815e76e6>] wait_for_completion+0x116/0x170
[ 8119.181431] [<ffffffff81097700>] ? wake_up_state+0x20/0x20
[ 8119.181431] [<ffffffffa096a944>] osc_io_fsync_end+0x74/0xa0 [osc]
[ 8119.181431] [<ffffffffa0572a2d>] cl_io_end+0x5d/0x150 [obdclass]
[ 8119.181431] [<ffffffffa09b61fb>] lov_io_end_wrapper+0xdb/0xe0 [lov]
[ 8119.181431] [<ffffffffa09b63a4>] lov_io_fsync_end+0x84/0x1c0 [lov]
[ 8119.181431] [<ffffffffa0572a2d>] cl_io_end+0x5d/0x150 [obdclass]
[ 8119.181431] [<ffffffffa0576733>] cl_io_loop+0xb3/0x190 [obdclass]
[ 8119.181431] [<ffffffffa0f0022b>] cl_sync_file_range+0x40b/0x610 [lustre]
[ 8119.181431] [<ffffffffa0f13fba>] ll_delete_inode+0x10a/0x230 [lustre]
[ 8119.181431] [<ffffffff811db91e>] ? inode_wait_for_writeback+0x2e/0x40
[ 8119.181431] [<ffffffff811cae27>] evict+0xa7/0x170
[ 8119.181431] [<ffffffff811caf2e>] dispose_list+0x3e/0x50
[ 8119.181431] [<ffffffff811cbb14>] evict_inodes+0x114/0x140
[ 8119.181431] [<ffffffff811b1fc8>] generic_shutdown_super+0x48/0xe0
[ 8119.181431] [<ffffffff811b2242>] kill_anon_super+0x12/0x20
[ 8119.181431] [<ffffffffa055880a>] lustre_kill_super+0x7a/0x80 [obdclass]
[ 8119.181431] [<ffffffff811b265d>] deactivate_locked_super+0x3d/0x60
[ 8119.181431] [<ffffffff811b26c6>] deactivate_super+0x46/0x60
[ 8119.181431] [<ffffffff811cf455>] mntput_no_expire+0xc5/0x120
[ 8119.181431] [<ffffffff811d058f>] SyS_umount+0x9f/0x3c0
[ 8119.181431] [<ffffffff815f2119>] system_call_fastpath+0x16/0x1b{no format}

But CL_FSYNC_LOCAL not. I have updated patch.

Yang Sheng added a comment - 25/Oct/14 12:46 PM The umount will hang while invoke with CL_FSYNC_ALL in replay-single test_89. {no format} [ 8119.181431] umount D ffff88003fc14580 0 4225 3612 0x00000080 [ 8119.181431] ffff880020cb7ae0 0000000000000086 ffff880020cb7fd8 0000000000014580 [ 8119.181431] ffff880020cb7fd8 0000000000014580 ffff880012100b60 ffff880009f50e80 [ 8119.181431] ffff880009f50e88 7fffffffffffffff ffff880012100b60 0000000000000000 [ 8119.181431] Call Trace: [ 8119.181431] [<ffffffff815e71b9>] schedule+0x29/0x70 [ 8119.181431] [<ffffffff815e50b9>] schedule_timeout+0x209/0x2d0 [ 8119.181431] [<ffffffffa0783d60>] ? ptlrpcd_add_req+0x210/0x300 [ptlrpc] [ 8119.181431] [<ffffffffa0764370>] ? lustre_swab_obdo+0x100/0x100 [ptlrpc] [ 8119.181431] [<ffffffff815e76e6>] wait_for_completion+0x116/0x170 [ 8119.181431] [<ffffffff81097700>] ? wake_up_state+0x20/0x20 [ 8119.181431] [<ffffffffa096a944>] osc_io_fsync_end+0x74/0xa0 [osc] [ 8119.181431] [<ffffffffa0572a2d>] cl_io_end+0x5d/0x150 [obdclass] [ 8119.181431] [<ffffffffa09b61fb>] lov_io_end_wrapper+0xdb/0xe0 [lov] [ 8119.181431] [<ffffffffa09b63a4>] lov_io_fsync_end+0x84/0x1c0 [lov] [ 8119.181431] [<ffffffffa0572a2d>] cl_io_end+0x5d/0x150 [obdclass] [ 8119.181431] [<ffffffffa0576733>] cl_io_loop+0xb3/0x190 [obdclass] [ 8119.181431] [<ffffffffa0f0022b>] cl_sync_file_range+0x40b/0x610 [lustre] [ 8119.181431] [<ffffffffa0f13fba>] ll_delete_inode+0x10a/0x230 [lustre] [ 8119.181431] [<ffffffff811db91e>] ? inode_wait_for_writeback+0x2e/0x40 [ 8119.181431] [<ffffffff811cae27>] evict+0xa7/0x170 [ 8119.181431] [<ffffffff811caf2e>] dispose_list+0x3e/0x50 [ 8119.181431] [<ffffffff811cbb14>] evict_inodes+0x114/0x140 [ 8119.181431] [<ffffffff811b1fc8>] generic_shutdown_super+0x48/0xe0 [ 8119.181431] [<ffffffff811b2242>] kill_anon_super+0x12/0x20 [ 8119.181431] [<ffffffffa055880a>] lustre_kill_super+0x7a/0x80 [obdclass] [ 8119.181431] [<ffffffff811b265d>] deactivate_locked_super+0x3d/0x60 [ 8119.181431] [<ffffffff811b26c6>] deactivate_super+0x46/0x60 [ 8119.181431] [<ffffffff811cf455>] mntput_no_expire+0xc5/0x120 [ 8119.181431] [<ffffffff811d058f>] SyS_umount+0x9f/0x3c0 [ 8119.181431] [<ffffffff815f2119>] system_call_fastpath+0x16/0x1b{no format} But CL_FSYNC_LOCAL not. I have updated patch.

Jinshan Xiong (Inactive) added a comment - 21/Oct/14 5:45 PM

It looks like the inode is called with redirty_tail() and then put in the list of b_dirty with newer timestamp. This is why it was skipped in the 2nd __sync_filesystem() therefore dirty data is not flushed. I think Yang Sheng's patch is good to fix the problem.

Jinshan Xiong (Inactive) added a comment - 21/Oct/14 5:45 PM It looks like the inode is called with redirty_tail() and then put in the list of b_dirty with newer timestamp. This is why it was skipped in the 2nd __sync_filesystem() therefore dirty data is not flushed. I think Yang Sheng's patch is good to fix the problem.

Jinshan Xiong (Inactive) added a comment - 21/Oct/14 12:51 AM

When a filesystem is being umounted, it will sync dirty data by calling sync_filesystem(). Then it calls __sync_filesystem() twice: the first time just starts the write back and the 2nd time is to wait for the write back to finish. Both call should reach ll_writepages() but for unknown reason, the 2nd call didn't reach it.

My temporary solution is to make ll_writepages() to wait for write back to finish anyway:

diff --git a/lustre/llite/rw.c b/lustre/llite/rw.c
index ff25525..778609b 100644
--- a/lustre/llite/rw.c
+++ b/lustre/llite/rw.c
@@ -1080,7 +1080,7 @@ int ll_writepages(struct address_space *mapping, struct writeback_control *wbc)
                }
        }
 
-       mode = CL_FSYNC_NONE;
+       mode = CL_FSYNC_LOCAL;
        if (wbc->sync_mode == WB_SYNC_ALL)
                mode = CL_FSYNC_LOCAL;

I can't see this issue any more after applying the above patch.

Yang Sheng, if you have any time, can you please investigate it further to figure out why the 2nd __sync_filesystem() didn't reach ll_writepages()?

Thanks,
Jinshan

Jinshan Xiong (Inactive) added a comment - 21/Oct/14 12:51 AM When a filesystem is being umounted, it will sync dirty data by calling sync_filesystem(). Then it calls __sync_filesystem() twice: the first time just starts the write back and the 2nd time is to wait for the write back to finish. Both call should reach ll_writepages() but for unknown reason, the 2nd call didn't reach it. My temporary solution is to make ll_writepages() to wait for write back to finish anyway: diff --git a/lustre/llite/rw.c b/lustre/llite/rw.c index ff25525..778609b 100644 --- a/lustre/llite/rw.c +++ b/lustre/llite/rw.c @@ -1080,7 +1080,7 @@ int ll_writepages(struct address_space *mapping, struct writeback_control *wbc) } } - mode = CL_FSYNC_NONE; + mode = CL_FSYNC_LOCAL; if (wbc->sync_mode == WB_SYNC_ALL) mode = CL_FSYNC_LOCAL; I can't see this issue any more after applying the above patch. Yang Sheng, if you have any time, can you please investigate it further to figure out why the 2nd __sync_filesystem() didn't reach ll_writepages()? Thanks, Jinshan

Jinshan Xiong (Inactive) added a comment - 20/Oct/14 5:44 PM

Hi Yang Sheng, Thank you very much for the reproduce script and log, I'm looking at it now.

Jinshan Xiong (Inactive) added a comment - 20/Oct/14 5:44 PM Hi Yang Sheng, Thank you very much for the reproduce script and log, I'm looking at it now.

People

Assignee:: Yang Sheng

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 04/Sep/14 3:01 PM

Updated:: 14/May/15 2:52 AM

Resolved:: 31/Oct/14 4:47 PM