Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5584

EL7 client: Test failure on test suite runtests, subtest test_1

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0
    • None
    • None
    • 3
    • 15577

    Description

      This issue was created by maloo for Minh Diep <minh.diep@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/d54472f6-2fc6-11e4-957a-5254006e85c2.

      The sub-test test_1 failed with the following error:

      old and new files are different: rc=22

      Info required for matching: runtests 1

      Attachments

        1. after-umount-log
          3.39 MB
        2. copy-end-log
          3.41 MB

        Activity

          [LU-5584] EL7 client: Test failure on test suite runtests, subtest test_1
          pjones Peter Jones added a comment -

          Landed for 2.7

          pjones Peter Jones added a comment - Landed for 2.7
          bogl Bob Glossman (Inactive) added a comment - - edited

          I've verified that with the recent landing of http://review.whamcloud.com/#/c/12103 in master I no longer see this failure when running the test on el7 clients. I think this ticket can be marked Fixed.

          bogl Bob Glossman (Inactive) added a comment - - edited I've verified that with the recent landing of http://review.whamcloud.com/#/c/12103 in master I no longer see this failure when running the test on el7 clients. I think this ticket can be marked Fixed.
          ys Yang Sheng added a comment -

          The umount will hang while invoke with CL_FSYNC_ALL in replay-single test_89.

          {no format}
          [ 8119.181431] umount D ffff88003fc14580 0 4225 3612 0x00000080
          [ 8119.181431] ffff880020cb7ae0 0000000000000086 ffff880020cb7fd8 0000000000014580
          [ 8119.181431] ffff880020cb7fd8 0000000000014580 ffff880012100b60 ffff880009f50e80
          [ 8119.181431] ffff880009f50e88 7fffffffffffffff ffff880012100b60 0000000000000000
          [ 8119.181431] Call Trace:
          [ 8119.181431] [<ffffffff815e71b9>] schedule+0x29/0x70
          [ 8119.181431] [<ffffffff815e50b9>] schedule_timeout+0x209/0x2d0
          [ 8119.181431] [<ffffffffa0783d60>] ? ptlrpcd_add_req+0x210/0x300 [ptlrpc]
          [ 8119.181431] [<ffffffffa0764370>] ? lustre_swab_obdo+0x100/0x100 [ptlrpc]
          [ 8119.181431] [<ffffffff815e76e6>] wait_for_completion+0x116/0x170
          [ 8119.181431] [<ffffffff81097700>] ? wake_up_state+0x20/0x20
          [ 8119.181431] [<ffffffffa096a944>] osc_io_fsync_end+0x74/0xa0 [osc]
          [ 8119.181431] [<ffffffffa0572a2d>] cl_io_end+0x5d/0x150 [obdclass]
          [ 8119.181431] [<ffffffffa09b61fb>] lov_io_end_wrapper+0xdb/0xe0 [lov]
          [ 8119.181431] [<ffffffffa09b63a4>] lov_io_fsync_end+0x84/0x1c0 [lov]
          [ 8119.181431] [<ffffffffa0572a2d>] cl_io_end+0x5d/0x150 [obdclass]
          [ 8119.181431] [<ffffffffa0576733>] cl_io_loop+0xb3/0x190 [obdclass]
          [ 8119.181431] [<ffffffffa0f0022b>] cl_sync_file_range+0x40b/0x610 [lustre]
          [ 8119.181431] [<ffffffffa0f13fba>] ll_delete_inode+0x10a/0x230 [lustre]
          [ 8119.181431] [<ffffffff811db91e>] ? inode_wait_for_writeback+0x2e/0x40
          [ 8119.181431] [<ffffffff811cae27>] evict+0xa7/0x170
          [ 8119.181431] [<ffffffff811caf2e>] dispose_list+0x3e/0x50
          [ 8119.181431] [<ffffffff811cbb14>] evict_inodes+0x114/0x140
          [ 8119.181431] [<ffffffff811b1fc8>] generic_shutdown_super+0x48/0xe0
          [ 8119.181431] [<ffffffff811b2242>] kill_anon_super+0x12/0x20
          [ 8119.181431] [<ffffffffa055880a>] lustre_kill_super+0x7a/0x80 [obdclass]
          [ 8119.181431] [<ffffffff811b265d>] deactivate_locked_super+0x3d/0x60
          [ 8119.181431] [<ffffffff811b26c6>] deactivate_super+0x46/0x60
          [ 8119.181431] [<ffffffff811cf455>] mntput_no_expire+0xc5/0x120
          [ 8119.181431] [<ffffffff811d058f>] SyS_umount+0x9f/0x3c0
          [ 8119.181431] [<ffffffff815f2119>] system_call_fastpath+0x16/0x1b{no format}

          But CL_FSYNC_LOCAL not. I have updated patch.

          ys Yang Sheng added a comment - The umount will hang while invoke with CL_FSYNC_ALL in replay-single test_89. {no format} [ 8119.181431] umount D ffff88003fc14580 0 4225 3612 0x00000080 [ 8119.181431] ffff880020cb7ae0 0000000000000086 ffff880020cb7fd8 0000000000014580 [ 8119.181431] ffff880020cb7fd8 0000000000014580 ffff880012100b60 ffff880009f50e80 [ 8119.181431] ffff880009f50e88 7fffffffffffffff ffff880012100b60 0000000000000000 [ 8119.181431] Call Trace: [ 8119.181431] [<ffffffff815e71b9>] schedule+0x29/0x70 [ 8119.181431] [<ffffffff815e50b9>] schedule_timeout+0x209/0x2d0 [ 8119.181431] [<ffffffffa0783d60>] ? ptlrpcd_add_req+0x210/0x300 [ptlrpc] [ 8119.181431] [<ffffffffa0764370>] ? lustre_swab_obdo+0x100/0x100 [ptlrpc] [ 8119.181431] [<ffffffff815e76e6>] wait_for_completion+0x116/0x170 [ 8119.181431] [<ffffffff81097700>] ? wake_up_state+0x20/0x20 [ 8119.181431] [<ffffffffa096a944>] osc_io_fsync_end+0x74/0xa0 [osc] [ 8119.181431] [<ffffffffa0572a2d>] cl_io_end+0x5d/0x150 [obdclass] [ 8119.181431] [<ffffffffa09b61fb>] lov_io_end_wrapper+0xdb/0xe0 [lov] [ 8119.181431] [<ffffffffa09b63a4>] lov_io_fsync_end+0x84/0x1c0 [lov] [ 8119.181431] [<ffffffffa0572a2d>] cl_io_end+0x5d/0x150 [obdclass] [ 8119.181431] [<ffffffffa0576733>] cl_io_loop+0xb3/0x190 [obdclass] [ 8119.181431] [<ffffffffa0f0022b>] cl_sync_file_range+0x40b/0x610 [lustre] [ 8119.181431] [<ffffffffa0f13fba>] ll_delete_inode+0x10a/0x230 [lustre] [ 8119.181431] [<ffffffff811db91e>] ? inode_wait_for_writeback+0x2e/0x40 [ 8119.181431] [<ffffffff811cae27>] evict+0xa7/0x170 [ 8119.181431] [<ffffffff811caf2e>] dispose_list+0x3e/0x50 [ 8119.181431] [<ffffffff811cbb14>] evict_inodes+0x114/0x140 [ 8119.181431] [<ffffffff811b1fc8>] generic_shutdown_super+0x48/0xe0 [ 8119.181431] [<ffffffff811b2242>] kill_anon_super+0x12/0x20 [ 8119.181431] [<ffffffffa055880a>] lustre_kill_super+0x7a/0x80 [obdclass] [ 8119.181431] [<ffffffff811b265d>] deactivate_locked_super+0x3d/0x60 [ 8119.181431] [<ffffffff811b26c6>] deactivate_super+0x46/0x60 [ 8119.181431] [<ffffffff811cf455>] mntput_no_expire+0xc5/0x120 [ 8119.181431] [<ffffffff811d058f>] SyS_umount+0x9f/0x3c0 [ 8119.181431] [<ffffffff815f2119>] system_call_fastpath+0x16/0x1b{no format} But CL_FSYNC_LOCAL not. I have updated patch.

          It looks like the inode is called with redirty_tail() and then put in the list of b_dirty with newer timestamp. This is why it was skipped in the 2nd __sync_filesystem() therefore dirty data is not flushed. I think Yang Sheng's patch is good to fix the problem.

          jay Jinshan Xiong (Inactive) added a comment - It looks like the inode is called with redirty_tail() and then put in the list of b_dirty with newer timestamp. This is why it was skipped in the 2nd __sync_filesystem() therefore dirty data is not flushed. I think Yang Sheng's patch is good to fix the problem.

          When a filesystem is being umounted, it will sync dirty data by calling sync_filesystem(). Then it calls __sync_filesystem() twice: the first time just starts the write back and the 2nd time is to wait for the write back to finish. Both call should reach ll_writepages() but for unknown reason, the 2nd call didn't reach it.

          My temporary solution is to make ll_writepages() to wait for write back to finish anyway:

          diff --git a/lustre/llite/rw.c b/lustre/llite/rw.c
          index ff25525..778609b 100644
          --- a/lustre/llite/rw.c
          +++ b/lustre/llite/rw.c
          @@ -1080,7 +1080,7 @@ int ll_writepages(struct address_space *mapping, struct writeback_control *wbc)
                          }
                  }
           
          -       mode = CL_FSYNC_NONE;
          +       mode = CL_FSYNC_LOCAL;
                  if (wbc->sync_mode == WB_SYNC_ALL)
                          mode = CL_FSYNC_LOCAL;
           
          

          I can't see this issue any more after applying the above patch.

          Yang Sheng, if you have any time, can you please investigate it further to figure out why the 2nd __sync_filesystem() didn't reach ll_writepages()?

          Thanks,
          Jinshan

          jay Jinshan Xiong (Inactive) added a comment - When a filesystem is being umounted, it will sync dirty data by calling sync_filesystem(). Then it calls __sync_filesystem() twice: the first time just starts the write back and the 2nd time is to wait for the write back to finish. Both call should reach ll_writepages() but for unknown reason, the 2nd call didn't reach it. My temporary solution is to make ll_writepages() to wait for write back to finish anyway: diff --git a/lustre/llite/rw.c b/lustre/llite/rw.c index ff25525..778609b 100644 --- a/lustre/llite/rw.c +++ b/lustre/llite/rw.c @@ -1080,7 +1080,7 @@ int ll_writepages(struct address_space *mapping, struct writeback_control *wbc) } } - mode = CL_FSYNC_NONE; + mode = CL_FSYNC_LOCAL; if (wbc->sync_mode == WB_SYNC_ALL) mode = CL_FSYNC_LOCAL; I can't see this issue any more after applying the above patch. Yang Sheng, if you have any time, can you please investigate it further to figure out why the 2nd __sync_filesystem() didn't reach ll_writepages()? Thanks, Jinshan

          Hi Yang Sheng, Thank you very much for the reproduce script and log, I'm looking at it now.

          jay Jinshan Xiong (Inactive) added a comment - Hi Yang Sheng, Thank you very much for the reproduce script and log, I'm looking at it now.

          People

            ys Yang Sheng
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: