Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5951

sanity test_39k: mtime is lost on close

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.7.0, Lustre 2.8.0, Lustre 2.5.4
    • 3
    • 16613

    Description

      This issue was created by maloo for John Hammond <john.hammond@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/01943f56-7150-11e4-b80a-5254006e85c2.

      The sub-test test_39k failed with the following error:

      mtime is lost on close: 1416505386, should be 1384969360
      

      I ran 39k in a loop locally and saw the same failure in 2 out of 256 runs.

      Here are all the instances from maloo:

      https://testing.hpdd.intel.com/sub_tests/7e8dab70-069d-11e2-9e80-52540035b04c ~2012-09-24
      https://testing.hpdd.intel.com/sub_tests/9f92a846-0c4e-11e2-8132-52540035b04c ~2012-10-01

      https://testing.hpdd.intel.com/sub_tests/bed3a506-06c6-11e4-9c81-5254006e85c2 2014-07-08 01:42:48 UTCs
      https://testing.hpdd.intel.com/sub_tests/127d427e-0c86-11e4-8fe6-5254006e85c2 2014-07-15 21:43:59 UTCs
      https://testing.hpdd.intel.com/sub_tests/ab1c3752-2a76-11e4-8657-5254006e85c2 2014-08-23 00:04:17 UTCs

      https://testing.hpdd.intel.com/sub_tests/1791dfdc-6d01-11e4-8bd3-5254006e85c2 2014-11-14 08:35:54 UTCs
      https://testing.hpdd.intel.com/sub_tests/545725c0-6db6-11e4-a728-5254006e85c2 2014-11-15 20:02:55 UTCs
      https://testing.hpdd.intel.com/sub_tests/9c2721d8-7078-11e4-a6ba-5254006e85c2 2014-11-19 16:37:26 UTCs
      https://testing.hpdd.intel.com/sub_tests/bb99d730-712d-11e4-9495-5254006e85c2 2014-11-20 17:12:53 UTCs
      https://testing.hpdd.intel.com/sub_tests/2a99b35e-7150-11e4-b80a-5254006e85c2 2014-11-20 17:12:53 UTCs
      https://testing.hpdd.intel.com/sub_tests/1db886d4-7177-11e4-89a9-5254006e85c2 2014-11-21 00:07:35 UTCs

      Info required for matching: sanity 39k

      Attachments

        Issue Links

          Activity

            [LU-5951] sanity test_39k: mtime is lost on close

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15473/
            Subject: LU-5951 ptlrpc: track unreplied requests
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: c77e504fdac12d3be7d19a652d6c7da497018c76

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15473/ Subject: LU-5951 ptlrpc: track unreplied requests Project: fs/lustre-release Branch: master Current Patch Set: Commit: c77e504fdac12d3be7d19a652d6c7da497018c76
            jamesanunez James Nunez (Inactive) added a comment - - edited I've seen this issue again: 2015-07-02 18:54:42 - https://testing.hpdd.intel.com/test_sets/4ee3283c-2102-11e5-8eb6-5254006e85c2 2015-07-03 18:45:11 - https://testing.hpdd.intel.com/test_sets/f4ce726e-21e9-11e5-a388-5254006e85c2 2015-07-10 05:07:18 - https://testing.hpdd.intel.com/test_sets/0d7b77d2-26fc-11e5-925d-5254006e85c2

            Ok, I'll update the patch to maintain an unreplied xid list for each import.

            niu Niu Yawei (Inactive) added a comment - Ok, I'll update the patch to maintain an unreplied xid list for each import.
            bzzz Alex Zhuravlev added a comment - - edited

            well, if we don't track that, then it's very easy to "lose" some slots: at moment X we used 8 slots, then later we were using 2 slots at most. using tags we can reuse only those 2 slots, but we can't report the others slots can be reused. there is no strong need to maintain that absolutely up to date,
            technically it should be possible (and not very complex) to introduce another list, like.. ptlrpc_next_xid() (or it's callers) atomically puts RPC on the list, after_reply() and __ptlrpc_req_free() delete the RPC from the list.

            bzzz Alex Zhuravlev added a comment - - edited well, if we don't track that, then it's very easy to "lose" some slots: at moment X we used 8 slots, then later we were using 2 slots at most. using tags we can reuse only those 2 slots, but we can't report the others slots can be reused. there is no strong need to maintain that absolutely up to date, technically it should be possible (and not very complex) to introduce another list, like.. ptlrpc_next_xid() (or it's callers) atomically puts RPC on the list, after_reply() and __ptlrpc_req_free() delete the RPC from the list.

            It isn't totally clear that we need the change from http://review.whamcloud.com/14793 in order for the multi-slot code to work. While it would make the tracking of unreplied RPCs a bit more complex, having an atomic XID assignment set at "send" time is not quite the same as "unreplied" so there still needs to be a mechanism used to track which RPCs have replies.

            The one major difference would be that there needs to be some mechanism to track RPC XIDs which are never sent, so that they don't permanently get stuck as the lowest unreplied XID. It would seem possible to do this in __ptlrpc_req_free() I think?

            adilger Andreas Dilger added a comment - It isn't totally clear that we need the change from http://review.whamcloud.com/14793 in order for the multi-slot code to work. While it would make the tracking of unreplied RPCs a bit more complex, having an atomic XID assignment set at "send" time is not quite the same as "unreplied" so there still needs to be a mechanism used to track which RPCs have replies. The one major difference would be that there needs to be some mechanism to track RPC XIDs which are never sent, so that they don't permanently get stuck as the lowest unreplied XID. It would seem possible to do this in __ptlrpc_req_free() I think?

            Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/15473
            Subject: LU-5951 osc: set ioepoch to ost setattr/punch/write
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: fafc374824db7d69bed1c527989ea60d825200dd

            gerrit Gerrit Updater added a comment - Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/15473 Subject: LU-5951 osc: set ioepoch to ost setattr/punch/write Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: fafc374824db7d69bed1c527989ea60d825200dd

            I think the regression should be introduced by:

            commit bf3e7f67cb33f3b4e0590ef8af3843ac53d0a4e8
            Author: Gregoire Pichon <gregoire.pichon@bull.net>
            Date:   Wed May 13 16:42:44 2015 +0200
            
                LU-5319 ptlrpc: embed highest XID in each request
            
                Atomically assign XIDs and put request and sending list so
                we can learn the lowest unreplied XID at any point.
            
                This allows to embed in every resquests the highest XID for
                which a reply has been received and does not have an unreplied
                lower-numbered XID.
            
                This will be used by the MDT target to release in-memory
                reply data corresponding to XIDs of reply received by the client.
            
                Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
                Signed-off-by: Gregoire Pichon <gregoire.pichon@bull.net>
                Change-Id: Ic88fb6db704d8e9a78a34fe16f64abb2cdffc4c4
                Reviewed-on: http://review.whamcloud.com/14793
                Tested-by: Jenkins
                Tested-by: Maloo <hpdd-maloo@intel.com>
                Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
                Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
            

            Where we deferred the xid assignment from request packing to request sending, that breaks the fix of bug 10150, see osc_build_rpc():

                    /* Need to update the timestamps after the request is built in case
                     * we race with setattr (locally or in queue at OST).  If OST gets
                     * later setattr before earlier BRW (as determined by the request xid),
                     * the OST will not use BRW timestamps.  Sadly, there is no obvious
                     * way to do this in a single call.  bug 10150 */
            

            Looks we have to fix the race of setattr vs. brw in another method or just fix the multi-slot patch, any suggestions?

            niu Niu Yawei (Inactive) added a comment - I think the regression should be introduced by: commit bf3e7f67cb33f3b4e0590ef8af3843ac53d0a4e8 Author: Gregoire Pichon <gregoire.pichon@bull.net> Date: Wed May 13 16:42:44 2015 +0200 LU-5319 ptlrpc: embed highest XID in each request Atomically assign XIDs and put request and sending list so we can learn the lowest unreplied XID at any point. This allows to embed in every resquests the highest XID for which a reply has been received and does not have an unreplied lower-numbered XID. This will be used by the MDT target to release in-memory reply data corresponding to XIDs of reply received by the client. Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com> Signed-off-by: Gregoire Pichon <gregoire.pichon@bull.net> Change-Id: Ic88fb6db704d8e9a78a34fe16f64abb2cdffc4c4 Reviewed-on: http: //review.whamcloud.com/14793 Tested-by: Jenkins Tested-by: Maloo <hpdd-maloo@intel.com> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Reviewed-by: Oleg Drokin <oleg.drokin@intel.com> Where we deferred the xid assignment from request packing to request sending, that breaks the fix of bug 10150, see osc_build_rpc(): /* Need to update the timestamps after the request is built in case * we race with setattr (locally or in queue at OST). If OST gets * later setattr before earlier BRW (as determined by the request xid), * the OST will not use BRW timestamps. Sadly, there is no obvious * way to do this in a single call. bug 10150 */ Looks we have to fix the race of setattr vs. brw in another method or just fix the multi-slot patch, any suggestions?
            di.wang Di Wang (Inactive) added a comment - It seems a regression, I saw it twice recently https://testing.hpdd.intel.com/sub_tests/3032df32-1fbc-11e5-bc94-5254006e85c2 https://testing.hpdd.intel.com/sub_tests/7e4a1dbe-1fe1-11e5-9b0e-5254006e85c2

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13261/
            Subject: LU-5951 clio: update timestamps after buiding rpc
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set:
            Commit: 7b70c1598a9b484cfe7f50c584caaca5ab64f0ba

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13261/ Subject: LU-5951 clio: update timestamps after buiding rpc Project: fs/lustre-release Branch: b2_5 Current Patch Set: Commit: 7b70c1598a9b484cfe7f50c584caaca5ab64f0ba

            Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/13261
            Subject: LU-5951 clio: update timestamps after buiding rpc
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: 7844e1db6d17ae1721c7b1955404ea12bb08b8ad

            gerrit Gerrit Updater added a comment - Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/13261 Subject: LU-5951 clio: update timestamps after buiding rpc Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 7844e1db6d17ae1721c7b1955404ea12bb08b8ad

            patch landed on master.

            niu Niu Yawei (Inactive) added a comment - patch landed on master.

            People

              niu Niu Yawei (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: