[LU-5951] sanity test_39k: mtime is lost on close Created: 24/Nov/14  Updated: 13/Dec/16  Resolved: 09/Dec/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.8.0, Lustre 2.5.4
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: 22pl

Issue Links:
Blocker
is blocking LU-3289 IU Shared Secret Key authentication a... Resolved
Duplicate
Related
is related to LU-5006 chown/chgrp doesn't work for files cr... Resolved
is related to LU-5319 Support multiple slots per client in ... Resolved
is related to LU-6831 The ticket for tracking all DNE2 bugs Reopened
is related to LU-7252 replay-single test_116b:process_req_l... Resolved
is related to LU-7182 LBUG during key reestablishment with ... Resolved
Severity: 3
Rank (Obsolete): 16613

 Description   

This issue was created by maloo for John Hammond <john.hammond@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/01943f56-7150-11e4-b80a-5254006e85c2.

The sub-test test_39k failed with the following error:

mtime is lost on close: 1416505386, should be 1384969360

I ran 39k in a loop locally and saw the same failure in 2 out of 256 runs.

Here are all the instances from maloo:

https://testing.hpdd.intel.com/sub_tests/7e8dab70-069d-11e2-9e80-52540035b04c ~2012-09-24
https://testing.hpdd.intel.com/sub_tests/9f92a846-0c4e-11e2-8132-52540035b04c ~2012-10-01

https://testing.hpdd.intel.com/sub_tests/bed3a506-06c6-11e4-9c81-5254006e85c2 2014-07-08 01:42:48 UTCs
https://testing.hpdd.intel.com/sub_tests/127d427e-0c86-11e4-8fe6-5254006e85c2 2014-07-15 21:43:59 UTCs
https://testing.hpdd.intel.com/sub_tests/ab1c3752-2a76-11e4-8657-5254006e85c2 2014-08-23 00:04:17 UTCs

https://testing.hpdd.intel.com/sub_tests/1791dfdc-6d01-11e4-8bd3-5254006e85c2 2014-11-14 08:35:54 UTCs
https://testing.hpdd.intel.com/sub_tests/545725c0-6db6-11e4-a728-5254006e85c2 2014-11-15 20:02:55 UTCs
https://testing.hpdd.intel.com/sub_tests/9c2721d8-7078-11e4-a6ba-5254006e85c2 2014-11-19 16:37:26 UTCs
https://testing.hpdd.intel.com/sub_tests/bb99d730-712d-11e4-9495-5254006e85c2 2014-11-20 17:12:53 UTCs
https://testing.hpdd.intel.com/sub_tests/2a99b35e-7150-11e4-b80a-5254006e85c2 2014-11-20 17:12:53 UTCs
https://testing.hpdd.intel.com/sub_tests/1db886d4-7177-11e4-89a9-5254006e85c2 2014-11-21 00:07:35 UTCs

Info required for matching: sanity 39k



 Comments   
Comment by Andreas Dilger [ 24/Nov/14 ]

Just looking back at patches that landed before Nov 14. I found http://review.whamcloud.com/12243 from LU-5006 that landed on Nov 11, and it is the only patch that appears related to this part of the code in that time period.

Comment by Andreas Dilger [ 24/Nov/14 ]

Another possibility is http://review.whamcloud.com/10858 "LU-3259 clio: cl_lock simplification" which was also present in the earlier failures on 2014-07-08 and 2014-07-15 on their development branches, and it was landed to master on 2014-11-04.

Comment by Niu Yawei (Inactive) [ 27/Nov/14 ]

Looks this was introduced when integrating OFD stack:
a67ea1c5

Author: Mikhail Pershin <tappro@whamcloud.com>
Date:   Wed May 23 23:00:33 2012 +0400

    LU-1406 ofd: IO operations
    
    add IO functions to OFD

see the ofd_commitrw():

+       if (cmd == OBD_BRW_WRITE) {
+               /* Don't update timestamps if this write is older than a
+                * setattr which modifies the timestamps. b=10150 */
+
+               /* XXX when we start having persistent reservations this needs
+                * to be changed to ofd_fmd_get() to create the fmd if it
+                * doesn't already exist so we can store the reservation handle
+                * there. */
+               valid = OBD_MD_FLUID | OBD_MD_FLGID;
+               fmd = ofd_fmd_find(exp, &info->fti_fid);
+               if (!fmd || fmd->fmd_mactime_xid < info->fti_xid)
+                       valid |= OBD_MD_FLATIME | OBD_MD_FLMTIME |
+                                OBD_MD_FLCTIME;

This actually should be:

if (fmd && fmd->fmd_mactime_xid > info->fti_xid)
        valid &=~ time_flags;

I'm going to cook a patch soon.

Comment by Gerrit Updater [ 27/Nov/14 ]

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/12865
Subject: LU-5951 ofd: typo in ofd_commitrw()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0c0b41d98d4c9d76082652d637c0e58726a5674b

Comment by Niu Yawei (Inactive) [ 27/Nov/14 ]

patch for master: http://review.whamcloud.com/12865

Comment by Jian Yu [ 30/Nov/14 ]

While verifying patch http://review.whamcloud.com/12804 on Lustre b2_5 branch, the same failure occurred:
https://testing.hpdd.intel.com/test_sets/fdc66684-7176-11e4-89a9-5254006e85c2

Comment by Gerrit Updater [ 04/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12865/
Subject: LU-5951 clio: update timestamps after buiding rpc
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 665ad328b368f1cbd8646690999b609d7b0feaf9

Comment by Jian Yu [ 08/Dec/14 ]

More instance on Lustre b2_5 branch:
https://testing.hpdd.intel.com/test_sets/65e1bd20-7d90-11e4-8c81-5254006e85c2

Comment by Niu Yawei (Inactive) [ 05/Jan/15 ]

patch landed on master.

Comment by Gerrit Updater [ 07/Jan/15 ]

Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/13261
Subject: LU-5951 clio: update timestamps after buiding rpc
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 7844e1db6d17ae1721c7b1955404ea12bb08b8ad

Comment by Gerrit Updater [ 27/Jan/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13261/
Subject: LU-5951 clio: update timestamps after buiding rpc
Project: fs/lustre-release
Branch: b2_5
Current Patch Set:
Commit: 7b70c1598a9b484cfe7f50c584caaca5ab64f0ba

Comment by Di Wang [ 01/Jul/15 ]

It seems a regression, I saw it twice recently

https://testing.hpdd.intel.com/sub_tests/3032df32-1fbc-11e5-bc94-5254006e85c2
https://testing.hpdd.intel.com/sub_tests/7e4a1dbe-1fe1-11e5-9b0e-5254006e85c2

Comment by Niu Yawei (Inactive) [ 02/Jul/15 ]

I think the regression should be introduced by:

commit bf3e7f67cb33f3b4e0590ef8af3843ac53d0a4e8
Author: Gregoire Pichon <gregoire.pichon@bull.net>
Date:   Wed May 13 16:42:44 2015 +0200

    LU-5319 ptlrpc: embed highest XID in each request

    Atomically assign XIDs and put request and sending list so
    we can learn the lowest unreplied XID at any point.

    This allows to embed in every resquests the highest XID for
    which a reply has been received and does not have an unreplied
    lower-numbered XID.

    This will be used by the MDT target to release in-memory
    reply data corresponding to XIDs of reply received by the client.

    Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
    Signed-off-by: Gregoire Pichon <gregoire.pichon@bull.net>
    Change-Id: Ic88fb6db704d8e9a78a34fe16f64abb2cdffc4c4
    Reviewed-on: http://review.whamcloud.com/14793
    Tested-by: Jenkins
    Tested-by: Maloo <hpdd-maloo@intel.com>
    Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
    Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>

Where we deferred the xid assignment from request packing to request sending, that breaks the fix of bug 10150, see osc_build_rpc():

        /* Need to update the timestamps after the request is built in case
         * we race with setattr (locally or in queue at OST).  If OST gets
         * later setattr before earlier BRW (as determined by the request xid),
         * the OST will not use BRW timestamps.  Sadly, there is no obvious
         * way to do this in a single call.  bug 10150 */

Looks we have to fix the race of setattr vs. brw in another method or just fix the multi-slot patch, any suggestions?

Comment by Gerrit Updater [ 02/Jul/15 ]

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/15473
Subject: LU-5951 osc: set ioepoch to ost setattr/punch/write
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fafc374824db7d69bed1c527989ea60d825200dd

Comment by Andreas Dilger [ 03/Jul/15 ]

It isn't totally clear that we need the change from http://review.whamcloud.com/14793 in order for the multi-slot code to work. While it would make the tracking of unreplied RPCs a bit more complex, having an atomic XID assignment set at "send" time is not quite the same as "unreplied" so there still needs to be a mechanism used to track which RPCs have replies.

The one major difference would be that there needs to be some mechanism to track RPC XIDs which are never sent, so that they don't permanently get stuck as the lowest unreplied XID. It would seem possible to do this in __ptlrpc_req_free() I think?

Comment by Alex Zhuravlev [ 03/Jul/15 ]

well, if we don't track that, then it's very easy to "lose" some slots: at moment X we used 8 slots, then later we were using 2 slots at most. using tags we can reuse only those 2 slots, but we can't report the others slots can be reused. there is no strong need to maintain that absolutely up to date,
technically it should be possible (and not very complex) to introduce another list, like.. ptlrpc_next_xid() (or it's callers) atomically puts RPC on the list, after_reply() and __ptlrpc_req_free() delete the RPC from the list.

Comment by Niu Yawei (Inactive) [ 03/Jul/15 ]

Ok, I'll update the patch to maintain an unreplied xid list for each import.

Comment by James Nunez (Inactive) [ 06/Jul/15 ]

I've seen this issue again:
2015-07-02 18:54:42 - https://testing.hpdd.intel.com/test_sets/4ee3283c-2102-11e5-8eb6-5254006e85c2
2015-07-03 18:45:11 - https://testing.hpdd.intel.com/test_sets/f4ce726e-21e9-11e5-a388-5254006e85c2
2015-07-10 05:07:18 - https://testing.hpdd.intel.com/test_sets/0d7b77d2-26fc-11e5-925d-5254006e85c2

Comment by Gerrit Updater [ 02/Oct/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15473/
Subject: LU-5951 ptlrpc: track unreplied requests
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c77e504fdac12d3be7d19a652d6c7da497018c76

Comment by Joseph Gmitter (Inactive) [ 02/Oct/15 ]

Landed for 2.8.0

Comment by Gerrit Updater [ 06/Oct/15 ]

Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/16734
Subject: Revert "LU-5951 ptlrpc: track unreplied requests"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e12c29a48b33a0ae7bd4147dab57dae5597954aa

Comment by Gerrit Updater [ 06/Oct/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16734/
Subject: Revert "LU-5951 ptlrpc: track unreplied requests"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c250f40ef3222dbeb92d7914a0d9f38a3525d2fb

Comment by Joseph Gmitter (Inactive) [ 06/Oct/15 ]

Reopening as the recent landing had caused LU-7252. The recent landing has been reverted.

Comment by Joseph Gmitter (Inactive) [ 06/Oct/15 ]

The fixVersion has been updated to 2.9.0 to properly address the issue that was being addressed by http://review.whamcloud.com/15473/
This issue is not something we are currently seeing as a failure on master.

Comment by Gerrit Updater [ 08/Oct/15 ]

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/16759
Subject: LU-5951 ptlrpc: track unreplied requests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c4af36bfdb61fdbdf7d43778859d3e8b441531b5

Comment by Jeremy Filizetti [ 01/Dec/15 ]

The patch (http://review.whamcloud.com/#/c/16759/) here is necessary for GSS Shared Key (and I assume Kerberos) to function without generating an LBUG.

Comment by Gerrit Updater [ 09/Dec/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16759/
Subject: LU-5951 ptlrpc: track unreplied requests
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8c69ef1e5caf3b6800e83bb73696b4bd1ae6e613

Comment by Peter Jones [ 09/Dec/15 ]

Landed for 2.8

Generated at Sat Feb 10 01:55:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.