[LU-3036] Interop 1.8.9<->2.4 Failure on test suite sanityn test_23: atime doesn't update among nodes Created: 26/Mar/13  Updated: 17/Jun/13  Resolved: 17/Jun/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0, Lustre 1.8.9

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Emoly Liu
Resolution: Fixed Votes: 0
Labels: None
Environment:

client: 1.8.9
server: lustre-master build #1338


Severity: 3
Rank (Obsolete): 7407

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/25418d26-948b-11e2-93c6-52540035b04c.

The sub-test test_23 failed with the following error:

test_23 failed with 1

Info required for matching: sanityn 23



 Comments   
Comment by Oleg Drokin [ 27/Mar/13 ]

I noticed that in the b2_1 the test is different, it sends the signal to multiop before reading the time with a comment that says:

        multiop_bg_pause $DIR1/f23 or20_c || return 1
        # with SOM and opencache enabled, we need to close a file and cancel
        # open lock to get atime propogated to MDS
        kill -USR1 $!

I imagine it could be that b1_8 interop needs the same thing?

Also somebody needs to check correctness of the updated test in 2.x code, it looks pretty strange now to me, what's the point of a pause if we stop it the very next thing?

Comment by Peter Jones [ 28/Mar/13 ]

Emoly

Could you please look into this one?

Thanks

Peter

Comment by Emoly Liu [ 29/Mar/13 ]

I will try per Oleg's advice.

Comment by Emoly Liu [ 06/May/13 ]

This failure is easy to reproduce on my local VMs. I tried the method that Oleg mentioned but failed.

I will investigate more.

Comment by Emoly Liu [ 08/May/13 ]

sanity test_203 on b1_8 is another atime test, whose script is similar to sanityN test_23. But I found there was some lustre version judgement there, introduced by bz=23766 for 1.8.6<->2.1.0 interop (https://projectlava.xyratex.com/show_bug.cgi?id=23766#c56).

 test_203() {
+        local lustre_version=$(get_lustre_version mds)
+        if [[ $lustre_version != 1.8* ]]; then
+               skip bug23766 mds running $lustre_version
+               return
+        fi
+
         local ATIME=`do_facet mds lctl get_param -n mds.*.atime_diff`
         echo "atime should be updated on the MDS when closing file" > $DIR/$tfile
         sync

Why we added this ? If I remove this judgement, apparently test_203 will fail for b1_8<->master interop test, as same as test_23 does. Is there anything blocking interop atime update since then?

BTW, I know the path of parameter "atime_diff" is a little different between b1_8 and others, but I don't think that will be a reason to skip this test, we can use md*.*.atime_diff instead.

Comment by Emoly Liu [ 09/May/13 ]

I forgot to say in my last two comments that the problem of "no such process" really can be fixed by Oleg's suggestion, but atime update problem still exists in b18<->master interop. The maloo test report is at https://maloo.whamcloud.com/test_sessions/a1e209cc-b872-11e2-891d-52540035b04c

On master, we update atime on close if (la_valid == LA_ATIME),

int mdt_mfd_close(struct mdt_thread_info *info, struct mdt_file_data *mfd)
{
...
        /* Update atime on close only. */
        if ((mode & MDS_FMODE_EXEC || mode & FMODE_READ || mode & FMODE_WRITE)
            && (ma->ma_valid & MA_INODE) && (ma->ma_attr.la_valid & LA_ATIME)) {
                /* Set the atime only. */
                ma->ma_valid = MA_INODE;
                ma->ma_attr.la_valid = LA_ATIME;
                rc = mo_attr_set(info->mti_env, next, ma);
        }
...
}

static int mdd_fix_attr(const struct lu_env *env, struct mdd_object *obj,
                        struct lu_attr *la, const unsigned long flags)
{
...
        if (la->la_valid == LA_ATIME) {
                /* This is atime only set for read atime update on close. */
                if (la->la_atime >= tmp_la->la_atime &&
                    la->la_atime < (tmp_la->la_atime +
                                    mdd_obj2mdd_dev(obj)->mdd_atime_diff))
                        la->la_valid &= ~LA_ATIME;
                RETURN(0);
        }
...
}

but my debug message showed that "valid" from b18 client wasn't mapped to LA_ATIME finally. I will investigate more.

Comment by Emoly Liu [ 10/May/13 ]

After looking at the code between master and b1_8, I found they used different attribute flags to trigger atime update.

MASTER: ll_close_inode_openhandle()->ll_prepare_close(): ATTR_MODE | ATTR_ATIME_SET | ATTR_MTIME_SET | ATTR_CTIME_SET;

----------------------------------------------------------------
     client          |           server
----------------------------------------------------------------
 ATTR_* ----->   MDS_ATTR*   ------>   ATTR_*   ------>   LA_* 
      attr_pack()
                          attr_unpack()
                                        mdt_attr_valid_xlate()  --->  if (in & ATTR_ATIME_SET)
----------------------------------------------------------------          out |= LA_ATIME;     

B1_8: ll_close_inode_openhandle(): OBD_MD_FLTYPE | OBD_MD_FLMODE | OBD_MD_FLATIME | OBD_MD_FLMTIME | OBD_MD_FLCTIME;

--------------------------------------------------
     client          |           server
--------------------------------------------------
 OBD_MD_FL* ----->   MDS_ATTR*   ------>   ATTR_*  
    mdc_close_pack_20()
                     OBD_MD_FL*
    mdc_close_pack_18()
                            mds_mfd_close()    -----> if ((request_body->valid & OBD_MD_FLATIME) &&
--------------------------------------------------        ((request_body->atime > LTIME_S(inode->i_atime) + mds->mds_atime_diff) )) {                      
                                                                 LTIME_S(iattr.ia_atime) = request_body->atime;
                                                                 iattr.ia_valid |= ATTR_ATIME;
                                                      }

I think we don't need to force 2.x to understand the flags from b18 client, so I decide to add lustre version check to skip this test for interop test.

I will push a patch later.

Comment by Emoly Liu [ 10/May/13 ]

patch for b18 is at http://review.whamcloud.com/#change,6289

Comment by Andreas Dilger [ 10/May/13 ]

Sorry, I don't understand your comment. The 1.8 client has support for the 2.x protocol in mdc_close_pack_20(), so it should be sending the same MDS_ATTR_* flags as the 2.x client. The mdc_close_pack_18() codepath is only used when the client is communicating with a 1.8 MDS.

So, I don't think we should be so quick to just skip this test. If the 1.8 client sends the same MDS_ATTR_* flags as the 2.x client, then the 2.x server should treat the atime update in the same way.

Comment by Emoly Liu [ 10/May/13 ]

Probably the code path in my last comment misled you.

The flags translation between 1.8 and 2.x has no problem. The problem is that they use different flags to trigger atime update.

  • 2.x uses ATTR_ATIME_SET. From the code we can see, 2.x updates atime on close only if (la->la_valid == LA_ATIME), and only ATTR_ATIME_SET can be translated into LA_ATIME.
  • but 1.8 uses ATTR_ATIME.
Comment by Andreas Dilger [ 10/May/13 ]

The protocol is supposed to be that ATTR_ATIME means the atime should be updated, but the server generates the timestamp if ATTR_ATIME_SET is not also set. See e.g. mds_fix_attr() in 1.8 updating xtime locally if ATTR_xTIME_SET is missing, and ll_setattr_raw() on the client setting ATTR_xTIME_SET if it is actually sending the atime value along in the RPC.

Looking at the code more closely, I see that there some more improvements that should be made. The conversion in mdt_setatt_unpack_rec() from MDS_ATTR_ to ATTR_ to LA_ doesn't make sense. mdt_attr_valid_xlate() should be changed to convert from MDS_ATTR_ directly to LA_* and attr_unpack() should just be removed (it is not used anywhere else).

There are several patches that are needed to fix this issue properly (in order of decreasing priority):

  • 2.1/2.4 MDS should check in mdt_setattr_unpack_rec() if MDS_ATTR_xTIME is set without MDS_ATTR_xTIME_SET, and set MDS_ATTR_xTIME_SET if the client does not have OBD_CONNECT_FULL20 (i.e. 1.8 client), before calling mdt_attr_valid_xlate(). This will fix old 1.8 clients with new 2.x servers.
  • 2.1/2.4 clients should set both MDS_ATTR_xTIME | MDS_ATTR_xTIME_SET for timestamps in ll_prepare_close(). This is for protocol correctness and allows us to fix the server-side timestamp setting in the future.
  • 2.1/2.4 clients should add sanity.sh test_203() if this is checking something different than test_23(). This improves test coverage.
  • 2.4 MDS should remove attr_unpack()
  • 1.8 clients should set both MDS_ATTR_xTIME | MDS_ATTR_xTIME_SET when converting from OBD_MD_FLATIME in mdc_close_pack_20(). This will fix new 1.8 clients with old 2.x servers.
  • 1.8 sanity.sh test_203() should remove the "skip" check, since it should be working properly with 2.x MDS. This ensures interop is working again.
Comment by Emoly Liu [ 11/May/13 ]

Thanks, Andreas! I will prepare for these patches.

BTW, do we keep supporting 1.8 client for 2.5 and later?

Comment by Andreas Dilger [ 11/May/13 ]

No, interoperability with 1.8 clients will end with Lustre 2.4, and 2.5 will not need to interoperate with these older clients anymore.

Comment by Emoly Liu [ 13/May/13 ]

patch for 2.4 is at http://review.whamcloud.com/6327. It includes the following fixes

  • 2.1/2.4 MDS should check in mdt_setattr_unpack_rec() if MDS_ATTR_xTIME is set without MDS_ATTR_xTIME_SET, and set MDS_ATTR_xTIME_SET if the client does not have OBD_CONNECT_FULL20 (i.e. 1.8 client), before calling mdt_attr_valid_xlate(). This will fix old 1.8 clients with new 2.x servers.
  • 2.1/2.4 clients should set both MDS_ATTR_xTIME | MDS_ATTR_xTIME_SET for timestamps in ll_prepare_close(). This is for protocol correctness and allows us to fix the server-side timestamp setting in the future.
  • 2.1/2.4 clients should add sanity.sh test_203() if this is checking something different than test_23(). This improves test coverage.
  • 2.4 MDS should remove attr_unpack()

BTW, sanityn.sh test_23 does the same check to sanity.sh test_203, so I didn't add test_203 but improve test_23 instead.

Comment by Emoly Liu [ 20/May/13 ]

Patch landed for 2.4 and b18

Generated at Sat Feb 10 01:30:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.