[LU-4592] mdt_reint_open()) @@@ OPEN & CREAT not in open replay Created: 05/Feb/14  Updated: 23/Jan/16  Resolved: 23/Jan/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.5
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oz Rentas Assignee: Hongchao Zhang
Resolution: Done Votes: 0
Labels: None
Environment:

Lustre 2.1.5 servesr, LLNL Chaos clients


Attachments: File client.807442     File mds00.20140131.17     File mds01.20140131.17     File slurm-807442.out    
Severity: 2
Rank (Obsolete): 12545

 Description   

As part of preparation testing, the customer performed a failover tests. The customer rebooted the primary MDS in order to confirm the standby MDS would takeover and not interrupt the job. The job died when the client was unable to open a file.

3 files attached.
mds00.20140131.17 Primary MDS that was rebooted
mds01.20140131.17 Secondary MDS that took over when mds00 went down
client.807442 Client logs from the 2 compute nodes running the job (#807422) that failed. (The two nodes are mu0104 and mu0105)

Error reported on MDS01 -
Jan 31 17:07:12 l1-mds01 kernel: : LustreError: 18626:0:(mdt_open.c:1314:mdt_reint_open()) @@@ OPEN & CREAT not in open replay. req@ffff881006dda400 x1458783605491287/t0(30064772087) o101->8eb15a41-9744-ff91-d294-57256d6605bc@10.11.16.104@tcp:0/0 lens 544/4552 e 0 to 0 dl 1391213274 ref 1 fl Interpret:/4/0 rc 0/0

ERRORs on client -
Jan 31 17:07:12 mu0104 kernel: : LustreError: 2376:0:(client.c:2634:ptlrpc_replay_interpret()) @@@ status 116, old was 0 req@ffff88025258e400 x1458783605491285/t30064772084(30064772084) o35>l1-MDT0000-mdc-ffff8804014a9000@10.1.15.2@o2ib5:23/10 lens 360/424 e 0 to 0 dl 1391213270 ref 2 fl Interpret:R/4/0 rc -116/-116
Jan 31 17:07:13 mu0104 kernel: : LustreError: 2376:0:(client.c:2634:ptlrpc_replay_interpret()) @@@ status 116, old was 0 req@ffff88017c924400 x1458783605697158/t30064772108(30064772108) o35>l1-MDT0000-mdc-ffff8804014a9000@10.1.15.2@o2ib5:23/10 lens 360/424 e 0 to 0 dl 1391213270 ref 2 fl Interpret:R/4/0 rc -116/-116



 Comments   
Comment by Peter Jones [ 06/Feb/14 ]

Hongchao

Could you please advise on this one?

Thanks

Peter

Comment by Hongchao Zhang [ 07/Feb/14 ]

Hi Oz,

do you mount the Lustre with ACL enabled and disabled the "identity_upcall"?

Jan 31 17:17:55 l1-mds01 kernel: : Lustre: 22128:0:(mdt_lproc.c:414:lprocfs_wr_identity_upcall()) l1-MDT0000: disable "identity_upcall" with ACL enabled maybe cause unexpected "EACCESS"
Jan 31 17:17:55 l1-mds01 kernel: : Lustre: 22128:0:(mdt_lproc.c:416:lprocfs_wr_identity_upcall()) l1-MDT0000: identity upcall set to NONE

the problem in the job is just -EACCESS,

...
Rank 26 Host mu0104.localdomain FATAL ERROR 1391213863: Unable to open file /lustre/lscratch1/atorrez/out.1391213561.26 for read. (errno=Permission denied) (MPI_Error = 42)
Rank 28 Host mu0104.localdomain FATAL ERROR 1391213863: Unable to open file /lustre/lscratch1/atorrez/out.1391213561.28 for read. (errno=Permission denied) (MPI_Error = 42)
Rank 29 Host mu0104.localdomain FATAL ERROR 1391213863: Unable to open file /lustre/lscratch1/atorrez/out.1391213561.29 for read. (errno=Permission denied) (MPI_Error = 42)
Rank 47 Host mu0104.localdomain FATAL ERROR 1391213863: Unable to open file /lustre/lscratch1/atorrez/out.1391213561.47 for read. (errno=Permission denied) (MPI_Error = 42)
...

could you please test it without ACL to check whether it is the cause? Thanks!

Comment by Oz Rentas [ 14/Feb/14 ]

The customer reports they are not mounting with ACL support, as seen here:
/dev/mapper/vg_l1-mdt on /lustre/l1/mdt type lustre (rw)

Any other suggestions on where we can look?

Side note - On my system I was able to duplicate the error they received by setting upcall_identity to NONE, and mounting with ACL:
Feb 11 09:56:12 es0 kernel: : Lustre: 7799:0:(mdt_lproc.c:372:lprocfs_wr_identity_upcall()) testfs-MDT0000: disable "identity_upcall" with ACL enabled maybe cause unexpected "EACCESS"

[root@es0 ~]# mount |grep mdt
/dev/mapper/vg_testfs-mdt on /lustre/testfs/mdt type lustre (rw,acl)

Comment by Oz Rentas [ 18/Feb/14 ]

Any updates on this one?

Comment by Hongchao Zhang [ 19/Feb/14 ]

sorry for delayed response.

from the code, this debug line is only printed when mounted with ACL

static int lprocfs_wr_identity_upcall(struct file *file, const char *buffer,
                                      unsigned long count, void *data)
{
        ...
        if (strcmp(hash->uc_upcall, "NONE") == 0 && mdt->mdt_opts.mo_acl)   <---- here, "mo_acl is 1"
                CWARN("%s: disable \"identity_upcall\" with ACL enabled maybe "
                      "cause unexpected \"EACCESS\"\n", mdt_obd_name(mdt));
        ...
}

Is it possible that mds00 mounts without ACL but mds01 with it?

Thanks!

Comment by Bobbie Lind (Inactive) [ 26/Mar/14 ]

After being onsite with customer I can confirm that when running the mount command that the system appears to NOT be mounting with acls.

/dev/mapper/vg_l1-mdt on /lustre/l1/mdt type lustre (rw)

Re-asking to Oz's question, is there another place that it may show as being mounted with acl's that I can check the next time I'm onsite?

Comment by Hongchao Zhang [ 11/Apr/14 ]

currently, the mount options is not printed when showing the mount info if the mount type is "lustre" (and it will show when mounting it with "ldiskfs" type)

the default mount options could contain "ACL" (it's the case in my local node RHEL6.5/x86_64),
could you please mount the MDT with "-o noacl" explicitly and retest it, Thanks

Comment by Hongchao Zhang [ 22/Jan/16 ]

Hi Oz,
Do you need more works on this ticket? Or can we close it?
Thanks

Comment by Oz Rentas [ 22/Jan/16 ]

yes, it can be closed. thanks.

Comment by John Fuchs-Chesney (Inactive) [ 23/Jan/16 ]

Thanks Oz and Hongchao.

~ jfc.

Generated at Sat Feb 10 01:44:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.