Details
-
Bug
-
Resolution: Done
-
Major
-
None
-
Lustre 2.1.5
-
None
-
Lustre 2.1.5 servesr, LLNL Chaos clients
-
2
-
12545
Description
As part of preparation testing, the customer performed a failover tests. The customer rebooted the primary MDS in order to confirm the standby MDS would takeover and not interrupt the job. The job died when the client was unable to open a file.
3 files attached.
mds00.20140131.17 Primary MDS that was rebooted
mds01.20140131.17 Secondary MDS that took over when mds00 went down
client.807442 Client logs from the 2 compute nodes running the job (#807422) that failed. (The two nodes are mu0104 and mu0105)
Error reported on MDS01 -
Jan 31 17:07:12 l1-mds01 kernel: : LustreError: 18626:0:(mdt_open.c:1314:mdt_reint_open()) @@@ OPEN & CREAT not in open replay. req@ffff881006dda400 x1458783605491287/t0(30064772087) o101->8eb15a41-9744-ff91-d294-57256d6605bc@10.11.16.104@tcp:0/0 lens 544/4552 e 0 to 0 dl 1391213274 ref 1 fl Interpret:/4/0 rc 0/0
ERRORs on client -
Jan 31 17:07:12 mu0104 kernel: : LustreError: 2376:0:(client.c:2634:ptlrpc_replay_interpret()) @@@ status 116, old was 0 req@ffff88025258e400 x1458783605491285/t30064772084(30064772084) o35>l1-MDT0000-mdc-ffff8804014a9000@10.1.15.2@o2ib5:23/10 lens 360/424 e 0 to 0 dl 1391213270 ref 2 fl Interpret:R/4/0 rc -116/-116
Jan 31 17:07:13 mu0104 kernel: : LustreError: 2376:0:(client.c:2634:ptlrpc_replay_interpret()) @@@ status 116, old was 0 req@ffff88017c924400 x1458783605697158/t30064772108(30064772108) o35>l1-MDT0000-mdc-ffff8804014a9000@10.1.15.2@o2ib5:23/10 lens 360/424 e 0 to 0 dl 1391213270 ref 2 fl Interpret:R/4/0 rc -116/-116