Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.8.0
-
3
-
9223372036854775807
Description
We saw this during DNE failover soak-test.
Lustre: 4374:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1455822992/real 1455822992] req@ffff8807bba3b980 x1526529651962748/t0(0) o250->MGC192.168.1.108@o2ib10@0@lo:26/25 lens 520/544 e 0 to 1 dl 1455823038 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 4374:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 37 previous similar messages LustreError: 8115:0:(tgt_lastrcvd.c:1464:tgt_clients_data_init()) soaked-MDT0001: duplicate export for client generation 3 LustreError: 8115:0:(obd_config.c:578:class_setup()) setup soaked-MDT0001 failed (-114) LustreError: 8115:0:(obd_config.c:1666:class_config_llog_handler()) MGC192.168.1.108@o2ib10: cfg command failed: rc = -114 Lustre: cmd=cf003 0:soaked-MDT0001 1:soaked-MDT0001_UUID 2:1 3:soaked-MDT0001-mdtlov 4:f
And it cause the failover MDT can not be mounted on the new MDS. Similar as LU-7430, but no panic this time.
I upload the debug log to ftp.hpdd.intel.com:/uploads/lu-7794/
The whole process does like this, 4 MDS, each MDS has 2 MDTs. MDS0 (MDT0/1), MDS1(MDT2/3), MDS2(MDT4/5), MDS3(MDT6/7). MDS0 and MDS1 are configured as active/active failover, and MDS2 and MDS3 are configured as active/active failover.
And the test is randomly chosen one of MDS to reboot. then its MDTs will be failover to its pair MDS. In this case, it is MDS0 is restarted, then MDT0/MDT1 should be mounted on MDS1, but failed because of this. No, I do not think this can be easily produced.
Maybe Frank or Cliff will know more.