Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.4.1, Lustre 2.5.0
-
None
-
failover MDS/MGS, failover OSTs
-
3
-
11555
Description
Since Lustre 2.4.0 we had problems with clients that could not connect after eg. the MGS was failing over. Most experiments we did with a client on the standby MGS server, the symptom was that the client only worked from the active MGS/MDS node, not from the passive one!
The reason for the problem seems to be commit d9d27cad and the following hunk of the patch:
@@ -1447,13 +1481,11 @@ static int mgs_write_log_failnids(const struct lu_env *env,
failnodeuuid, cliname);
rc = record_add_uuid(env, llh, nid, failnodeuuid);
}
- if (failnodeuuid)
{
+ if (failnodeuuid)
rc = record_add_conn(env, llh, cliname, failnodeuuid);
- name_destroy(&failnodeuuid);
- failnodeuuid = NULL;
- }
}
+ name_destroy(&failnodeuuid);
return rc;
}
This leads to a wrong lustre client llog when the lustre block devices are formated with multiple --servicenode options! Here is an example:
#09 (224)marker 6 (flags=0x01, v2.5.0.0) lnec-MDT0000 'add mdc' Mon Nov 11 17:03:21 2013-
#10 (080)add_uuid nid=10.3.0.34@o2ib(0x500000a030022) 0: 1:10.3.0.34@o2ib
#11 (128)attach 0:lnec-MDT0000-mdc 1:mdc 2:lnec-clilmv_UUID
#12 (136)setup 0:lnec-MDT0000-mdc 1:lnec-MDT0000_UUID 2:10.3.0.34@o2ib
#13 (080)add_uuid nid=10.3.0.34@o2ib(0x500000a030022) 0: 1:10.3.0.34@o2ib
#14 (104)add_conn 0:lnec-MDT0000-mdc 1:10.3.0.34@o2ib
#15 (080)add_uuid nid=10.3.0.35@o2ib(0x500000a030023) 0: 1:10.3.0.34@o2ib
#16 (104)add_conn 0:lnec-MDT0000-mdc 1:10.3.0.34@o2ib
#17 (160)modify_mdc_tgts add 0:lnec-clilmv 1:lnec-MDT0000_UUID 2:0 3:1 4:lnec-MDT0000-mdc_UUID
#18 (224)marker 6 (flags=0x02, v2.5.0.0) lnec-MDT0000 'add mdc' Mon Nov 11 17:03:21 2013-
The last add_uuid should have 1:10.3.0.35@o2ib instead of 1:10.3.0.34@o2ib.
And the reason is that only the first nid of the first --servicenode AKA --failnode entry is considered.
Please revert that little patch of mgs_llog.c.
Regards,
Erich
Attachments
Issue Links
- is duplicated by
-
LU-4043 clients unable to reconnect after OST failover
- Resolved