Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-184

Test failure on test suite insanity, subtest test_0

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.1.0
    • None
    • None
    • 3
    • 5074

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/b573b3c0-5c15-11e0-a272-52540025f9af.

      The sub-test test_0 failed with the following error:

      post-failover df: 1

      Attachments

        Activity

          [LU-184] Test failure on test suite insanity, subtest test_0

          Integrated in lustre-master » client,ubuntu-x86_64 #27
          LU-184 Keep orphan on failover umount

          Oleg Drokin : 5b3eecce0bad98c81a45712594037b6ec7a9024f
          Files :

          • lustre/include/lustre/lustre_idl.h
          • lustre/mdd/mdd_object.c
          • lustre/mdt/mdt_open.c
          • lustre/mdt/mdt_handler.c
          hudson Build Master (Inactive) added a comment - Integrated in lustre-master » client,ubuntu-x86_64 #27 LU-184 Keep orphan on failover umount Oleg Drokin : 5b3eecce0bad98c81a45712594037b6ec7a9024f Files : lustre/include/lustre/lustre_idl.h lustre/mdd/mdd_object.c lustre/mdt/mdt_open.c lustre/mdt/mdt_handler.c

          Integrated in lustre-master » client,el5-i686 #27
          LU-184 Keep orphan on failover umount

          Oleg Drokin : 5b3eecce0bad98c81a45712594037b6ec7a9024f
          Files :

          • lustre/include/lustre/lustre_idl.h
          • lustre/mdt/mdt_open.c
          • lustre/mdt/mdt_handler.c
          • lustre/mdd/mdd_object.c
          hudson Build Master (Inactive) added a comment - Integrated in lustre-master » client,el5-i686 #27 LU-184 Keep orphan on failover umount Oleg Drokin : 5b3eecce0bad98c81a45712594037b6ec7a9024f Files : lustre/include/lustre/lustre_idl.h lustre/mdt/mdt_open.c lustre/mdt/mdt_handler.c lustre/mdd/mdd_object.c

          Integrated in lustre-master-centos5 #199
          LU-184 Keep orphan on failover umount

          Oleg Drokin : 5b3eecce0bad98c81a45712594037b6ec7a9024f
          Files :

          • lustre/mdt/mdt_open.c
          • lustre/mdt/mdt_handler.c
          • lustre/mdd/mdd_object.c
          • lustre/include/lustre/lustre_idl.h
          hudson Build Master (Inactive) added a comment - Integrated in lustre-master-centos5 #199 LU-184 Keep orphan on failover umount Oleg Drokin : 5b3eecce0bad98c81a45712594037b6ec7a9024f Files : lustre/mdt/mdt_open.c lustre/mdt/mdt_handler.c lustre/mdd/mdd_object.c lustre/include/lustre/lustre_idl.h

          Integrated in lustre-master » client,el5-x86_64 #27
          LU-184 Keep orphan on failover umount

          Oleg Drokin : 5b3eecce0bad98c81a45712594037b6ec7a9024f
          Files :

          • lustre/mdt/mdt_handler.c
          • lustre/include/lustre/lustre_idl.h
          • lustre/mdt/mdt_open.c
          • lustre/mdd/mdd_object.c
          hudson Build Master (Inactive) added a comment - Integrated in lustre-master » client,el5-x86_64 #27 LU-184 Keep orphan on failover umount Oleg Drokin : 5b3eecce0bad98c81a45712594037b6ec7a9024f Files : lustre/mdt/mdt_handler.c lustre/include/lustre/lustre_idl.h lustre/mdt/mdt_open.c lustre/mdd/mdd_object.c

          Yes, without this patch, it can be reproduced every time by runracer + insanity, and it passed the 'runrace + insanity' after patch applied.

          niu Niu Yawei (Inactive) added a comment - Yes, without this patch, it can be reproduced every time by runracer + insanity, and it passed the 'runrace + insanity' after patch applied.

          You was able to reproduce that issue, does it disappear with that patch?

          tappro Mikhail Pershin added a comment - You was able to reproduce that issue, does it disappear with that patch?

          OK, got it, but I wonder shouldn't we use one check for client removal and orphan cleanup? Now obd_fail is checked to decide about client record removal but OBD_OPT_FAILOVER flag for orphans. We need common check for both cases I think

          tappro Mikhail Pershin added a comment - OK, got it, but I wonder shouldn't we use one check for client removal and orphan cleanup? Now obd_fail is checked to decide about client record removal but OBD_OPT_FAILOVER flag for orphans. We need common check for both cases I think
          niu Niu Yawei (Inactive) added a comment - - edited

          Hi, Tappro

          In 1.8, the 'unlink_orphan' param of mds_mfd_close() should be false when it's called from umount, since the flag 'OBD_OPT_FAILOVER' should has been set to exp->exp_flags when it's not a force umount.

          (see server_put_super() -> class_manual_cleanup() -> class_process_config() -> class_cleanup() -> class_disconnect_exports() -> class_disconnect_export_list() -> mds_disconnect() -> mds_cleanup_mfd() -> mds_mfd_close(). obd_fail is set in server_put_super(), exp_flags is set in class_disconnect_export_list(). )

          niu Niu Yawei (Inactive) added a comment - - edited Hi, Tappro In 1.8, the 'unlink_orphan' param of mds_mfd_close() should be false when it's called from umount, since the flag 'OBD_OPT_FAILOVER' should has been set to exp->exp_flags when it's not a force umount. (see server_put_super() -> class_manual_cleanup() -> class_process_config() -> class_cleanup() -> class_disconnect_exports() -> class_disconnect_export_list() -> mds_disconnect() -> mds_cleanup_mfd() -> mds_mfd_close(). obd_fail is set in server_put_super(), exp_flags is set in class_disconnect_export_list(). )

          Niu, yes, I agree though I can't find how 1.8 keeps orphans, can you show, please?

          tappro Mikhail Pershin added a comment - Niu, yes, I agree though I can't find how 1.8 keeps orphans, can you show, please?

          As far as I can see, the problem is that we have to preserve client data on failover umount for futhure recovery, however, the orphan was cleared while closing files on umount, such inconsistence would cause open replay error.

          For 1.8, we don't have such problem, because the final close on failover umount doesn't clear orphan in 1.8.

          My proposal is that we should keep orhpan on failover umount, just like 1.8 does. Which requires some code changes in close/orphan handling path, and the post recovery orphan cleanup as well. Tappro/Andreas/Wangdi/Oleg: Does it sound ok? If you all agree with me, I will try to make a patch to fix it in this way. Thank you.

          niu Niu Yawei (Inactive) added a comment - As far as I can see, the problem is that we have to preserve client data on failover umount for futhure recovery, however, the orphan was cleared while closing files on umount, such inconsistence would cause open replay error. For 1.8, we don't have such problem, because the final close on failover umount doesn't clear orphan in 1.8. My proposal is that we should keep orhpan on failover umount, just like 1.8 does. Which requires some code changes in close/orphan handling path, and the post recovery orphan cleanup as well. Tappro/Andreas/Wangdi/Oleg: Does it sound ok? If you all agree with me, I will try to make a patch to fix it in this way. Thank you.

          the obd_fail is set to 1 by server_stop_servers() just before class_manual_cleanup() call. That is why we have clients in last_rcvd but no orphans.

          tappro Mikhail Pershin added a comment - the obd_fail is set to 1 by server_stop_servers() just before class_manual_cleanup() call. That is why we have clients in last_rcvd but no orphans.

          People

            niu Niu Yawei (Inactive)
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: