Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8295

conf-sanity test_41c: test failed to respond and timed out

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.10.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for nasf <fan.yong@intel.com>

      Please provide additional information about the failure here.

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/dc7b4098-3372-11e6-bbf5-5254006e85c2.

      The logs on the OST show that two mount thread were blocked:

      [12811.430006] mount           S ffff88007b4e72f0     0 25441  25436 0x00000080
      [12811.430006]  ffff8800465b7e70 0000000000000086 ffff88007b4e7300 ffff8800465b7fd8
      [12811.430006]  ffff8800465b7fd8 ffff8800465b7fd8 ffff88007b4e7300 ffff8800465b7ef8
      [12811.430006]  ffff88007b4e77f0 ffff88007b4e7300 ffff88007b4e7300 ffff88007b4e72f0
      [12811.430006] Call Trace:
      [12811.430006]  [<ffffffff8163b349>] schedule+0x29/0x70
      [12811.430006]  [<ffffffff81080f83>] do_wait+0x203/0x260
      [12811.430006]  [<ffffffff810820a0>] SyS_wait4+0x80/0x110
      [12811.430006]  [<ffffffff8107fb20>] ? task_stopped_code+0x60/0x60
      [12811.430006]  [<ffffffff816463c9>] system_call_fastpath+0x16/0x1b
      [12811.430006] sh              S ffff88007b4e2df0     0 25442  25440 0x00000080
      [12811.430006]  ffff88007a95fe70 0000000000000086 ffff88007b4e2e00 ffff88007a95ffd8
      [12811.430006]  ffff88007a95ffd8 ffff88007a95ffd8 ffff88007b4e2e00 ffff88007a95fef8
      [12811.430006]  ffff88007b4e32f0 ffff88007b4e2e00 ffff88007b4e2e00 ffff88007b4e2df0
      [12811.430006] Call Trace:
      [12811.430006]  [<ffffffff8163b349>] schedule+0x29/0x70
      [12811.430006]  [<ffffffff81080f83>] do_wait+0x203/0x260
      [12811.430006]  [<ffffffff810820a0>] SyS_wait4+0x80/0x110
      [12811.430006]  [<ffffffff8107fb20>] ? task_stopped_code+0x60/0x60
      [12811.430006]  [<ffffffff816463c9>] system_call_fastpath+0x16/0x1b
      [12811.430006] mount.lustre    S 0000000000000000     0 25443  25441 0x00000080
      [12811.430006]  ffff8800438d7ce0 0000000000000082 ffff8800438c5c00 ffff8800438d7fd8
      [12811.430006]  ffff8800438d7fd8 ffff8800438d7fd8 ffff8800438c5c00 ffff880047f4c800
      [12811.430006]  ffff8800438c5c00 ffff8800438d7e08 ffff880043836800 0000000000000000
      [12811.430006] Call Trace:
      [12811.430006]  [<ffffffff8163b349>] schedule+0x29/0x70
      [12811.430006]  [<ffffffffa0abaed6>] server_fill_super+0x1656/0x184c [obdclass]
      [12811.430006]  [<ffffffff810a6ac0>] ? wake_up_atomic_t+0x30/0x30
      [12811.430006]  [<ffffffffa0a8f9d8>] lustre_fill_super+0x328/0x950 [obdclass]
      [12811.430006]  [<ffffffffa0a8f6b0>] ? lustre_common_put_super+0x270/0x270 [obdclass]
      [12811.430006]  [<ffffffff811e1f2d>] mount_nodev+0x4d/0xb0
      [12811.430006]  [<ffffffffa0a87908>] lustre_mount+0x38/0x60 [obdclass]
      [12811.430006]  [<ffffffff811e28d9>] mount_fs+0x39/0x1b0
      [12811.430006]  [<ffffffff811fe1af>] vfs_kern_mount+0x5f/0xf0
      [12811.430006]  [<ffffffff812006fe>] do_mount+0x24e/0xa40
      [12811.430006]  [<ffffffff8116e15e>] ? __get_free_pages+0xe/0x50
      [12811.430006]  [<ffffffff81200f86>] SyS_mount+0x96/0xf0
      [12811.430006]  [<ffffffff816463c9>] system_call_fastpath+0x16/0x1b
      [12811.430006] mount           S ffff88003dd65070     0 25446  25442 0x00000080
      [12811.430006]  ffff8800434d3e70 0000000000000086 ffff88003dd65080 ffff8800434d3fd8
      [12811.430006]  ffff8800434d3fd8 ffff8800434d3fd8 ffff88003dd65080 ffff8800434d3ef8
      [12811.430006]  ffff88003dd65570 ffff88003dd65080 ffff88003dd65080 ffff88003dd65070
      [12811.430006] Call Trace:
      [12811.430006]  [<ffffffff8163b349>] schedule+0x29/0x70
      [12811.430006]  [<ffffffff81080f83>] do_wait+0x203/0x260
      [12811.430006]  [<ffffffff810820a0>] SyS_wait4+0x80/0x110
      [12811.430006]  [<ffffffff8107fb20>] ? task_stopped_code+0x60/0x60
      [12811.430006]  [<ffffffff816463c9>] system_call_fastpath+0x16/0x1b
      [12811.430006] mount.lustre    S 0000000000000000     0 25449  25446 0x00000080
      [12811.430006]  ffff88003db0bce0 0000000000000082 ffff88007c141700 ffff88003db0bfd8
      [12811.430006]  ffff88003db0bfd8 ffff88003db0bfd8 ffff88007c141700 ffff88005146c000
      [12811.430006]  ffff88007c141700 ffff88003db0be08 ffff88007a89c000 0000000000000000
      [12811.430006] Call Trace:
      [12811.430006]  [<ffffffff8163b349>] schedule+0x29/0x70
      [12811.430006]  [<ffffffffa0abaed6>] server_fill_super+0x1656/0x184c [obdclass]
      [12811.430006]  [<ffffffff810a6ac0>] ? wake_up_atomic_t+0x30/0x30
      [12811.430006]  [<ffffffffa0a8f9d8>] lustre_fill_super+0x328/0x950 [obdclass]
      [12811.430006]  [<ffffffffa0a8f6b0>] ? lustre_common_put_super+0x270/0x270 [obdclass]
      [12811.430006]  [<ffffffff811e1f2d>] mount_nodev+0x4d/0xb0
      [12811.430006]  [<ffffffffa0a87908>] lustre_mount+0x38/0x60 [obdclass]
      [12811.430006]  [<ffffffff811e28d9>] mount_fs+0x39/0x1b0
      [12811.430006]  [<ffffffff811fe1af>] vfs_kern_mount+0x5f/0xf0
      [12811.430006]  [<ffffffff812006fe>] do_mount+0x24e/0xa40
      [12811.430006]  [<ffffffff8116e15e>] ? __get_free_pages+0xe/0x50
      [12811.430006]  [<ffffffff81200f86>] SyS_mount+0x96/0xf0
      [12811.430006]  [<ffffffff816463c9>] system_call_fastpath+0x16/0x1b
      

      Attachments

        Activity

          [LU-8295] conf-sanity test_41c: test failed to respond and timed out
          pjones Peter Jones added a comment -

          Landed for 2.10

          pjones Peter Jones added a comment - Landed for 2.10

          Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/21117/
          Subject: LU-8295 tests: comply with OBD_RACE() behavior
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: a3becc1092524366e13b3c72d61c6b0ea82709f7

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/21117/ Subject: LU-8295 tests: comply with OBD_RACE() behavior Project: fs/lustre-release Branch: master Current Patch Set: Commit: a3becc1092524366e13b3c72d61c6b0ea82709f7

          Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/21117
          Subject: LU-8295 tests: comply with OBD_RACE() behavior
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: e3050d4414667a941c501728de2bfb0b78e8b097

          gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/21117 Subject: LU-8295 tests: comply with OBD_RACE() behavior Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e3050d4414667a941c501728de2bfb0b78e8b097
          jhammond John Hammond added a comment -

          > Do you want me to push a patch to fix these 2 holes (4 in fact since fail_loc is wrongly set/cleared for MDSs and OSSs cases) in test ?

          Yes, please.

          I'm still not confident that I understand how this is working but I can see several potential issues with this code: cfs_race_state should be declared volatile. There is a race between the two threads setting cfs_race_state and one sleeping the other waking. proc_fail_loc() calls wake_up(&cfs_race_waitq) but it does not set cfs_race_state to 1 so it seems like the sleeping task would just go back to sleep.

          My best guess is that usually "do_facet ost1 $LCTL set_param fail_loc=0x0" is running before either mount command so the CFS_RACE() is never triggered.

          jhammond John Hammond added a comment - > Do you want me to push a patch to fix these 2 holes (4 in fact since fail_loc is wrongly set/cleared for MDSs and OSSs cases) in test ? Yes, please. I'm still not confident that I understand how this is working but I can see several potential issues with this code: cfs_race_state should be declared volatile. There is a race between the two threads setting cfs_race_state and one sleeping the other waking. proc_fail_loc() calls wake_up(&cfs_race_waitq) but it does not set cfs_race_state to 1 so it seems like the sleeping task would just go back to sleep. My best guess is that usually "do_facet ost1 $LCTL set_param fail_loc=0x0" is running before either mount command so the CFS_RACE() is never triggered.

          John, since I am the "responsible" of this test and associated code, I feel a bit interested by this problem and ticket

          My understanding was that OBD_RACE()/CFS_RACE()/cfs_race() only/safely work with 2 threads, the 1st stops waiting for the 2nd to awake it.

          And I think, like you, that for this to work as expected, fail_loc must not be cleared in between, when there is a potential window for this to happen due to the second mount/thread to be run asynchronously, with the clear of fail_loc being launched simultaneously.

          On the other hand, I thought before that CFS_FAIL_ONCE was useless for OBD_RACE() to work, but after better regarding the code in __cfs_fail_check_set() core routine this may also, as you point, be the reason for both threads being stuck if they are closely racing and have entered the routine at around the same time, and particularly if CFS_FAILED bit has still not been set.

          Do you want me to push a patch to fix these 2 holes (4 in fact since fail_loc is wrongly set/cleared for MDSs and OSSs cases) in test ?

          bfaccini Bruno Faccini (Inactive) added a comment - John, since I am the "responsible" of this test and associated code, I feel a bit interested by this problem and ticket My understanding was that OBD_RACE()/CFS_RACE()/cfs_race() only/safely work with 2 threads, the 1st stops waiting for the 2nd to awake it. And I think, like you, that for this to work as expected, fail_loc must not be cleared in between, when there is a potential window for this to happen due to the second mount/thread to be run asynchronously, with the clear of fail_loc being launched simultaneously. On the other hand, I thought before that CFS_FAIL_ONCE was useless for OBD_RACE() to work, but after better regarding the code in __cfs_fail_check_set() core routine this may also, as you point, be the reason for both threads being stuck if they are closely racing and have entered the routine at around the same time, and particularly if CFS_FAILED bit has still not been set. Do you want me to push a patch to fix these 2 holes (4 in fact since fail_loc is wrongly set/cleared for MDSs and OSSs cases) in test ?
          jhammond John Hammond added a comment -

          And we need CFS_FAIL_ONCE along with OBD_FAIL_TGT_MOUNT_RACE when setting fail_loc.

          jhammond John Hammond added a comment - And we need CFS_FAIL_ONCE along with OBD_FAIL_TGT_MOUNT_RACE when setting fail_loc .
          jhammond John Hammond added a comment -

          There's a bug in the test:

                  #define OBD_FAIL_TGT_MOUNT_RACE 0x716                                             
                  do_facet ost1 "$LCTL set_param fail_loc=0x716"
                  start ost1 $(ostdevname 1) $OST_MOUNT_OPTS &
                  pid=$!
                  start ost1 $(ostdevname 1) $OST_MOUNT_OPTS &
                  do_facet ost1 "$LCTL set_param fail_loc=0x0"
          	pid2=$!
          

          Since both mount commands are asynchronously, we may clear fail_loc between the two invocations of OBD_RACE(). In this case both threads will hang.

          > Also there's only one print in the logs about the OBD_RACE actually being hit

          Shouldn't we expect that CDEBUG rate limitation will prevent the second message from being printed.

          jhammond John Hammond added a comment - There's a bug in the test: #define OBD_FAIL_TGT_MOUNT_RACE 0x716 do_facet ost1 "$LCTL set_param fail_loc=0x716" start ost1 $(ostdevname 1) $OST_MOUNT_OPTS & pid=$! start ost1 $(ostdevname 1) $OST_MOUNT_OPTS & do_facet ost1 "$LCTL set_param fail_loc=0x0" pid2=$! Since both mount commands are asynchronously, we may clear fail_loc between the two invocations of OBD_RACE() . In this case both threads will hang. > Also there's only one print in the logs about the OBD_RACE actually being hit Shouldn't we expect that CDEBUG rate limitation will prevent the second message from being printed.
          green Oleg Drokin added a comment -

          so OBD_RACE() is supposed to stop for the first caller and then continue when somebody else reaches there.

          Yet we see both threads stuck in the same place in OBD_RACE? Something strange is going on here. Also there's only one print in the logs about the OBD_RACE actually being hit

          green Oleg Drokin added a comment - so OBD_RACE() is supposed to stop for the first caller and then continue when somebody else reaches there. Yet we see both threads stuck in the same place in OBD_RACE? Something strange is going on here. Also there's only one print in the logs about the OBD_RACE actually being hit

          People

            jhammond John Hammond
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: