[LU-8760] sanity-lfsck test 31g hung - Whamcloud Community JIRA

Gerrit Updater added a comment - 08/Apr/19 5:32 AM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34502/
Subject: ~~LU-8760~~ lfsck: fix bit operations lfsck_assistant_data
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f0ead95dd1275ee906eccdf117abb92b36949a1b

Gerrit Updater added a comment - 08/Apr/19 5:32 AM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34502/ Subject: LU-8760 lfsck: fix bit operations lfsck_assistant_data Project: fs/lustre-release Branch: master Current Patch Set: Commit: f0ead95dd1275ee906eccdf117abb92b36949a1b

Andrew Perepechko added a comment - 29/Mar/19 10:40 AM - edited

The Linux kernel guarantees that wait_event(..., event_var); and event_var = 1; wake_up(...); are always atomic and consistent and do not require any explicit memory barriers:

SLEEP AND WAKE-UP FUNCTIONS
---------------------------

Sleeping and waking on an event flagged in global data can be viewed as an
interaction between two pieces of data: the task state of the task waiting for
the event and the global data used to indicate the event.  To make sure that
these appear to happen in the right order, the primitives to begin the process
of going to sleep, and the primitives to initiate a wake up imply certain
barriers.

Firstly, the sleeper normally follows something like this sequence of events:

        for (;;) {
                set_current_state(TASK_UNINTERRUPTIBLE);
                if (event_indicated)
                        break;
                schedule();
        }

A general memory barrier is interpolated automatically by set_current_state()
after it has altered the task state:

        CPU 1
        ===============================
        set_current_state();
          set_mb();
            STORE current->state
            <general barrier>
        LOAD event_indicated

set_current_state() may be wrapped by:

        prepare_to_wait();
        prepare_to_wait_exclusive();

which therefore also imply a general memory barrier after setting the state.
The whole sequence above is available in various canned forms, all of which
interpolate the memory barrier in the right place:

        wait_event();
        wait_event_interruptible();
        wait_event_interruptible_exclusive();
        wait_event_interruptible_timeout();
        wait_event_killable();
        wait_event_timeout();
        wait_on_bit();
        wait_on_bit_lock();


Secondly, code that performs a wake up normally follows something like this:

        event_indicated = 1;
        wake_up(&event_wait_queue);

or:

        event_indicated = 1;
        wake_up_process(event_daemon);

A write memory barrier is implied by wake_up() and co. if and only if they wake
something up.  The barrier occurs before the task state is cleared, and so sits
between the STORE to indicate the event and the STORE to set TASK_RUNNING:

        CPU 1                           CPU 2
        =============================== ===============================
        set_current_state();            STORE event_indicated
          set_mb();                     wake_up();
            STORE current->state          <write barrier>
            <general barrier>             STORE current->state
        LOAD event_indicated

l_wait_event() is different from wait_event() but seems to mimic the same logic even without the smp_mb() modification.

Andrew Perepechko added a comment - 29/Mar/19 10:40 AM - edited The Linux kernel guarantees that wait_event(..., event_var); and event_var = 1; wake_up(...); are always atomic and consistent and do not require any explicit memory barriers: SLEEP AND WAKE-UP FUNCTIONS --------------------------- Sleeping and waking on an event flagged in global data can be viewed as an interaction between two pieces of data: the task state of the task waiting for the event and the global data used to indicate the event. To make sure that these appear to happen in the right order, the primitives to begin the process of going to sleep, and the primitives to initiate a wake up imply certain barriers. Firstly, the sleeper normally follows something like this sequence of events: for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); if (event_indicated) break ; schedule(); } A general memory barrier is interpolated automatically by set_current_state() after it has altered the task state: CPU 1 =============================== set_current_state(); set_mb(); STORE current->state <general barrier> LOAD event_indicated set_current_state() may be wrapped by: prepare_to_wait(); prepare_to_wait_exclusive(); which therefore also imply a general memory barrier after setting the state. The whole sequence above is available in various canned forms, all of which interpolate the memory barrier in the right place: wait_event(); wait_event_interruptible(); wait_event_interruptible_exclusive(); wait_event_interruptible_timeout(); wait_event_killable(); wait_event_timeout(); wait_on_bit(); wait_on_bit_lock(); Secondly, code that performs a wake up normally follows something like this : event_indicated = 1; wake_up(&event_wait_queue); or: event_indicated = 1; wake_up_process(event_daemon); A write memory barrier is implied by wake_up() and co. if and only if they wake something up. The barrier occurs before the task state is cleared, and so sits between the STORE to indicate the event and the STORE to set TASK_RUNNING: CPU 1 CPU 2 =============================== =============================== set_current_state(); STORE event_indicated set_mb(); wake_up(); STORE current->state <write barrier> <general barrier> STORE current->state LOAD event_indicated l_wait_event() is different from wait_event() but seems to mimic the same logic even without the smp_mb() modification.

Gerrit Updater added a comment - 26/Mar/19 11:46 AM

Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/34502
Subject: ~~LU-8760~~ lfsck: fix bit operations lfsck_assistant_data
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ce2a56e1aca04d7cf211c50e12e17d34f6587a57

Gerrit Updater added a comment - 26/Mar/19 11:46 AM Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/34502 Subject: LU-8760 lfsck: fix bit operations lfsck_assistant_data Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ce2a56e1aca04d7cf211c50e12e17d34f6587a57

Alexander Boyko added a comment - 26/Mar/19 11:39 AM

I have the same failure for 2.12 version.

crash> ps | grep lfs
   6537      2   7  ffff880ffb9aeeb0  IN   0.0       0      0  [lfsck]
   6539      2   6  ffff880ffb9aaf70  IN   0.0       0      0  [lfsck_layout]
   6540      2   0  ffff880ffb9acf10  IN   0.0       0      0  [lfsck_namespace]
crash> bt 6537
PID: 6537   TASK: ffff880ffb9aeeb0  CPU: 7   COMMAND: "lfsck"
 #0 [ffff880178d13c08] __schedule at ffffffff816b6de4
 #1 [ffff880178d13c90] schedule at ffffffff816b7409
 #2 [ffff880178d13ca0] lfsck_double_scan_generic at ffffffffc115a66e [lfsck]
 #3 [ffff880178d13d18] lfsck_layout_master_double_scan at ffffffffc1181bc0 [lfsck]
 #4 [ffff880178d13d60] lfsck_double_scan at ffffffffc115af0f [lfsck]
 #5 [ffff880178d13df0] lfsck_master_engine at ffffffffc115fe16 [lfsck]
 #6 [ffff880178d13ec8] kthread at ffffffff810b4031
 #7 [ffff880178d13f50] ret_from_fork at ffffffff816c455d
crash> bt 6540
PID: 6540   TASK: ffff880ffb9acf10  CPU: 0   COMMAND: "lfsck_namespace"
 #0 [ffff880fa73e3ce8] __schedule at ffffffff816b6de4
 #1 [ffff880fa73e3d70] schedule at ffffffff816b7409
 #2 [ffff880fa73e3d80] lfsck_assistant_engine at ffffffffc1161e0d [lfsck]
 #3 [ffff880fa73e3ec8] kthread at ffffffff810b4031
 #4 [ffff880fa73e3f50] ret_from_fork at ffffffff816c455d
crash> bt 6539
PID: 6539   TASK: ffff880ffb9aaf70  CPU: 6   COMMAND: "lfsck_layout"
 #0 [ffff880f67be7ce8] __schedule at ffffffff816b6de4
 #1 [ffff880f67be7d70] schedule at ffffffff816b7409
 #2 [ffff880f67be7d80] lfsck_assistant_engine at ffffffffc1161e0d [lfsck]
 #3 [ffff880f67be7ec8] kthread at ffffffff810b4031
 #4 [ffff880f67be7f50] ret_from_fork at ffffffff816c455d

lfsck_master_engine->lfsck_double_scan_generic() sleeps and waits wakeup from lfsck_layout->lfsck_assistant_engine(). And lfsck_assistant_engine sleeps and wait when lfsck_master_engine starts some operation.
The state of the lfsck_assistant_data is

struct lfsck_assistant_data {
  lad_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  lad_req_list = {
    next = 0xffff880fa720d208,
    prev = 0xffff880fa720d208
  },
  lad_ost_list = {
    next = 0xffff880fd8680ce8,
    prev = 0xffff880ff9310568
  },
  lad_ost_phase1_list = {
    next = 0xffff880fd8680cf8,
    prev = 0xffff880ff9310578
  },
  lad_ost_phase2_list = {
    next = 0xffff880fa720d238,
    prev = 0xffff880fa720d238
  },
  lad_mdt_list = {
    next = 0xffff880fa720d248,
    prev = 0xffff880fa720d248
  },
  lad_mdt_phase1_list = {
    next = 0xffff880fa720d258,
    prev = 0xffff880fa720d258
  },
  lad_mdt_phase2_list = {
    next = 0xffff880fa720d268,
    prev = 0xffff880fa720d268
  },
  lad_name = 0xffffffffc11b2423 "lfsck_layout",
  lad_thread = {
    t_link = {
      next = 0x0,
      prev = 0x0
    },
    t_data = 0x0,
    t_flags = 8,
    t_id = 0,
    t_pid = 0,
    t_watchdog = 0x0,
    t_svcpt = 0x0,
    t_ctl_waitq = {
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      },
      task_list = {
        next = 0xffff880f67be7e80,
        prev = 0xffff880f67be7e80
      }
    },
    t_env = 0x0,
    t_name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
  },
  lad_task = 0xffff880ffb9aaf70,
  lad_ops = 0xffffffffc11d2d40 <lfsck_layout_assistant_ops>,
  lad_bitmap = 0xffff88103b1862e0,
  lad_touch_gen = 97,
  lad_prefetched = 0,
  lad_assistant_status = 0,
  lad_post_result = 1,
  lad_to_post = 0,
  lad_to_double_scan = 0,
  lad_in_double_scan = 0,
  lad_exit = 0,
  lad_incomplete = 0,
  lad_advance_lock = false
}

Before the sleep lfsck_double_scan_generic set lad_to_double_scan to 1. lfsck_assistant_engine could zeroed it and set lad_in_double_scan to 1. But the data doesn't show this.

                if (lad->lad_to_double_scan) {
                        lad->lad_to_double_scan = 0;
                        atomic_inc(&lfsck->li_double_scan_count);
                        lad->lad_in_double_scan = 1;
                        wake_up_all(&mthread->t_ctl_waitq);

li_double_scan_count is 0 too.
The sleep state is equal to current issue, the first comment shows the same. But the patch already included at 2.12. So the real root cause is different for both fails I think.
The log shows setting lad_to_double_scan and clearing lad_to_post at the same time.

00100000:10000000:7.0:1552569516.765338:0:6537:0:(lfsck_namespace.c:4595:lfsck_namespace_post()) testfs-MDT0001-osd: namespace LFSCK post done: rc = 0
00100000:10000000:6.0:1552569516.765352:0:6539:0:(lfsck_engine.c:1661:lfsck_assistant_engine()) testfs-MDT0001-osd: lfsck_layout LFSCK assistant thread post
00100000:10000000:7.0:1552569516.765354:0:6537:0:(lfsck_lib.c:2614:lfsck_double_scan_generic()) testfs-MDT0001-osd: waiting for assistant to do lfsck_layout double_scan, status 2

00100000:10000000:6.0:1552569516.765473:0:6539:0:(lfsck_engine.c:1680:lfsck_assistant_engine()) testfs-MDT0001-osd: LFSCK assistant notified others for lfsck_layout post: rc = 0

00100000:10000000:3.0:1552569516.771127:0:20498:0:(lfsck_layout.c:6395:lfsck_layout_master_in_notify()) testfs-MDT0001-osd: layout LFSCK master handles notify 3 from MDT 0, status 1, flags 0, flags2 0

The real root cause is race during set/clear bit operation.
lad->lad_to_post = 0; vs lad->lad_to_double_scan = 1;
And the landed patch avoid unexpected out of order execution didn't help and probably is wrong. Because set_current_state use barrier already.
#define set_current_state(state_value) \
set_mb(current->state, (state_value))

I'm reopening ticket and pushing the patch for race.

Alexander Boyko added a comment - 26/Mar/19 11:39 AM I have the same failure for 2.12 version. crash> ps | grep lfs 6537 2 7 ffff880ffb9aeeb0 IN 0.0 0 0 [lfsck] 6539 2 6 ffff880ffb9aaf70 IN 0.0 0 0 [lfsck_layout] 6540 2 0 ffff880ffb9acf10 IN 0.0 0 0 [lfsck_namespace] crash> bt 6537 PID: 6537 TASK: ffff880ffb9aeeb0 CPU: 7 COMMAND: "lfsck" #0 [ffff880178d13c08] __schedule at ffffffff816b6de4 #1 [ffff880178d13c90] schedule at ffffffff816b7409 #2 [ffff880178d13ca0] lfsck_double_scan_generic at ffffffffc115a66e [lfsck] #3 [ffff880178d13d18] lfsck_layout_master_double_scan at ffffffffc1181bc0 [lfsck] #4 [ffff880178d13d60] lfsck_double_scan at ffffffffc115af0f [lfsck] #5 [ffff880178d13df0] lfsck_master_engine at ffffffffc115fe16 [lfsck] #6 [ffff880178d13ec8] kthread at ffffffff810b4031 #7 [ffff880178d13f50] ret_from_fork at ffffffff816c455d crash> bt 6540 PID: 6540 TASK: ffff880ffb9acf10 CPU: 0 COMMAND: "lfsck_namespace" #0 [ffff880fa73e3ce8] __schedule at ffffffff816b6de4 #1 [ffff880fa73e3d70] schedule at ffffffff816b7409 #2 [ffff880fa73e3d80] lfsck_assistant_engine at ffffffffc1161e0d [lfsck] #3 [ffff880fa73e3ec8] kthread at ffffffff810b4031 #4 [ffff880fa73e3f50] ret_from_fork at ffffffff816c455d crash> bt 6539 PID: 6539 TASK: ffff880ffb9aaf70 CPU: 6 COMMAND: "lfsck_layout" #0 [ffff880f67be7ce8] __schedule at ffffffff816b6de4 #1 [ffff880f67be7d70] schedule at ffffffff816b7409 #2 [ffff880f67be7d80] lfsck_assistant_engine at ffffffffc1161e0d [lfsck] #3 [ffff880f67be7ec8] kthread at ffffffff810b4031 #4 [ffff880f67be7f50] ret_from_fork at ffffffff816c455d lfsck_master_engine->lfsck_double_scan_generic() sleeps and waits wakeup from lfsck_layout->lfsck_assistant_engine(). And lfsck_assistant_engine sleeps and wait when lfsck_master_engine starts some operation. The state of the lfsck_assistant_data is struct lfsck_assistant_data { lad_lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, lad_req_list = { next = 0xffff880fa720d208, prev = 0xffff880fa720d208 }, lad_ost_list = { next = 0xffff880fd8680ce8, prev = 0xffff880ff9310568 }, lad_ost_phase1_list = { next = 0xffff880fd8680cf8, prev = 0xffff880ff9310578 }, lad_ost_phase2_list = { next = 0xffff880fa720d238, prev = 0xffff880fa720d238 }, lad_mdt_list = { next = 0xffff880fa720d248, prev = 0xffff880fa720d248 }, lad_mdt_phase1_list = { next = 0xffff880fa720d258, prev = 0xffff880fa720d258 }, lad_mdt_phase2_list = { next = 0xffff880fa720d268, prev = 0xffff880fa720d268 }, lad_name = 0xffffffffc11b2423 "lfsck_layout", lad_thread = { t_link = { next = 0x0, prev = 0x0 }, t_data = 0x0, t_flags = 8, t_id = 0, t_pid = 0, t_watchdog = 0x0, t_svcpt = 0x0, t_ctl_waitq = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff880f67be7e80, prev = 0xffff880f67be7e80 } }, t_env = 0x0, t_name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000" }, lad_task = 0xffff880ffb9aaf70, lad_ops = 0xffffffffc11d2d40 <lfsck_layout_assistant_ops>, lad_bitmap = 0xffff88103b1862e0, lad_touch_gen = 97, lad_prefetched = 0, lad_assistant_status = 0, lad_post_result = 1, lad_to_post = 0, lad_to_double_scan = 0, lad_in_double_scan = 0, lad_exit = 0, lad_incomplete = 0, lad_advance_lock = false } Before the sleep lfsck_double_scan_generic set lad_to_double_scan to 1. lfsck_assistant_engine could zeroed it and set lad_in_double_scan to 1. But the data doesn't show this. if (lad->lad_to_double_scan) { lad->lad_to_double_scan = 0; atomic_inc(&lfsck->li_double_scan_count); lad->lad_in_double_scan = 1; wake_up_all(&mthread->t_ctl_waitq); li_double_scan_count is 0 too. The sleep state is equal to current issue, the first comment shows the same. But the patch already included at 2.12. So the real root cause is different for both fails I think. The log shows setting lad_to_double_scan and clearing lad_to_post at the same time. 00100000:10000000:7.0:1552569516.765338:0:6537:0:(lfsck_namespace.c:4595:lfsck_namespace_post()) testfs-MDT0001-osd: namespace LFSCK post done: rc = 0 00100000:10000000:6.0:1552569516.765352:0:6539:0:(lfsck_engine.c:1661:lfsck_assistant_engine()) testfs-MDT0001-osd: lfsck_layout LFSCK assistant thread post 00100000:10000000:7.0:1552569516.765354:0:6537:0:(lfsck_lib.c:2614:lfsck_double_scan_generic()) testfs-MDT0001-osd: waiting for assistant to do lfsck_layout double_scan, status 2 00100000:10000000:6.0:1552569516.765473:0:6539:0:(lfsck_engine.c:1680:lfsck_assistant_engine()) testfs-MDT0001-osd: LFSCK assistant notified others for lfsck_layout post: rc = 0 00100000:10000000:3.0:1552569516.771127:0:20498:0:(lfsck_layout.c:6395:lfsck_layout_master_in_notify()) testfs-MDT0001-osd: layout LFSCK master handles notify 3 from MDT 0, status 1, flags 0, flags2 0 The real root cause is race during set/clear bit operation. lad->lad_to_post = 0; vs lad->lad_to_double_scan = 1; And the landed patch avoid unexpected out of order execution didn't help and probably is wrong. Because set_current_state use barrier already. #define set_current_state(state_value) \ set_mb(current->state, (state_value)) I'm reopening ticket and pushing the patch for race.

Gerrit Updater added a comment - 06/Sep/17 4:31 PM

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28322/
Subject: ~~LU-8760~~ lib: avoid unexpected out of order execution
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 9785fb53d0c939b2d94a69a580bdf0b6d968a25e

Gerrit Updater added a comment - 06/Sep/17 4:31 PM John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28322/ Subject: LU-8760 lib: avoid unexpected out of order execution Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 9785fb53d0c939b2d94a69a580bdf0b6d968a25e

Gerrit Updater added a comment - 02/Aug/17 4:11 PM

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/28322
Subject: ~~LU-8760~~ lib: avoid unexpected out of order execution
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 11b5fddf2dc3c40cdf9bce8cd19db8f162a5dffb

Gerrit Updater added a comment - 02/Aug/17 4:11 PM Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/28322 Subject: LU-8760 lib: avoid unexpected out of order execution Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 11b5fddf2dc3c40cdf9bce8cd19db8f162a5dffb

Minh Diep added a comment - 02/Aug/17 4:10 PM

Landed for 2.11

Minh Diep added a comment - 02/Aug/17 4:10 PM Landed for 2.11

Gerrit Updater added a comment - 29/Jul/17 12:02 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23564/
Subject: ~~LU-8760~~ lib: avoid unexpected out of order execution
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c2b6030e9217e54e7153c0a33cce0c2ea4afa54c

Gerrit Updater added a comment - 29/Jul/17 12:02 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23564/ Subject: LU-8760 lib: avoid unexpected out of order execution Project: fs/lustre-release Branch: master Current Patch Set: Commit: c2b6030e9217e54e7153c0a33cce0c2ea4afa54c

Gerrit Updater added a comment - 03/Nov/16 3:05 PM

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/23564
Subject: ~~LU-8760~~ lib: avoid unexpected out of order execution
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a66b0c9809dd4e0ce75944a4763e333421bae8fd

Gerrit Updater added a comment - 03/Nov/16 3:05 PM Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/23564 Subject: LU-8760 lib: avoid unexpected out of order execution Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a66b0c9809dd4e0ce75944a4763e333421bae8fd

nasf (Inactive) added a comment - 03/Nov/16 2:49 PM - edited

The failure itself is not important. But the reason the failure may be well affect the whole Lustre.
The logs show that the namespace LFSCK on the 2nd MDS deadlock as following:

07:39:35:[21041.906022] lfsck           S ffff880053dff300     0 20397      2 0x00000080
07:39:35:[21041.906022]  ffff880053fa7c90 0000000000000046 ffff880053dff300 ffff880053fa7fd8
07:39:35:[21041.906022]  ffff880053fa7fd8 ffff880053fa7fd8 ffff880053dff300 ffff8800457ecc00
07:39:35:[21041.906022]  ffff880077f06800 ffff88004e3a80d0 0000000000000000 ffff880053dff300
07:39:35:[21041.906022] Call Trace:
07:39:35:[21041.906022]  [<ffffffff8163bc39>] schedule+0x29/0x70
07:39:35:[21041.906022]  [<ffffffffa0ca596e>] lfsck_double_scan_generic+0x22e/0x2c0 [lfsck]
07:39:35:[21041.906022]  [<ffffffff810b8940>] ? wake_up_state+0x20/0x20
07:39:35:[21041.906022]  [<ffffffffa0caed90>] lfsck_namespace_double_scan+0x30/0x140 [lfsck]
07:39:35:[21041.906022]  [<ffffffffa0ca5cc9>] lfsck_double_scan+0x59/0x200 [lfsck]
07:39:35:[21041.906022]  [<ffffffffa0c4961a>] ? osd_otable_it_fini+0xca/0x240 [osd_ldiskfs]
07:39:35:[21041.906022]  [<ffffffff811c0bed>] ? kfree+0xfd/0x140
07:39:35:[21041.906022]  [<ffffffffa0caab04>] lfsck_master_engine+0x434/0x1310 [lfsck]
07:39:35:[21041.906022]  [<ffffffff810b8940>] ? wake_up_state+0x20/0x20
07:39:35:[21041.906022]  [<ffffffffa0caa6d0>] ? lfsck_master_oit_engine+0x14c0/0x14c0 [lfsck]
07:39:35:[21041.906022]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
07:39:35:[21041.906022]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
07:39:35:[21041.906022]  [<ffffffff81646b98>] ret_from_fork+0x58/0x90
07:39:35:[21041.906022]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
07:39:35:[21041.906022] lfsck_namespace S ffffffffa0d0fba0     0 20399      2 0x00000080
07:39:35:[21041.906022]  ffff8800793f7d48 0000000000000046 ffff880053df8b80 ffff8800793f7fd8
07:39:35:[21041.906022]  ffff8800793f7fd8 ffff8800793f7fd8 ffff880053df8b80 0000000000000000
07:39:35:[21041.906022]  ffff8800457ecc08 ffff8800457ecc00 ffff88004e3a8000 ffffffffa0d0fba0
07:39:35:[21041.906022] Call Trace:
07:39:35:[21041.906022]  [<ffffffff8163bc39>] schedule+0x29/0x70
07:39:35:[21041.906022]  [<ffffffffa0cacbdd>] lfsck_assistant_engine+0x11fd/0x2150 [lfsck]
07:39:35:[21041.906022]  [<ffffffff810c1a96>] ? dequeue_entity+0x106/0x520
07:39:35:[21041.906022]  [<ffffffff8163b5e8>] ? __schedule+0x2d8/0x900
07:39:35:[21041.906022]  [<ffffffff810b8940>] ? wake_up_state+0x20/0x20
07:39:35:[21041.906022]  [<ffffffffa0cab9e0>] ? lfsck_master_engine+0x1310/0x1310 [lfsck]
07:39:35:[21041.906022]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
07:39:35:[21041.906022]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
07:39:35:[21041.906022]  [<ffffffff81646b98>] ret_from_fork+0x58/0x90
07:39:35:[21041.906022]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140

Debug according to the stack:

# gdb lfsck.ko
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-64.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /root/Work/Lustre/master/lustre-release/lustre/lfsck/lfsck.ko...done.
(gdb) l *lfsck_double_scan_generic+0x22e
0x1197e is in lfsck_double_scan_generic (/root/Work/Lustre/master/lustre-release/lustre/lfsck/lfsck_lib.c:2606).
2601		CDEBUG(D_LFSCK, "%s: waiting for assistant to do %s double_scan, "
2602		       "status %d\n",
2603		       lfsck_lfsck2name(com->lc_lfsck), lad->lad_name, status);
2604	
2605		wake_up_all(&athread->t_ctl_waitq);
2606		l_wait_event(mthread->t_ctl_waitq,
2607			     lad->lad_in_double_scan ||
2608			     thread_is_stopped(athread),
2609			     &lwi);
2610	
(gdb) l *lfsck_assistant_engine+0x11fd
0x18bfd is in lfsck_assistant_engine (/root/Work/Lustre/master/lustre-release/lustre/lfsck/lfsck_engine.c:1628).
warning: Source file is more recent than executable.
1623				lao->la_req_fini(env, lar);
1624				if (rc < 0 && bk->lb_param & LPF_FAILOUT)
1625					GOTO(cleanup, rc);
1626			}
1627	
1628			l_wait_event(athread->t_ctl_waitq,
1629				     !lfsck_assistant_req_empty(lad) ||
1630				     lad->lad_exit ||
1631				     lad->lad_to_post ||
1632				     lad->lad_to_double_scan,
(gdb)

That means the master engine "lfsck" was blocked in the function lfsck_double_scan_generic(), and wait the assistant thread "lfsck_namespace" to complete the second-stage scanning:

int lfsck_double_scan_generic(const struct lu_env *env,
                              struct lfsck_component *com, int status)
{
        struct lfsck_assistant_data     *lad     = com->lc_data;
        struct ptlrpc_thread            *mthread = &com->lc_lfsck->li_thread;
        struct ptlrpc_thread            *athread = &lad->lad_thread;
        struct l_wait_info               lwi     = { 0 };

        if (status != LS_SCANNING_PHASE2)
                lad->lad_exit = 1;
        else
                lad->lad_to_double_scan = 1;

        CDEBUG(D_LFSCK, "%s: waiting for assistant to do %s double_scan, "
               "status %d\n",
               lfsck_lfsck2name(com->lc_lfsck), lad->lad_name, status);

        wake_up_all(&athread->t_ctl_waitq);
(line 2606)==>        l_wait_event(mthread->t_ctl_waitq,
                     lad->lad_in_double_scan ||
                     thread_is_stopped(athread),
                     &lwi);

        CDEBUG(D_LFSCK, "%s: the assistant has done %s double_scan, "
               "status %d\n", lfsck_lfsck2name(com->lc_lfsck), lad->lad_name,
               lad->lad_assistant_status);

        if (lad->lad_assistant_status < 0)
                return lad->lad_assistant_status;

        return 0;
}

While the assistant engine "lfsck_namespace" was waiting for the master engine to unplug the flag "lad_to_double_scan" or "lad->lad_exit" as following:

int lfsck_assistant_engine(void *args)
{
...
(line 1628)==>                l_wait_event(athread->t_ctl_waitq,
                             !lfsck_assistant_req_empty(lad) ||
                             lad->lad_exit ||
                             lad->lad_to_post ||
                             lad->lad_to_double_scan,
                             &lwi);
...
}

In fact, the "lfsck" thread has already set the "lad->lad_to_double_scan" or "lad->lad_exit" before waking up the "lfsck_namespace" thread. Then the "lfsck" thread went to wait. But the "lfsck_namespace" thread did NOT found the condition. The logic is not complex, the issue should be inside the l_wait_event() that is widely used for kinds of Lustre functions.

I suspect that it is related with some race conditions because of out-of-order execution in the following code:

#define __l_wait_event(wq, condition, info, ret, l_add_wait)                   \
do {                                                                           \
        wait_queue_t __wait;                                                   \
...
        for (;;) {                                                             \
(line 278)==>                set_current_state(TASK_INTERRUPTIBLE);                         \
                                                                               \
(line 280)==>                if (condition)                                                 \
                        break;                                                 \

                if (__timeout == 0) {                                          \
                        schedule();                                            \
...
}

For modern CPU, there may be out-of-oder execution between changing thread's state (line 278) and checking conditions (line 280). Consider the following real execution order:

1. Thread1 checks condition on CPU1, gets false.
2. Thread2 sets condition on CPU2.
3. Thread2 calls wake_up() on CPU2 to wake the threads with state TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE. But the Thread1's state is TASK_RUNNING at that time.
4. Thread1 sets its state as TASK_INTERRUPTIBLE on CPU1, then schedule.

If the '__timeout' variable is zero, the Thread1 will have no chance to check the condition again. Generally, the interval between out-of-ordered step1 and step4 is very tiny, as to above step2 and step3 cannot happen. On some degree, it can explain why we seldom hit related trouble. But such race really exists, especially consider that the step1 and step4 can be interruptible.

So I will make patch to add smp_mb() between changing thread's state and checking condition to avoid out-of-order execution.

nasf (Inactive) added a comment - 03/Nov/16 2:49 PM - edited The failure itself is not important. But the reason the failure may be well affect the whole Lustre. The logs show that the namespace LFSCK on the 2nd MDS deadlock as following: 07:39:35:[21041.906022] lfsck S ffff880053dff300 0 20397 2 0x00000080 07:39:35:[21041.906022] ffff880053fa7c90 0000000000000046 ffff880053dff300 ffff880053fa7fd8 07:39:35:[21041.906022] ffff880053fa7fd8 ffff880053fa7fd8 ffff880053dff300 ffff8800457ecc00 07:39:35:[21041.906022] ffff880077f06800 ffff88004e3a80d0 0000000000000000 ffff880053dff300 07:39:35:[21041.906022] Call Trace: 07:39:35:[21041.906022] [<ffffffff8163bc39>] schedule+0x29/0x70 07:39:35:[21041.906022] [<ffffffffa0ca596e>] lfsck_double_scan_generic+0x22e/0x2c0 [lfsck] 07:39:35:[21041.906022] [<ffffffff810b8940>] ? wake_up_state+0x20/0x20 07:39:35:[21041.906022] [<ffffffffa0caed90>] lfsck_namespace_double_scan+0x30/0x140 [lfsck] 07:39:35:[21041.906022] [<ffffffffa0ca5cc9>] lfsck_double_scan+0x59/0x200 [lfsck] 07:39:35:[21041.906022] [<ffffffffa0c4961a>] ? osd_otable_it_fini+0xca/0x240 [osd_ldiskfs] 07:39:35:[21041.906022] [<ffffffff811c0bed>] ? kfree+0xfd/0x140 07:39:35:[21041.906022] [<ffffffffa0caab04>] lfsck_master_engine+0x434/0x1310 [lfsck] 07:39:35:[21041.906022] [<ffffffff810b8940>] ? wake_up_state+0x20/0x20 07:39:35:[21041.906022] [<ffffffffa0caa6d0>] ? lfsck_master_oit_engine+0x14c0/0x14c0 [lfsck] 07:39:35:[21041.906022] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 07:39:35:[21041.906022] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 07:39:35:[21041.906022] [<ffffffff81646b98>] ret_from_fork+0x58/0x90 07:39:35:[21041.906022] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 07:39:35:[21041.906022] lfsck_namespace S ffffffffa0d0fba0 0 20399 2 0x00000080 07:39:35:[21041.906022] ffff8800793f7d48 0000000000000046 ffff880053df8b80 ffff8800793f7fd8 07:39:35:[21041.906022] ffff8800793f7fd8 ffff8800793f7fd8 ffff880053df8b80 0000000000000000 07:39:35:[21041.906022] ffff8800457ecc08 ffff8800457ecc00 ffff88004e3a8000 ffffffffa0d0fba0 07:39:35:[21041.906022] Call Trace: 07:39:35:[21041.906022] [<ffffffff8163bc39>] schedule+0x29/0x70 07:39:35:[21041.906022] [<ffffffffa0cacbdd>] lfsck_assistant_engine+0x11fd/0x2150 [lfsck] 07:39:35:[21041.906022] [<ffffffff810c1a96>] ? dequeue_entity+0x106/0x520 07:39:35:[21041.906022] [<ffffffff8163b5e8>] ? __schedule+0x2d8/0x900 07:39:35:[21041.906022] [<ffffffff810b8940>] ? wake_up_state+0x20/0x20 07:39:35:[21041.906022] [<ffffffffa0cab9e0>] ? lfsck_master_engine+0x1310/0x1310 [lfsck] 07:39:35:[21041.906022] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 07:39:35:[21041.906022] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 07:39:35:[21041.906022] [<ffffffff81646b98>] ret_from_fork+0x58/0x90 07:39:35:[21041.906022] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 Debug according to the stack: # gdb lfsck.ko GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-64.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /root/Work/Lustre/master/lustre-release/lustre/lfsck/lfsck.ko...done. (gdb) l *lfsck_double_scan_generic+0x22e 0x1197e is in lfsck_double_scan_generic (/root/Work/Lustre/master/lustre-release/lustre/lfsck/lfsck_lib.c:2606). 2601 CDEBUG(D_LFSCK, "%s: waiting for assistant to do %s double_scan, " 2602 "status %d\n", 2603 lfsck_lfsck2name(com->lc_lfsck), lad->lad_name, status); 2604 2605 wake_up_all(&athread->t_ctl_waitq); 2606 l_wait_event(mthread->t_ctl_waitq, 2607 lad->lad_in_double_scan || 2608 thread_is_stopped(athread), 2609 &lwi); 2610 (gdb) l *lfsck_assistant_engine+0x11fd 0x18bfd is in lfsck_assistant_engine (/root/Work/Lustre/master/lustre-release/lustre/lfsck/lfsck_engine.c:1628). warning: Source file is more recent than executable. 1623 lao->la_req_fini(env, lar); 1624 if (rc < 0 && bk->lb_param & LPF_FAILOUT) 1625 GOTO(cleanup, rc); 1626 } 1627 1628 l_wait_event(athread->t_ctl_waitq, 1629 !lfsck_assistant_req_empty(lad) || 1630 lad->lad_exit || 1631 lad->lad_to_post || 1632 lad->lad_to_double_scan, (gdb) That means the master engine "lfsck" was blocked in the function lfsck_double_scan_generic(), and wait the assistant thread "lfsck_namespace" to complete the second-stage scanning: int lfsck_double_scan_generic(const struct lu_env *env, struct lfsck_component *com, int status) { struct lfsck_assistant_data *lad = com->lc_data; struct ptlrpc_thread *mthread = &com->lc_lfsck->li_thread; struct ptlrpc_thread *athread = &lad->lad_thread; struct l_wait_info lwi = { 0 }; if (status != LS_SCANNING_PHASE2) lad->lad_exit = 1; else lad->lad_to_double_scan = 1; CDEBUG(D_LFSCK, "%s: waiting for assistant to do %s double_scan, " "status %d\n", lfsck_lfsck2name(com->lc_lfsck), lad->lad_name, status); wake_up_all(&athread->t_ctl_waitq); (line 2606)==> l_wait_event(mthread->t_ctl_waitq, lad->lad_in_double_scan || thread_is_stopped(athread), &lwi); CDEBUG(D_LFSCK, "%s: the assistant has done %s double_scan, " "status %d\n", lfsck_lfsck2name(com->lc_lfsck), lad->lad_name, lad->lad_assistant_status); if (lad->lad_assistant_status < 0) return lad->lad_assistant_status; return 0; } While the assistant engine "lfsck_namespace" was waiting for the master engine to unplug the flag "lad_to_double_scan" or "lad->lad_exit" as following: int lfsck_assistant_engine(void *args) { ... (line 1628)==> l_wait_event(athread->t_ctl_waitq, !lfsck_assistant_req_empty(lad) || lad->lad_exit || lad->lad_to_post || lad->lad_to_double_scan, &lwi); ... } In fact, the "lfsck" thread has already set the "lad->lad_to_double_scan" or "lad->lad_exit" before waking up the "lfsck_namespace" thread. Then the "lfsck" thread went to wait. But the "lfsck_namespace" thread did NOT found the condition. The logic is not complex, the issue should be inside the l_wait_event() that is widely used for kinds of Lustre functions. I suspect that it is related with some race conditions because of out-of-order execution in the following code: #define __l_wait_event(wq, condition, info, ret, l_add_wait) \ do { \ wait_queue_t __wait; \ ... for (;;) { \ (line 278)==> set_current_state(TASK_INTERRUPTIBLE); \ \ (line 280)==> if (condition) \ break; \ if (__timeout == 0) { \ schedule(); \ ... } For modern CPU, there may be out-of-oder execution between changing thread's state (line 278) and checking conditions (line 280). Consider the following real execution order: 1. Thread1 checks condition on CPU1, gets false. 2. Thread2 sets condition on CPU2. 3. Thread2 calls wake_up() on CPU2 to wake the threads with state TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE. But the Thread1's state is TASK_RUNNING at that time. 4. Thread1 sets its state as TASK_INTERRUPTIBLE on CPU1, then schedule. If the '__timeout' variable is zero, the Thread1 will have no chance to check the condition again. Generally, the interval between out-of-ordered step1 and step4 is very tiny, as to above step2 and step3 cannot happen. On some degree, it can explain why we seldom hit related trouble. But such race really exists, especially consider that the step1 and step4 can be interruptible. So I will make patch to add smp_mb() between changing thread's state and checking condition to avoid out-of-order execution.

sanity-lfsck test 31g hung

Details

Description

Attachments

Activity

People

Dates