Details

    • 3
    • 9223372036854775807

    Description

      With patch http://review.whamcloud.com/7200 on master branch, sanity-lfsck test 31g hung as follows:

      == sanity-lfsck test 31g: Repair the corrupted slave LMV EA ========================================== 06:38:18 (1476970698)
      #####
      For some reason, the stripe index in the slave LMV EA is
      corrupted. The LFSCK should repair the slave LMV EA.
      #####
      Inject failure stub on MDT0 to simulate the case that the
      slave LMV EA on the first shard of the striped directory
      claims the same index as the second shard claims
      CMD: onyx-35vm7 /usr/sbin/lctl set_param fail_loc=0x162b fail_val=0
      fail_loc=0x162b
      fail_val=0
      CMD: onyx-35vm7 /usr/sbin/lctl set_param fail_loc=0x0 fail_val=0
      fail_loc=0x0
      fail_val=0
      Trigger namespace LFSCK to repair the slave LMV EA
      CMD: onyx-35vm7 /usr/sbin/lctl lfsck_start -M lustre-MDT0000 -t namespace -r -A
      Started LFSCK on the device lustre-MDT0000: scrub namespace
      CMD: onyx-35vm7 /usr/sbin/lctl lfsck_query -t namespace -M lustre-MDT0000 -w |
      		      awk '/^namespace_mdts_completed/ { print \$2 }'
      

      Maloo report: https://testing.hpdd.intel.com/test_sets/7e295098-96fa-11e6-a763-5254006e85c2

      Attachments

        Activity

          [LU-8760] sanity-lfsck test 31g hung
          eaujames Etienne Aujames added a comment - +1 on b2_12: https://testing.whamcloud.com/test_sets/3daca484-09f5-4839-a786-67650dccaf6c
          mdiep Minh Diep added a comment -

          Landed in 2.13

          mdiep Minh Diep added a comment - Landed in 2.13

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34502/
          Subject: LU-8760 lfsck: fix bit operations lfsck_assistant_data
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: f0ead95dd1275ee906eccdf117abb92b36949a1b

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34502/ Subject: LU-8760 lfsck: fix bit operations lfsck_assistant_data Project: fs/lustre-release Branch: master Current Patch Set: Commit: f0ead95dd1275ee906eccdf117abb92b36949a1b
          panda Andrew Perepechko added a comment - - edited

          The Linux kernel guarantees that wait_event(..., event_var); and event_var = 1; wake_up(...); are always atomic and consistent and do not require any explicit memory barriers:

          SLEEP AND WAKE-UP FUNCTIONS
          ---------------------------
          
          Sleeping and waking on an event flagged in global data can be viewed as an
          interaction between two pieces of data: the task state of the task waiting for
          the event and the global data used to indicate the event.  To make sure that
          these appear to happen in the right order, the primitives to begin the process
          of going to sleep, and the primitives to initiate a wake up imply certain
          barriers.
          
          Firstly, the sleeper normally follows something like this sequence of events:
          
                  for (;;) {
                          set_current_state(TASK_UNINTERRUPTIBLE);
                          if (event_indicated)
                                  break;
                          schedule();
                  }
          
          A general memory barrier is interpolated automatically by set_current_state()
          after it has altered the task state:
          
                  CPU 1
                  ===============================
                  set_current_state();
                    set_mb();
                      STORE current->state
                      <general barrier>
                  LOAD event_indicated
          
          set_current_state() may be wrapped by:
          
                  prepare_to_wait();
                  prepare_to_wait_exclusive();
          
          which therefore also imply a general memory barrier after setting the state.
          The whole sequence above is available in various canned forms, all of which
          interpolate the memory barrier in the right place:
          
                  wait_event();
                  wait_event_interruptible();
                  wait_event_interruptible_exclusive();
                  wait_event_interruptible_timeout();
                  wait_event_killable();
                  wait_event_timeout();
                  wait_on_bit();
                  wait_on_bit_lock();
          
          
          Secondly, code that performs a wake up normally follows something like this:
          
                  event_indicated = 1;
                  wake_up(&event_wait_queue);
          
          or:
          
                  event_indicated = 1;
                  wake_up_process(event_daemon);
          
          A write memory barrier is implied by wake_up() and co. if and only if they wake
          something up.  The barrier occurs before the task state is cleared, and so sits
          between the STORE to indicate the event and the STORE to set TASK_RUNNING:
          
                  CPU 1                           CPU 2
                  =============================== ===============================
                  set_current_state();            STORE event_indicated
                    set_mb();                     wake_up();
                      STORE current->state          <write barrier>
                      <general barrier>             STORE current->state
                  LOAD event_indicated
          

          l_wait_event() is different from wait_event() but seems to mimic the same logic even without the smp_mb() modification.

          panda Andrew Perepechko added a comment - - edited The Linux kernel guarantees that wait_event(..., event_var); and event_var = 1; wake_up(...); are always atomic and consistent and do not require any explicit memory barriers: SLEEP AND WAKE-UP FUNCTIONS --------------------------- Sleeping and waking on an event flagged in global data can be viewed as an interaction between two pieces of data: the task state of the task waiting for the event and the global data used to indicate the event. To make sure that these appear to happen in the right order, the primitives to begin the process of going to sleep, and the primitives to initiate a wake up imply certain barriers. Firstly, the sleeper normally follows something like this sequence of events: for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); if (event_indicated) break ; schedule(); } A general memory barrier is interpolated automatically by set_current_state() after it has altered the task state: CPU 1 =============================== set_current_state(); set_mb(); STORE current->state <general barrier> LOAD event_indicated set_current_state() may be wrapped by: prepare_to_wait(); prepare_to_wait_exclusive(); which therefore also imply a general memory barrier after setting the state. The whole sequence above is available in various canned forms, all of which interpolate the memory barrier in the right place: wait_event(); wait_event_interruptible(); wait_event_interruptible_exclusive(); wait_event_interruptible_timeout(); wait_event_killable(); wait_event_timeout(); wait_on_bit(); wait_on_bit_lock(); Secondly, code that performs a wake up normally follows something like this : event_indicated = 1; wake_up(&event_wait_queue); or: event_indicated = 1; wake_up_process(event_daemon); A write memory barrier is implied by wake_up() and co. if and only if they wake something up. The barrier occurs before the task state is cleared, and so sits between the STORE to indicate the event and the STORE to set TASK_RUNNING: CPU 1 CPU 2 =============================== =============================== set_current_state(); STORE event_indicated set_mb(); wake_up(); STORE current->state <write barrier> <general barrier> STORE current->state LOAD event_indicated l_wait_event() is different from wait_event() but seems to mimic the same logic even without the smp_mb() modification.

          Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/34502
          Subject: LU-8760 lfsck: fix bit operations lfsck_assistant_data
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: ce2a56e1aca04d7cf211c50e12e17d34f6587a57

          gerrit Gerrit Updater added a comment - Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/34502 Subject: LU-8760 lfsck: fix bit operations lfsck_assistant_data Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ce2a56e1aca04d7cf211c50e12e17d34f6587a57

          I have the same failure for 2.12 version.

          crash> ps | grep lfs
             6537      2   7  ffff880ffb9aeeb0  IN   0.0       0      0  [lfsck]
             6539      2   6  ffff880ffb9aaf70  IN   0.0       0      0  [lfsck_layout]
             6540      2   0  ffff880ffb9acf10  IN   0.0       0      0  [lfsck_namespace]
          crash> bt 6537
          PID: 6537   TASK: ffff880ffb9aeeb0  CPU: 7   COMMAND: "lfsck"
           #0 [ffff880178d13c08] __schedule at ffffffff816b6de4
           #1 [ffff880178d13c90] schedule at ffffffff816b7409
           #2 [ffff880178d13ca0] lfsck_double_scan_generic at ffffffffc115a66e [lfsck]
           #3 [ffff880178d13d18] lfsck_layout_master_double_scan at ffffffffc1181bc0 [lfsck]
           #4 [ffff880178d13d60] lfsck_double_scan at ffffffffc115af0f [lfsck]
           #5 [ffff880178d13df0] lfsck_master_engine at ffffffffc115fe16 [lfsck]
           #6 [ffff880178d13ec8] kthread at ffffffff810b4031
           #7 [ffff880178d13f50] ret_from_fork at ffffffff816c455d
          crash> bt 6540
          PID: 6540   TASK: ffff880ffb9acf10  CPU: 0   COMMAND: "lfsck_namespace"
           #0 [ffff880fa73e3ce8] __schedule at ffffffff816b6de4
           #1 [ffff880fa73e3d70] schedule at ffffffff816b7409
           #2 [ffff880fa73e3d80] lfsck_assistant_engine at ffffffffc1161e0d [lfsck]
           #3 [ffff880fa73e3ec8] kthread at ffffffff810b4031
           #4 [ffff880fa73e3f50] ret_from_fork at ffffffff816c455d
          crash> bt 6539
          PID: 6539   TASK: ffff880ffb9aaf70  CPU: 6   COMMAND: "lfsck_layout"
           #0 [ffff880f67be7ce8] __schedule at ffffffff816b6de4
           #1 [ffff880f67be7d70] schedule at ffffffff816b7409
           #2 [ffff880f67be7d80] lfsck_assistant_engine at ffffffffc1161e0d [lfsck]
           #3 [ffff880f67be7ec8] kthread at ffffffff810b4031
           #4 [ffff880f67be7f50] ret_from_fork at ffffffff816c455d
          

          lfsck_master_engine->lfsck_double_scan_generic() sleeps and waits wakeup from lfsck_layout->lfsck_assistant_engine(). And lfsck_assistant_engine sleeps and wait when lfsck_master_engine starts some operation.
          The state of the lfsck_assistant_data is

          struct lfsck_assistant_data {
            lad_lock = {
              {
                rlock = {
                  raw_lock = {
                    val = {
                      counter = 0
                    }
                  }
                }
              }
            },
            lad_req_list = {
              next = 0xffff880fa720d208,
              prev = 0xffff880fa720d208
            },
            lad_ost_list = {
              next = 0xffff880fd8680ce8,
              prev = 0xffff880ff9310568
            },
            lad_ost_phase1_list = {
              next = 0xffff880fd8680cf8,
              prev = 0xffff880ff9310578
            },
            lad_ost_phase2_list = {
              next = 0xffff880fa720d238,
              prev = 0xffff880fa720d238
            },
            lad_mdt_list = {
              next = 0xffff880fa720d248,
              prev = 0xffff880fa720d248
            },
            lad_mdt_phase1_list = {
              next = 0xffff880fa720d258,
              prev = 0xffff880fa720d258
            },
            lad_mdt_phase2_list = {
              next = 0xffff880fa720d268,
              prev = 0xffff880fa720d268
            },
            lad_name = 0xffffffffc11b2423 "lfsck_layout",
            lad_thread = {
              t_link = {
                next = 0x0,
                prev = 0x0
              },
              t_data = 0x0,
              t_flags = 8,
              t_id = 0,
              t_pid = 0,
              t_watchdog = 0x0,
              t_svcpt = 0x0,
              t_ctl_waitq = {
                lock = {
                  {
                    rlock = {
                      raw_lock = {
                        val = {
                          counter = 0
                        }
                      }
                    }
                  }
                },
                task_list = {
                  next = 0xffff880f67be7e80,
                  prev = 0xffff880f67be7e80
                }
              },
              t_env = 0x0,
              t_name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
            },
            lad_task = 0xffff880ffb9aaf70,
            lad_ops = 0xffffffffc11d2d40 <lfsck_layout_assistant_ops>,
            lad_bitmap = 0xffff88103b1862e0,
            lad_touch_gen = 97,
            lad_prefetched = 0,
            lad_assistant_status = 0,
            lad_post_result = 1,
            lad_to_post = 0,
            lad_to_double_scan = 0,
            lad_in_double_scan = 0,
            lad_exit = 0,
            lad_incomplete = 0,
            lad_advance_lock = false
          }
          

          Before the sleep lfsck_double_scan_generic set lad_to_double_scan to 1. lfsck_assistant_engine could zeroed it and set lad_in_double_scan to 1. But the data doesn't show this.

                          if (lad->lad_to_double_scan) {
                                  lad->lad_to_double_scan = 0;
                                  atomic_inc(&lfsck->li_double_scan_count);
                                  lad->lad_in_double_scan = 1;
                                  wake_up_all(&mthread->t_ctl_waitq);
          
          

          li_double_scan_count is 0 too.
          The sleep state is equal to current issue, the first comment shows the same. But the patch already included at 2.12. So the real root cause is different for both fails I think.
          The log shows setting lad_to_double_scan and clearing lad_to_post at the same time.

          00100000:10000000:7.0:1552569516.765338:0:6537:0:(lfsck_namespace.c:4595:lfsck_namespace_post()) testfs-MDT0001-osd: namespace LFSCK post done: rc = 0
          00100000:10000000:6.0:1552569516.765352:0:6539:0:(lfsck_engine.c:1661:lfsck_assistant_engine()) testfs-MDT0001-osd: lfsck_layout LFSCK assistant thread post
          00100000:10000000:7.0:1552569516.765354:0:6537:0:(lfsck_lib.c:2614:lfsck_double_scan_generic()) testfs-MDT0001-osd: waiting for assistant to do lfsck_layout double_scan, status 2
          
          00100000:10000000:6.0:1552569516.765473:0:6539:0:(lfsck_engine.c:1680:lfsck_assistant_engine()) testfs-MDT0001-osd: LFSCK assistant notified others for lfsck_layout post: rc = 0
          
          00100000:10000000:3.0:1552569516.771127:0:20498:0:(lfsck_layout.c:6395:lfsck_layout_master_in_notify()) testfs-MDT0001-osd: layout LFSCK master handles notify 3 from MDT 0, status 1, flags 0, flags2 0
          

          The real root cause is race during set/clear bit operation.
          lad->lad_to_post = 0; vs lad->lad_to_double_scan = 1;
          And the landed patch avoid unexpected out of order execution didn't help and probably is wrong. Because set_current_state use barrier already.
          #define set_current_state(state_value) \
          set_mb(current->state, (state_value))

           I'm reopening ticket and pushing the patch for race.

          aboyko Alexander Boyko added a comment - I have the same failure for 2.12 version. crash> ps | grep lfs 6537 2 7 ffff880ffb9aeeb0 IN 0.0 0 0 [lfsck] 6539 2 6 ffff880ffb9aaf70 IN 0.0 0 0 [lfsck_layout] 6540 2 0 ffff880ffb9acf10 IN 0.0 0 0 [lfsck_namespace] crash> bt 6537 PID: 6537 TASK: ffff880ffb9aeeb0 CPU: 7 COMMAND: "lfsck" #0 [ffff880178d13c08] __schedule at ffffffff816b6de4 #1 [ffff880178d13c90] schedule at ffffffff816b7409 #2 [ffff880178d13ca0] lfsck_double_scan_generic at ffffffffc115a66e [lfsck] #3 [ffff880178d13d18] lfsck_layout_master_double_scan at ffffffffc1181bc0 [lfsck] #4 [ffff880178d13d60] lfsck_double_scan at ffffffffc115af0f [lfsck] #5 [ffff880178d13df0] lfsck_master_engine at ffffffffc115fe16 [lfsck] #6 [ffff880178d13ec8] kthread at ffffffff810b4031 #7 [ffff880178d13f50] ret_from_fork at ffffffff816c455d crash> bt 6540 PID: 6540 TASK: ffff880ffb9acf10 CPU: 0 COMMAND: "lfsck_namespace" #0 [ffff880fa73e3ce8] __schedule at ffffffff816b6de4 #1 [ffff880fa73e3d70] schedule at ffffffff816b7409 #2 [ffff880fa73e3d80] lfsck_assistant_engine at ffffffffc1161e0d [lfsck] #3 [ffff880fa73e3ec8] kthread at ffffffff810b4031 #4 [ffff880fa73e3f50] ret_from_fork at ffffffff816c455d crash> bt 6539 PID: 6539 TASK: ffff880ffb9aaf70 CPU: 6 COMMAND: "lfsck_layout" #0 [ffff880f67be7ce8] __schedule at ffffffff816b6de4 #1 [ffff880f67be7d70] schedule at ffffffff816b7409 #2 [ffff880f67be7d80] lfsck_assistant_engine at ffffffffc1161e0d [lfsck] #3 [ffff880f67be7ec8] kthread at ffffffff810b4031 #4 [ffff880f67be7f50] ret_from_fork at ffffffff816c455d lfsck_master_engine->lfsck_double_scan_generic() sleeps and waits wakeup from lfsck_layout->lfsck_assistant_engine(). And lfsck_assistant_engine sleeps and wait when lfsck_master_engine starts some operation. The state of the lfsck_assistant_data is struct lfsck_assistant_data { lad_lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, lad_req_list = { next = 0xffff880fa720d208, prev = 0xffff880fa720d208 }, lad_ost_list = { next = 0xffff880fd8680ce8, prev = 0xffff880ff9310568 }, lad_ost_phase1_list = { next = 0xffff880fd8680cf8, prev = 0xffff880ff9310578 }, lad_ost_phase2_list = { next = 0xffff880fa720d238, prev = 0xffff880fa720d238 }, lad_mdt_list = { next = 0xffff880fa720d248, prev = 0xffff880fa720d248 }, lad_mdt_phase1_list = { next = 0xffff880fa720d258, prev = 0xffff880fa720d258 }, lad_mdt_phase2_list = { next = 0xffff880fa720d268, prev = 0xffff880fa720d268 }, lad_name = 0xffffffffc11b2423 "lfsck_layout", lad_thread = { t_link = { next = 0x0, prev = 0x0 }, t_data = 0x0, t_flags = 8, t_id = 0, t_pid = 0, t_watchdog = 0x0, t_svcpt = 0x0, t_ctl_waitq = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff880f67be7e80, prev = 0xffff880f67be7e80 } }, t_env = 0x0, t_name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000" }, lad_task = 0xffff880ffb9aaf70, lad_ops = 0xffffffffc11d2d40 <lfsck_layout_assistant_ops>, lad_bitmap = 0xffff88103b1862e0, lad_touch_gen = 97, lad_prefetched = 0, lad_assistant_status = 0, lad_post_result = 1, lad_to_post = 0, lad_to_double_scan = 0, lad_in_double_scan = 0, lad_exit = 0, lad_incomplete = 0, lad_advance_lock = false } Before the sleep lfsck_double_scan_generic set lad_to_double_scan to 1. lfsck_assistant_engine could zeroed it and set lad_in_double_scan to 1. But the data doesn't show this. if (lad->lad_to_double_scan) { lad->lad_to_double_scan = 0; atomic_inc(&lfsck->li_double_scan_count); lad->lad_in_double_scan = 1; wake_up_all(&mthread->t_ctl_waitq); li_double_scan_count is 0 too. The sleep state is equal to current issue, the first comment shows the same. But the patch already included at 2.12. So the real root cause is different for both fails I think. The log shows setting lad_to_double_scan and clearing lad_to_post at the same time. 00100000:10000000:7.0:1552569516.765338:0:6537:0:(lfsck_namespace.c:4595:lfsck_namespace_post()) testfs-MDT0001-osd: namespace LFSCK post done: rc = 0 00100000:10000000:6.0:1552569516.765352:0:6539:0:(lfsck_engine.c:1661:lfsck_assistant_engine()) testfs-MDT0001-osd: lfsck_layout LFSCK assistant thread post 00100000:10000000:7.0:1552569516.765354:0:6537:0:(lfsck_lib.c:2614:lfsck_double_scan_generic()) testfs-MDT0001-osd: waiting for assistant to do lfsck_layout double_scan, status 2 00100000:10000000:6.0:1552569516.765473:0:6539:0:(lfsck_engine.c:1680:lfsck_assistant_engine()) testfs-MDT0001-osd: LFSCK assistant notified others for lfsck_layout post: rc = 0 00100000:10000000:3.0:1552569516.771127:0:20498:0:(lfsck_layout.c:6395:lfsck_layout_master_in_notify()) testfs-MDT0001-osd: layout LFSCK master handles notify 3 from MDT 0, status 1, flags 0, flags2 0 The real root cause is race during set/clear bit operation. lad->lad_to_post = 0; vs lad->lad_to_double_scan = 1; And the landed patch avoid unexpected out of order execution didn't help and probably is wrong. Because set_current_state use barrier already. #define set_current_state(state_value) \ set_mb(current->state, (state_value))  I'm reopening ticket and pushing the patch for race.

          John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28322/
          Subject: LU-8760 lib: avoid unexpected out of order execution
          Project: fs/lustre-release
          Branch: b2_10
          Current Patch Set:
          Commit: 9785fb53d0c939b2d94a69a580bdf0b6d968a25e

          gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28322/ Subject: LU-8760 lib: avoid unexpected out of order execution Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 9785fb53d0c939b2d94a69a580bdf0b6d968a25e

          Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/28322
          Subject: LU-8760 lib: avoid unexpected out of order execution
          Project: fs/lustre-release
          Branch: b2_10
          Current Patch Set: 1
          Commit: 11b5fddf2dc3c40cdf9bce8cd19db8f162a5dffb

          gerrit Gerrit Updater added a comment - Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/28322 Subject: LU-8760 lib: avoid unexpected out of order execution Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 11b5fddf2dc3c40cdf9bce8cd19db8f162a5dffb
          mdiep Minh Diep added a comment -

          Landed for 2.11

          mdiep Minh Diep added a comment - Landed for 2.11

          Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23564/
          Subject: LU-8760 lib: avoid unexpected out of order execution
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: c2b6030e9217e54e7153c0a33cce0c2ea4afa54c

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23564/ Subject: LU-8760 lib: avoid unexpected out of order execution Project: fs/lustre-release Branch: master Current Patch Set: Commit: c2b6030e9217e54e7153c0a33cce0c2ea4afa54c

          People

            yong.fan nasf (Inactive)
            yujian Jian Yu
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: