Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11975

sanity test 60a hangs in llog_test_10

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: Lustre 2.13.0
    • Fix Version/s: Lustre 2.13.0, Lustre 2.12.1
    • Labels:
      None
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      sanity test_60a hangs when running llog_test.c. This test started hanging on 18 February 2019 with 2.12.51.51.

      Looking at the logs for one of the failures, https://testing.whamcloud.com/test_sets/41ea8004-348f-11e9-a05f-52540065bddc , the last thing seen in the client test_log is

      trevis-10vm12: + RC=0
      trevis-10vm12: + eval '/sbin/lctl <<-EOF || RC=2
      trevis-10vm12: 	attach llog_test llt_name llt_uuid
      trevis-10vm12: 	ignore_errors
      trevis-10vm12: 	setup MGS
      trevis-10vm12: 	device llt_name
      trevis-10vm12: 	cleanup
      trevis-10vm12: 	detach
      trevis-10vm12: EOF'
      trevis-10vm12: ++ /sbin/lctl
      

      Looking at the MDS console log, we see the test in hung in llog_test 10h

      [ 2700.758677] Lustre: 29346:0:(llog_test.c:2045:llog_test_10()) 10h: write 64767 more log records
      [ 2700.758851] Lustre: 29365:0:(llog_test.c:1469:cat_check_old_cb()) seeing record at index 2 - [0x1:0x4e:0x0] in log [0xa:0x11:0x0]
      [ 2700.760785] LustreError: 29346:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 1317 sleeping
      [ 2880.228053] INFO: task jbd2/dm-3-8:6113 blocked for more than 120 seconds.
      [ 2880.228819] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 2880.229589] jbd2/dm-3-8     D ffff9862a41e9040     0  6113      2 0x00000080
      [ 2880.230319] Call Trace:
      [ 2880.230599]  [<ffffffff85767c49>] schedule+0x29/0x70
      [ 2880.231117]  [<ffffffffc039c2bc>] jbd2_journal_commit_transaction+0x23c/0x19b0 [jbd2]
      [ 2880.231941]  [<ffffffff8502a59e>] ? __switch_to+0xce/0x580
      [ 2880.232676]  [<ffffffff850d0880>] ? finish_task_switch+0x50/0x1c0
      [ 2880.233269]  [<ffffffff850c2d00>] ? wake_up_atomic_t+0x30/0x30
      [ 2880.233825]  [<ffffffff8576778f>] ? __schedule+0x3ff/0x890
      [ 2880.234424]  [<ffffffff850aa0de>] ? try_to_del_timer_sync+0x5e/0x90
      [ 2880.235057]  [<ffffffffc03a2e89>] kjournald2+0xc9/0x260 [jbd2]
      [ 2880.235611]  [<ffffffff850c2d00>] ? wake_up_atomic_t+0x30/0x30
      [ 2880.236169]  [<ffffffffc03a2dc0>] ? commit_timeout+0x10/0x10 [jbd2]
      [ 2880.236834]  [<ffffffff850c1c31>] kthread+0xd1/0xe0
      [ 2880.237302]  [<ffffffff850c1b60>] ? insert_kthread_work+0x40/0x40
      [ 2880.237882]  [<ffffffff85774c37>] ret_from_fork_nospec_begin+0x21/0x21
      [ 2880.238500]  [<ffffffff850c1b60>] ? insert_kthread_work+0x40/0x40
      

      The sanity test 60a hang at https://testing.whamcloud.com/test_sets/80bd5466-339d-11e9-bd83-52540065bddc has a different call trace in the MDS console log

      [ 4208.005408] Lustre: 9960:0:(llog_test.c:2045:llog_test_10()) 10h: write 64767 more log records
      [ 4208.005528] Lustre: 13221:0:(llog_test.c:1469:cat_check_old_cb()) seeing record at index 2 - [0x1:0x35:0x0] in log [0xa:0x15:0x0]
      [ 4208.008970] LustreError: 9960:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 1317 sleeping
      [ 4440.279385] INFO: task txg_quiesce:17743 blocked for more than 120 seconds.
      [ 4440.280697] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 4440.282045] txg_quiesce     D ffff9910f6268000     0 17743      2 0x00000080
      [ 4440.283468] Call Trace:
      [ 4440.283971]  [<ffffffffa2f67c49>] schedule+0x29/0x70
      [ 4440.284879]  [<ffffffffc048c2d5>] cv_wait_common+0x125/0x150 [spl]
      [ 4440.285989]  [<ffffffffa28c2d00>] ? wake_up_atomic_t+0x30/0x30
      [ 4440.287197]  [<ffffffffc048c315>] __cv_wait+0x15/0x20 [spl]
      [ 4440.288234]  [<ffffffffc05f2c2b>] txg_quiesce_thread+0x2fb/0x410 [zfs]
      [ 4440.289475]  [<ffffffffc05f2930>] ? txg_init+0x2b0/0x2b0 [zfs]
      [ 4440.290514]  [<ffffffffc0487063>] thread_generic_wrapper+0x73/0x80 [spl]
      [ 4440.291768]  [<ffffffffc0486ff0>] ? __thread_exit+0x20/0x20 [spl]
      [ 4440.292837]  [<ffffffffa28c1c31>] kthread+0xd1/0xe0
      [ 4440.293814]  [<ffffffffa28c1b60>] ? insert_kthread_work+0x40/0x40
      [ 4440.294891]  [<ffffffffa2f74c37>] ret_from_fork_nospec_begin+0x21/0x21
      [ 4440.296120]  [<ffffffffa28c1b60>] ? insert_kthread_work+0x40/0x40
      

      The patch for LU-11591, https://review.whamcloud.com/#/c/33683/ , was the last patch to modify llog_test test 10 and landed around the time sanity test 60a started failing.

      Logs for additional failures can be found at
      https://testing.whamcloud.com/test_sets/6f903f48-348b-11e9-ae87-52540065bddc
      https://testing.whamcloud.com/test_sets/86731228-3376-11e9-ae87-52540065bddc

        Attachments

          Activity

            People

            • Assignee:
              aboyko Alexander Boyko
              Reporter:
              jamesanunez James Nunez
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: