Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14300

sanity-sec test 18 hangs in txg_quiesce on MDS

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • Lustre 2.14.0
    • None
    • ZFS and single MDS
    • 3
    • 9223372036854775807

    Description

      sanity-sec test_18 hangs. It looks like this issue started on 15 MAY 2020 with https://testing.whamcloud.com/test_sets/8514ff93-a5ac-45f0-9584-dff4d4e42eae for Lustre 2.13.53.163.

      Looking at the suite_log for https://testing.whamcloud.com/test_sets/6875529f-ecc2-432e-9f20-e8810a08448c, the last thing we see is

      CMD: trevis-201vm1.trevis.whamcloud.com runas -u60003 -g60003 -G60003 lfs quota -q /mnt/lustre
      running as uid/gid/euid/egid 60003/60003/60003/60003, groups: 60003
       [lfs] [quota] [-q] [/mnt/lustre]
      CMD: trevis-201vm2 chmod 000 /mnt/lustre/d18.sanity-sec
      CMD: trevis-201vm2 chmod 001 /mnt/lustre/d18.sanity-sec
      CMD: trevis-201vm2 chmod 002 /mnt/lustre/d18.sanity-sec
      CMD: trevis-201vm2 chmod 003 /mnt/lustre/d18.sanity-sec
      CMD: trevis-201vm2 chmod 004 /mnt/lustre/d18.sanity-sec
      CMD: trevis-201vm2 chmod 005 /mnt/lustre/d18.sanity-sec
      CMD: trevis-201vm2 chmod 006 /mnt/lustre/d18.sanity-sec
      

      Looking at the MDS (vm4) console log, we see the hang

      [32726.826140] Lustre: mdt00_003: service thread pid 819155 was inactive for 41.957 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [32726.826328] Pid: 816983, comm: mdt_rdpg00_000 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Wed Dec 9 21:09:51 UTC 2020
      [32726.829004] Lustre: Skipped 1 previous similar message
      [32726.830561] Call Trace TBD:
      [32726.831789] Pid: 819155, comm: mdt00_003 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Wed Dec 9 21:09:51 UTC 2020
      [32726.833277] Call Trace TBD:
      [32808.745683] INFO: task txg_quiesce:816666 blocked for more than 120 seconds.
      [32808.746826]       Tainted: P           OE    --------- -  - 4.18.0-240.1.1.el8_lustre.x86_64 #1
      [32808.748143] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [32808.749328] txg_quiesce     D    0 816666      2 0x80004080
      [32808.750184] Call Trace:
      [32808.750601]  __schedule+0x2a6/0x700
      [32808.751160]  schedule+0x38/0xa0
      [32808.751671]  cv_wait_common+0xfb/0x130 [spl]
      [32808.752325]  ? finish_wait+0x80/0x80
      [32808.752958]  txg_quiesce_thread+0x2ac/0x3a0 [zfs]
      [32808.753741]  ? txg_sync_thread+0x480/0x480 [zfs]
      [32808.754470]  ? __thread_exit+0x20/0x20 [spl]
      [32808.755136]  thread_generic_wrapper+0x6f/0x80 [spl]
      [32808.755892]  kthread+0x112/0x130
      [32808.756397]  ? kthread_flush_work_fn+0x10/0x10
      [32808.757084]  ret_from_fork+0x35/0x40
      [32808.757657] INFO: task mdt_rdpg00_000:816983 blocked for more than 120 seconds.
      [32808.758764]       Tainted: P           OE    --------- -  - 4.18.0-240.1.1.el8_lustre.x86_64 #1
      [32808.760077] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [32808.761260] mdt_rdpg00_000  D    0 816983      2 0x80004080
      [32808.762114] Call Trace:
      [32808.762502]  __schedule+0x2a6/0x700
      [32808.763046]  schedule+0x38/0xa0
      [32808.763536]  rwsem_down_write_slowpath+0x32d/0x4e0
      [32808.764303]  ? dmu_tx_callback_register+0x57/0x80 [zfs]
      [32808.765124]  qti_lqes_write_lock+0x87/0x190 [lquota]
      [32808.765896]  qmt_dqacq0+0x4eb/0x2380 [lquota]
      [32808.766566]  ? lqe_locate_find+0x546/0x7c0 [lquota]
      [32808.767316]  ? qmt_dqacq+0x668/0x790 [lquota]
      [32808.767996]  qmt_dqacq+0x668/0x790 [lquota]
      [32808.768669]  mdt_quota_dqacq+0x59/0x120 [mdt]
      [32808.769391]  tgt_request_handle+0xc78/0x1910 [ptlrpc]
      [32808.770205]  ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc]
      [32808.771130]  ptlrpc_main+0xba4/0x14a0 [ptlrpc]
      [32808.771910]  ? __schedule+0x2ae/0x700
      [32808.772528]  ? ptlrpc_register_service+0xfb0/0xfb0 [ptlrpc]
      [32808.773381]  kthread+0x112/0x130
      [32808.773908]  ? kthread_flush_work_fn+0x10/0x10
      [32808.774601]  ret_from_fork+0x35/0x40
      [32808.775157] INFO: task qmt_reba_lustre:817021 blocked for more than 120 seconds.
      [32808.776291]       Tainted: P           OE    --------- -  - 4.18.0-240.1.1.el8_lustre.x86_64 #1
      [32808.777627] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [32808.778806] qmt_reba_lustre D    0 817021      2 0x80004080
      [32808.779652] Call Trace:
      [32808.780033]  __schedule+0x2a6/0x700
      [32808.780586]  schedule+0x38/0xa0
      [32808.781080]  rwsem_down_write_slowpath+0x32d/0x4e0
      [32808.781827]  ? cfs_hash_bd_lookup_intent+0x2a/0xc0 [libcfs]
      [32808.782688]  qmt_reba_thread+0x6ca/0x9b0 [lquota]
      [32808.783407]  ? qmt_glimpse_lock.isra.19+0xfb0/0xfb0 [lquota]
      [32808.784269]  kthread+0x112/0x130
      [32808.784777]  ? kthread_flush_work_fn+0x10/0x10
      [32808.785452]  ret_from_fork+0x35/0x40
      [32808.786011] INFO: task mdt00_003:819155 blocked for more than 120 seconds.
      [32808.787045]       Tainted: P           OE    --------- -  - 4.18.0-240.1.1.el8_lustre.x86_64 #1
      [32808.788345] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [32808.789522] mdt00_003       D    0 819155      2 0x80004080
      [32808.790365] Call Trace:
      [32808.790760]  __schedule+0x2a6/0x700
      [32808.791290]  ? __wake_up_common+0x7a/0x190
      [32808.791922]  schedule+0x38/0xa0
      [32808.792406]  io_schedule+0x12/0x40
      [32808.792941]  cv_wait_common+0xaf/0x130 [spl]
      [32808.793592]  ? finish_wait+0x80/0x80
      [32808.794184]  txg_wait_synced_impl+0xc2/0x110 [zfs]
      [32808.794967]  txg_wait_synced+0xc/0x40 [zfs]
      [32808.795619]  osd_trans_stop+0x51f/0x560 [osd_zfs]
      [32808.796374]  top_trans_stop+0x9f/0xd30 [ptlrpc]
      [32808.797089]  ? mdd_acl_chmod+0x185/0x4b0 [mdd]
      [32808.797791]  lod_trans_stop+0x9b/0x340 [lod]
      [32808.798453]  ? mdd_attr_set+0x54a/0xc20 [mdd]
      [32808.799129]  mdd_trans_stop+0x29/0x17f [mdd]
      [32808.799808]  mdt_reint_setattr+0xc7a/0x12a0 [mdt]
      [32808.800565]  ? nodemap_get_from_exp+0xea/0x270 [ptlrpc]
      [32808.801372]  mdt_reint_rec+0x11f/0x250 [mdt]
      [32808.802044]  mdt_reint_internal+0x498/0x780 [mdt]
      [32808.802778]  mdt_reint+0x5e/0x100 [mdt]
      [32808.803400]  tgt_request_handle+0xc78/0x1910 [ptlrpc]
      [32808.804212]  ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc]
      [32808.805135]  ptlrpc_main+0xba4/0x14a0 [ptlrpc]
      [32808.805824]  ? __schedule+0x2ae/0x700
      [32808.806412]  ? ptlrpc_register_service+0xfb0/0xfb0 [ptlrpc]
      [32808.808415]  kthread+0x112/0x130
      [32808.808950]  ? kthread_flush_work_fn+0x10/0x10
      [32808.809648]  ret_from_fork+0x35/0x40
      

      Logs for more hangs are at
      https://testing.whamcloud.com/test_sets/a8822b08-9e99-4320-8750-fbfffad657a5
      https://testing.whamcloud.com/test_sets/713a62aa-3393-4b43-8178-87398949b6c0
      https://testing.whamcloud.com/test_sets/a801aef1-a4d8-4470-a54e-49c953f2c246

      Attachments

        Issue Links

          Activity

            People

              scherementsev Sergey Cheremencev
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: