Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15616

sanity-lnet test_226: Timeout occurred after 112 minutes, last suite running was sanity-lnet

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Chris Horn <chris.horn@hpe.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/01cd6eae-ac03-48f6-980d-3977224dfaad

      test_226 failed with the following error:

      Timeout occurred after 112 minutes, last suite running was sanity-lnet
      

      LNetNIFini() and discovery thread appear to have hit a deadlock:

      [Thu Mar  3 19:42:41 2022] INFO: task lnet_discovery:424118 blocked for more than 120 seconds.
      [Thu Mar  3 19:42:41 2022]       Tainted: G           OE    --------- -  - 4.18.0-240.22.1.el8_lustre.x86_64 #1
      [Thu Mar  3 19:42:41 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [Thu Mar  3 19:42:41 2022] lnet_discovery  D    0 424118      2 0x80004080
      [Thu Mar  3 19:42:41 2022] Call Trace:
      [Thu Mar  3 19:42:41 2022]  __schedule+0x2c4/0x700
      [Thu Mar  3 19:42:41 2022]  schedule+0x38/0xa0
      [Thu Mar  3 19:42:41 2022]  schedule_preempt_disabled+0xa/0x10
      [Thu Mar  3 19:42:41 2022]  __mutex_lock.isra.5+0x2d0/0x4a0
      [Thu Mar  3 19:42:41 2022]  lnet_peer_discovery+0x929/0x16c0 [lnet]
      [Thu Mar  3 19:42:41 2022]  ? finish_wait+0x80/0x80
      [Thu Mar  3 19:42:41 2022]  ? lnet_peer_merge_data+0xff0/0xff0 [lnet]
      [Thu Mar  3 19:42:41 2022]  kthread+0x112/0x130
      [Thu Mar  3 19:42:41 2022]  ? kthread_flush_work_fn+0x10/0x10
      [Thu Mar  3 19:42:41 2022]  ret_from_fork+0x35/0x40
      [Thu Mar  3 19:42:41 2022] INFO: task lnetctl:428295 blocked for more than 120 seconds.
      [Thu Mar  3 19:42:41 2022]       Tainted: G           OE    --------- -  - 4.18.0-240.22.1.el8_lustre.x86_64 #1
      [Thu Mar  3 19:42:41 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [Thu Mar  3 19:42:41 2022] lnetctl         D    0 428295 428283 0x00004080
      [Thu Mar  3 19:42:41 2022] Call Trace:
      [Thu Mar  3 19:42:41 2022]  __schedule+0x2c4/0x700
      [Thu Mar  3 19:42:41 2022]  ? __wake_up_common_lock+0x89/0xc0
      [Thu Mar  3 19:42:41 2022]  schedule+0x38/0xa0
      [Thu Mar  3 19:42:41 2022]  lnet_peer_discovery_stop+0x112/0x260 [lnet]
      [Thu Mar  3 19:42:41 2022]  ? finish_wait+0x80/0x80
      [Thu Mar  3 19:42:41 2022]  LNetNIFini+0x5e/0x100 [lnet]
      [Thu Mar  3 19:42:41 2022]  lnet_ioctl+0x220/0x260 [lnet]
      [Thu Mar  3 19:42:41 2022]  notifier_call_chain+0x47/0x70
      [Thu Mar  3 19:42:41 2022]  blocking_notifier_call_chain+0x3e/0x60
      [Thu Mar  3 19:42:41 2022]  libcfs_psdev_ioctl+0x346/0x590 [libcfs]
      [Thu Mar  3 19:42:41 2022]  do_vfs_ioctl+0xa4/0x640
      [Thu Mar  3 19:42:41 2022]  ? syscall_trace_enter+0x1d3/0x2c0
      [Thu Mar  3 19:42:41 2022]  ksys_ioctl+0x60/0x90
      [Thu Mar  3 19:42:41 2022]  __x64_sys_ioctl+0x16/0x20
      [Thu Mar  3 19:42:41 2022]  do_syscall_64+0x5b/0x1a0
      [Thu Mar  3 19:42:41 2022]  entry_SYSCALL_64_after_hwframe+0x65/0xca
      

      LNetNIFini() has the ln_api_mutex and is waiting for the discovery thread to stop. The discovery thread needs the mutex to progress.

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-lnet test_226 - Timeout occurred after 112 minutes, last suite running was sanity-lnet

      Attachments

        Issue Links

          Activity

            People

              hornc Chris Horn
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: