Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10872

obdfilter-survey test 1a hangs in lctl

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.3, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.13.0, Lustre 2.10.6, Lustre 2.10.7, Lustre 2.15.2
    • None
    • ZFS
    • 3
    • 9223372036854775807

    Description

      obdfilter-survey test_1a hangs. The last thing seen in the client test_log is

      == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 12:43:34 (1522586614)
      Unable to detect ip address for host: ''
      + NETTYPE=tcp thrlo=8 nobjhi=1 thrhi=16 size=1024 case=disk rslt_loc=/tmp targets="10.9.6.25:lustre-OST0000 10.9.6.25:lustre-OST0001 10.9.6.25:lustre-OST0002 10.9.6.25:lustre-OST0003 10.9.6.25:lustre-OST0004 10.9.6.25:lustre-OST0005 10.9.6.25:lustre-OST0006 10.9.6.25:lustre-OST0007" /usr/bin/obdfilter-survey
      Warning: Permanently added '10.9.6.25' (ECDSA) to the list of known hosts.
      Sun Apr  1 12:43:46 UTC 2018 Obdfilter-survey for case=disk from trevis-45vm5.trevis.hpdd.intel.com
      ost  8 sz  8388608K rsz 1024K obj    8 thr   64 write    8.68 [   0.00,   74.99] rewrite    8.34 [   0.00,   36.00] read   28.79 [   0.00,   66.99]
      ost  8 sz  8388608K rsz 1024K obj    8 thr  128 write   10.07 [   0.00,   71.99] rewrite
      

       

      The console logs for the OSS show that lctl is hung.

      [45533.585190] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 12:43:34 \(1522586614\)
      [45533.839472] Lustre: DEBUG MARKER: == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 12:43:34 (1522586614)
      [45534.132589] LustreError: 0-0: lustre-OST0000: can't enable quota enforcement since space accounting isn't functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already
      [45534.139448] LustreError: Skipped 15 previous similar messages
      [45535.043556] Lustre: Echo OBD driver; http://www.lustre.org/
      [47640.146175] INFO: task lctl:15920 blocked for more than 120 seconds.
      [47640.149032] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [47640.151858] lctl            D ffff880057268fd0     0 15920  15879 0x00000080
      [47640.154680] Call Trace:
      [47640.157152]  [<ffffffffc08cef72>] ? zio_nowait+0xc2/0x160 [zfs]
      [47640.159858]  [<ffffffff816b40e9>] schedule+0x29/0x70
      [47640.162388]  [<ffffffff816b1a49>] schedule_timeout+0x239/0x2c0
      [47640.164992]  [<ffffffff81063f5e>] ? kvm_clock_get_cycles+0x1e/0x20
      [47640.167590]  [<ffffffff816b35ed>] io_schedule_timeout+0xad/0x130
      [47640.170180]  [<ffffffff810b4cb6>] ? prepare_to_wait_exclusive+0x56/0x90
      [47640.172812]  [<ffffffff816b3688>] io_schedule+0x18/0x20
      [47640.175437]  [<ffffffffc0726502>] cv_wait_common+0xb2/0x150 [spl]
      [47640.178083]  [<ffffffff810b4fc0>] ? wake_up_atomic_t+0x30/0x30
      [47640.180646]  [<ffffffffc07265f8>] __cv_wait_io+0x18/0x20 [spl]
      [47640.183211]  [<ffffffffc08ce833>] zio_wait+0x113/0x1c0 [zfs]
      [47640.185714]  [<ffffffffc08184d4>] dmu_buf_hold_array_by_dnode+0x154/0x4a0 [zfs]
      [47640.188421]  [<ffffffffc0818889>] dmu_buf_hold_array_by_bonus+0x69/0x90 [zfs]
      [47640.191206]  [<ffffffffc10d8782>] osd_bufs_get+0x422/0xd00 [osd_zfs]
      [47640.193941]  [<ffffffffc12160cb>] ofd_preprw+0x6bb/0x1170 [ofd]
      [47640.196494]  [<ffffffff8118bb0e>] ? __get_free_pages+0xe/0x40
      [47640.198992]  [<ffffffff811e146e>] ? kmalloc_order_trace+0x2e/0xa0
      [47640.201483]  [<ffffffff811e5011>] ? __kmalloc+0x211/0x230
      [47640.203908]  [<ffffffffc12fc134>] echo_client_prep_commit.isra.51+0x2f4/0xcb0 [obdecho]
      [47640.206558]  [<ffffffffc1303f7f>] echo_client_iocontrol+0x95f/0x1bb0 [obdecho]
      [47640.209174]  [<ffffffffc0c95229>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [47640.211622]  [<ffffffffc0c7fa93>] class_handle_ioctl+0x18e3/0x1df0 [obdclass]
      [47640.214026]  [<ffffffff811b1f16>] ? do_read_fault.isra.44+0xe6/0x130
      [47640.216353]  [<ffffffff812b72be>] ? security_capable+0x1e/0x20
      [47640.218605]  [<ffffffffc0c647f2>] obd_class_ioctl+0xd2/0x170 [obdclass]
      [47640.220979]  [<ffffffff81219e90>] do_vfs_ioctl+0x350/0x560
      [47640.223400]  [<ffffffff816bb521>] ? __do_page_fault+0x171/0x450
      [47640.226015]  [<ffffffff8121a141>] SyS_ioctl+0xa1/0xc0
      [47640.228072]  [<ffffffff816c0655>] ? system_call_after_swapgs+0xa2/0x146
      [47640.230250]  [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21
      [47640.232290]  [<ffffffff816c0661>] ? system_call_after_swapgs+0xae/0x146
      

       

      This stack trace is very similar to the ones see in LU-6649 and LU-5775.

       

      So far, this has only been seen on ZFS testing.

       

      Logs for this failure are at

      2.11.0-RC3 - https://testing.hpdd.intel.com/test_sets/33a1e65c-35be-11e8-8f8a-52540065bddc

      2.11.0-RC1 - https://testing.hpdd.intel.com/test_sessions/854854b1-df4e-4c3a-82ee-67efd0b2e5da

      2.10.3 - https://testing.hpdd.intel.com/test_sets/e688805a-2cf4-11e8-b74b-52540065bddc

      2.10.3 – https://testing.hpdd.intel.com/test_sessions/32fecab7-704a-44b5-9645-5b9c10bc41e9

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: