Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7837

sanity test_27o times out hung in reset_enospc

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.9.0
    • None
    • autotest review-dne-part-1
    • 3
    • 9223372036854775807

    Description

      sanity test 27o times out in review-dne-part-1. The last thing seen in the test_log is

      osc.lustre-OST0007-osc-MDT0003.prealloc_reserved=0
      osc.lustre-OST0007-osc-MDT0003.prealloc_status=-28
      CMD: onyx-42vm8 lctl set_param fail_loc=0x215
      fail_loc=0x215
      CMD: onyx-42vm7 lctl get_param -n lov.*.qos_maxage
      touch: cannot touch `/mnt/lustre/d27o.sanity/f27o.sanity': No space left on device
      CMD: onyx-42vm8 lctl set_param fail_loc=0
      fail_loc=0
      

      The ‘No space left on device’ is expected. It looks like the test is hung in the reset_enospc() routine possibly hung on the call to sync.

      1467 # OSCs keep a NOSPC flag that will be reset after ~5s (qos_maxage)
      1468 # if the OST isn't full anymore.
      1469 reset_enospc() {
      1470         local OSTIDX=${1:-""}
      1471 
      1472         local list=$(comma_list $(osts_nodes))
      1473         [ "$OSTIDX" ] && list=$(facet_host ost$((OSTIDX + 1)))
      1474 
      1475         do_nodes $list lctl set_param fail_loc=0
      1476         sync    # initiate all OST_DESTROYs from MDS to OST
      1477         sleep_maxage
      1478 }
      

      The logs incomplete and aren’t much help. The only thing that looks out of place are some disconnect notices for the client2

      03:26:34:Lustre: DEBUG MARKER: == sanity test 27o: create file with all full OSTs (should error) ====== 01:24:20 (1456824260)
      03:26:34:
      03:26:34:<ConMan> Console [onyx-42vm2] disconnected from <onyx-42:6001> at 03-01 03:24.
      03:26:34:
      03:26:34:<ConMan> Console [onyx-42vm2] connected to <onyx-42:6001> at 03-01 03:24.
      03:26:34:Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null || true
      

      and the MDS1

      03:26:47:Lustre: DEBUG MARKER: == sanity test 27o: create file with all full OSTs (should error) ====== 01:24:20 (1456824260)
      03:26:47:Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
      03:26:47:Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
      03:26:47:Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
      03:26:47:Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
      03:26:47:Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
      03:26:47:Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
      03:26:47:Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
      03:26:47:Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
      03:26:47:Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
      03:26:47:Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
      03:26:47:
      03:26:47:<ConMan> Console [onyx-42vm7] disconnected from <onyx-42:6006> at 03-01 03:26.
      03:26:47:
      03:26:47:<ConMan> Console [onyx-42vm7] connected to <onyx-42:6006> at 03-01 03:26.
      03:26:47:Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null || true
      

      Logs are at https://testing.hpdd.intel.com/test_sets/4a1a8e4a-dfce-11e5-9020-5254006e85c2

      Attachments

        Activity

          People

            wc-triage WC Triage
            jamesanunez James Nunez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: