Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8472

sanity-scrub test_5 times out

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.9.0
    • Lustre 2.9.0
    • None
    • autotest review-dne-part-2
    • 3
    • 9223372036854775807

    Description

      sanity-scrub test 5 hangs and times out. Looking at the console logs, it’s not clear why the test is hanging.

      From the test_log, the last thing we see

      Starting client: onyx-40vm1.onyx.hpdd.intel.com:  -o user_xattr,flock onyx-40vm7@tcp:/lustre /mnt/lustre
      CMD: onyx-40vm1.onyx.hpdd.intel.com mkdir -p /mnt/lustre
      CMD: onyx-40vm1.onyx.hpdd.intel.com mount -t lustre -o user_xattr,flock onyx-40vm7@tcp:/lustre /mnt/lustre
      CMD: onyx-40vm3,onyx-40vm7 /usr/sbin/lctl set_param -n osd-ldiskfs.*.full_scrub_ratio=0
      CMD: onyx-40vm3,onyx-40vm7 /usr/sbin/lctl set_param fail_val=3 fail_loc=0x190
      fail_val=3
      fail_loc=0x190
      fail_val=3
      fail_loc=0x190
        File: '/mnt/lustre/d5.sanity-scrub/mds1/f5.sanity-scrub800'
        Size: 0         	Blocks: 0          IO Block: 4194304 regular empty file
      Device: 2c54f966h/743766374d	Inode: 144115205322834801  Links: 1
      Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
      Access: 2016-08-01 19:00:37.000000000 -0700
      Modify: 2016-08-01 18:59:51.000000000 -0700
      Change: 2016-08-01 19:00:37.000000000 -0700
       Birth: -
      

      I don’t see any LBUG or ASSERTIONs in any of the console logs. In client logs when this test completes, we see a mds_connect to the MDS fail and then later connect. From the client dmesg, we see the problem connection to the MDS, but we don’t connect with the MDS:

      [13304.508640] Lustre: Unmounted lustre-client
      [13304.511334] Lustre: Skipped 2 previous similar messages
      [13367.608943] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre
      [13367.616737] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock onyx-40vm7@tcp:/lustre /mnt/lustre
      [13368.970139] Lustre: DEBUG MARKER: grep -c /mnt/lustre' ' /proc/mounts
      [13368.978423] Lustre: DEBUG MARKER: lsof -t /mnt/lustre
      [13369.048679] Lustre: DEBUG MARKER: umount /mnt/lustre 2>&1
      [13449.961128] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre
      [13449.968913] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock onyx-40vm7@tcp:/lustre /mnt/lustre
      [13450.020802] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16
      [13450.024064] LustreError: Skipped 2 previous similar messages
      [13485.017940] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16
      [13485.023276] LustreError: Skipped 6 previous similar messages
      [13550.024059] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16
      [13550.031793] LustreError: Skipped 12 previous similar messages
      [13680.018274] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16
      [13680.023602] LustreError: Skipped 25 previous similar messages
      [13940.017773] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16
      [13940.023074] LustreError: Skipped 51 previous similar messages
      [14455.018070] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16
      [14455.023708] LustreError: Skipped 102 previous similar messages
      [15060.018374] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16
      [15060.024085] LustreError: Skipped 120 previous similar messages
      [15665.017755] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16
      [15665.024000] LustreError: Skipped 120 previous similar messages
      [16270.017567] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16
      [16270.023955] LustreError: Skipped 120 previous similar messages
      [16816.123140] SysRq : Show State
      [16816.124024]   task                        PC stack   pid father
      [16816.124024] systemd         S 0000000000000000     0     1      0 0x00000000
      [16816.124024]  ffff88007c7b3db8 0000000000000086 ffff88007c7a8000 ffff88007c7b3fd8
      [16816.124024]  ffff88007c7b3fd8 ffff88007c7b3fd8 ffff88007c7a8000 0000000000000000
      [16816.124024]  0000000000000000 ffff88003693d5a0 ffff88007c7a8000 0000000000000000
      [16816.124024] Call Trace:
      [16816.124024]  [<ffffffff8163b809>] schedule+0x29/0x70
      [16816.124024]  [<ffffffff8163a99d>] schedule_hrtimeout_range_clock+0x12d/0x150
      [16816.124024]  [<ffffffff81228a59>] ? ep_scan_ready_list.isra.9+0x1b9/0x1f0
      [16816.124024]  [<ffffffff8163a9d3>] schedule_hrtimeout_range+0x13/0x20
      [16816.124024]  [<ffffffff81228cee>] ep_poll+0x23e/0x360
      [16816.124024]  [<ffffffff8122b23d>] ? do_timerfd_settime+0x2ed/0x3a0
      [16816.124024]  [<ffffffff810b88d0>] ? wake_up_state+0x20/0x20
      [16816.124024]  [<ffffffff81229ded>] SyS_epoll_wait+0xed/0x120
      [16816.124024]  [<ffffffff81646889>] system_call_fastpath+0x16/0x1b
      

      These failures do NOT look like LU-8070 nor LU-8399.

      Recent failure logs are at
      2016-07-22 - https://testing.hpdd.intel.com/test_sets/b5c71fbc-5087-11e6-8968-5254006e85c2
      2016-07-25 - https://testing.hpdd.intel.com/test_sets/cdc385e8-52d9-11e6-bf87-5254006e85c2
      2016-07-25 - https://testing.hpdd.intel.com/test_sets/eaaeec4a-52d6-11e6-bf87-5254006e85c2
      2016-07-29 - https://testing.hpdd.intel.com/test_sets/97c36552-5604-11e6-b5b1-5254006e85c2
      2016-07-29 - https://testing.hpdd.intel.com/test_sets/5d523e04-55e9-11e6-aa74-5254006e85c2
      2016-08-01 - https://testing.hpdd.intel.com/test_sets/0dca7564-583b-11e6-aa74-5254006e85c2
      2016-08-01 - https://testing.hpdd.intel.com/test_sets/64a45f36-584e-11e6-aa74-5254006e85c2
      2016-08-02 - https://testing.hpdd.intel.com/test_sets/058493fc-588c-11e6-b5b1-5254006e85c2

      Attachments

        Issue Links

          Activity

            People

              yong.fan nasf (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: