Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.9.0
-
None
-
autotest review-dne-part-2
-
3
-
9223372036854775807
Description
sanity-scrub test 5 hangs and times out. Looking at the console logs, it’s not clear why the test is hanging.
From the test_log, the last thing we see
Starting client: onyx-40vm1.onyx.hpdd.intel.com: -o user_xattr,flock onyx-40vm7@tcp:/lustre /mnt/lustre CMD: onyx-40vm1.onyx.hpdd.intel.com mkdir -p /mnt/lustre CMD: onyx-40vm1.onyx.hpdd.intel.com mount -t lustre -o user_xattr,flock onyx-40vm7@tcp:/lustre /mnt/lustre CMD: onyx-40vm3,onyx-40vm7 /usr/sbin/lctl set_param -n osd-ldiskfs.*.full_scrub_ratio=0 CMD: onyx-40vm3,onyx-40vm7 /usr/sbin/lctl set_param fail_val=3 fail_loc=0x190 fail_val=3 fail_loc=0x190 fail_val=3 fail_loc=0x190 File: '/mnt/lustre/d5.sanity-scrub/mds1/f5.sanity-scrub800' Size: 0 Blocks: 0 IO Block: 4194304 regular empty file Device: 2c54f966h/743766374d Inode: 144115205322834801 Links: 1 Access: (0444/-r--r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2016-08-01 19:00:37.000000000 -0700 Modify: 2016-08-01 18:59:51.000000000 -0700 Change: 2016-08-01 19:00:37.000000000 -0700 Birth: -
I don’t see any LBUG or ASSERTIONs in any of the console logs. In client logs when this test completes, we see a mds_connect to the MDS fail and then later connect. From the client dmesg, we see the problem connection to the MDS, but we don’t connect with the MDS:
[13304.508640] Lustre: Unmounted lustre-client [13304.511334] Lustre: Skipped 2 previous similar messages [13367.608943] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre [13367.616737] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock onyx-40vm7@tcp:/lustre /mnt/lustre [13368.970139] Lustre: DEBUG MARKER: grep -c /mnt/lustre' ' /proc/mounts [13368.978423] Lustre: DEBUG MARKER: lsof -t /mnt/lustre [13369.048679] Lustre: DEBUG MARKER: umount /mnt/lustre 2>&1 [13449.961128] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre [13449.968913] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock onyx-40vm7@tcp:/lustre /mnt/lustre [13450.020802] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16 [13450.024064] LustreError: Skipped 2 previous similar messages [13485.017940] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16 [13485.023276] LustreError: Skipped 6 previous similar messages [13550.024059] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16 [13550.031793] LustreError: Skipped 12 previous similar messages [13680.018274] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16 [13680.023602] LustreError: Skipped 25 previous similar messages [13940.017773] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16 [13940.023074] LustreError: Skipped 51 previous similar messages [14455.018070] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16 [14455.023708] LustreError: Skipped 102 previous similar messages [15060.018374] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16 [15060.024085] LustreError: Skipped 120 previous similar messages [15665.017755] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16 [15665.024000] LustreError: Skipped 120 previous similar messages [16270.017567] LustreError: 11-0: lustre-MDT0001-mdc-ffff880051c8a800: operation mds_connect to node 10.2.4.185@tcp failed: rc = -16 [16270.023955] LustreError: Skipped 120 previous similar messages [16816.123140] SysRq : Show State [16816.124024] task PC stack pid father [16816.124024] systemd S 0000000000000000 0 1 0 0x00000000 [16816.124024] ffff88007c7b3db8 0000000000000086 ffff88007c7a8000 ffff88007c7b3fd8 [16816.124024] ffff88007c7b3fd8 ffff88007c7b3fd8 ffff88007c7a8000 0000000000000000 [16816.124024] 0000000000000000 ffff88003693d5a0 ffff88007c7a8000 0000000000000000 [16816.124024] Call Trace: [16816.124024] [<ffffffff8163b809>] schedule+0x29/0x70 [16816.124024] [<ffffffff8163a99d>] schedule_hrtimeout_range_clock+0x12d/0x150 [16816.124024] [<ffffffff81228a59>] ? ep_scan_ready_list.isra.9+0x1b9/0x1f0 [16816.124024] [<ffffffff8163a9d3>] schedule_hrtimeout_range+0x13/0x20 [16816.124024] [<ffffffff81228cee>] ep_poll+0x23e/0x360 [16816.124024] [<ffffffff8122b23d>] ? do_timerfd_settime+0x2ed/0x3a0 [16816.124024] [<ffffffff810b88d0>] ? wake_up_state+0x20/0x20 [16816.124024] [<ffffffff81229ded>] SyS_epoll_wait+0xed/0x120 [16816.124024] [<ffffffff81646889>] system_call_fastpath+0x16/0x1b
These failures do NOT look like LU-8070 nor LU-8399.
Recent failure logs are at
2016-07-22 - https://testing.hpdd.intel.com/test_sets/b5c71fbc-5087-11e6-8968-5254006e85c2
2016-07-25 - https://testing.hpdd.intel.com/test_sets/cdc385e8-52d9-11e6-bf87-5254006e85c2
2016-07-25 - https://testing.hpdd.intel.com/test_sets/eaaeec4a-52d6-11e6-bf87-5254006e85c2
2016-07-29 - https://testing.hpdd.intel.com/test_sets/97c36552-5604-11e6-b5b1-5254006e85c2
2016-07-29 - https://testing.hpdd.intel.com/test_sets/5d523e04-55e9-11e6-aa74-5254006e85c2
2016-08-01 - https://testing.hpdd.intel.com/test_sets/0dca7564-583b-11e6-aa74-5254006e85c2
2016-08-01 - https://testing.hpdd.intel.com/test_sets/64a45f36-584e-11e6-aa74-5254006e85c2
2016-08-02 - https://testing.hpdd.intel.com/test_sets/058493fc-588c-11e6-b5b1-5254006e85c2