[LU-7837] sanity test_27o times out hung in reset_enospc Created: 02/Mar/16  Updated: 13/Oct/21  Resolved: 13/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

autotest review-dne-part-1


Issue Links:
Duplicate
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test 27o times out in review-dne-part-1. The last thing seen in the test_log is

osc.lustre-OST0007-osc-MDT0003.prealloc_reserved=0
osc.lustre-OST0007-osc-MDT0003.prealloc_status=-28
CMD: onyx-42vm8 lctl set_param fail_loc=0x215
fail_loc=0x215
CMD: onyx-42vm7 lctl get_param -n lov.*.qos_maxage
touch: cannot touch `/mnt/lustre/d27o.sanity/f27o.sanity': No space left on device
CMD: onyx-42vm8 lctl set_param fail_loc=0
fail_loc=0

The ‘No space left on device’ is expected. It looks like the test is hung in the reset_enospc() routine possibly hung on the call to sync.

1467 # OSCs keep a NOSPC flag that will be reset after ~5s (qos_maxage)
1468 # if the OST isn't full anymore.
1469 reset_enospc() {
1470         local OSTIDX=${1:-""}
1471 
1472         local list=$(comma_list $(osts_nodes))
1473         [ "$OSTIDX" ] && list=$(facet_host ost$((OSTIDX + 1)))
1474 
1475         do_nodes $list lctl set_param fail_loc=0
1476         sync    # initiate all OST_DESTROYs from MDS to OST
1477         sleep_maxage
1478 }

The logs incomplete and aren’t much help. The only thing that looks out of place are some disconnect notices for the client2

03:26:34:Lustre: DEBUG MARKER: == sanity test 27o: create file with all full OSTs (should error) ====== 01:24:20 (1456824260)
03:26:34:
03:26:34:<ConMan> Console [onyx-42vm2] disconnected from <onyx-42:6001> at 03-01 03:24.
03:26:34:
03:26:34:<ConMan> Console [onyx-42vm2] connected to <onyx-42:6001> at 03-01 03:24.
03:26:34:Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null || true

and the MDS1

03:26:47:Lustre: DEBUG MARKER: == sanity test 27o: create file with all full OSTs (should error) ====== 01:24:20 (1456824260)
03:26:47:Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
03:26:47:Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
03:26:47:Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
03:26:47:Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
03:26:47:Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
03:26:47:Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
03:26:47:Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
03:26:47:Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
03:26:47:Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
03:26:47:Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
03:26:47:
03:26:47:<ConMan> Console [onyx-42vm7] disconnected from <onyx-42:6006> at 03-01 03:26.
03:26:47:
03:26:47:<ConMan> Console [onyx-42vm7] connected to <onyx-42:6006> at 03-01 03:26.
03:26:47:Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null || true

Logs are at https://testing.hpdd.intel.com/test_sets/4a1a8e4a-dfce-11e5-9020-5254006e85c2


Generated at Sat Feb 10 02:12:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.