[LU-13230] sanity-benchmark test fsx hangs Created: 10/Feb/20  Updated: 11/Feb/20  Resolved: 11/Feb/20

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-12234 sanity-benchmark test iozone hangs in... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-benchmark test_fsx hangs.

Looking at the hang at https://testing.whamcloud.com/test_sets/9c812454-4b41-11ea-a1c8-52540065bddc, the last thing seen in the client test_log is

== sanity-benchmark test fsx: fsx ==================================================================== 15:41:36 (1581176496)
debug=0
Using: fsx -c 50 -p 1000 -S 20400 -P /tmp -l 3438416         -N 100000  /mnt/lustre/f0.fsxfile
Chance of close/open is 1 in 50
Seed set to 20400
truncating to largest ever: 0x1240bb
truncating to largest ever: 0x1b0cac
truncating to largest ever: 0x331506
truncating to largest ever: 0x338a0b
truncating to largest ever: 0x3443e1

Looking at the console logs, there’s no call traces and not many error messages to understand why the test hangs. Looking at the client1 (vm6) console log, we see

[31869.584851] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-benchmark test fsx: fsx ==================================================================== 15:41:36 \(1581176496\)
[31869.821328] Lustre: DEBUG MARKER: == sanity-benchmark test fsx: fsx ==================================================================== 15:41:36 (1581176496)
[31869.901535] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200000401:0x6ac8:0x0], use llapi_layout_get_by_path()
[31869.917359] Lustre: lustre-OST0002-osc-ffff98bc9b52c000: reconnect after 1s idle
[31869.918649] Lustre: Skipped 5 previous similar messages
[31916.662148] Lustre: 24080:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1581176536/real 1581176536]  req@ffff98bc805cb600 x1657965685190528/t0(0) o400->lustre-MDT0000-mdc-ffff98bc9b52c000@10.9.5.70@tcp:12/10 lens 224/224 e 0 to 1 dl 1581176543 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[31916.667085] Lustre: lustre-MDT0000-mdc-ffff98bc9b52c000: Connection to lustre-MDT0000 (at 10.9.5.70@tcp) was lost; in progress operations using this service will wait for recovery to complete
[31916.669901] LustreError: 166-1: MGC10.9.5.70@tcp: Connection to MGS (at 10.9.5.70@tcp) was lost; in progress operations using this service will fail

<ConMan> Console [trevis-26vm6] disconnected from <trevis-26:6005> at 02-08 16:14.

Although there are other examples of sanity-benchmark test fsx hanging in the past, there isn’t enough information here to match this hang to past failures.



 Comments   
Comment by Gerrit Updater [ 10/Feb/20 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37522
Subject: LU-13230 tests: reproduce test fsx hang
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 8180e0423a2de7e31e6f06cee009db61368faded

Comment by Andreas Dilger [ 11/Feb/20 ]

There were also failures in sanityn fsx runs due to the landing of patch https://review.whamcloud.com/8201 "LU-4198 clio: turn on lockless for some kind of IO" that may also be related to this issue.

Comment by Andreas Dilger [ 11/Feb/20 ]

Never mind - that patch has not yet landed to b2_12.

Comment by James Nunez (Inactive) [ 11/Feb/20 ]

Closing this ticket as a duplicate of LU-12234.

In this case, look at the console logs for test_iozone and not test_fsx.

Generated at Sat Feb 10 02:59:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.