Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
Lustre 2.12.4
-
None
-
3
-
9223372036854775807
Description
sanity-benchmark test_fsx hangs.
Looking at the hang at https://testing.whamcloud.com/test_sets/9c812454-4b41-11ea-a1c8-52540065bddc, the last thing seen in the client test_log is
== sanity-benchmark test fsx: fsx ==================================================================== 15:41:36 (1581176496) debug=0 Using: fsx -c 50 -p 1000 -S 20400 -P /tmp -l 3438416 -N 100000 /mnt/lustre/f0.fsxfile Chance of close/open is 1 in 50 Seed set to 20400 truncating to largest ever: 0x1240bb truncating to largest ever: 0x1b0cac truncating to largest ever: 0x331506 truncating to largest ever: 0x338a0b truncating to largest ever: 0x3443e1
Looking at the console logs, there’s no call traces and not many error messages to understand why the test hangs. Looking at the client1 (vm6) console log, we see
[31869.584851] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-benchmark test fsx: fsx ==================================================================== 15:41:36 \(1581176496\) [31869.821328] Lustre: DEBUG MARKER: == sanity-benchmark test fsx: fsx ==================================================================== 15:41:36 (1581176496) [31869.901535] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200000401:0x6ac8:0x0], use llapi_layout_get_by_path() [31869.917359] Lustre: lustre-OST0002-osc-ffff98bc9b52c000: reconnect after 1s idle [31869.918649] Lustre: Skipped 5 previous similar messages [31916.662148] Lustre: 24080:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1581176536/real 1581176536] req@ffff98bc805cb600 x1657965685190528/t0(0) o400->lustre-MDT0000-mdc-ffff98bc9b52c000@10.9.5.70@tcp:12/10 lens 224/224 e 0 to 1 dl 1581176543 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [31916.667085] Lustre: lustre-MDT0000-mdc-ffff98bc9b52c000: Connection to lustre-MDT0000 (at 10.9.5.70@tcp) was lost; in progress operations using this service will wait for recovery to complete [31916.669901] LustreError: 166-1: MGC10.9.5.70@tcp: Connection to MGS (at 10.9.5.70@tcp) was lost; in progress operations using this service will fail <ConMan> Console [trevis-26vm6] disconnected from <trevis-26:6005> at 02-08 16:14.
Although there are other examples of sanity-benchmark test fsx hanging in the past, there isn’t enough information here to match this hang to past failures.