[LU-9893] replay-single test_70c: test failed to respond and timed out Created: 18/Aug/17  Updated: 24/Aug/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.1, Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Casper Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Trevis2, failover
server: RHEL 7.3, ldiskfs, branch master, v2.10.51, b3620
client: RHEL 7.3, branch master, v2.10.51, b3620


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.hpdd.intel.com/test_sessions/9b7c7e8e-7b5a-4f4d-af09-400c586a8340

Looks like another mds hang on umount issue for this build. It may be related to these tickets:

LU-9791 (parallel-scale-nfsv3 & v4)
LU-9856 (racer)
LU-9469 (conf-sanity)

However, this one does not show this message in the MDS console log:

BUG: unable to handle kernel NULL pointer dereference at (null)

What this shares in common with the above three is they all have an mds umount at the end of the suite_log. No further activity is seen.

From suite_log:

test_70c fail mds1 1 times
Failing mds1 on trevis-41vm3
CMD: trevis-41vm3 grep -c /mnt/lustre-mds1' ' /proc/mounts
Stopping /mnt/lustre-mds1 (opts:) on trevis-41vm3
CMD: trevis-41vm3 umount -d /mnt/lustre-mds1  (end of log)


 Comments   
Comment by John Hammond [ 23/Aug/17 ]

Jim, can you grab /kdumproot/scratch//dumps/trevis-41vm3.trevis.hpdd.intel.com/10.9.5.239-2017-08-04-12:28:28/vmcore-dmesg.txt.

See https://testing.hpdd.intel.com/test_logs/f0c4415a-799c-11e7-8e1f-5254006e85c2/show_text

Comment by James Casper [ 23/Aug/17 ]

Looks like that directory no longer exists:

[root@trevis-41 trevis-41vm3.trevis.hpdd.intel.com]# pwd
/scratch/dumps/trevis-41vm3.trevis.hpdd.intel.com
[root@trevis-41 trevis-41vm3.trevis.hpdd.intel.com]# ls -al
total 100
drwxr-xr-x 7 root root 4096 Aug 21 23:05 .
drwxr-xr-x 1433 root root 73728 Aug 10 17:23 ..
drwxr-xr-x 2 root root 4096 Aug 5 03:29 10.9.5.239-2017-08-05-03:29:12
drwxr-xr-x 2 root root 4096 Aug 7 18:48 10.9.5.239-2017-08-07-18:48:49
drwxr-xr-x 2 root root 4096 Aug 7 21:50 10.9.5.239-2017-08-07-21:50:33
drwxr-xr-x 2 root root 4096 Aug 14 06:19 10.9.5.239-2017-08-14-06:19:45
drwxr-xr-x 2 root root 4096 Aug 15 03:22 10.9.5.239-2017-08-15-03:21:47
[root@trevis-41 trevis-41vm3.trevis.hpdd.intel.com]#

Comment by James Nunez (Inactive) [ 23/Aug/17 ]

I looked in Maloo for all replay-single test 70c timeouts (hangs) this year. I found 14 occurrences of this test hanging, but none of them are hanging on umount.

If we see this issue again, we need to look for the vmcore-dmesg.txt file as early as possible.

Comment by John Hammond [ 24/Aug/17 ]

James, maybe try something like

find /scratch/dumps -name vmcore-dmesg.txt -exec grep --with-filename test_70c {} \;

to find other instances of this crash.

Comment by Andreas Dilger [ 26/Jan/22 ]

+1 on master: https://testing.whamcloud.com/test_sets/079de990-f2fe-47b8-a70f-cb455e084ec8

Comment by Qian Yingjin [ 24/Aug/22 ]

+1 on master: https://testing.whamcloud.com/test_sets/102cd218-cda2-4092-b4f1-991fa8aeda2e

Generated at Sat Feb 10 02:30:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.