[LU-9511] parallel-scale-stress-hw_parallel_grouplock test stuck on subtest 12, timeout 2hours, normally takes < 400sec Created: 16/May/17  Updated: 19/Jul/17  Resolved: 19/Jul/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Minor
Reporter: VIKRAM BABASO JADHAV (Inactive) Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-9429 parallel-scale test_parallel_grouploc... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

stdout.log
17:34:15: All tests passed!
parallel_grouplock subtests -t 10 PASS
/usr/lib64/lustre/tests/parallel_grouplock -g -v -d /mnt/fs1/d0.parallel_grouplock -t 11
chmod 0777 /mnt/fs1
drwxrwxrwx 8 root root 270336 Aug 12 17:31 /mnt/fs1
su mpiuser sh -c "/usr/bin/mpirun -np 32 /usr/lib64/lustre/tests/parallel_grouplock -g -v -d /mnt/fs1/d0.parallel_grouplock -t 11 "
/usr/lib64/lustre/tests/parallel_grouplock is running with 32 task(es) in DEBUG mode
17:34:16: Running test #/usr/lib64/lustre/tests/parallel_grouplock(iter 0)
17:34:16: Beginning subtest 11
17:34:56: Finished subtest 11 (39.317 sec)
17:34:56: All tests passed!
parallel_grouplock subtests -t 11 PASS
/usr/lib64/lustre/tests/parallel_grouplock -g -v -d /mnt/fs1/d0.parallel_grouplock -t 12
su mpiuser sh -c "/usr/bin/mpirun -np 32 /usr/lib64/lustre/tests/parallel_grouplock -g -v -d /mnt/fs1/d0.parallel_grouplock -t 12 "
17:34:57: Running test #/usr/lib64/lustre/tests/parallel_grouplock(iter 0)
17:34:57: Beginning subtest 12

stderr.log



 Comments   
Comment by Gerrit Updater [ 16/May/17 ]

jadhav.vikram (jadhav.vikram@seagate.com) uploaded a new patch: https://review.whamcloud.com/27127
Subject: LU-9511 utils: fix parallel_grouplock test timeout
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 80c6a9089cdccdd6e3096d5bfc7cdd717cf122ea

Comment by Andreas Dilger [ 17/Jun/17 ]

It looks like this is the same as LU-9429.

Comment by Andreas Dilger [ 17/Jun/17 ]

Both this patch and the change from LU-9425 are changing the userspace code to avoid the problem, but that means it is still possible for real applications using group lock to become deadlocked. That shouldn't be allowed.

Comment by Gerrit Updater [ 19/Jul/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27127/
Subject: LU-9511 utils: fix parallel_grouplock test timeout
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4a33e1edd48c5db0e0c54c1226787c6575301bee

Comment by Peter Jones [ 19/Jul/17 ]

Landed for 2.11

Generated at Sat Feb 10 02:26:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.