[LU-11774] replay-single test 0c fails with ‘mount fails’ Created: 13/Dec/18 Updated: 09/Apr/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0, Lustre 2.12.1, Lustre 2.14.0, Lustre 2.12.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
replay-singe test_0c fails with ‘mount fails’. Looking at a recent failure, https://testing.whamcloud.com/test_sets/5532dea2-fd87-11e8-93ea-52540065bddc , we can see in the client test_log that there is a problem unmounting the file system because it is busy and then failover MDS1 (vm9) CMD: trevis-9vm9 /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
umount: /mnt/lustre: target is busy
(In some cases useful info about processes that
use the device is found by lsof(8) or fuser(1).)
Failing mds1 on trevis-9vm9
Then we see that the file system fails to mount because it is already mouted CMD: trevis-9vm6 mkdir -p /mnt/lustre CMD: trevis-9vm6 mount -t lustre -o user_xattr,flock trevis-9vm9@tcp:/lustre /mnt/lustre mount.lustre: according to /etc/mtab trevis-9vm9@tcp:/lustre is already mounted on /mnt/lustre replay-single test_0c: @@@@@@ FAIL: mount fails It’s possible that the file system never unmounted. replay-single test_0d always fails after 0c and in the same way. Looking over all branches for the past year, this test started to fail on July 31, 2018. Logs for past failures are at aarch64 ppc64 Suse 12 |
| Comments |
| Comment by Peter Jones [ 13/Dec/18 ] |
|
Hongchao Could you please assess this issue? Thanks Peter |
| Comment by Hongchao Zhang [ 18/Dec/18 ] |
|
As per the logs, there is a running process "cp" in the background for a long time, which affect the mount/umount 00020000:00200000:1.0:1533057091.189815:0:12598:0:(lov_io.c:435:lov_io_mirror_init()) [0x200000401:0x218:0x0]: flr state: 2, move mirror from 0 to 0, have retried: 878, mirror count: 2 00020000:00200000:0.0:1533057092.190236:0:12598:0:(lov_io.c:454:lov_io_mirror_init()) use non-delayed RPC state for this IO 00020000:00200000:0.0:1533057092.194654:0:12598:0:(lov_io.c:435:lov_io_mirror_init()) [0x200000401:0x218:0x0]: flr state: 2, move mirror from 0 to 0, have retried: 879, mirror count: 2 00020000:00200000:0.0:1533057093.194224:0:12598:0:(lov_io.c:454:lov_io_mirror_init()) use non-delayed RPC state for this IO ... 00020000:00200000:0.0:1533057124.268068:0:12598:0:(lov_io.c:435:lov_io_mirror_init()) [0x200000401:0x218:0x0]: flr state: 2, move mirror from 0 to 0, have retried: 911, mirror count: 2 00020000:00200000:0.0:1533057125.267217:0:12598:0:(lov_io.c:454:lov_io_mirror_init()) use non-delayed RPC state for this IO 00020000:00200000:0.0:1533057125.273560:0:12598:0:(lov_io.c:435:lov_io_mirror_init()) [0x200000401:0x218:0x0]: flr state: 2, move mirror from 0 to 0, have retried: 912, mirror count: 2 00020000:00200000:0.0:1533057126.273228:0:12598:0:(lov_io.c:454:lov_io_mirror_init()) use non-delayed RPC state for this IO |
| Comment by Hongchao Zhang [ 21/Dec/18 ] |
|
As per the retry count of this running process "cp", it should the process in the previous test "racer" and could be in the while /bin/true ; do
file=$((RANDOM % MAX))
cp -p $PROG $DIR/$file > /dev/null 2>&1
$DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
sleep $((RANDOM % 3))
done 2>&1 | egrep -v "Segmentation fault|Bus error"
|
| Comment by Gerrit Updater [ 21/Dec/18 ] |
|
Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33904 |
| Comment by Jian Yu [ 28/Aug/19 ] |
|
+1 on Lustre b2_12 branch: https://testing.whamcloud.com/test_sets/124eefae-c997-11e9-a25b-52540065bddc |
| Comment by Jian Yu [ 19/May/20 ] |
|
+1 on master branch: |