[LU-11774] replay-single test 0c fails with ‘mount fails’ Created: 13/Dec/18  Updated: 09/Apr/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.12.1, Lustre 2.14.0, Lustre 2.12.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Hongchao Zhang
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-singe test_0c fails with ‘mount fails’.

Looking at a recent failure, https://testing.whamcloud.com/test_sets/5532dea2-fd87-11e8-93ea-52540065bddc , we can see in the client test_log that there is a problem unmounting the file system because it is busy and then failover MDS1 (vm9)

CMD: trevis-9vm9 /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
umount: /mnt/lustre: target is busy
        (In some cases useful info about processes that
         use the device is found by lsof(8) or fuser(1).)
Failing mds1 on trevis-9vm9

Then we see that the file system fails to mount because it is already mouted

CMD: trevis-9vm6 mkdir -p /mnt/lustre
CMD: trevis-9vm6 mount -t lustre -o user_xattr,flock trevis-9vm9@tcp:/lustre /mnt/lustre
mount.lustre: according to /etc/mtab trevis-9vm9@tcp:/lustre is already mounted on /mnt/lustre
 replay-single test_0c: @@@@@@ FAIL: mount fails 

It’s possible that the file system never unmounted.

replay-single test_0d always fails after 0c and in the same way.

Looking over all branches for the past year, this test started to fail on July 31, 2018.

Logs for past failures are at
https://testing.whamcloud.com/test_sets/d02535e4-958e-11e8-8ee3-52540065bddc
https://testing.whamcloud.com/test_sets/ab6bf330-9838-11e8-b0aa-52540065bddc
https://testing.whamcloud.com/test_sets/26d280e4-a35f-11e8-8853-52540065bddc
https://testing.whamcloud.com/test_sets/9438cbfa-dee9-11e8-89f8-52540065bddc

aarch64
https://testing.whamcloud.com/test_sets/6d97765e-ed94-11e8-815b-52540065bddc

ppc64
https://testing.whamcloud.com/test_sets/925cab5a-a783-11e8-80f7-52540065bddc
https://testing.whamcloud.com/test_sets/33d0ef58-fb55-11e8-8a18-52540065bddc

Suse 12
https://testing.whamcloud.com/test_sets/3c98eace-9a8d-11e8-8ee3-52540065bddc



 Comments   
Comment by Peter Jones [ 13/Dec/18 ]

Hongchao

Could you please assess this issue?

Thanks

Peter

Comment by Hongchao Zhang [ 18/Dec/18 ]

As per the logs, there is a running process "cp" in the background for a long time, which affect the mount/umount

00020000:00200000:1.0:1533057091.189815:0:12598:0:(lov_io.c:435:lov_io_mirror_init()) [0x200000401:0x218:0x0]: flr state: 2, move mirror from 0 to 0, have retried: 878, mirror count: 2
00020000:00200000:0.0:1533057092.190236:0:12598:0:(lov_io.c:454:lov_io_mirror_init()) use non-delayed RPC state for this IO
00020000:00200000:0.0:1533057092.194654:0:12598:0:(lov_io.c:435:lov_io_mirror_init()) [0x200000401:0x218:0x0]: flr state: 2, move mirror from 0 to 0, have retried: 879, mirror count: 2
00020000:00200000:0.0:1533057093.194224:0:12598:0:(lov_io.c:454:lov_io_mirror_init()) use non-delayed RPC state for this IO
...
00020000:00200000:0.0:1533057124.268068:0:12598:0:(lov_io.c:435:lov_io_mirror_init()) [0x200000401:0x218:0x0]: flr state: 2, move mirror from 0 to 0, have retried: 911, mirror count: 2
00020000:00200000:0.0:1533057125.267217:0:12598:0:(lov_io.c:454:lov_io_mirror_init()) use non-delayed RPC state for this IO
00020000:00200000:0.0:1533057125.273560:0:12598:0:(lov_io.c:435:lov_io_mirror_init()) [0x200000401:0x218:0x0]: flr state: 2, move mirror from 0 to 0, have retried: 912, mirror count: 2
00020000:00200000:0.0:1533057126.273228:0:12598:0:(lov_io.c:454:lov_io_mirror_init()) use non-delayed RPC state for this IO
Comment by Hongchao Zhang [ 21/Dec/18 ]

As per the retry count of this running process "cp", it should the process in the previous test "racer" and could be in the
test script "file_exec.sh".

while /bin/true ; do
        file=$((RANDOM % MAX))
        cp -p $PROG $DIR/$file > /dev/null 2>&1
        $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
        sleep $((RANDOM % 3))
done 2>&1 | egrep -v "Segmentation fault|Bus error"
Comment by Gerrit Updater [ 21/Dec/18 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33904
Subject: LU-11774 test: wait to finish in racer's file_exec
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3197ce11538db8953be09ef940ce0cf5f0cda1a3

Comment by Jian Yu [ 28/Aug/19 ]

+1 on Lustre b2_12 branch: https://testing.whamcloud.com/test_sets/124eefae-c997-11e9-a25b-52540065bddc

Comment by Jian Yu [ 19/May/20 ]

+1 on master branch:
https://testing.whamcloud.com/test_sets/fc1418b8-b684-4a7e-94e5-79c70f711f2b

Generated at Sat Feb 10 02:46:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.