[LU-17154] parallel-scale-nfsv4: hangs on umount after racer_on_nfs Created: 28/Sep/23 Updated: 25/Jan/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.16.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Alex Deiter |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This issue was created by maloo for jianyu <yujian@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/568e0c21-9347-476a-beac-081e9b2ee112 Test session details: <<Please provide additional information about the failure here>> parallel-scale-nfsv4 hangs on: Stopping client trevis-27vm4 /mnt/lustre (opts:-f) CMD: trevis-27vm4 lsof -t /mnt/lustre pdsh@trevis-27vm1: trevis-27vm4: ssh exited with exit code 1 CMD: trevis-27vm4 umount -f /mnt/lustre 2>&1 Console long on trevis-27vm4: [70712.060132] Lustre: DEBUG MARKER: umount -f /mnt/lustre 2>&1 [70712.213680] Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request [70712.215066] LustreError: 2067684:0:(file.c:245:ll_close_inode_openhandle()) lustre-clilmv-ffffa03d96d7f000: inode [0x200000bd3:0x2c31:0x0] mdc close failed: rc = -108 [70712.243116] Lustre: 1411383:0:(llite_lib.c:3965:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.38.143@tcp:/lustre/fid: [0x28000040a:0x3699:0x0]/ may get corrupted (rc -108) [70712.243167] Lustre: 1411382:0:(llite_lib.c:3965:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.38.143@tcp:/lustre/fid: [0x2c000040a:0x3318:0x0]/ may get corrupted (rc -108) <~snip~> [70742.217783] Lustre: lustre-MDT0000: haven't heard from client 0e545e12-9ad6-4857-a78b-e65f011477b4 (at 0@lo) in 31 seconds. I think it's dead, and I am evicting it. exp 00000000320f809c, cur 1695838270 expire 1695838240 last 1695838239 [70745.262062] Lustre: lustre-MDT0002: haven't heard from client 0e545e12-9ad6-4857-a78b-e65f011477b4 (at 0@lo) in 34 seconds. I think it's dead, and I am evicting it. exp 00000000887f97a0, cur 1695838273 expire 1695838243 last 1695838239 |
| Comments |
| Comment by Andreas Dilger [ 03/Oct/23 ] |
|
It looks like parallel-scale-nfsv4 is failing to unmount cleanly when racer is run, and unmounted properly when racer_on_nfs is skipped. I haven't checked why racer is skipped, but I wonder if it should always be skipped? Otherwise, it makes all of the NFS testing unreliable? It does appear that there are some cases where racer_on_nfs is run AND the test unmounts properly, but unfortunately it is difficult to search for this in Maloo easily. |
| Comment by Andreas Dilger [ 03/Oct/23 ] |
|
Deiter, could you please submit a patch to master and b_es6_0 to add racer_on_nfs to the always_except list. It looks like the parallel-scale-nfs-v4.sh script is incorrectly checking:
export ALWAYS_EXCEPT="$PARALLEL_SCALE_NFSV3_EXCEPT "
so this should also be updated to check PARALLEL_SCALE_NFSV4_EXCEPT. It may be that this is why racer_on_nfs is sometimes being skipped when running parallel-scale-nfsv4 - when it is run after paralle\l-scale-nfsv3? It would also be useful to see if the passing cases (without unmount timeout) are only for interop (I saw one case with ddn31 or similar, not sure of others). Checking this, and going back through the test history to see if there was a time this was passing regularly would help isolate if there was a patch landed that caused this problem. yujian, can you please re-add NFS testing to your patches, but skip racer: Test-Parameters: trivial Test-Parameters: testlist=env=PARALLEL_SCALE_NFSV3_EXCEPT=racer_on_nfs This should hopefully now pass. |
| Comment by Jian Yu [ 03/Oct/23 ] |
|
Sure, adilger. I found ALWAYS_EXCEPT was not defined in parallel-scale-nfsv4.sh, so while updating patch https://review.whamcloud.com/52533, I added it and also added racer_on_nfs test into the always_except list. |
| Comment by Andreas Dilger [ 04/Oct/23 ] |
|
You are right. I guess I was mistakenly looking at the nfsv3.sh file, and there is nothing in the nfsv4.sh file that allows setting ALWAYS_EXCEPT. |