[LU-14358] interop: sanity-pcc and sanity-flr tests fail with ‘cannot open volatile file’on the MDS Created: 22/Jan/21  Updated: 05/Feb/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Qian Yingjin
Resolution: Unresolved Votes: 0
Labels: interop
Environment:

2.13.0 servers and master clients with Lustre version>= 2.13.55.16


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

A variety of sanity-pcc tests fail with different error messages, but all have the following in the MDS console log

[67028.845779] LustreError: 3916:0:(mdt_open.c:1613:mdt_reint_open()) lustre-MDT0000: cannot open volatile file [0x2000766a8:0x4:0x0], orphan file will be left in PENDING directory until next reboot, rc = -2

and the action on that file does not complete.

This issue has only been seen in interop testing for 2.13.0 servers and master clients with Lustre version>= 2.13.55.16. It looks like this error message and failures started on 08 AUG 2020 with failures in sanity-pcc and sanity-flr in the test session https://testing.whamcloud.com/test_sessions/7175020e-210c-436a-a2a6-be91c6c12aad

A few examples of this failure are:
2021-01-16 Lustre server 2.13.0 and Lustre client 2.13.57.53 - https://testing.whamcloud.com/test_sets/c1c698e2-e40e-4035-a3bb-de8acfacc5b5
sanity-pcc test_1a fails with 'request on 0x2000766a8:0x2:0x0 is not SUCCEED on mds1'
sanity-pcc test_1e fails with ‘failed to attach file /mnt/lustre/d1e.sanity-pcc/f1e.sanity-pcc’
sanity-pcc test_3a fails with ‘failed to attach file \/mnt\/lustre\/d3a.sanity-pcc\/f3a.sanity-pcc ‘

2021-01-07 Lustre server 2.13.0 and Lustre client 2.13.57.44 - https://testing.whamcloud.com/test_sets/c445c494-4d76-4185-9554-f5eb131d5b03
2021-12-25 Lustre server 2.13.0 and Lustre client 2.13.57.36 - https://testing.whamcloud.com/test_sets/e26503ed-449e-4070-84e3-8af9d129fad9
2021-12-10 Lustre server 2.13.0 and Lustre client 2.13.57 - https://testing.whamcloud.com/test_sets/9e2b1a2a-1d6a-4f56-b375-cd177c058eb0
sanity-pfl test 6 fails with 'Migrate(v1 -> composite) /mnt/lustre/d6.sanity-pfl/f6.sanity-pfl failed'

2021-12-25 Lustre server 2.13.0 and Lustre client 2.13.57.12 - https://testing.whamcloud.com/test_sets/3f774e46-eb0f-4b4b-90e5-a70a48ae9c56
sanity-pcc test_1c fails with 'request on 0x200076e71:0x2:0x0 is not SUCCEED on mds1'
sanity-pcc test 7b fails with 'multiop mmap write failed'
sanity-pcc test 12 fails with 'request on 0x200079d51:0x14:0x0 is not SUCCEED on mds1'
sanity-pcc test 19 fails with 'Failed to attach /mnt/lustre/f19.sanity-pcc'

We do see this error message occasionally in ost-pools test 28 and the test passes:
2021-12-25 Lustre server 2.13.0 and Lustre client 2.13.57.36 - https://testing.whamcloud.com/test_sets/fe713439-9aa7-4a71-beb0-b1b715fca596
2021-12-10 Lustre server 2.13.0 and Lustre client 2.13.57 - https://testing.whamcloud.com/test_sets/f48d3beb-c6d3-4451-8dc1-dcbe7dfe0b14



 Comments   
Comment by Gerrit Updater [ 22/Jan/21 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41301
Subject: LU-14358 test: check for error 2.13 server/client
Project: fs/lustre-release
Branch: b2_13
Current Patch Set: 1
Commit: 784779d06ed7f2b23f0a56281dd9fdc25309b239

Comment by James Nunez (Inactive) [ 28/Jan/21 ]

The question was asked if we see this error using 2.13.0 for both servers and clients. In patch https://review.whamcloud.com/#/c/41301/, we tried to reproduce this error using 2.13.0 clients and servers, but, so far, we can't reproduce this error.

Comment by John Hammond [ 28/Jan/21 ]

https://testing.whamcloud.com/test_sessions/7175020e-210c-436a-a2a6-be91c6c12aad

Earlier in the same session I see sanity 185 passing and sanity-hsm passes.

This needs to be reproduced with trace enabled on the MDT. Unfortunately the MDT debug logs on for sanity-pcc text_1c are useless due to 829055 messages of the form:

0000004:00020000:0.0:1596889762.316083:0:25563:0:(mdd_orphans.c:329:mdd_orphan_destroy()) lustre-MDD0000: could not delete orphan [0x1d6:0xf096f8eb:0x0]: rc = -2
00000004:00080000:0.0:1596889762.316093:0:25563:0:(mdd_orphans.c:376:mdd_orphan_key_test_and_delete()) Found orphan [0x1d6:0xf096f8eb:0x0], delete it
00000004:00020000:0.0:1596889762.316098:0:25563:0:(mdd_orphans.c:329:mdd_orphan_destroy()) lustre-MDD0000: could not delete orphan [0x1d6:0xf096f8eb:0x0]: rc = -2

I wonder if this is related.

Comment by Peter Jones [ 29/Jan/21 ]

Lai

Could you please comment on this one?

Thanks

Peter

Comment by Lai Siyao [ 05/Feb/21 ]

I reproduced it in local system, but I'm not familiar with PCC code and test case, Qian will look into it.

Generated at Sat Feb 10 03:09:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.