[LU-13358] sanityn test_43j: Timeout occurred after 348 mins, last suite running was sanityn Created: 11/Mar/20  Updated: 23/Sep/22  Resolved: 20/Nov/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Yang Sheng <ys@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/febc8b7c-b581-4673-ae0d-0acc45b22ca7

test_43j failed with the following error:

Timeout occurred after 348 mins, last suite running was sanityn
======================================================== 23:32:57 \(1583883177\)
[17311.109132] Lustre: DEBUG MARKER: == sanityn test 43j: racy mkdir return EEXIST ======================================================== 23:32:57 (1583883177)
[17311.329173] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17311.517750] LustreError: 29009:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 167 sleeping
[17311.519410] LustreError: 29007:0:(libcfs_fail.h:174:cfs_race()) cfs_fail_race id 167 waking
[17311.520591] LustreError: 29009:0:(libcfs_fail.h:172:cfs_race()) cfs_fail_race id 167 awake: rc=0
[17311.917773] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17312.103328] LustreError: 29009:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 167 sleeping
[17312.103439] LustreError: 29007:0:(libcfs_fail.h:174:cfs_race()) cfs_fail_race id 167 waking
[17312.105603] LustreError: 29009:0:(libcfs_fail.h:172:cfs_race()) cfs_fail_race id 167 awake: rc=0
[17312.286194] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17312.656299] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17313.023848] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17313.211668] LustreError: 29009:0:(libcfs_fail.h:174:cfs_race()) cfs_fail_race id 167 waking
[17313.211671] LustreError: 29958:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 167 sleeping
[17313.211701] LustreError: 29958:0:(libcfs_fail.h:169:cfs_race()) Skipped 2 previous similar messages
[17313.215085] LustreError: 29009:0:(libcfs_fail.h:174:cfs_race()) Skipped 2 previous similar messages
[17313.216256] LustreError: 29958:0:(libcfs_fail.h:172:cfs_race()) cfs_fail_race id 167 awake: rc=0
[17313.217446] LustreError: 29958:0:(libcfs_fail.h:172:cfs_race()) Skipped 2 previous similar messages
[17313.395036] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17313.769611] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17314.135975] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17314.505179] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17314.879328] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17315.250075] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17315.440261] LustreError: 29958:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 167 sleeping
[17315.441404] LustreError: 29958:0:(libcfs_fail.h:169:cfs_race()) Skipped 5 previous similar messages
[17315.442663] LustreError: 14962:0:(libcfs_fail.h:174:cfs_race()) cfs_fail_race id 167 waking
[17315.443955] LustreError: 14962:0:(libcfs_fail.h:174:cfs_race()) Skipped 5 previous similar messages
[17315.446280] LustreError: 29958:0:(libcfs_fail.h:172:cfs_race()) cfs_fail_race id 167 awake: rc=0
[17315.447552] LustreError: 29958:0:(libcfs_fail.h:172:cfs_race()) Skipped 5 previous similar messages
[17315.627940] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17315.999870] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17316.367260] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17316.737816] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17317.116644] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17317.488198] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17317.862823] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17318.247053] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17318.669421] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17319.048331] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17319.414480] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17319.602497] LustreError: 20329:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 167 sleeping
[17319.602911] LustreError: 14963:0:(libcfs_fail.h:174:cfs_race()) cfs_fail_race id 167 waking
[17319.602913] LustreError: 14963:0:(libcfs_fail.h:174:cfs_race()) Skipped 10 previous similar messages
[17319.606183] LustreError: 20329:0:(libcfs_fail.h:169:cfs_race()) Skipped 10 previous similar messages
[17319.607404] LustreError: 20329:0:(libcfs_fail.h:172:cfs_race()) cfs_fail_race id 167 awake: rc=0
[17319.608561] LustreError: 20329:0:(libcfs_fail.h:172:cfs_race()) Skipped 10 previous similar messages
[17319.789546] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17320.160652] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17320.533914] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17320.902690] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17321.271216] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17321.639185] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17322.010093] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17322.381671] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17322.752167] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17323.124004] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17323.488809] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17323.858050] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17324.229766] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17324.596745] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17324.966558] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17325.331734] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17325.698313] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17326.059736] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17326.422889] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17326.791051] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17327.158072] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17327.523903] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17327.714309] LustreError: 29009:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 167 sleeping
[17327.715587] LustreError: 29009:0:(libcfs_fail.h:169:cfs_race()) Skipped 21 previous similar messages
[17327.715632] LustreError: 29007:0:(libcfs_fail.h:174:cfs_race()) cfs_fail_race id 167 waking
[17327.715634] LustreError: 29007:0:(libcfs_fail.h:174:cfs_race()) Skipped 21 previous similar messages
[17327.719196] LustreError: 29009:0:(libcfs_fail.h:172:cfs_race()) cfs_fail_race id 167 awake: rc=0
[17327.720335] LustreError: 29009:0:(libcfs_fail.h:172:cfs_race()) Skipped 21 previous similar messages
[17327.899149] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17328.264351] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17328.632777] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17329.003813] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17329.371423] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17329.740128] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17330.109495] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17330.480179] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17330.851411] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17331.217019] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17331.584481] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17331.953570] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17332.319811] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17332.689562] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17333.071429] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17333.442313] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17333.813061] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17334.182470] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17334.548590] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17334.915354] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17335.281713] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17335.648787] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17336.016108] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17336.379247] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17336.748132] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17337.119766] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17337.486196] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17337.869241] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17338.237004] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17338.630641] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17338.997588] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17339.364136] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17339.730623] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17340.096813] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17340.462864] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17340.831538] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167
[17403.223365] Lustre: mdt00_002: service thread pid 20331 was inactive for 62.204 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[17403.225741] Pid: 20331, comm: mdt00_002 3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Wed Feb 12 09:50:45 UTC 2020
[17403.226985] Call Trace:
[17403.227524]  [<ffffffffc150b281>] mdt_create+0xf31/0x1350 [mdt]
[17403.228564]  [<ffffffffc150b80b>] mdt_reint_create+0x16b/0x360 [mdt]
[17403.229409]  [<ffffffffc1510d33>] mdt_reint_rec+0x83/0x210 [mdt]
[17403.230297]  [<ffffffffc14ea970>] mdt_reint_internal+0x7b0/0xba0 [mdt]
[17403.231291]  [<ffffffffc14f5fe7>] mdt_reint+0x67/0x140 [mdt]
[17403.232119]  [<ffffffffc1168f9a>] tgt_request_handle+0x95a/0x1630 [ptlrpc]
[17403.233926]  [<ffffffffc110e0e6>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[17403.234967]  [<ffffffffc1112574>] ptlrpc_main+0xbb4/0x1550 [ptlrpc]
[17403.235890]  [<ffffffffb06c61f1>] kthread+0xd1/0xe0
[17403.236704]  [<ffffffffb0d8dd37>] ret_from_fork_nospec_end+0x0/0x39
[17403.237634]  [<ffffffffffffffff>] 0xffffffffffffffff
[17936.088290] Lustre: 29006:0:(service.c:1440:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/5), not sending early reply  req@ffff8b1d15c90480 x1660819790875584/t0(0) o36->e483fb96-13aa-4@10.9.3.3@tcp:262/0 lens 512/448 e 23 to 0 dl 1583883807 ref 2 fl Interpret:/0/0 rc 0/0 job:'mkdir.0'
[17942.007808] Lustre: lustre-MDT0000: Client e483fb96-13aa-4 (at 10.9.3.3@tcp) reconnecting
[17942.009423] Lustre: lustre-MDT0000: Connection restored to 66d0d404-6645-4 (at 10.9.3.3@tcp)
[17942.010579] Lustre: Skipped 2 previous similar messages
[18542.999441] Lustre: lustre-MDT0000: Client e483fb96-13aa-4 (at 10.9.3.3@tcp) reconnecting
[18543.000625] Lustre: lustre-MDT0000: Connection restored to 66d0d404-6645-4 (at 10.9.3.3@tcp)
[19143.990549] Lustre: lustre-MDT0000: Client e483fb96-13aa-4 (at 10.9.3.3@tcp) reconnecting
[19143.991784] Lustre: lustre-MDT0000: Connection restored to 66d0d404-6645-4 (at 10.9.3.3@tcp)
[19744.981248] Lustre: lustre-MDT0000: Client e483fb96-13aa-4 (at 10.9.3.3@tcp) reconnecting
[19744.982495] Lustre: lustre-MDT0000: Connection restored to 66d0d404-6645-4 (at 10.9.3.3@tcp)
[20345.972751] Lustre: lustre-MDT0000: Client e483fb96-13aa-4 (at 10.9.3.3@tcp) reconnecting
[20345.974154] Lustre: lustre-MDT0000: Connection restored to 66d0d404-6645-4 (at 10.9.3.3@tcp)

Looks like OBD_RACE wakeup was lost cause mdt thread sleep forever.

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanityn test_43j - Timeout occurred after 348 mins, last suite running was sanityn



 Comments   
Comment by Emoly Liu [ 20/Apr/20 ]

+1 on master: https://testing.whamcloud.com/test_sets/c25c2f47-cefa-4921-837a-398084ffdc44

Comment by Andreas Dilger [ 30/Apr/20 ]

This failed on master with:
https://testing.whamcloud.com/test_sets/84f50a21-ef1e-4e20-bdfe-95787bc3a18f

mkdir: cannot create directory '/mnt/lustre2/d43j.sanityn': File exists
CMD: trevis-63vm9 lctl set_param fail_loc=0x80000167
fail_loc=0x80000167
mkdir: cannot create directory '/mnt/lustre2/d43j.sanityn': File exists
CMD: trevis-63vm9 lctl set_param fail_loc=0x80000167
fail_loc=0x80000167
Comment by James A Simmons [ 29/Mar/21 ]

Is this the same problem as LU-13329?

Comment by Alex Zhuravlev [ 30/Mar/21 ]

Is this the same problem as LU-13329?

doesn't look so.
looking at the code:

                        int rc;
                        cfs_race_state = 0;
                        CERROR("cfs_race id %x sleeping\n", id);
                        rc = wait_event_interruptible(cfs_race_waitq,
                                                      cfs_race_state != 0);
                        CERROR("cfs_fail_race id %x awake: rc=%d\n", id, rc);
                } else {
                        CERROR("cfs_fail_race id %x waking\n", id);
                        cfs_race_state = 1;
                        wake_up(&cfs_race_waitq);
                }

I don't understand how is it guaranteed that the first branch (with cfs_race_state = 0) is executed before another one (with wake_up())

Comment by Gerrit Updater [ 30/Mar/21 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43161
Subject: LU-13358 libcfs: add timeout to cfs_race()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 30d3c7afb04767d3e30467a1061762b0c3956dc2

Comment by Gerrit Updater [ 20/Nov/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/43161/
Subject: LU-13358 libcfs: add timeout to cfs_race() to fix race
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2d2d381f35ee004319a20f5d2d8e70d13480d6c7

Comment by Peter Jones [ 20/Nov/21 ]

Landed for 2.15

Generated at Sat Feb 10 03:00:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.