[LU-13358] sanityn test_43j: Timeout occurred after 348 mins, last suite running was sanityn Created: 11/Mar/20 Updated: 23/Sep/22 Resolved: 20/Nov/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Yang Sheng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for Yang Sheng <ys@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/febc8b7c-b581-4673-ae0d-0acc45b22ca7 test_43j failed with the following error: Timeout occurred after 348 mins, last suite running was sanityn ======================================================== 23:32:57 \(1583883177\) [17311.109132] Lustre: DEBUG MARKER: == sanityn test 43j: racy mkdir return EEXIST ======================================================== 23:32:57 (1583883177) [17311.329173] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17311.517750] LustreError: 29009:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 167 sleeping [17311.519410] LustreError: 29007:0:(libcfs_fail.h:174:cfs_race()) cfs_fail_race id 167 waking [17311.520591] LustreError: 29009:0:(libcfs_fail.h:172:cfs_race()) cfs_fail_race id 167 awake: rc=0 [17311.917773] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17312.103328] LustreError: 29009:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 167 sleeping [17312.103439] LustreError: 29007:0:(libcfs_fail.h:174:cfs_race()) cfs_fail_race id 167 waking [17312.105603] LustreError: 29009:0:(libcfs_fail.h:172:cfs_race()) cfs_fail_race id 167 awake: rc=0 [17312.286194] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17312.656299] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17313.023848] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17313.211668] LustreError: 29009:0:(libcfs_fail.h:174:cfs_race()) cfs_fail_race id 167 waking [17313.211671] LustreError: 29958:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 167 sleeping [17313.211701] LustreError: 29958:0:(libcfs_fail.h:169:cfs_race()) Skipped 2 previous similar messages [17313.215085] LustreError: 29009:0:(libcfs_fail.h:174:cfs_race()) Skipped 2 previous similar messages [17313.216256] LustreError: 29958:0:(libcfs_fail.h:172:cfs_race()) cfs_fail_race id 167 awake: rc=0 [17313.217446] LustreError: 29958:0:(libcfs_fail.h:172:cfs_race()) Skipped 2 previous similar messages [17313.395036] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17313.769611] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17314.135975] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17314.505179] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17314.879328] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17315.250075] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17315.440261] LustreError: 29958:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 167 sleeping [17315.441404] LustreError: 29958:0:(libcfs_fail.h:169:cfs_race()) Skipped 5 previous similar messages [17315.442663] LustreError: 14962:0:(libcfs_fail.h:174:cfs_race()) cfs_fail_race id 167 waking [17315.443955] LustreError: 14962:0:(libcfs_fail.h:174:cfs_race()) Skipped 5 previous similar messages [17315.446280] LustreError: 29958:0:(libcfs_fail.h:172:cfs_race()) cfs_fail_race id 167 awake: rc=0 [17315.447552] LustreError: 29958:0:(libcfs_fail.h:172:cfs_race()) Skipped 5 previous similar messages [17315.627940] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17315.999870] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17316.367260] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17316.737816] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17317.116644] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17317.488198] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17317.862823] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17318.247053] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17318.669421] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17319.048331] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17319.414480] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17319.602497] LustreError: 20329:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 167 sleeping [17319.602911] LustreError: 14963:0:(libcfs_fail.h:174:cfs_race()) cfs_fail_race id 167 waking [17319.602913] LustreError: 14963:0:(libcfs_fail.h:174:cfs_race()) Skipped 10 previous similar messages [17319.606183] LustreError: 20329:0:(libcfs_fail.h:169:cfs_race()) Skipped 10 previous similar messages [17319.607404] LustreError: 20329:0:(libcfs_fail.h:172:cfs_race()) cfs_fail_race id 167 awake: rc=0 [17319.608561] LustreError: 20329:0:(libcfs_fail.h:172:cfs_race()) Skipped 10 previous similar messages [17319.789546] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17320.160652] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17320.533914] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17320.902690] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17321.271216] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17321.639185] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17322.010093] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17322.381671] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17322.752167] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17323.124004] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17323.488809] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17323.858050] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17324.229766] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17324.596745] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17324.966558] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17325.331734] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17325.698313] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17326.059736] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17326.422889] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17326.791051] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17327.158072] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17327.523903] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17327.714309] LustreError: 29009:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 167 sleeping [17327.715587] LustreError: 29009:0:(libcfs_fail.h:169:cfs_race()) Skipped 21 previous similar messages [17327.715632] LustreError: 29007:0:(libcfs_fail.h:174:cfs_race()) cfs_fail_race id 167 waking [17327.715634] LustreError: 29007:0:(libcfs_fail.h:174:cfs_race()) Skipped 21 previous similar messages [17327.719196] LustreError: 29009:0:(libcfs_fail.h:172:cfs_race()) cfs_fail_race id 167 awake: rc=0 [17327.720335] LustreError: 29009:0:(libcfs_fail.h:172:cfs_race()) Skipped 21 previous similar messages [17327.899149] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17328.264351] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17328.632777] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17329.003813] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17329.371423] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17329.740128] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17330.109495] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17330.480179] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17330.851411] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17331.217019] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17331.584481] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17331.953570] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17332.319811] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17332.689562] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17333.071429] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17333.442313] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17333.813061] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17334.182470] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17334.548590] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17334.915354] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17335.281713] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17335.648787] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17336.016108] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17336.379247] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17336.748132] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17337.119766] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17337.486196] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17337.869241] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17338.237004] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17338.630641] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17338.997588] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17339.364136] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17339.730623] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17340.096813] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17340.462864] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17340.831538] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000167 [17403.223365] Lustre: mdt00_002: service thread pid 20331 was inactive for 62.204 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [17403.225741] Pid: 20331, comm: mdt00_002 3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Wed Feb 12 09:50:45 UTC 2020 [17403.226985] Call Trace: [17403.227524] [<ffffffffc150b281>] mdt_create+0xf31/0x1350 [mdt] [17403.228564] [<ffffffffc150b80b>] mdt_reint_create+0x16b/0x360 [mdt] [17403.229409] [<ffffffffc1510d33>] mdt_reint_rec+0x83/0x210 [mdt] [17403.230297] [<ffffffffc14ea970>] mdt_reint_internal+0x7b0/0xba0 [mdt] [17403.231291] [<ffffffffc14f5fe7>] mdt_reint+0x67/0x140 [mdt] [17403.232119] [<ffffffffc1168f9a>] tgt_request_handle+0x95a/0x1630 [ptlrpc] [17403.233926] [<ffffffffc110e0e6>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [17403.234967] [<ffffffffc1112574>] ptlrpc_main+0xbb4/0x1550 [ptlrpc] [17403.235890] [<ffffffffb06c61f1>] kthread+0xd1/0xe0 [17403.236704] [<ffffffffb0d8dd37>] ret_from_fork_nospec_end+0x0/0x39 [17403.237634] [<ffffffffffffffff>] 0xffffffffffffffff [17936.088290] Lustre: 29006:0:(service.c:1440:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/5), not sending early reply req@ffff8b1d15c90480 x1660819790875584/t0(0) o36->e483fb96-13aa-4@10.9.3.3@tcp:262/0 lens 512/448 e 23 to 0 dl 1583883807 ref 2 fl Interpret:/0/0 rc 0/0 job:'mkdir.0' [17942.007808] Lustre: lustre-MDT0000: Client e483fb96-13aa-4 (at 10.9.3.3@tcp) reconnecting [17942.009423] Lustre: lustre-MDT0000: Connection restored to 66d0d404-6645-4 (at 10.9.3.3@tcp) [17942.010579] Lustre: Skipped 2 previous similar messages [18542.999441] Lustre: lustre-MDT0000: Client e483fb96-13aa-4 (at 10.9.3.3@tcp) reconnecting [18543.000625] Lustre: lustre-MDT0000: Connection restored to 66d0d404-6645-4 (at 10.9.3.3@tcp) [19143.990549] Lustre: lustre-MDT0000: Client e483fb96-13aa-4 (at 10.9.3.3@tcp) reconnecting [19143.991784] Lustre: lustre-MDT0000: Connection restored to 66d0d404-6645-4 (at 10.9.3.3@tcp) [19744.981248] Lustre: lustre-MDT0000: Client e483fb96-13aa-4 (at 10.9.3.3@tcp) reconnecting [19744.982495] Lustre: lustre-MDT0000: Connection restored to 66d0d404-6645-4 (at 10.9.3.3@tcp) [20345.972751] Lustre: lustre-MDT0000: Client e483fb96-13aa-4 (at 10.9.3.3@tcp) reconnecting [20345.974154] Lustre: lustre-MDT0000: Connection restored to 66d0d404-6645-4 (at 10.9.3.3@tcp) Looks like OBD_RACE wakeup was lost cause mdt thread sleep forever. VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Emoly Liu [ 20/Apr/20 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/c25c2f47-cefa-4921-837a-398084ffdc44 |
| Comment by Andreas Dilger [ 30/Apr/20 ] |
|
This failed on master with: mkdir: cannot create directory '/mnt/lustre2/d43j.sanityn': File exists CMD: trevis-63vm9 lctl set_param fail_loc=0x80000167 fail_loc=0x80000167 mkdir: cannot create directory '/mnt/lustre2/d43j.sanityn': File exists CMD: trevis-63vm9 lctl set_param fail_loc=0x80000167 fail_loc=0x80000167 |
| Comment by James A Simmons [ 29/Mar/21 ] |
|
Is this the same problem as LU-13329? |
| Comment by Alex Zhuravlev [ 30/Mar/21 ] |
doesn't look so.
int rc;
cfs_race_state = 0;
CERROR("cfs_race id %x sleeping\n", id);
rc = wait_event_interruptible(cfs_race_waitq,
cfs_race_state != 0);
CERROR("cfs_fail_race id %x awake: rc=%d\n", id, rc);
} else {
CERROR("cfs_fail_race id %x waking\n", id);
cfs_race_state = 1;
wake_up(&cfs_race_waitq);
}
I don't understand how is it guaranteed that the first branch (with cfs_race_state = 0) is executed before another one (with wake_up()) |
| Comment by Gerrit Updater [ 30/Mar/21 ] |
|
Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43161 |
| Comment by Gerrit Updater [ 20/Nov/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/43161/ |
| Comment by Peter Jones [ 20/Nov/21 ] |
|
Landed for 2.15 |