[LU-16747] concurrency issue in get_root_path_slow() Created: 17/Apr/23 Updated: 16/Jul/23 Resolved: 20/Jun/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Etienne Aujames | Assignee: | Etienne Aujames |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Since the integration of https://review.whamcloud.com/c/fs/lustre-release/+/36603 ("LU-8585 llapi: use open_by_handle_at in llapi_open_by_fid") at CEA, we observed the following errors when starting a robinhood policy process: rh_purge.log:2023/04/11 11:43:33 [24565/6] lhsm | ERROR performing HSM request(RELEASE, root=/mnt/lustre, fid=[0x200000bd1:0x599:0x0]): Bad file descriptor rh_purge.log:2023/04/11 11:43:33 [24565/6] purge | Error applying action on entry /mnt/lustre/file.5: Bad file descriptor The error is triggered inside "llapi_hsm_request()". This is because several threads are started at the same time with an uninitialized root_cache entry. Multiple threads are running inside get_root_path_slow() at the same moment:
Here a reproducer: #define TEST13_THR_NBR 20 void *test13_thr(void *arg) { char *fidstr = arg; char path[PATH_MAX]; long long recno = -1; int linkno = 0; long long rc; rc = llapi_fid2path(lustre_dir2, fidstr, path, sizeof(path), &recno, &linkno); return (void *) rc; } /* Test llapi root cache on multi-threading context */ static void test13(void) { static pthread_t thread[TEST13_THR_NBR]; int fd, i, iter; long long rc; struct lu_fid fid; char fidstr[FID_LEN + 1]; fd = creat(mainpath, 00660); ASSERTF(fd >= 0, "creat failed for '%s': %s", mainpath, strerror(errno)); rc = llapi_fd2fid(fd, &fid); ASSERTF(rc == 0, "llapi_fd2fid failed for '%s': %s", mainpath, strerror(-rc)); close(fd); snprintf(fidstr, sizeof(fidstr), DFID_NOBRACE, PFID(&fid)); for (iter = 0; iter < 100; iter++) { /* reset cache on first mountpoint */ fd = llapi_open_by_fid(lustre_dir, &fid, O_RDONLY); ASSERTF(fd >= 0, "llapi_open_by_fid for " DFID_NOBRACE ": %d", PFID(&fid), fd); close(fd); /* start threads with llapi_fid2path() */ for (i = 0; i < TEST13_THR_NBR; i++) pthread_create(&thread[i], NULL, &test13_thr, fidstr); for (i = 0; i < TEST13_THR_NBR; i++) { pthread_join(thread[i], (void **) &rc); ASSERTF(rc == 0, "llapi_fid2path for " DFID_NOBRACE " (iter: %d, thr:%d): %s", PFID(&fid), iter, i, strerror(-rc)); } } } |
| Comments |
| Comment by Gerrit Updater [ 18/Apr/23 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50682 |
| Comment by Gerrit Updater [ 19/Jun/23 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51367 |
| Comment by Gerrit Updater [ 20/Jun/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50682/ |
| Comment by Peter Jones [ 20/Jun/23 ] |
|
Landed for 2.16 |