Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
Since the integration of https://review.whamcloud.com/c/fs/lustre-release/+/36603 ("LU-8585 llapi: use open_by_handle_at in llapi_open_by_fid") at CEA, we observed the following errors when starting a robinhood policy process:
rh_purge.log:2023/04/11 11:43:33 [24565/6] lhsm | ERROR performing HSM request(RELEASE, root=/mnt/lustre, fid=[0x200000bd1:0x599:0x0]): Bad file descriptor rh_purge.log:2023/04/11 11:43:33 [24565/6] purge | Error applying action on entry /mnt/lustre/file.5: Bad file descriptor
The error is triggered inside "llapi_hsm_request()".
This is because several threads are started at the same time with an uninitialized root_cache entry. Multiple threads are running inside get_root_path_slow() at the same moment:
- Thread 1 takes write lock on root_cache and updates the root_cache.fd.
- Thread 2 takes write lock on root_cache, closes the root_cache.fd and then updates it.
- Thread 1 returns a closed file descriptor to llapi_hsm_request().
Here a reproducer:
#define TEST13_THR_NBR 20 void *test13_thr(void *arg) { char *fidstr = arg; char path[PATH_MAX]; long long recno = -1; int linkno = 0; long long rc; rc = llapi_fid2path(lustre_dir2, fidstr, path, sizeof(path), &recno, &linkno); return (void *) rc; } /* Test llapi root cache on multi-threading context */ static void test13(void) { static pthread_t thread[TEST13_THR_NBR]; int fd, i, iter; long long rc; struct lu_fid fid; char fidstr[FID_LEN + 1]; fd = creat(mainpath, 00660); ASSERTF(fd >= 0, "creat failed for '%s': %s", mainpath, strerror(errno)); rc = llapi_fd2fid(fd, &fid); ASSERTF(rc == 0, "llapi_fd2fid failed for '%s': %s", mainpath, strerror(-rc)); close(fd); snprintf(fidstr, sizeof(fidstr), DFID_NOBRACE, PFID(&fid)); for (iter = 0; iter < 100; iter++) { /* reset cache on first mountpoint */ fd = llapi_open_by_fid(lustre_dir, &fid, O_RDONLY); ASSERTF(fd >= 0, "llapi_open_by_fid for " DFID_NOBRACE ": %d", PFID(&fid), fd); close(fd); /* start threads with llapi_fid2path() */ for (i = 0; i < TEST13_THR_NBR; i++) pthread_create(&thread[i], NULL, &test13_thr, fidstr); for (i = 0; i < TEST13_THR_NBR; i++) { pthread_join(thread[i], (void **) &rc); ASSERTF(rc == 0, "llapi_fid2path for " DFID_NOBRACE " (iter: %d, thr:%d): %s", PFID(&fid), iter, i, strerror(-rc)); } } }