[LU-16747] concurrency issue in get_root_path_slow() Created: 17/Apr/23  Updated: 16/Jul/23  Resolved: 20/Jun/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Etienne Aujames Assignee: Etienne Aujames
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Since the integration of https://review.whamcloud.com/c/fs/lustre-release/+/36603 ("LU-8585 llapi: use open_by_handle_at in llapi_open_by_fid") at CEA, we observed the following errors when starting a robinhood policy process:

rh_purge.log:2023/04/11 11:43:33 [24565/6] lhsm | ERROR performing HSM
request(RELEASE, root=/mnt/lustre, fid=[0x200000bd1:0x599:0x0]): Bad
file descriptor
rh_purge.log:2023/04/11 11:43:33 [24565/6] purge | Error applying action
on entry /mnt/lustre/file.5: Bad file descriptor

The error is triggered inside "llapi_hsm_request()".

This is because several threads are started at the same time with an uninitialized root_cache entry. Multiple threads are running inside get_root_path_slow() at the same moment:

  1. Thread 1 takes write lock on root_cache and updates the root_cache.fd.
  2. Thread 2 takes write lock on root_cache, closes the root_cache.fd and then updates it.
  3. Thread 1 returns a closed file descriptor to llapi_hsm_request().

Here a reproducer:

#define TEST13_THR_NBR 20                                                                       
void *test13_thr(void *arg)                                                                     
{                                                                                               
        char *fidstr = arg;                                                                     
        char path[PATH_MAX];                                                                    
        long long recno = -1;                                                                   
        int linkno = 0;                                                                         
        long long rc;                                                                           
                                                                                                
        rc = llapi_fid2path(lustre_dir2, fidstr, path,                                          
                            sizeof(path), &recno, &linkno);                                     
                                                                                                
        return (void *) rc;                                                                     
}                                                                                               
                                                                                                
/* Test llapi root cache on multi-threading context */                                          
static void test13(void)                                                                        
{                                                                                               
        static pthread_t thread[TEST13_THR_NBR];                                                
        int fd, i, iter;                                                                        
        long long rc;                                                                           
        struct lu_fid fid;                                                                      
        char fidstr[FID_LEN + 1];                                                               
                                                                                                
        fd = creat(mainpath, 00660);                                                            
        ASSERTF(fd >= 0, "creat failed for '%s': %s",                                           
                mainpath, strerror(errno));                                                     
                                                                                                
        rc = llapi_fd2fid(fd, &fid);                                                            
        ASSERTF(rc == 0, "llapi_fd2fid failed for '%s': %s",                                    
                mainpath, strerror(-rc));                                                       
        close(fd);                                                                              
                                                                                                
        snprintf(fidstr, sizeof(fidstr), DFID_NOBRACE, PFID(&fid));                             
        for (iter = 0; iter < 100; iter++) {                                                    
                /* reset cache on first mountpoint */                                           
                fd = llapi_open_by_fid(lustre_dir, &fid, O_RDONLY);                             
                ASSERTF(fd >= 0, "llapi_open_by_fid for " DFID_NOBRACE ": %d",                  
                        PFID(&fid), fd);                                                        
                close(fd);                                                                      
                                                                                                
                /* start threads with llapi_fid2path() */                                    
                for (i = 0; i < TEST13_THR_NBR; i++)                                            
                        pthread_create(&thread[i], NULL, &test13_thr, fidstr);                  
                                                                                                
                for (i = 0; i < TEST13_THR_NBR; i++) {                                          
                        pthread_join(thread[i], (void **) &rc);                                 
                        ASSERTF(rc == 0,                                                        
                                "llapi_fid2path for " DFID_NOBRACE " (iter: %d, thr:%d): %s",   
                                PFID(&fid), iter, i, strerror(-rc));                            
                }                                                                               
        }                                                                                       
}                                                                                               


 Comments   
Comment by Gerrit Updater [ 18/Apr/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50682
Subject: LU-16747 llapi: fix race in get_root_path_slow()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2ee9f3eb401b37e4a5c8f74bcf2e07ecbc38966d

Comment by Gerrit Updater [ 19/Jun/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51367
Subject: LU-16747 llapi: fix race in get_root_path_slow()
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: e364909d72a9ae48699e171b1ee10170b9a8fbce

Comment by Gerrit Updater [ 20/Jun/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50682/
Subject: LU-16747 llapi: fix race in get_root_path_slow()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9ef1e097d53000233f9ba23319268f467c276173

Comment by Peter Jones [ 20/Jun/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:29:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.