Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16747

concurrency issue in get_root_path_slow()

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Since the integration of https://review.whamcloud.com/c/fs/lustre-release/+/36603 ("LU-8585 llapi: use open_by_handle_at in llapi_open_by_fid") at CEA, we observed the following errors when starting a robinhood policy process:

      rh_purge.log:2023/04/11 11:43:33 [24565/6] lhsm | ERROR performing HSM
      request(RELEASE, root=/mnt/lustre, fid=[0x200000bd1:0x599:0x0]): Bad
      file descriptor
      rh_purge.log:2023/04/11 11:43:33 [24565/6] purge | Error applying action
      on entry /mnt/lustre/file.5: Bad file descriptor
      

      The error is triggered inside "llapi_hsm_request()".

      This is because several threads are started at the same time with an uninitialized root_cache entry. Multiple threads are running inside get_root_path_slow() at the same moment:

      1. Thread 1 takes write lock on root_cache and updates the root_cache.fd.
      2. Thread 2 takes write lock on root_cache, closes the root_cache.fd and then updates it.
      3. Thread 1 returns a closed file descriptor to llapi_hsm_request().

      Here a reproducer:

      #define TEST13_THR_NBR 20                                                                       
      void *test13_thr(void *arg)                                                                     
      {                                                                                               
              char *fidstr = arg;                                                                     
              char path[PATH_MAX];                                                                    
              long long recno = -1;                                                                   
              int linkno = 0;                                                                         
              long long rc;                                                                           
                                                                                                      
              rc = llapi_fid2path(lustre_dir2, fidstr, path,                                          
                                  sizeof(path), &recno, &linkno);                                     
                                                                                                      
              return (void *) rc;                                                                     
      }                                                                                               
                                                                                                      
      /* Test llapi root cache on multi-threading context */                                          
      static void test13(void)                                                                        
      {                                                                                               
              static pthread_t thread[TEST13_THR_NBR];                                                
              int fd, i, iter;                                                                        
              long long rc;                                                                           
              struct lu_fid fid;                                                                      
              char fidstr[FID_LEN + 1];                                                               
                                                                                                      
              fd = creat(mainpath, 00660);                                                            
              ASSERTF(fd >= 0, "creat failed for '%s': %s",                                           
                      mainpath, strerror(errno));                                                     
                                                                                                      
              rc = llapi_fd2fid(fd, &fid);                                                            
              ASSERTF(rc == 0, "llapi_fd2fid failed for '%s': %s",                                    
                      mainpath, strerror(-rc));                                                       
              close(fd);                                                                              
                                                                                                      
              snprintf(fidstr, sizeof(fidstr), DFID_NOBRACE, PFID(&fid));                             
              for (iter = 0; iter < 100; iter++) {                                                    
                      /* reset cache on first mountpoint */                                           
                      fd = llapi_open_by_fid(lustre_dir, &fid, O_RDONLY);                             
                      ASSERTF(fd >= 0, "llapi_open_by_fid for " DFID_NOBRACE ": %d",                  
                              PFID(&fid), fd);                                                        
                      close(fd);                                                                      
                                                                                                      
                      /* start threads with llapi_fid2path() */                                    
                      for (i = 0; i < TEST13_THR_NBR; i++)                                            
                              pthread_create(&thread[i], NULL, &test13_thr, fidstr);                  
                                                                                                      
                      for (i = 0; i < TEST13_THR_NBR; i++) {                                          
                              pthread_join(thread[i], (void **) &rc);                                 
                              ASSERTF(rc == 0,                                                        
                                      "llapi_fid2path for " DFID_NOBRACE " (iter: %d, thr:%d): %s",   
                                      PFID(&fid), iter, i, strerror(-rc));                            
                      }                                                                               
              }                                                                                       
      }                                                                                               
      

      Attachments

        Activity

          People

            eaujames Etienne Aujames
            eaujames Etienne Aujames
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: