Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.4.0, Lustre 2.1.6
    • Lustre 2.4.0, Lustre 2.1.4
    • None
    • 3
    • 7394

    Description

      We are using lustre 2.1.4-3chaos on our server clusters.

      Running a test application, one of our archive storage folks discovered that Lustre's directory listings are rather unreliable. The first thing she noticed is that directory entries can appear multiple times:

      > cd /p/lscratchrza/apotts/divt_rzstagg0/htar_1st_27475
      > find . -type f > ../test.lst0 ; echo $? ; wc -l ../test.lst0
      0
      34339 ../test.lst0
      > find . -type f > ../test.lst1 ; echo $? ; wc -l ../test.lst1
      0
      35006 ../test.lst1
      

      When the two directory listings are sorted and run through uniq, there are only 34339 unique entries.

      One of our sysadmins investigated, and further found that sometimes entry listing are missing altogether. But when the missing files are checked with an ls, they are present.

      This has been noticed with the above find command, and also using "/bin/ls -laR .". Both files and subdirectories have appeared twice in the directory listing.

      The Lustre clients that have reproduced this behaviour are running 2.1.2-4chaos and 1.8.5.0-6chaos.

      Attachments

        Activity

          [LU-3029] Directory listings are unreliable
          yong.fan nasf (Inactive) added a comment - - edited

          It is NOT true that all the lustre-1.8.5 support 64bithash. I have checked your branches and found that the oldest branch which supports 64bithash is lustre-1.8.5.0-6chaos. But the former version, such as lustre-1.8.5.0-{1/2/3/4/5}chaos, they all do NOT support 64bithash.

          yong.fan nasf (Inactive) added a comment - - edited It is NOT true that all the lustre-1.8.5 support 64bithash. I have checked your branches and found that the oldest branch which supports 64bithash is lustre-1.8.5.0-6chaos. But the former version, such as lustre-1.8.5.0-{1/2/3/4/5}chaos, they all do NOT support 64bithash.

          No, I don't believe so. The PPC clients are running 2.3.58, all other clients else should be 2.1.x or 1.8.5.

          nedbass Ned Bass (Inactive) added a comment - No, I don't believe so. The PPC clients are running 2.3.58, all other clients else should be 2.1.x or 1.8.5.
          yong.fan nasf (Inactive) added a comment - - edited

          Were there any liblustre clients or old b2_0 clients accessed your system since the last MDS reboot? They do not support 64-bit dirhash.

          yong.fan nasf (Inactive) added a comment - - edited Were there any liblustre clients or old b2_0 clients accessed your system since the last MDS reboot? They do not support 64-bit dirhash.

          The old client means it was ever old, for example, the client upgrade from lustre-1.8.x to lustre-1.8.5.0-6chao. If the client ever triggered readdir RPC before the upgrading, then it would cause above issues.

          The 1.8 clients have been running the same version since long before the MDS was last rebooted.

          nedbass Ned Bass (Inactive) added a comment - The old client means it was ever old, for example, the client upgrade from lustre-1.8.x to lustre-1.8.5.0-6chao. If the client ever triggered readdir RPC before the upgrading, then it would cause above issues. The 1.8 clients have been running the same version since long before the MDS was last rebooted.

          The old client means it was ever old, for example, the client upgrade from lustre-1.8.x to lustre-1.8.5.0-6chao. If the client ever triggered readdir RPC before the upgrading, then it would cause above issues.

          As for the race checking, I am not sure. Since it is race, it should be rare. So means most RPC service threads should NOT set "FMODE_32BITHASH". But the results does not support such conclusion. On the other hand, if the "race" exist, it is possible for me to reproduce it locally. But I have tested with your branches thousand times, never reproduced. So the "race" is so rare...

          yong.fan nasf (Inactive) added a comment - The old client means it was ever old, for example, the client upgrade from lustre-1.8.x to lustre-1.8.5.0-6chao. If the client ever triggered readdir RPC before the upgrading, then it would cause above issues. As for the race checking, I am not sure. Since it is race, it should be rare. So means most RPC service threads should NOT set "FMODE_32BITHASH". But the results does not support such conclusion. On the other hand, if the "race" exist, it is possible for me to reproduce it locally. But I have tested with your branches thousand times, never reproduced. So the "race" is so rare...

          nasf, thanks for the analysis. It does seem plausible, except that our oldest client is lustre-1.8.5.0-6chaos, and it seems to claim OBD_CONNECT_64BITHASH:

          https://github.com/chaos/lustre/blob/1.8.5.0-6chaos/lustre/llite/llite_lib.c#L284

          Another unsubstantiated theory is that a swabbing bug could make a PPC client look like 32-bit. But, I checked the import file in /proc for a PPC client and it said 64bithash.

          The only other idea I have is maybe there is some race checking the exp_connect_flags here:

          https://github.com/chaos/lustre/blob/2.1.4-4chaos/lustre/mdt/mdt_handler.c#L1488

          nedbass Ned Bass (Inactive) added a comment - nasf, thanks for the analysis. It does seem plausible, except that our oldest client is lustre-1.8.5.0-6chaos, and it seems to claim OBD_CONNECT_64BITHASH: https://github.com/chaos/lustre/blob/1.8.5.0-6chaos/lustre/llite/llite_lib.c#L284 Another unsubstantiated theory is that a swabbing bug could make a PPC client look like 32-bit. But, I checked the import file in /proc for a PPC client and it said 64bithash. The only other idea I have is maybe there is some race checking the exp_connect_flags here: https://github.com/chaos/lustre/blob/2.1.4-4chaos/lustre/mdt/mdt_handler.c#L1488
          yong.fan nasf (Inactive) added a comment - - edited

          The situation should be like this:

          1) When you upgraded your MDS with the patch http://review.whamcloud.com/#patch,sidebyside,3467,7,lustre/osd-ldiskfs/osd_handler.c, the 64bithash/32bithash issue has been introduced in your system. Because the "osd_thread_info" is reused without totally reset when switch from one RPC processing to another RPC processing.

          2) For old client, in spite of 32-bit or 64-bit, as long as it did NOT claim OBD_CONNECT_64BITHASH flags when connected to the MDS, the readdir RPC from such old client would cause that the "osd_thread_info::oti_it_ea::oie_file::f_mode" to be set as "FMODE_32BITHASH". As long as such readdir RPC happened once, the "FMODE_32BITHASH" flags on related "osd_thread_info" would not be cleared until the RPC service restarted on the MDS.

          3) As long as the "FMODE_32BITHASH" was set, dir-hash processed by such RPC service thread would use the major hash only - 32bit. That is why we saw the 32bit dirhash returned.

          4) No all the RPC service threads' "osd_thread_info::oti_it_ea::oie_file::f_mode" have been set as "FMODE_32BITHASH", depends on whether old clients triggered those RPC service threads to serve readdir RPCs or not. If the RPC service thread had not "FMODE_32BITHASH", then it will generate 64bithash, that is why we also saw some 64bit dirhash returned.

          5) The readdir RPC from one client can be served by any RPC (readpage) service thread. So sometimes the readdir RPC was served by the RPC service thread which was set "FMODE_32BITHASH", but sometimes it may be served by the RPC service thread which was NOT set "FMODE_32BITHASH". For a large directory, one "ls -l dir" command may trigger several readdir RPCs, if these RPCs were handled by different RPC service threads, some of them were set "FMODE_32BITHASH" but some of them were NOT, then when client send 32bithash to the RPC service thread, which had NOT "FMODE_32BITHASH", the RPC service thread would explain the 32bithash (from client) as "major = 0, minor = 32bithash", that was wrong. So it cannot locate the right position.

          6) For 2.x client, one readdir RPC will fetch back at most 256 pages, but for 1.8 client, only single page per RPC. So the readdir RPCs count for the same sized directory are different. And more readdir RPCs more failure possibility. That is why the failure is more easy to be reproduced on 1.8 client than on 2.x client.

          yong.fan nasf (Inactive) added a comment - - edited The situation should be like this: 1) When you upgraded your MDS with the patch http://review.whamcloud.com/#patch,sidebyside,3467,7,lustre/osd-ldiskfs/osd_handler.c , the 64bithash/32bithash issue has been introduced in your system. Because the "osd_thread_info" is reused without totally reset when switch from one RPC processing to another RPC processing. 2) For old client, in spite of 32-bit or 64-bit, as long as it did NOT claim OBD_CONNECT_64BITHASH flags when connected to the MDS, the readdir RPC from such old client would cause that the "osd_thread_info::oti_it_ea::oie_ file::f_mode " to be set as "FMODE_32BITHASH". As long as such readdir RPC happened once, the "FMODE_32BITHASH" flags on related "osd_thread_info" would not be cleared until the RPC service restarted on the MDS. 3) As long as the "FMODE_32BITHASH" was set, dir-hash processed by such RPC service thread would use the major hash only - 32bit. That is why we saw the 32bit dirhash returned. 4) No all the RPC service threads' "osd_thread_info::oti_it_ea::oie_ file::f_mode " have been set as "FMODE_32BITHASH", depends on whether old clients triggered those RPC service threads to serve readdir RPCs or not. If the RPC service thread had not "FMODE_32BITHASH", then it will generate 64bithash, that is why we also saw some 64bit dirhash returned. 5) The readdir RPC from one client can be served by any RPC (readpage) service thread. So sometimes the readdir RPC was served by the RPC service thread which was set "FMODE_32BITHASH", but sometimes it may be served by the RPC service thread which was NOT set "FMODE_32BITHASH". For a large directory, one "ls -l dir" command may trigger several readdir RPCs, if these RPCs were handled by different RPC service threads, some of them were set "FMODE_32BITHASH" but some of them were NOT, then when client send 32bithash to the RPC service thread, which had NOT "FMODE_32BITHASH", the RPC service thread would explain the 32bithash (from client) as "major = 0, minor = 32bithash", that was wrong. So it cannot locate the right position. 6) For 2.x client, one readdir RPC will fetch back at most 256 pages, but for 1.8 client, only single page per RPC. So the readdir RPCs count for the same sized directory are different. And more readdir RPCs more failure possibility. That is why the failure is more easy to be reproduced on 1.8 client than on 2.x client.

          Agreed, the "|=" will be changed back to "=".

          yong.fan nasf (Inactive) added a comment - Agreed, the "|=" will be changed back to "=".
          nedbass Ned Bass (Inactive) added a comment - - edited

          Reviewing this bit of code:

          osd_handler.c
          static struct dt_it *osd_it_ea_init(const struct lu_env *env,
          3475                                     struct dt_object *dt,
          3476                                     __u32 attr,
          3477                                     struct lustre_capa *capa)
          3478 {
          3479         struct osd_object       *obj  = osd_dt_obj(dt);
          3480         struct osd_thread_info  *info = osd_oti_get(env);
          3481         struct osd_it_ea        *it   = &info->oti_it_ea;
          3482         struct lu_object        *lo   = &dt->do_lu;
          3483         struct dentry           *obj_dentry = &info->oti_it_dentry;
          3484         ENTRY;
          3485         LASSERT(lu_object_exists(lo));
          3486
          3487         obj_dentry->d_inode = obj->oo_inode;
          3488         obj_dentry->d_sb = osd_sb(osd_obj2dev(obj));
          3489         obj_dentry->d_name.hash = 0;
          3490
          3491         it->oie_rd_dirent       = 0;
          3492         it->oie_it_dirent       = 0;
          3493         it->oie_dirent          = NULL;
          3494         it->oie_buf             = info->oti_it_ea_buf;
          3495         it->oie_obj             = obj;
          3496         it->oie_file.f_pos      = 0;
          3497         it->oie_file.f_dentry   = obj_dentry;
          3498         if (attr & LUDA_64BITHASH)
          3499                 it->oie_file.f_mode |= FMODE_64BITHASH;
          3500         else
          3501                 it->oie_file.f_mode |= FMODE_32BITHASH;
          

          Shouldn't lines 3499 and 3501 just use an assignment operator = rather than |=?

          It seems the iterator context is reused between requests handled by the mdt_rdpg_xx threads, so once set the bits are never cleared. If a mix of 32- and 64- bit clients connect, we will end up with both FMODE_64BITHASH and FMODE_32BITHASH set, as we are seeing. (I don't think we have any 32-bit clients, so it's still a mystery why we're seeing that.)

          Prior to this change it was an assignment:

          http://review.whamcloud.com/#patch,sidebyside,3467,7,lustre/osd-ldiskfs/osd_handler.c

          This bug surfaced after we started running 2.1.4-4chaos which included that change. I don't think it introduced the bug, but it may have unmasked it.

          nedbass Ned Bass (Inactive) added a comment - - edited Reviewing this bit of code: osd_handler.c static struct dt_it *osd_it_ea_init( const struct lu_env *env, 3475 struct dt_object *dt, 3476 __u32 attr, 3477 struct lustre_capa *capa) 3478 { 3479 struct osd_object *obj = osd_dt_obj(dt); 3480 struct osd_thread_info *info = osd_oti_get(env); 3481 struct osd_it_ea *it = &info->oti_it_ea; 3482 struct lu_object *lo = &dt->do_lu; 3483 struct dentry *obj_dentry = &info->oti_it_dentry; 3484 ENTRY; 3485 LASSERT(lu_object_exists(lo)); 3486 3487 obj_dentry->d_inode = obj->oo_inode; 3488 obj_dentry->d_sb = osd_sb(osd_obj2dev(obj)); 3489 obj_dentry->d_name.hash = 0; 3490 3491 it->oie_rd_dirent = 0; 3492 it->oie_it_dirent = 0; 3493 it->oie_dirent = NULL; 3494 it->oie_buf = info->oti_it_ea_buf; 3495 it->oie_obj = obj; 3496 it->oie_file.f_pos = 0; 3497 it->oie_file.f_dentry = obj_dentry; 3498 if (attr & LUDA_64BITHASH) 3499 it->oie_file.f_mode |= FMODE_64BITHASH; 3500 else 3501 it->oie_file.f_mode |= FMODE_32BITHASH; Shouldn't lines 3499 and 3501 just use an assignment operator = rather than |=? It seems the iterator context is reused between requests handled by the mdt_rdpg_xx threads, so once set the bits are never cleared. If a mix of 32- and 64- bit clients connect, we will end up with both FMODE_64BITHASH and FMODE_32BITHASH set, as we are seeing. (I don't think we have any 32-bit clients, so it's still a mystery why we're seeing that.) Prior to this change it was an assignment: http://review.whamcloud.com/#patch,sidebyside,3467,7,lustre/osd-ldiskfs/osd_handler.c This bug surfaced after we started running 2.1.4-4chaos which included that change. I don't think it introduced the bug, but it may have unmasked it.

          It turns out an earlier version of the patch I linked to was carried in the ldiskfs patch series for earlier kernels.

          nedbass Ned Bass (Inactive) added a comment - It turns out an earlier version of the patch I linked to was carried in the ldiskfs patch series for earlier kernels.

          I peeked at filp->f_mode in ldiskfs_readdir() with systemtap. It is always 0x600 on the server where we see this problem, so both FMODE_32BITHASH and FMODE_64BITHASH are set. On a server where we've never seen this bug, it is 0x400, i.e. only FMODE_64BITHASH is set. I'm not sure how it's getting set that way, but that explains why the major hash is always zero.

          nedbass Ned Bass (Inactive) added a comment - I peeked at filp->f_mode in ldiskfs_readdir() with systemtap. It is always 0x600 on the server where we see this problem, so both FMODE_32BITHASH and FMODE_64BITHASH are set. On a server where we've never seen this bug, it is 0x400, i.e. only FMODE_64BITHASH is set. I'm not sure how it's getting set that way, but that explains why the major hash is always zero.

          People

            yong.fan nasf (Inactive)
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: