Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15055

spurious qmt_pool_add_rem() message when adding OST to pool: -EEXIST

Details

    • 3
    • 9223372036854775807

    Description

      The qmt_pool_add_rem() message started appearing on 2021-08-18 in test output, and there have been a few thousand hits per day:

      LustreError: 5659:0:(qmt_pool.c:1390:qmt_pool_add_rem()) add to: can't scratch-QMT0000 scratch-OST0000_UUID pool pool1: rc = -17
      LustreError: 5666:0:(qmt_pool.c:1390:qmt_pool_add_rem()) add to: can't scratch-QMT0000 scratch-OST0001_UUID pool pool2: rc = -17
      LustreError: 5674:0:(qmt_pool.c:1390:qmt_pool_add_rem()) add to: can't scratch-QMT0000 scratch-OST0002_UUID pool pool3: rc = -17
      

      Patches landed on that day are listed below (no other patches landed after 2021-08-10 or before 2021-08-25):

      $ git log --oneline --before 2021-08-22  --after 2021-08-16
      d8204f903a (tag: v2_14_54, tag: 2.14.54) New tag 2.14.54
      5220160648 LU-14093 lutf: fix build with gcc10
      a205334da5 LU-14903 doc: update lfs-setdirstripe man page
      1313cad7a1 LU-14899 ldiskfs: Add 5.4.136 mainline kernel support
      c44afcfb72 LU-12815 socklnd: set conns_per_peer based on link speed
      6e30cd0844 LU-14871 kernel: kernel update RHEL7.9 [3.10.0-1160.36.2.el7]
      14b8276e06 LU-14865 utils: llog_reader.c printf type mismatch
      aa5d081237 LU-9859 lnet: fold lprocfs_call_handler functionality into lnet_debugfs_*
      e423a0bd7a LU-14787 libcfs: Proved an abstraction for AS_EXITING
      76c71a167b LU-14775 kernel: kernel update SLES12 SP5 [4.12.14-122.74.1]
      67752f6db2 LU-14773 tests: skip check_network() on working node
      024f9303bc LU-14668 lnet: Lock primary NID logic
      684943e2d0 LU-14668 lnet: peer state to lock primary nid
      16321de596 LU-14661 obdclass: Add peer/peer NI when processing llog *
      ac201366ad LU-14661 lnet: Provide kernel API for adding peers
      51350e9b73 LU-14531 osd: serialize access to object vs object destroy
      a5cbe7883d LU-12815 socklnd: allow dynamic setting of conns_per_peer
      d13d8158e8 LU-14093 mgc: rework mgc_apply_recover_logs() for gcc10
      8dd4488a07 LU-6142 tests: remove iam_ut binary
      301d76a711 LU-14876 out: don't connect to busy MDS-MDS export *
      

      It isn't really clear which of those patches started causing this problem, but the message is being printed in different subtests that create/remove OST pools.

       

      The graph shows occurrences by subtest since 2021-08-18, it looks like this happens in any subtest that is adding a pool:

      Attachments

        Issue Links

          Activity

            [LU-15055] spurious qmt_pool_add_rem() message when adding OST to pool: -EEXIST

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56939/
            Subject: LU-15055 lod: run qmt_pool_* only from the MDT0000 config
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 842a02df7a6b26ab096824d8ecea95474d649240

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56939/ Subject: LU-15055 lod: run qmt_pool_* only from the MDT0000 config Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 842a02df7a6b26ab096824d8ecea95474d649240

            "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56939
            Subject: LU-15055 lod: run qmt_pool_* only from the MDT0000 config
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 9f6b858fa7d9f2ac48744aec07446754869147c8

            gerrit Gerrit Updater added a comment - "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56939 Subject: LU-15055 lod: run qmt_pool_* only from the MDT0000 config Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 9f6b858fa7d9f2ac48744aec07446754869147c8
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47059/
            Subject: LU-15055 lod: run qmt_pool_* only from the MDT0000 config
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 0f158c6a093e059d89f637f31d34742078c38209

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47059/ Subject: LU-15055 lod: run qmt_pool_* only from the MDT0000 config Project: fs/lustre-release Branch: master Current Patch Set: Commit: 0f158c6a093e059d89f637f31d34742078c38209

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/47059
            Subject: LU-15055 lod: run qmt_pool_* only from the MDT0000 config
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 9b4ea520fa3f05da47b20758a81360d03d25e2db

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/47059 Subject: LU-15055 lod: run qmt_pool_* only from the MDT0000 config Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9b4ea520fa3f05da47b20758a81360d03d25e2db

            Hello,

            I was debugging the https://review.whamcloud.com/#/c/46955 ("LU-15707 lod: force creation of a component without a pool") when I saw the same message.

            I think the root cause is the following code:

            static int lod_pool_add_q(struct obd_device *obd, char *poolname, char *ostname)
            {                                                                               
                    int err = lod_pool_add(obd, poolname, ostname);                         
                                                                                            
                    if (!err) {                                                             
                            obd = obd_find_qmt0(obd->obd_name);                             <------
                            if (obd)                                                        
                                    obd_pool_add(obd, poolname, ostname);                   
                    }                                                                       
                                                                                            
                    return err;                                                             
            }                                                                               
            

            If the mds0 has 2 MDT, the MGS will add 1 pool_add record in CONFIGS/lustre-MDT0000 and CONFIGS/lustre-MDT0001. So obd_pool_add/lod_pool_add_q will be call 2 times (1 for each MDT), QMT0000 will be found 2 times (on mds0) and finally qmt_pool_add/qmt_pool_add_rem will try to add the same OST on QMT0000.

            So maybe we can add/remove the pool to QMT0000 only for the MDT0000:

            static int lod_pool_add_q(struct obd_device *obd, char *poolname, char *ostname)
            {                                                                               
                    int err = lod_pool_add(obd, poolname, ostname);                         
                                                                                            
                    if (!err && strstr(obd->obd_name, "-MDT0000")) {                                                             
                            obd = obd_find_qmt0(obd->obd_name);                           
                            if (obd)                                                        
                                    obd_pool_add(obd, poolname, ostname);                   
                    }                                                                       
                                                                                            
                    return err;                                                             
            }                                                                               
            
            eaujames Etienne Aujames added a comment - Hello, I was debugging the https://review.whamcloud.com/#/c/46955 (" LU-15707 lod: force creation of a component without a pool") when I saw the same message. I think the root cause is the following code: static int lod_pool_add_q( struct obd_device *obd, char *poolname, char *ostname) { int err = lod_pool_add(obd, poolname, ostname); if (!err) { obd = obd_find_qmt0(obd->obd_name); <------ if (obd) obd_pool_add(obd, poolname, ostname); } return err; } If the mds0 has 2 MDT, the MGS will add 1 pool_add record in CONFIGS/lustre-MDT0000 and CONFIGS/lustre-MDT0001. So obd_pool_add/lod_pool_add_q will be call 2 times (1 for each MDT), QMT0000 will be found 2 times (on mds0) and finally qmt_pool_add/qmt_pool_add_rem will try to add the same OST on QMT0000. So maybe we can add/remove the pool to QMT0000 only for the MDT0000: static int lod_pool_add_q( struct obd_device *obd, char *poolname, char *ostname) { int err = lod_pool_add(obd, poolname, ostname); if (!err && strstr(obd->obd_name, "-MDT0000" )) { obd = obd_find_qmt0(obd->obd_name); if (obd) obd_pool_add(obd, poolname, ostname); } return err; }

            People

              scherementsev Sergey Cheremencev
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: