Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11582

LBUG: ASSERTION( inode->i_data.nrpages == 0 ) failed

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.12.0, Lustre 2.10.7
    • Lustre 2.10.5
    • None
    • Client: CentOS 7.5 Lustre 2.10.5
      Server (Oak) CentOS 7.4 Lustre 2.10.4
    • 3
    • 9223372036854775807

    Description

      Hi,
      We have a type of job that keeps crashing Lustre client version 2.10.5 with the following trace. It is very likely this job has files open on Oak (Lustre 2.10.4). This looks like old tickets LU-1414 and LU-118... fixed in Lustre 1.8! The issue happens on a bigmem node (1.5TB of RAM), and doesn't seem to happen on node with less memory. I'll try to upload a crash dump file to your ftp.

      [11497.465606] LustreError: 132407:0:(llite_lib.c:2047:ll_delete_inode()) ASSERTION( inode->i_data.nrpages == 0 ) failed: inode=[0x200018e83:0x1ba2c:0x0](ffff8aa85a298510) nrpages=1, see LU-118
      [11497.487939] LustreError: 132407:0:(llite_lib.c:2047:ll_delete_inode()) LBUG
      [11497.495730] Pid: 132407, comm: spades 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018
      [11497.505939] Call Trace:
      [11497.508685]  [<ffffffffc09947cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [11497.516009]  [<ffffffffc099487c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [11497.522955]  [<ffffffffc0f25c87>] ll_delete_inode+0x1b7/0x1c0 [lustre]
      [11497.530291]  [<ffffffff8d43c504>] evict+0xb4/0x180
      [11497.535663]  [<ffffffff8d43ce0c>] iput+0xfc/0x190
      [11497.540940]  [<ffffffff8d43126e>] do_unlinkat+0x1ae/0x2d0
      [11497.546990]  [<ffffffff8d432326>] SyS_unlink+0x16/0x20
      [11497.552753]  [<ffffffff8d92579b>] system_call_fastpath+0x22/0x27
      [11497.559484]  [<ffffffffffffffff>] 0xffffffffffffffff
      [11497.565069] Kernel panic - not syncing: LBUG
      [11497.569837] CPU: 7 PID: 132407 Comm: spades Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.14.4.el7.x86_64 #1
      [11497.582928] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.8.0 005/17/2018
      [11497.591379] Call Trace:
      [11497.594105]  [<ffffffff8d913754>] dump_stack+0x19/0x1b
      [11497.599845]  [<ffffffff8d90d29f>] panic+0xe8/0x21f
      [11497.605211]  [<ffffffffc09948cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
      [11497.612131]  [<ffffffffc0f25c87>] ll_delete_inode+0x1b7/0x1c0 [lustre]
      [11497.619421]  [<ffffffff8d43c504>] evict+0xb4/0x180
      [11497.624775]  [<ffffffff8d43ce0c>] iput+0xfc/0x190
      [11497.630033]  [<ffffffff8d43126e>] do_unlinkat+0x1ae/0x2d0
      [11497.636064]  [<ffffffff8d42175e>] ? ____fput+0xe/0x10
      [11497.641709]  [<ffffffff8d2bab90>] ? task_work_run+0xc0/0xe0
      [11497.647935]  [<ffffffff8d432326>] SyS_unlink+0x16/0x20
      [11497.653679]  [<ffffffff8d92579b>] system_call_fastpath+0x22/0x27
      

      Thanks,
      Stephane

      Attachments

        Issue Links

          Activity

            [LU-11582] LBUG: ASSERTION( inode->i_data.nrpages == 0 ) failed

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33681/
            Subject: LU-11582 llite: protect reading inode->i_data.nrpages
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set:
            Commit: d936fdca9e00e05cecd4b21c6e4bbbf7107dc9b4

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33681/ Subject: LU-11582 llite: protect reading inode->i_data.nrpages Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: d936fdca9e00e05cecd4b21c6e4bbbf7107dc9b4
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33639/
            Subject: LU-11582 llite: protect reading inode->i_data.nrpages
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 04c172b686763be0d42eb4c36532d5795166eb7c

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33639/ Subject: LU-11582 llite: protect reading inode->i_data.nrpages Project: fs/lustre-release Branch: master Current Patch Set: Commit: 04c172b686763be0d42eb4c36532d5795166eb7c

            Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/33681
            Subject: LU-11582 llite: protect reading inode->i_data.nrpages
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 35396125e071ad1457257348be38721bc9ffdad5

            gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/33681 Subject: LU-11582 llite: protect reading inode->i_data.nrpages Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 35396125e071ad1457257348be38721bc9ffdad5

            Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/33639
            Subject: LU-11582 llite: protect reading inode->i_data.nrpages
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c9fb1cae878d4c3df24173e8b3c6b436fe983533

            gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/33639 Subject: LU-11582 llite: protect reading inode->i_data.nrpages Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c9fb1cae878d4c3df24173e8b3c6b436fe983533
            bobijam Zhenyu Xu added a comment -

            I think the assertion is reading the nrpages without supposed being protected under mapping->tree_lock, and truncate_inode_pages() is traverse the mapping's radix tree without tree_lock, and could miss finding the page being removed from the radix in __remove_mapping()

            truncate_inode_pages_final()
                    nrpages = mapping->nrpages;
                    smp_rmb();
                    nrexceptional = mapping->nrexceptional;
            
                    if (nrpages || nrexceptional) {
                            /*
                             * As truncation uses a lockless tree lookup, cycle
                             * the tree lock to make sure any ongoing tree
                             * modification that does not see AS_EXITING is
                             * completed before starting the final truncate.
                             */
                            spin_lock_irq(&mapping->tree_lock);
                            spin_unlock_irq(&mapping->tree_lock);
                            // race window, that __remove_mapping() removes the page from the radix,
                            // but nrpages hasn't been decreased yet.
                            truncate_inode_pages(mapping, 0);
                    }       
            

            And I think our truncate_inode_pages_final() in lustre/include/lustre_compat.h made the right sequence call

            #ifndef HAVE_TRUNCATE_INODE_PAGES_FINAL
            static inline void truncate_inode_pages_final(struct address_space *map)
            {       
                    truncate_inode_pages(map, 0);
                            /* Workaround for LU-118 */
                    if (map->nrpages) {
                            spin_lock_irq(&map->tree_lock);              // after get the tree_lock, we avoid the race
                            spin_unlock_irq(&map->tree_lock);
                    }       /* Workaround end */
            }
            #endif  
            

            I think the fix could be add a tree_lock for checking the nrpages in ll_delete_inode, or just delete this assertion.

            bobijam Zhenyu Xu added a comment - I think the assertion is reading the nrpages without supposed being protected under mapping->tree_lock, and truncate_inode_pages() is traverse the mapping's radix tree without tree_lock, and could miss finding the page being removed from the radix in __remove_mapping() truncate_inode_pages_final() nrpages = mapping->nrpages; smp_rmb(); nrexceptional = mapping->nrexceptional; if (nrpages || nrexceptional) { /* * As truncation uses a lockless tree lookup, cycle * the tree lock to make sure any ongoing tree * modification that does not see AS_EXITING is * completed before starting the final truncate. */ spin_lock_irq(&mapping->tree_lock); spin_unlock_irq(&mapping->tree_lock); // race window, that __remove_mapping() removes the page from the radix, // but nrpages hasn't been decreased yet. truncate_inode_pages(mapping, 0); } And I think our truncate_inode_pages_final() in lustre/include/lustre_compat.h made the right sequence call #ifndef HAVE_TRUNCATE_INODE_PAGES_FINAL static inline void truncate_inode_pages_final(struct address_space *map) { truncate_inode_pages(map, 0); /* Workaround for LU-118 */ if (map->nrpages) { spin_lock_irq(&map->tree_lock); // after get the tree_lock, we avoid the race spin_unlock_irq(&map->tree_lock); } /* Workaround end */ } #endif I think the fix could be add a tree_lock for checking the nrpages in ll_delete_inode, or just delete this assertion.
            pjones Peter Jones added a comment -

            Bobijam

            Could you please advise?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bobijam Could you please advise? Thanks Peter

            vmcore uploaded to your ftp server, the file is vmcore-sh-112-03-2018-10-26-22-17-26_LU-11582

            kernel used is CentOS 7 3.10.0-862.14.4.el7.x86_64 ( http://debuginfo.centos.org/7/x86_64/ )

            Thanks!

            sthiell Stephane Thiell added a comment - vmcore uploaded to your ftp server, the file is vmcore-sh-112-03-2018-10-26-22-17-26_ LU-11582 kernel used is CentOS 7 3.10.0-862.14.4.el7.x86_64 ( http://debuginfo.centos.org/7/x86_64/  ) Thanks!

            People

              bobijam Zhenyu Xu
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: