[LU-15114] ASSERTION( atomic_read(&d->opd_sync_changes) > 0 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.15.0
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Customer sees repeatable MDS crashes with ASSERTION( atomic_read(&d->opd_sync_changes) > 0 . From the vmcore it is seen sync_changes indeed overflowed and turned into a negative number triggering the assertion failure.

crash> osp_device ffff9a6f5ff4e000
struct osp_device {
  opd_dt_dev = {
    dd_lu_dev = {
      ld_ref = {
        counter = 371136
      }, 
      ld_type = 0xffffffffc14d1620 <osp_device_type>, 
      ld_ops = 0xffffffffc14c74c0 <osp_lu_ops>, 
      ld_site = 0xffff9a6fac302138, 
...
crash> osp_device.opd_sync_changes ffff9a6f5ff4e000
  opd_sync_changes = {
    counter = -2147477073
  }
crash>

The whole OSP-sync system of adding unlink/setattr llog records to per OST llog files (two-tier , llog catalog + plain llogs) has no mechanism to prevent growing of llog catalogs/files and eventually overflow sync_updates counter . The counter is a signed integer, so exceeding 2bln turns it into a negative number. The llog catalog + llog files also has a limited capacity to store llog records (approximately 64k * 64k is the max).

On a slow system, I can reproduce an unlimited grow of sync_changes by running a simple program changing uid of an open file in a endless loop:

[root@cslmo2302 ~]# while sleep 10; do lctl get_param osp.*.sync_changes ; done
osp.testfs-OST0000-osc-MDT0000.sync_changes=106264
osp.testfs-OST0001-osc-MDT0000.sync_changes=0
osp.testfs-OST0000-osc-MDT0000.sync_changes=157168
osp.testfs-OST0001-osc-MDT0000.sync_changes=0
osp.testfs-OST0000-osc-MDT0000.sync_changes=206598
osp.testfs-OST0001-osc-MDT0000.sync_changes=0
osp.testfs-OST0000-osc-MDT0000.sync_changes=255955
osp.testfs-OST0001-osc-MDT0000.sync_changes=0
osp.testfs-OST0000-osc-MDT0000.sync_changes=305767
osp.testfs-OST0001-osc-MDT0000.sync_changes=0
...
osp.testfs-OST0001-osc-MDT0000.sync_changes=0
osp.testfs-OST0000-osc-MDT0000.sync_changes=1225362
osp.testfs-OST0001-osc-MDT0000.sync_changes=0
osp.testfs-OST0000-osc-MDT0000.sync_changes=1221266
by running a single thread test program with chown(2) system calls and setting

lctl set_param osp.*.max_rpcs_in_progress=4096
Having 1) a faster system , 2) some additional load on OSTs , 3) network problems it is even more real to hit the assertion.

Attachments

Issue Links

is related to

LU-15577 Interop sanity test_831: osp changes throttling failed, 999>110

Resolved

is related to

LU-15030 Add /debugfs interface to monitor the sync progress between MDT and OST

Open

LU-15115 ptlrpc resend on EINPROGRESS timeouts can be not correct

Resolved

Activity

[LU-15114] ASSERTION( atomic_read(&d->opd_sync_changes) > 0

Peter Jones added a comment - 30/Nov/21 1:45 PM

Landed for 2.15

Peter Jones added a comment - 30/Nov/21 1:45 PM Landed for 2.15

Gerrit Updater added a comment - 30/Nov/21 3:46 AM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45265/
Subject: ~~LU-15114~~ osp: changes queuing throttle
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c226e70007aab798c39ccd0fe13ddcba65f04f23

Gerrit Updater added a comment - 30/Nov/21 3:46 AM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45265/ Subject: LU-15114 osp: changes queuing throttle Project: fs/lustre-release Branch: master Current Patch Set: Commit: c226e70007aab798c39ccd0fe13ddcba65f04f23

Gerrit Updater added a comment - 15/Oct/21 5:44 PM

"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45265
Subject: ~~LU-15114~~ osp: changes queuing throttle
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d7663f26374675f3670d72328feb97e101cf5ba0

Gerrit Updater added a comment - 15/Oct/21 5:44 PM "Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45265 Subject: LU-15114 osp: changes queuing throttle Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d7663f26374675f3670d72328feb97e101cf5ba0

People

Assignee:: Alexander Zarochentsev

Reporter:: Alexander Zarochentsev

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 15/Oct/21 5:01 PM

Updated:: 20/Feb/22 7:18 PM

Resolved:: 30/Nov/21 1:45 PM