Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15368

Monitoring changelog_size is slowing down changelogs operations

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.7
    • None
    • CentOS 7.9
    • 3
    • 9223372036854775807

    Description

      We have noticed that monitoring changelog_size on MDS can significantly slow down changelogs operations and eventually even the MDS in general.

      We have a script that monitor changelog_size every minute to display the size per MDT with Grafana (we also have an alert via Grafana when a changelog_size grows too much). The script basically does the following:

      [root@fir-md1-s3 ~]# lctl get_param mdd.*-MDT*.changelog_size
      mdd.fir-MDT0002.changelog_size=305538455808
      

      Recently, a user launched massive jobs writing many small files at 30K+ create/s for hours, resulting in a backlog of changelogs on MDT0002: changelog_size increased dramatically and the monitoring script took longer and longer to get the value of changelog_size (more than 1 minute). This resulted in a slow down of changelogs commit seen from Robinhood and also eventually a general high load on the MDS and slow operations (up to 5s for a chownat).

      [root@fir-md1-s3 ~]# uptime
       09:35:21 up 25 days, 23:30,  2 users,  load average: 101.05, 97.57, 100.55
      

      I even noticed one watchdog backtrace at some point:

      Dec 13 08:49:02 fir-md1-s3 kernel: Pid: 24707, comm: mdt00_062 3.10.0-1160.45.1.el7_lustre.pl1.x86_64 #1 SMP Wed Nov 10 23:41:33 PST 2021
      Dec 13 08:49:02 fir-md1-s3 kernel: Call Trace:
      Dec 13 08:49:02 fir-md1-s3 kernel:  [<ffffffffa7798387>] call_rwsem_down_write_failed+0x17/0x30
      Dec 13 08:49:02 fir-md1-s3 kernel:  [<ffffffffc0ecf68f>] llog_cat_id2handle+0x7f/0x620 [obdclass]
      Dec 13 08:49:02 fir-md1-s3 kernel:  [<ffffffffc0ed0778>] llog_cat_cancel_records+0x128/0x3d0 [obdclass]
      Dec 13 08:49:02 fir-md1-s3 kernel:  [<ffffffffc1b01328>] llog_changelog_cancel_cb+0x1d8/0x5b0 [mdd]
      Dec 13 08:49:02 fir-md1-s3 kernel:  [<ffffffffc0eca5af>] llog_process_thread+0x85f/0x1a70 [obdclass]
      Dec 13 08:49:02 fir-md1-s3 kernel:  [<ffffffffc0ecb87c>] llog_process_or_fork+0xbc/0x450 [obdclass]
      Dec 13 08:49:02 fir-md1-s3 kernel:  [<ffffffffc0ed0c59>] llog_cat_process_cb+0x239/0x250 [obdclass]
      Dec 13 08:49:02 fir-md1-s3 kernel:  [<ffffffffc0eca5af>] llog_process_thread+0x85f/0x1a70 [obdclass]
      Dec 13 08:49:02 fir-md1-s3 kernel:  [<ffffffffc0ecb87c>] llog_process_or_fork+0xbc/0x450 [obdclass]
      Dec 13 08:49:02 fir-md1-s3 kernel:  [<ffffffffc0ecd5e1>] llog_cat_process_or_fork+0x1e1/0x360 [obdclass]
      Dec 13 08:49:02 fir-md1-s3 kernel:  [<ffffffffc0ecd78e>] llog_cat_process+0x2e/0x30 [obdclass]
      Dec 13 08:49:02 fir-md1-s3 kernel:  [<ffffffffc1affa34>] llog_changelog_cancel.isra.16+0x54/0x1c0 [mdd]
      Dec 13 08:49:02 fir-md1-s3 kernel:  [<ffffffffc1b02110>] mdd_changelog_llog_cancel+0xd0/0x270 [mdd]
      Dec 13 08:49:03 fir-md1-s3 kernel:  [<ffffffffc1b05f33>] mdd_changelog_clear+0x653/0x7d0 [mdd]
      Dec 13 08:49:03 fir-md1-s3 kernel:  [<ffffffffc1b08153>] mdd_iocontrol+0x163/0x540 [mdd]
      Dec 13 08:49:03 fir-md1-s3 kernel:  [<ffffffffc198684c>] mdt_iocontrol+0x5ec/0xb00 [mdt]
      Dec 13 08:49:03 fir-md1-s3 kernel:  [<ffffffffc19871e4>] mdt_set_info+0x484/0x490 [mdt]
      Dec 13 08:49:03 fir-md1-s3 kernel:  [<ffffffffc125b89a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      Dec 13 08:49:03 fir-md1-s3 kernel:  [<ffffffffc120073b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      Dec 13 08:49:03 fir-md1-s3 kernel:  [<ffffffffc12040a4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      Dec 13 08:49:03 fir-md1-s3 kernel:  [<ffffffffa74c5e61>] kthread+0xd1/0xe0
      Dec 13 08:49:03 fir-md1-s3 kernel:  [<ffffffffa7b95ddd>] ret_from_fork_nospec_begin+0x7/0x21
      Dec 13 08:49:03 fir-md1-s3 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
      Dec 13 08:49:03 fir-md1-s3 kernel: LustreError: dumping log to /tmp/lustre-log.1639414143.24707
      Dec 13 08:49:11 fir-md1-s3 kernel: LNet: Service thread pid 24707 completed after 369.43s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      

      Just stopping the monitoring script (thus stopping checking the changelog_size from the MDS) resolved the issue: changelogs commit rates went immediately up and the load of the server down at a reasonable level.

      I wanted to raise this issue so it can be improved in the future. It's important for us that we can monitor the size of the changelogs stored on each MDT. It would be nice to have a more efficient way of doing so. Thanks!

      Attachments

        Activity

          People

            pjones Peter Jones
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: