Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15468

Lustre Client Lockup/OOM Conditions Under Buffered I/O

    XMLWordPrintable

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • Lustre 2.14.0, Lustre 2.15.0
    • None
    • None
    • 5.4.0-1065-azure #68~18.04.1-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux
    • 2
    • 9223372036854775807

    Description

      Recently we've switched from using ZFS to ldiskfs as the backing filesystem to work around some performance issues and I'm finding that when I put the cluster under load (with as little as a single client) I can almost completely lockup the client.  SSH (even existing sessions) stall, iostat, top, etc all freeze for 20 to 200 seconds.  This alleviates for small windows and recurs as long as I leave the io-generating process in existence.  It reports extremely high CPU and RAM usage, and appears to be consumed exclusively doing 'system'-tagged work.  This is on 2.14.0, but I've reproduced on more or less HOL for master-next.  If I do direct-IO, performance is fantastic and I have no such issues regarding CPU/memory pressure.

      The test cluster I've been using is a single client with 16GB of RAM, one MDS/MGS with 64GB of RAM, and two OSSes with 64GB of RAM each.  This issue occurs but without the complete stall out with 32GB clients, and doesn't appear to occur with 64GB clients.  At the last size it maxes out at about 40GB RAM consumed and enough free all the time that it never OOM-kills any userspace processes, hangs, or shows perf degradation).

      I dmesg I see consistent spew on the client about:

      [19548.601651] LustreError: 30918:0:(events.c:208:client_bulk_callback()) event type 1, status 5, desc 00000000b69b83b0 [19548.662647] LustreError: 30917:0:(events.c:208:client_bulk_callback()) event type 1, status -5, desc 000000009ef2fc22 [19549.153590] Lustre: lustrefs-OST0000-osc-ffff8d52a9c52800: Connection to lustrefs-OST0000 (at 10.1.98.7@tcp) was lost; in progress operations using this service will wait for recovery to complete [19549.153621] Lustre: 30927:0:(client.c:2282:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1642535831/real 1642535833]  req@0000000002361e2d x1722317313374336/t0(0) o4>lustrefs-OST0001-osc-ffff8d52a9c52800@10.1.98.10@tcp:6/4 lens 488/448 e 0 to 1 dl 1642535883 ref 2 fl Rpc:eXQr/0/ffffffff rc 0/-1 job:''

      [19549.153623] Lustre: 30927:0:(client.c:2282:ptlrpc_expire_one_request()) Skipped 4 previous similar messages

      But I actually think this is a symptom of extreme memory pressure causing the client to timeout things, not a cause.

      Testing with obdfilter-survey (local) on the OSS side shows expected performance of the disk subsystem.  Testing with lnet_selftest from client to OSS shows expected performance.  In neither case do I see the high cpu or memory pressure issues.

      Reducing a variety of lctl tunables that appear to govern memory allowances for Lustre clients does not improve the situation.  In almost all cases I'm using defaults for 2.14.0, but I've further reduced the following without any observed improvement:

      ~# lctl set_param llite.*.max_cached_mb=1024

      ~# lctl set_param osc.*.max_dirty_mb=512

      ~# lctl set_param osc.*.max_pages_per_rpc=128

      ~# lctl set_param osc.*.max_rpcs_in_flight=2

      I've also tried reducing client and server-side sysctls to force things to flush out (since ODIRECT seems to DTRT) more frequently, but that doesn't help me either:

      sudo sysctl -w vm.dirty_ratio=5; sudo sysctl -w vm.dirty_background_ratio=2; sudo sysctl -w vm.dirty_writeback_centisecs=50; sudo sysctl -w vm.dirty_expire_centisecs=100

      Even running "sync" in a bash loop doesn't improve the situation - something apparently really thinks it needs all of the memory on the machine until some magic event unplugs things and suddenly I get 8-10GB of memory back at once.

      By all appearances, the running iozone or even simple dd processes gradually (i.e., over a span of just 10 seconds or so) consumes all 16GB of RAM on the client I'm using.  I've generated bcc profile graphs for both on- and off-cpu analysis, and they are utterly boring – they basically just reflect rampant calls to shrink_inactive_list resulting from page_cache_alloc in the presence of extreme memory pressure.

      I have collected and attached output from issuing a sysrq during the "hang" point when there is no memory available, as well as a set of logs with memory-oriented logs enabled, and a set of logs with network-oriented logs enabled.  Both sets of logs overran my 1GB in-memory buffer, so the start MARK for iozone is not present, but you can find the end MARK.  At the time of the end MARK the used memory is still reported as extremely high - that doesn't tend to resolve until a while after iozone is complete.  I cannot determine with any accuracy what causes memory to free up suddenly, so that part may or may not be covered in the logs.

      I have also attached two SVG files with flamegraphs - one is on-cpu from start of iozone through the recurring hangs, and ditto that for an off-cpu graph.  It is notable that these are all from separate runs (5 in total: one for sysrq, two sets of logs, and two bcc runs), so I would avoid trying to exactingly correlate them, though they all overlap the same phenomenon.

      This is a major blocker for us, so any other information I can provide, experiments I can run, or further logs I can collect you need only ask.

      Attachments

        1. mem.tar.bz2
          50.73 MB
        2. net.tar.bz2
          48.11 MB
        3. off-cpu.svg
          637 kB
        4. on-cpu.svg
          1.08 MB
        5. syslog
          251 kB

        Activity

          People

            wc-triage WC Triage
            elliswilson Ellis Wilson
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: