[LU-10348] replay-ost-single test 5 hangs during OST fail/recovery Created: 07/Dec/17  Updated: 12/Aug/22  Resolved: 12/Aug/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-ost-single test_5 hangs after starting iozone and waiting for the failed OST to start. Logs for this hang are at https://testing.hpdd.intel.com/test_sets/fd70c104-dae4-11e7-8027-52540065bddc

There's not much in the console and dmesg logs to determine what the issue is.

The last thing seen after failing the OST in the test log on the client

Failing ost1 on trevis-3vm3
CMD: trevis-3vm3 grep -c /mnt/lustre-ost1' ' /proc/mounts
Stopping /mnt/lustre-ost1 (opts:) on trevis-3vm3
CMD: trevis-3vm3 umount -d /mnt/lustre-ost1
CMD: trevis-3vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
reboot facets: ost1
Failover ost1 to trevis-3vm3
06:46:24 (1512542784) waiting for trevis-3vm3 network 900 secs ...
06:46:24 (1512542784) network interface is UP
CMD: trevis-3vm3 hostname
mount facets: ost1
CMD: trevis-3vm3 test -b /dev/lvm-Role_OSS/P1
CMD: trevis-3vm3 e2label /dev/lvm-Role_OSS/P1
Starting ost1:   /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1
CMD: trevis-3vm3 mkdir -p /mnt/lustre-ost1; mount -t lustre   		                   /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1
CMD: trevis-3vm3 /usr/sbin/lctl get_param -n health_check
CMD: trevis-3vm3 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/compat-openmpi16/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all\" 4 
trevis-3vm3: trevis-3vm3.trevis.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
CMD: trevis-3vm3 e2label /dev/lvm-Role_OSS/P1 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
CMD: trevis-3vm3 e2label /dev/lvm-Role_OSS/P1 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
CMD: trevis-3vm3 e2label /dev/lvm-Role_OSS/P1 2>/dev/null
Started lustre-OST0000
CMD: trevis-3vm1.trevis.hpdd.intel.com,trevis-3vm2 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/compat-openmpi16/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid 
trevis-3vm1: trevis-3vm1.trevis.hpdd.intel.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid
trevis-3vm2: trevis-3vm2.trevis.hpdd.intel.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid
trevis-3vm1: CMD: trevis-3vm1.trevis.hpdd.intel.com lctl get_param -n at_max
trevis-3vm2: CMD: trevis-3vm2.trevis.hpdd.intel.com lctl get_param -n at_max
trevis-3vm2: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 2 sec
trevis-3vm1: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 2 sec

Looking at the OST (vm3) console log, we see the following then a node reboot

[52386.276689] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-3vm1.trevis.hpdd.intel.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid
[52386.308628] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-3vm2.trevis.hpdd.intel.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid
[52386.442463] Lustre: DEBUG MARKER: trevis-3vm1.trevis.hpdd.intel.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid
[52386.593174] Lustre: DEBUG MARKER: trevis-3vm2.trevis.hpdd.intel.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-ffff*.ost_server_uuid
[52388.284557] Lustre: lustre-OST0000: deleting orphan objects from 0x0:12940 to 0x0:12969
[52388.881925] Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 2 sec
[52389.064745] Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 2 sec
[52389.105603] Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 2 sec
[52392.356528] Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 2 sec
[53031.378937] LNet: Service thread pid 16692 completed after 53.53s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[53033.710662] LNet: Service thread pid 18539 completed after 55.86s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[53074.558626] LNet: Service thread pid 1576 completed after 76.94s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[53290.176331] LNet: Service thread pid 16692 completed after 42.12s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[53584.271884] LNet: Service thread pid 18539 completed after 77.68s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[53808.647337] LNet: Service thread pid 1575 completed after 97.54s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[53864.707967] LNet: Service thread pid 18534 completed after 93.30s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[54237.094039] LNet: Service thread pid 1576 completed after 188.49s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[54335.118180] LNet: Service thread pid 18535 completed after 195.92s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[54527.722072] LNet: Service thread pid 18537 completed after 70.42s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[54527.728118] LNet: Skipped 1 previous similar message
[54994.845138] LNet: Service thread pid 20092 completed after 133.76s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[54994.851195] LNet: Skipped 2 previous similar messages
[55644.672185] LNet: Service thread pid 18535 completed after 87.13s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[55644.678120] LNet: Skipped 4 previous similar messages
[56366.669572] LNet: Service thread pid 18534 completed after 108.69s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[56366.675716] LNet: Skipped 8 previous similar messages
[57203.664122] LNet: Service thread pid 18534 completed after 87.49s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[57203.670055] LNet: Skipped 6 previous similar messages
[57422.974061] Lustre: 4713:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1512547818/real 1512547818]  req@ffff88000be18300 x1585999144680672/t0(0) o400->lustre-MDT0000-lwp-OST0001@10.9.4.19@tcp:12/10 lens 224/224 e 0 to 1 dl 1512547825 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[57422.983336] Lustre: 4713:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 35 previous similar messages
[57422.986340] Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.9.4.19@tcp) was lost; in progress operations using this service will wait for recovery to complete
[57422.992053] Lustre: Skipped 6 previous similar messages
[57422.994697] LustreError: 166-1: MGC10.9.4.19@tcp: Connection to MGS (at 10.9.4.19@tcp) was lost; in progress operations using this service will fail

<ConMan> Console [trevis-3vm3] disconnected from <trevis-3:6002> at 12-06 08:10.

Some of these failures don't include the 'many service threads' message in the OST logs.

This hang looks different from LU-9273.

Other logs for this test hang are at:
https://testing.hpdd.intel.com/test_sets/e12a8174-dacb-11e7-8027-52540065bddc
https://testing.hpdd.intel.com/test_sets/c7b476bc-d7b7-11e7-a066-52540065bddc



 Comments   
Comment by Sarah Liu [ 22/May/18 ]

+1 on b2_10 https://testing.hpdd.intel.com/test_sets/bff6adca-5d8e-11e8-b9d3-52540065bddc

Generated at Sat Feb 10 02:34:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.