[LU-15735] sanity-sec test_19: OSS not responding after "force_sync=1" Created: 12/Apr/22  Updated: 12/Apr/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for eaujames <eaujames@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/bdc21484-cfd2-46d9-a0a9-a46d1a5682b2

test_19 failed with the following error:

test_19 returned 1

This failure apears on review-ldiskfs-arm session, but the OSS was running on x86_64

Client test:

CMD: onyx-124vm1 lctl set_param -n osd*.*OS*.force_sync=1
...
CMD: onyx-81vm3 /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1
CMD: onyx-81vm3 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: onyx-81vm3 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: onyx-81vm3 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
...
Delete is not completed in 37 seconds
CMD: onyx-81vm3 /usr/sbin/lctl get_param osc.*MDT*.sync_*
osc.lustre-OST0000-osc-MDT0000.sync_changes=0
osc.lustre-OST0000-osc-MDT0000.sync_in_flight=0
osc.lustre-OST0000-osc-MDT0000.sync_in_progress=1
...
CMD: onyx-91vm5.onyx.whamcloud.com runas -u0 -g0 -G0 lfs quota -q /mnt/lustre
running as uid/gid/euid/egid 0/0/0/0, groups: 0
 [lfs] [quota] [-q] [/mnt/lustre]
/usr/lib64/lustre/tests/sanity-sec.sh: line 1293: [21448]: syntax error: operand expected (error token is "[21448]")
CMD: onyx-81vm3 /usr/sbin/lctl nodemap_del c0
CMD: onyx-81vm3 /usr/sbin/lctl nodemap_del c1
CMD: onyx-81vm3 /usr/sbin/lctl nodemap_modify --name default --property admin --value 0
CMD: onyx-81vm3 /usr/sbin/lctl get_param -n nodemap.active
CMD: onyx-81vm3 /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
On MGS 10.240.26.198, default.admin_nodemap = nodemap.default.admin_nodemap=0
CMD: onyx-124vm1 /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
onyx-124vm1: ssh: connect to host onyx-124vm1 port 22: No route to host
pdsh@onyx-91vm5: onyx-124vm1: ssh exited with exit code 255

MDS dmesg:

[18117.644399] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1
[18118.434329] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
[18120.214180] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
....
[18133.582033] Lustre: 11600:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1649444731/real 1649444731]  req@000000009702997f x1729549387801536/t0(0) o13->lustre-OST0006-osc-MDT0000@10.240.30.32@tcp:7/4 lens 224/368 e 0 to 1 dl 1649444738 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'osp-pre-6-0.0'
[18133.582035] Lustre: 11599:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1649444731/real 1649444731]  req@00000000f3577720 x1729549387801600/t0(0) o13->lustre-OST0002-osc-MDT0000@10.240.30.32@tcp:7/4 lens 224/368 e 0 to 1 dl 1649444738 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'osp-pre-2-0.0'
[18133.582063] Lustre: lustre-OST0002-osc-MDT0000: Connection to lustre-OST0002 (at 10.240.30.32@tcp) was lost; in progress operations using this service will wait for recovery to complete

OST dmesg:

[18108.151196] Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
[18108.941076] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c0.trusted_nodemap
[18115.611844] Lustre: DEBUG MARKER: lctl set_param -n osd*.*OS*.force_sync=1

No messages are logged after "force_sync=1" on the OST, but no crash has been reported. The OSS seems to have disappeared.
Hard reset? Misconfiguration of kdumpd? Network issues?

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-sec test_19 - test_19 returned 1


Generated at Sat Feb 10 03:20:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.