Skip to content

CASMTRIAGE-9122: Flood of SQUASHFS errors seen on compute nodes and i…#6550

Open
aasha-hpe wants to merge 21 commits into
release/1.7from
CASMTRIAGE-9122-WAR
Open

CASMTRIAGE-9122: Flood of SQUASHFS errors seen on compute nodes and i…#6550
aasha-hpe wants to merge 21 commits into
release/1.7from
CASMTRIAGE-9122-WAR

Conversation

@aasha-hpe

Copy link
Copy Markdown
Contributor

Description

This is the WAR document for the iSCSI issue (CASMTRIAGE-9122) seen after worker node rebuild during the upgrade.

  • If I added any command snippets, the steps they belong to follow the prompt conventions (see [example][1]).
  • If I added a new directory, I also updated .github/CODEOWNERS with the corresponding team in [Cray-HPE][2].
  • My commits or Pull-Request Title contain my JIRA information, or I do not have a JIRA.

which has automated steps to logout iSCSI sesion, do iscsiadm discovery
and login to iSCSI session.
Comment on lines +39 to +40
CONFIG_FILE="/etc/iscsi/iscsid.conf"
#CONFIG_FILE="/root/Asha/iscsid.conf"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove debug artifact. Leftover development path should not ship.

Suggested change
CONFIG_FILE="/etc/iscsi/iscsid.conf"
#CONFIG_FILE="/root/Asha/iscsid.conf"
CONFIG_FILE="/etc/iscsi/iscsid.conf"

cp "$CONFIG_FILE" "${CONFIG_FILE}.bak"

# Set iscsid.safe_logout value 'No'
sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)[Yy][Ee][Ss]/\1No/' "$CONFIG_FILE"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable brace consistency. Use ${CONFIG_FILE} everywhere — line 43 already uses ${CONFIG_FILE}.bak.

Suggested change
sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)[Yy][Ee][Ss]/\1No/' "$CONFIG_FILE"
sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)[Yy][Ee][Ss]/\1No/' "${CONFIG_FILE}"

Comment on lines +52 to +53
PORTAL=$(iscsiadm -m session | grep $NCN_WORKER | awk '{print $3}' | sed 's/3260.*/3260/')
IQN=$(iscsiadm -m session | grep $NCN_WORKER | awk '{print $4}')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: PORTAL/IQN can be empty or multi-line. If the worker name doesn't match any session, both variables are empty and subsequent iscsiadm commands produce cryptic errors. Also, grep $NCN_WORKER matches worker10 when searching for worker1. Quote and brace all variables, and validate.

Suggested change
PORTAL=$(iscsiadm -m session | grep $NCN_WORKER | awk '{print $3}' | sed 's/3260.*/3260/')
IQN=$(iscsiadm -m session | grep $NCN_WORKER | awk '{print $4}')
PORTAL=$(iscsiadm -m session | grep "${NCN_WORKER}" | awk '{print $3}' | sed 's/3260.*/3260/')
IQN=$(iscsiadm -m session | grep "${NCN_WORKER}" | awk '{print $4}')
if [[ -z "${PORTAL}" ]] || [[ -z "${IQN}" ]]; then
echo "Error: No iSCSI session found for worker ${NCN_WORKER}"
sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)No/\1Yes/' "${CONFIG_FILE}"
exit 1
fi


# Logout the iSCSI session

iscsiadm -m node -T $IQN -p $PORTAL -u

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quote and brace variables to prevent word splitting and stay consistent.

Suggested change
iscsiadm -m node -T $IQN -p $PORTAL -u
iscsiadm -m node -T "${IQN}" -p "${PORTAL}" -u

Comment on lines +62 to +63
echo "Logging out of iSCSI session with $NCN_WORKER failed, so exiting by resetting iscsid.safe_logout to 'Yes' "
sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)No/\1Yes/' "$CONFIG_FILE"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brace consistency in the error path.

Suggested change
echo "Logging out of iSCSI session with $NCN_WORKER failed, so exiting by resetting iscsid.safe_logout to 'Yes' "
sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)No/\1Yes/' "$CONFIG_FILE"
echo "Logging out of iSCSI session with ${NCN_WORKER} failed, so exiting by resetting iscsid.safe_logout to 'Yes'"
sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)No/\1Yes/' "${CONFIG_FILE}"

Comment on lines +67 to +73
# Perform iscsiadm discovery

iscsiadm -m discovery -t sendtargets -p $PORTAL

# Login to iSCSI session

iscsiadm -m node -T $IQN -p $PORTAL -l

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: No error checking on discovery or login. If either fails (network issue, target not ready), the script exits 0 — the operator thinks it worked but the iSCSI session is broken.

Suggested change
# Perform iscsiadm discovery
iscsiadm -m discovery -t sendtargets -p $PORTAL
# Login to iSCSI session
iscsiadm -m node -T $IQN -p $PORTAL -l
# Perform iscsiadm discovery
iscsiadm -m discovery -t sendtargets -p "${PORTAL}"
if [ $? -ne 0 ]; then
echo "Error: Discovery failed for portal ${PORTAL}"
sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)No/\1Yes/' "${CONFIG_FILE}"
exit 1
fi
# Login to iSCSI session
iscsiadm -m node -T "${IQN}" -p "${PORTAL}" -l
if [ $? -ne 0 ]; then
echo "Error: Login failed for ${IQN} at ${PORTAL}"
sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)No/\1Yes/' "${CONFIG_FILE}"
exit 1
fi


# Set back iscsid.safe_logout from 'No' to 'Yes'

sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)No/\1Yes/' "$CONFIG_FILE"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brace consistency.

Suggested change
sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)No/\1Yes/' "$CONFIG_FILE"
sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)No/\1Yes/' "${CONFIG_FILE}"

@github-actions

Copy link
Copy Markdown
Contributor

This pull-request has not had activity in over 20 days and is being marked as stale.

@github-actions github-actions Bot added the Stale Hasn't had activity in over 30 days label May 13, 2026
@github-actions github-actions Bot removed the Stale Hasn't had activity in over 30 days label Jun 1, 2026
[`iscsi_post_rollout.sh`](../../scripts/operations/iscsi_sbps/iscsi_post_rollout.sh) and it is required
to copy the script onto iSCSI initiator (Compute/UAN) nodes and run using `pdsh` command from the master
node:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reword it as:

The solution is to re-establish the iSCSI connection after a worker node rebuild by logging out of the existing iSCSI session, rediscovering the LUNs, and logging back in using the iscsiadm command. These steps are automated in a script [iscsi_post_rollout.sh (../../scripts/operations/iscsi_sbps/iscsi_post_rollout.sh) and must be copied to the iSCSI initiator nodes (Compute/UAN) and executed from the master node using the pdsh command.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this text:

The solution is to re-establish the iSCSI session after a worker node rebuild by logging out the existing
stale iSCSI session with rebuilt worker node, rediscover the LUNs and logging in back using the iscsiadm
command. The steps are automated the script iscsi_post_rollout.sh and must be copied to the iSCSI initiator nodes (Compute/UAN) and executed from the master node using
pdsh command.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reword "The solution is to re-establish the iSCSI session after a worker node rebuild by logging out the existing stale iSCSI session with rebuilt worker node,..." as it is confusing after and with rebuild context here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The steps are automated in the script"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove "and executed from the master node using pdsh command." as it is very confusing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reword below....

"Run the script scp_iscsi_scr.sh to copy iscsi_post_rollout.sh onto compute nodes. Then to run iscsi_post_rollout.sh on compute nodes:"

to something like below...

Steps 1: Run *src.h
Note: <copies *rollout.sh script to computes>
Step 2: : Run *rollout.sh

Comment thread operations/iscsi_sbps/iscsi_steps_post_rollout.md Outdated
Comment thread operations/iscsi_sbps/iscsi_steps_post_rollout.md Outdated
Comment thread scripts/operations/iscsi_sbps/iscsi_post_rollout.sh Outdated
Comment thread scripts/operations/iscsi_sbps/iscsi_post_rollout.sh Outdated
Comment thread scripts/operations/iscsi_sbps/iscsi_post_rollout.sh Outdated
Comment thread operations/iscsi_sbps/iscsi_steps_post_rollout.md Outdated
iSCSI client nodes became `un-responsive` where most of the commands failed with
`Bus error`.

Example command:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add node type prefix as per the standard through out the doc.


# This is the script to copy iscsi_post_rollout.sh to all compute nodes.

#!/bin/bash

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove.

))

for alias in "${aliases[@]}"; do
echo "$alias-nmn"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove echo statement.


for alias in "${aliases[@]}"; do
echo "$alias-nmn"
scp iscsi_post_rollout.sh root@$alias-nmn:/root/

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid copying to "/" o or "/root", instead use "/tmp" or your own new dir created.

# shellcheck disable=SC2207

aliases=($(
sat status --filter role=compute \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you have mentioned please take care for UAN too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants