CASMTRIAGE-9122: Flood of SQUASHFS errors seen on compute nodes and i…#6550
CASMTRIAGE-9122: Flood of SQUASHFS errors seen on compute nodes and i…#6550aasha-hpe wants to merge 21 commits into
Conversation
…SCSI multipathing is broken after worker node rebuild during upgrade
which has automated steps to logout iSCSI sesion, do iscsiadm discovery and login to iSCSI session.
| CONFIG_FILE="/etc/iscsi/iscsid.conf" | ||
| #CONFIG_FILE="/root/Asha/iscsid.conf" |
There was a problem hiding this comment.
Remove debug artifact. Leftover development path should not ship.
| CONFIG_FILE="/etc/iscsi/iscsid.conf" | |
| #CONFIG_FILE="/root/Asha/iscsid.conf" | |
| CONFIG_FILE="/etc/iscsi/iscsid.conf" |
| cp "$CONFIG_FILE" "${CONFIG_FILE}.bak" | ||
|
|
||
| # Set iscsid.safe_logout value 'No' | ||
| sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)[Yy][Ee][Ss]/\1No/' "$CONFIG_FILE" |
There was a problem hiding this comment.
Variable brace consistency. Use ${CONFIG_FILE} everywhere — line 43 already uses ${CONFIG_FILE}.bak.
| sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)[Yy][Ee][Ss]/\1No/' "$CONFIG_FILE" | |
| sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)[Yy][Ee][Ss]/\1No/' "${CONFIG_FILE}" |
| PORTAL=$(iscsiadm -m session | grep $NCN_WORKER | awk '{print $3}' | sed 's/3260.*/3260/') | ||
| IQN=$(iscsiadm -m session | grep $NCN_WORKER | awk '{print $4}') |
There was a problem hiding this comment.
Bug: PORTAL/IQN can be empty or multi-line. If the worker name doesn't match any session, both variables are empty and subsequent iscsiadm commands produce cryptic errors. Also, grep $NCN_WORKER matches worker10 when searching for worker1. Quote and brace all variables, and validate.
| PORTAL=$(iscsiadm -m session | grep $NCN_WORKER | awk '{print $3}' | sed 's/3260.*/3260/') | |
| IQN=$(iscsiadm -m session | grep $NCN_WORKER | awk '{print $4}') | |
| PORTAL=$(iscsiadm -m session | grep "${NCN_WORKER}" | awk '{print $3}' | sed 's/3260.*/3260/') | |
| IQN=$(iscsiadm -m session | grep "${NCN_WORKER}" | awk '{print $4}') | |
| if [[ -z "${PORTAL}" ]] || [[ -z "${IQN}" ]]; then | |
| echo "Error: No iSCSI session found for worker ${NCN_WORKER}" | |
| sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)No/\1Yes/' "${CONFIG_FILE}" | |
| exit 1 | |
| fi |
|
|
||
| # Logout the iSCSI session | ||
|
|
||
| iscsiadm -m node -T $IQN -p $PORTAL -u |
There was a problem hiding this comment.
Quote and brace variables to prevent word splitting and stay consistent.
| iscsiadm -m node -T $IQN -p $PORTAL -u | |
| iscsiadm -m node -T "${IQN}" -p "${PORTAL}" -u |
| echo "Logging out of iSCSI session with $NCN_WORKER failed, so exiting by resetting iscsid.safe_logout to 'Yes' " | ||
| sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)No/\1Yes/' "$CONFIG_FILE" |
There was a problem hiding this comment.
Brace consistency in the error path.
| echo "Logging out of iSCSI session with $NCN_WORKER failed, so exiting by resetting iscsid.safe_logout to 'Yes' " | |
| sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)No/\1Yes/' "$CONFIG_FILE" | |
| echo "Logging out of iSCSI session with ${NCN_WORKER} failed, so exiting by resetting iscsid.safe_logout to 'Yes'" | |
| sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)No/\1Yes/' "${CONFIG_FILE}" |
| # Perform iscsiadm discovery | ||
|
|
||
| iscsiadm -m discovery -t sendtargets -p $PORTAL | ||
|
|
||
| # Login to iSCSI session | ||
|
|
||
| iscsiadm -m node -T $IQN -p $PORTAL -l |
There was a problem hiding this comment.
Bug: No error checking on discovery or login. If either fails (network issue, target not ready), the script exits 0 — the operator thinks it worked but the iSCSI session is broken.
| # Perform iscsiadm discovery | |
| iscsiadm -m discovery -t sendtargets -p $PORTAL | |
| # Login to iSCSI session | |
| iscsiadm -m node -T $IQN -p $PORTAL -l | |
| # Perform iscsiadm discovery | |
| iscsiadm -m discovery -t sendtargets -p "${PORTAL}" | |
| if [ $? -ne 0 ]; then | |
| echo "Error: Discovery failed for portal ${PORTAL}" | |
| sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)No/\1Yes/' "${CONFIG_FILE}" | |
| exit 1 | |
| fi | |
| # Login to iSCSI session | |
| iscsiadm -m node -T "${IQN}" -p "${PORTAL}" -l | |
| if [ $? -ne 0 ]; then | |
| echo "Error: Login failed for ${IQN} at ${PORTAL}" | |
| sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)No/\1Yes/' "${CONFIG_FILE}" | |
| exit 1 | |
| fi |
|
|
||
| # Set back iscsid.safe_logout from 'No' to 'Yes' | ||
|
|
||
| sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)No/\1Yes/' "$CONFIG_FILE" |
There was a problem hiding this comment.
Brace consistency.
| sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)No/\1Yes/' "$CONFIG_FILE" | |
| sed -i 's/^\(\s*iscsid\.safe_logout\s*=\s*\)No/\1Yes/' "${CONFIG_FILE}" |
|
This pull-request has not had activity in over 20 days and is being marked as stale. |
| [`iscsi_post_rollout.sh`](../../scripts/operations/iscsi_sbps/iscsi_post_rollout.sh) and it is required | ||
| to copy the script onto iSCSI initiator (Compute/UAN) nodes and run using `pdsh` command from the master | ||
| node: | ||
|
|
There was a problem hiding this comment.
Reword it as:
The solution is to re-establish the iSCSI connection after a worker node rebuild by logging out of the existing iSCSI session, rediscovering the LUNs, and logging back in using the iscsiadm command. These steps are automated in a script [iscsi_post_rollout.sh (../../scripts/operations/iscsi_sbps/iscsi_post_rollout.sh) and must be copied to the iSCSI initiator nodes (Compute/UAN) and executed from the master node using the pdsh command.
There was a problem hiding this comment.
How about this text:
The solution is to re-establish the iSCSI session after a worker node rebuild by logging out the existing
stale iSCSI session with rebuilt worker node, rediscover the LUNs and logging in back using the iscsiadm
command. The steps are automated the script iscsi_post_rollout.sh and must be copied to the iSCSI initiator nodes (Compute/UAN) and executed from the master node using
pdsh command.
There was a problem hiding this comment.
Please reword "The solution is to re-establish the iSCSI session after a worker node rebuild by logging out the existing stale iSCSI session with rebuilt worker node,..." as it is confusing after and with rebuild context here.
There was a problem hiding this comment.
"The steps are automated in the script"
There was a problem hiding this comment.
Please remove "and executed from the master node using pdsh command." as it is very confusing.
There was a problem hiding this comment.
Reword below....
"Run the script scp_iscsi_scr.sh to copy iscsi_post_rollout.sh onto compute nodes. Then to run iscsi_post_rollout.sh on compute nodes:"
to something like below...
Steps 1: Run *src.h
Note: <copies *rollout.sh script to computes>
Step 2: : Run *rollout.sh
| iSCSI client nodes became `un-responsive` where most of the commands failed with | ||
| `Bus error`. | ||
|
|
||
| Example command: |
There was a problem hiding this comment.
Add node type prefix as per the standard through out the doc.
|
|
||
| # This is the script to copy iscsi_post_rollout.sh to all compute nodes. | ||
|
|
||
| #!/bin/bash |
| )) | ||
|
|
||
| for alias in "${aliases[@]}"; do | ||
| echo "$alias-nmn" |
There was a problem hiding this comment.
Remove echo statement.
|
|
||
| for alias in "${aliases[@]}"; do | ||
| echo "$alias-nmn" | ||
| scp iscsi_post_rollout.sh root@$alias-nmn:/root/ |
There was a problem hiding this comment.
Please avoid copying to "/" o or "/root", instead use "/tmp" or your own new dir created.
| # shellcheck disable=SC2207 | ||
|
|
||
| aliases=($( | ||
| sat status --filter role=compute \ |
There was a problem hiding this comment.
As you have mentioned please take care for UAN too.
Description
This is the WAR document for the iSCSI issue (CASMTRIAGE-9122) seen after worker node rebuild during the upgrade.
.github/CODEOWNERSwith the corresponding team in [Cray-HPE][2].