Common Tasks and Troubleshooting
Kubernetes is a robust and powerful system, but with that system comes a verbose vernacular. Likewise, the Kubectl command has many capabilities for interacting with a Kubernetes system. In response to that, this article subscribes to the Pareto Principle, also known as the “80-20 rule,” and provides the 20% of commands that should cover 80% of the work you will need to do with Kubectl.
We also have some conventions that we use in this article such as:
- Code blocks
- These should allow you to hover over them and reveal a clipboard copy icon to the right
- Example of a code block:
echo hello
- Use of square
[]
brackets in commands- These, and their contents, should be replaced unless otherwise noted
- Use of
resource-type
- In the context of this article, resource-type could take on the values of:
pod
,deployment
orjob
- In the context of this article, resource-type could take on the values of:
Common Tasks
Here is a short list of the common tasks that you might need to do when working with batch jobs and the Kubectl commands to accomplish them. For more detailed information and examples, please reference the official Kubectl Quick Reference.
- Note: Not all users will have access to all commands and resource types on the TIDE cluster due to security reasons
Setting Namespace
If you have access to more than one namespace, then you may need to switch between them periodically.
Instead of supplying the namespace with the -n
flag for each command, you can set the namespace one time for all subsequent commands and terminal sessions until you re-run this command:
kubectl config set-context nautilus --namespace=[your-namespace-here]
- Example command:
kubectl config set-context nautilus --namespace=csu-example
- Example output:
Context "nautilus" modified.
Targeting TIDE
TIDE hardware is both labeled and tainted to provide a way to target the hardware and reserve portions for CSU exclusive use.
In Kubernetes a “taint” is a way to prevent a job from scheduling on a node unless the job specifically “tolerates” the taint on the node.
You will need to add the following YAML to the spec
of pods, deployments and jobs to target TIDE-labeled hardware and tolerate the TIDE taint.
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nautilus.io/csu-tide
operator: Exists
tolerations:
- effect: NoSchedule
key: nautilus.io/csu-tide
operator: Exists
Scheduling Batch Jobs
At the start of a workflow, you will want to schedule batch jobs onto the cluster using a Kubernetes YAML file that defines your batch job:
kubectl apply -f [file-name].[yaml/yml]
- Example command:
kubectl apply -f hello-pod.yaml
- Example output:
pod/hello-pod created
Checking Batch Jobs
After you have scheduled a batch job, you will want to check to make sure it is showing a ready value of 1/1
and status value of Running
:
kubectl get [resource-type]
- Example command:
kubectl get pods
- Example output with no pods running:
No resources found in csu-example namespace.
- Example output with pods running:
NAME READY STATUS RESTARTS AGE hello-pod 1/1 Running 0 74s
Sometimes you may want to watch in real-time to make sure that your batch jobs get scheduled or deleted.
Add --watch
to see real-time updates on the status of batch jobs.
When you want to exit the watch command, hit ctrl + c
.
kubectl get [resource-type] --watch
- Example command:
kubectl get pods --watch
- Example output with no pods in the namespace:
[blank]
- Example output with pods scheduling:
NAME READY STATUS RESTARTS AGE hello-pod 0/1 Pending 0 0s hello-pod 0/1 Pending 0 0s hello-pod 0/1 ContainerCreating 0 0s hello-pod 0/1 ContainerCreating 0 0s hello-pod 1/1 Running 0 1s
- Example output with pods deleting:
NAME READY STATUS RESTARTS AGE hello-pod 1/1 Running 0 68s hello-pod 1/1 Terminating 0 74s hello-pod 1/1 Terminating 0 104s hello-pod 0/1 Terminating 0 105s hello-pod 0/1 Terminating 0 105s hello-pod 0/1 Terminating 0 105s hello-pod 0/1 Terminating 0 105s
Remoting into Batch Jobs
You can pull up a remote terminal session inside of a pod to execute linux commands:
kubectl exec -it [pod-name] -- [linux-command]
- Example command:
kubectl exec -it hello-pod -- bash
- Example output:
root@hello-pod:/usr/src/app#
- Note: The shell may change depending on the OS of the container you are running, but generally bash is very common
Port-forwarding into Batch Jobs
You can forward a local port on your machine into a pod running on the cluster to remotely access the port or service running on the port:
- Note: This is a more advanced use case, but can be useful for running interactive batch jobs like Jupyter Lab.
kubectl port-forward [pod-name] [local-port]:[remote-port]
- Example command:
kubectl port-forward jupyter-kkrick-40sdsu-2eedu 8888:8888
- Example output:
Forwarding from 127.0.0.1:8888 -> 8888 Forwarding from [::1]:8888 -> 8888
Cleaning up Batch Jobs
It is best to manually clean up batch jobs when you have finished with them as opposed to relying on an automatic shutdown.
In Kubectl we clean up resources using the delete
keyword.
Remember, once a batch job is deleted, so is any leftover data – make sure to transfer important data out of batch jobs before deleting them!
You can clean up batch jobs with the file that was used to schedule them:
kubectl delete -f [file-name].[yaml/yml]
- Example command:
kubectl delete -f hello-pod.yaml
- Example output:
pod "hello-pod" deleted
You can also clean up batch jobs with the resource type and the value of the name column:
kubectl delete [resource-type] [resource-name]
- Example command:
kubectl delete pod hello-pod
- Example output:
pod "hello-pod" deleted
Troubleshooting
Sometimes you may encounter an issue with a batch job. Issues may arise during scheduling, running or cleaning up batch jobs. This section offers some commands that may help you identify the various issues you may encounter.
- Note: When contacting the TIDE Support team for batch job assistance, we may ask you for output from one or more of these commands.
Typical Issues
Below is a non-exhuastive list of issues that you may encounter:
- Syntax errors
- Generally, Kubectl will warn you if you have a syntax issue in your YAML files
- FailedAttachVolume
- Your persistent storage from a PersistentVolumeClaim could not attach to your pod for one of several reasons (I.E. already attached to another pod with read-write once mode)
- FailedScheduling
- The specified resources may not be available (I.E. trying to schedule 4 NVIDIA A100 GPUs)
- The specified tolerations may be incorrect
Checking Description for Events
Kubernetes will log events on pods which can be very useful for describing the health of the pod.
You may see a wide range of output, but we are most interested in the Events
section at the bottom of the output from this command.
- Note: If you are using deployments or jobs, it is best to describe the individual pods in your batch job.
kubectl describe pod [pod-name]
- Example command:
kubectl describe pod hello-pod
- Example output of healthy pod:
[truncated] Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 9m default-scheduler Successfully assigned sdsu-kylekrick/hello-pod to rci-tide-gpu-14.sdsu.edu Normal Pulled 8m59s kubelet Container image "ghcr.io/csu-tide/hello-csu:main" already present on machine Normal Created 8m59s kubelet Created container hellopod Normal Started 8m59s kubelet Started container hellopod
- Example output of unhealthy pod:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 27s default-scheduler Successfully assigned sdsu-rci-jh/volume-checker to rci-nrp-dtn-01.sdsu.edu Warning FailedAttachVolume 27s attachdetach-controller Multi-Attach error for volume "pvc-5c17755d-5b40-489b-85ff-c296e269f4da" Volume is already used by pod(s) jupyter-kkrick-40sdsu-2eedu
Checking Logs
Pods may be configured to log their output to linux standard output which can be read with the logs
command:
kubectl logs [pod-name]
- Example command:
kubectl logs hello-pod
- Example output with no logs:
[blank]
If your pod contains more than one container, then this will default to the first container in the YAML file, but you can specify a container name with the optional -c
flag:
kubectl logs [pod-name] -c [container-name]
- Example command:
kubectl logs ollama-84b4ccb96c-lfbx2 -c pod-ollama
- Example output with logs:
[truncated] ollama.service: [GIN] 2024/10/21 - 16:30:53 | 200 | 173.485485ms | 10.244.150.39 | POST "/v1/chat/completions" ollama.service: [GIN] 2024/10/21 - 16:30:55 | 200 | 135.23678ms | 10.244.150.39 | POST "/v1/chat/completions" ollama.service: [GIN] 2024/10/21 - 16:30:55 | 200 | 241.210909ms | 10.244.150.39 | POST "/v1/chat/completions" ollama.service: [GIN] 2024/10/21 - 16:30:55 | 200 | 140.942317ms | 10.244.150.39 | POST "/v1/chat/completions" ollama.service: [GIN] 2024/10/21 - 16:30:55 | 200 | 174.779168ms | 10.244.150.39 | POST "/v1/chat/completions"