CITE Lab Server User Guide

This article is synchronized and updated to xLog by Mix Space
For the best browsing experience, it is recommended to visit the original link
https://www.do1e.cn/posts/citelab/server-help

Tip

This document is updated irregularly, feel free to come back often to learn about the latest developments.
Click the directory on the right (scroll up on mobile, small button in the lower right corner) to jump to the content of interest.

SSH connection or download the remote-ssh plugin for VSCode, specific details can be searched independently.

::: banner {warning}
Starting from 2024.08.11, all servers will not allow password login. Please provide a public key when assigning a new account.
Send the public key content (ssh-xxx xxxxx) to me.
:::

Create a key pair:

# Try to use a longer key size to ensure security
ssh-keygen -t rsa -b 8192
# It is more recommended to use a newer encryption algorithm
ssh-keygen -t ed25519

On Linux/Mac, it is saved by default in ~/.ssh/id_rsa or ~/.ssh/id_ed25519 (private key), ~/.ssh/id_rsa.pub or ~/.ssh/id_ed25519.pub (public key).
On Windows, it is saved by default in the C:\Users\[username]\.ssh folder, with the same names.
The public key can be shared and is saved in the server's ~/.ssh/authorized_keys file, with one public key per line corresponding to the private key of different PCs.

::: banner {warning}
Keep the private key safe and do not disclose it. It is strongly discouraged to use the same key on all your PCs!

Reference links:

You can configure ~/.ssh/config on your own computer as follows, so you can directly use the ssh s1 command to connect to the server, which is more convenient.

Host s1
  HostName s1.xxx.cn
  Port 22
  User xxx
  IdentityFile xxx/id_rsa

For detailed tutorials, see: VSCode Configuration for SSH Connection to Remote Server + Passwordless Connection Tutorial

Environment Configuration#

uv#

It is strongly recommended to use uv for project management environments. After configuring it once, you can quickly complete the same environment configuration in different places, and the installation speed is much faster than conda and pip.

curl -LsSf https://astral.sh/uv/install.sh | sh # Install uv
uv init # Initialize the current project
# This will generate five files: .gitignore, .python-version, main.py, pyproject.toml, README.md and execute git init
# Pay special attention to pyproject.toml, which contains dependencies, project name, etc. Do not modify .python-version, and delete or modify other files as needed.
uv python pin 3.12    # Specify Python version
uv add "torch==2.1.0" # Similar to pip install
# This will generate a very important file uv.lock, which contains all dependency information and their version numbers
# There will also be a .venv folder, which is the virtual environment for the current project
uv run xxx.py # Execute code

# Or
source .venv/bin/activate # Activate the virtual environment, similar to conda activate
python xxx.py

If you create another project that uses the same environment, or copy code to another machine, you only need to copy pyproject.toml and uv.lock, modify pyproject.toml as needed, and then execute the following command to fully reproduce the original environment:

uv lock && uv sync

conda#

If you find conda: command not found, execute the following command and restart the terminal:

/opt/anaconda3/bin/conda init

Since the environment is saved in the ~/.conda directory, switching servers only requires copying the entire directory to complete the environment migration without needing to reconfigure. You can also edit ~/.condarc as follows and change envs_dirs and pkgs_dirs to /nasdata/[name]/.conda/[envs/pkgs], so that the environment is configured on NAS and can be used by multiple services.

show_channel_urls: true
default_channels:
  - https://mirror.nju.edu.cn/anaconda/pkgs/main
  - https://mirror.nju.edu.cn/anaconda/pkgs/r
  - https://mirror.nju.edu.cn/anaconda/pkgs/msys2
custom_channels:
  conda-forge: https://mirror.nju.edu.cn/anaconda/cloud
  msys2: https://mirror.nju.edu.cn/anaconda/cloud
  bioconda: https://mirror.nju.edu.cn/anaconda/cloud
  menpo: https://mirror.nju.edu.cn/anaconda/cloud
  pytorch: https://mirror.nju.edu.cn/anaconda/cloud
  simpleitk: https://mirror.nju.edu.cn/anaconda/cloud
auto_activate_base: false
envs_dirs:
  - ~/.conda/envs
pkgs_dirs:
  - ~/.conda/pkgs

# Use Nanjing University's source for pip
pip config set global.index-url https://mirror.nju.edu.cn/pypi/web/simple

After configuring the environment, running conda clean --all and rm -rf ~/.cache/pip can clear a lot of unnecessary conda cache to alleviate space issues.

docker#

If the system software cannot meet the needs, you can use Docker. Specific tutorials can be searched and learned independently, but all Docker containers must be started with a regular user identity; otherwise, they will be cleared (lines 2-6 be retained, others can be customized as needed)

docker container run --name pytorch-dpj \
  --gpus all \
  --user $(id -u ${USER}):$(id -g ${USER}) \
  -v /etc/passwd:/etc/passwd:ro \
  -v /etc/group:/etc/group:ro \
  -v /etc/shadow:/etc/shadow:ro \
  -v /data1/peijie:/data/:rw \
  -v /home/peijie:/home/peijie:rw \
  -it fenghaox/pyt1.3cu10.1:v2 /bin/bash

Alleviating Home Space Issues#

conda clean --all: Delete conda cache
rm -rf ~/.cache/pip: Delete pip cache
uv cache clean: Delete uv cache
rmoldvs: Delete old versions of vscode-server (needs to be used in the vscode terminal)

Check GPU Usage Status#

https://nvtop.njucite.cn/ (recommended)
Log in with your email. Please submit your email to the administrator to be added to the whitelist for access.

Or use the nvtop command on each machine.

Use Specified GPU#

If parallelism is not enabled, PyTorch defaults to using GPU 0. If parallelism is enabled, it defaults to using all GPUs.
Before running the code, configure the CUDA_VISIBLE_DEVICES environment variable to specify which GPU to use. For non-parallel use of GPU 1:

export CUDA_VISIBLE_DEVICES=1

Or for parallel use of GPUs 0-3:

export CUDA_VISIBLE_DEVICES=0,1,2,3

Try learning the methods for multi-GPU parallelism DataParallel (which is relatively simple to implement but incurs additional memory overhead on the first GPU, leading to lower memory utilization) and DistributedDataParallel (which is more complex to implement and debug but is more efficient; it is recommended to switch to this method after the code is fixed).

nvtop can be used to check GPU usage. Coordinate with those who are using or occupying the GPU.

Networking Issues#

A proxy has been configured. If there are networking issues (e.g., with GitHub), add proxychains before the commands that require internet access, such as:

proxychains curl https://www.baidu.com

If you need to log in to p.nju.edu.cn, you can refer to this project:

Do1e/NJUlogin

南京大学统一身份认证登录模块，可用于登录校园各种网站

Python233

# You may need to install uv first, refer to the previous environment configuration section
uvx NJUlogin -i # Then scan the code to log in to the campus network
uvx NJUlogin -i -l pwdLogin # Or log in with username and password
uvx NJUlogin -o # Log out

Run Code in the Background#

The server has tmux installed. To run code in the background (so it can continue running after exiting the terminal), you only need to use the most basic features.

Type tmux new in the terminal to create a new terminal, execute long-running commands inside it, then press ctrl+B, followed by D, to exit. At this point, the code continues to run in the background.
Alternatively, use tmux new -s <name> to specify a name for the new terminal, which defaults to a number starting from 0.

You can use tmux ls to view the names of terminals running in the background.
Use tmux attach -t <name> to return to that terminal to check the running status.

In the tmux terminal, press ctrl+B, then [ to scroll up and down, and press q to exit the scroll mode.

Data!!!#

Data Storage Location#

::: warning
The home directory has limited space; do not place data files in the home directory. Please place them in /data1.
:::

Infrequently used files can be placed in /nasdata, see the NAS description section below for details.

Data Backup#

::: warning
Ensure the safety of your data on public servers.
:::

The server has rclone installed, providing a convenient and scheduled backup method (sync important files from the server to NJUBox):

rclone config

n → Custom configuration name (e.g., njubox) → 56 (seafile) → https://box.nju.edu.cn → Student ID → Password (enter y first, then enter the password twice) → 2fa (just press enter) → Database name (press enter to indicate all unencrypted databases) → Follow the prompts for the rest.

Common rclone Methods#

View Remote Files#

rclone ls [configuration name]:/[directory]

Sync#

The first run will copy all files (source) to the remote (target).
Subsequent runs will only copy changed and new files.

::: warning
Important Note: After each run, the files at the target address will be completely consistent with the source address. If files are deleted from the source address, running sync will also delete the corresponding files at the target address (using rclone copy will not delete the target address files).
:::

rclone sync -v [source directory] [configuration name]:/[target directory]

Scheduled Sync#

Copy the above sync command and use crontab for scheduled tasks. Specific details can be found online, as there are many related tutorials.

NAS Description#

::: banner {warning}
NAS is not 100% reliable either; for important data, please follow the 321 principle (three copies, two types of media, one offline backup).
:::

Download applications from Synology's official website: Enterprise Cloud | Synology Drive_Private Cloud_Access Data Anytime_Multi-Person Collaboration | Synology Inc.
Or access directly via the web: https://nas.njucite.cn:5001

IP/Domain: nas.njucite.cn

The application login Drive will only display the home directory, which is only visible to you.
The web login will show the share directory, which is a shared directory mounted on each server at /nasdata, and can be used for data transfer between servers. Some (s4 and s5) servers have a 10G connection to NAS, while others are 1G.

::: warning
Everyone has access to /nasdata. To prevent accidental deletion by others, it is recommended to configure important data using rclone. Refer to the section on Using rclone to Sync Local and NAS Files, and remember to replace the URL.
:::

Files can be moved between the two directories via the web interface.

WebDAV can also be mounted, with the WebDAV address: https://nas.njucite.cn:5006

Use iperf3 to test connection speed:

iperf3 -c nas.njucite.cn

Using rclone to Sync Local and NAS Files#

rclone config
e/n/d/r/c/s/q> n # Create a new configuration
name> nas # Configuration name is nas
Storage> 52 # WebDAV, the rclone version may vary
url> https://nas.njucite.cn:5006 # It is recommended to use the 10G network on the server: http://10.0.0.100:5005
vendor> 7 # Other site/service or software, the rclone version may vary
user> abcd # NAS username
y/g/n> y # Enter password
password: ... # Enter NAS password twice
# Press enter for the rest

After creating the configuration on your local computer as described above, you can use the previously introduced rclone copy or rclone sync commands for file synchronization (e.g., upload local files to NAS or download NAS files to local).

Advanced#

Automatically Fill in Previously Entered Commands#

You can use zsh as the default terminal and configure oh-my-zsh, powerlevel10k, zsh-autosuggestions, and zsh-syntax-highlighting.

zsh+oh-my-zsh+powerlevel10k terminal configuration_powerlevel10k configuration-CSDN Blog

Alternatively, you can directly use my configuration by decompressing the following file and placing it in your home directory.
zshconfigs.tar.gz

Some commands may prompt that there is no display. If you must use GUI and have no other options, you can refer to the following two methods. The first method is suitable for executing commands in your own terminal, while the second requires executing in MobaXterm. The former requires additional configuration, while the latter is ready to use.

Method 1#

Install MobaXterm on your local computer and open X server.

Hover the mouse over it to display [IP]:[x11port]. Choose an IP and port that are not under router NAT (in Nanjing University, generally, non-NAT IPs start with 114 or 172, while router NAT IPs generally start with 192.168 or 10) and enter the following in the server terminal:

export DISPLAY=[IP]:[x11port]

Then enter commands related to GUI, and click "Yes" in the pop-up window on your local computer.

Method 2#

Directly use MobaXterm for SSH connection and execute GUI-related commands.

Copy with Progress Display#

Add the following to ~/.bashrc or ~/.zshrc:

function rcp(){
    local src=$1
    local dst=$2
    if [ -f "$src" ] && [ -d "$dst" ]; then
        dst="$dst/$(basename "$src")"
    fi
    mkdir -p "$(dirname "$dst")"
    rsync -ah --info=progress2 "$src" "$dst"
}

After that, use rcp instead of cp. The logic is not completely the same; the second parameter dst should be the target directory and cannot be renamed like cp.

Send Email Notifications After Training Ends/Fails#

Add the following Python code at the end of your training script.

sender = "noreply@do1e.cn"             # Configure the sending email address
sender_name = "s1"                     # Define the sender's name, here I define it as the server name
passwd = "xxxxxxx"                     # Email password, if it's a QQ email, it's the authorization code
server = "smtphz.qiye.163.com"         # The sending email server, e.g., for QQ email it's smtp.qq.com
port = 465                             # The port number for the sending email, usually this one
receiver = "pjdiao@smail.nju.edu.cn"   # Receiving email address
receiver_name = "Peijie Diao"          # Receiver's name
subject = "train on s3"                # Email subject
message = "Training on s3 is finished" # Email content

import smtplib
from email.mime.text import MIMEText
from email.utils import formataddr
import socks

# The server cannot log in to the internet; I configured a proxy that allows local area network connections
socks.set_default_proxy(socks.SOCKS5, "xxxx", 7891)
socks.wrapmodule(smtplib)

msg = MIMEText(message, 'plain', 'utf-8')
msg['From'] = formataddr((sender_name, sender))
msg['To'] = formataddr((receiver_name, receiver))
msg['Subject'] = subject

server = smtplib.SMTP_SSL(server, port)
server.login(sender, passwd)
server.sendmail(sender, [receiver], msg.as_string())
server.quit()

VPN Alternatives#

When the school's VPN server is unstable, consider using this, which is also relatively fast (provided that a successful P2P connection is established).

~~If capable, you can also consider building your own Zerotier or OpenVPN service.~~

Using Zerotier for P2P Connection with My Campus Server#

Refer to xubiaolin/docker-zerotier-planet-client configuration to configure Zerotier One.
The planet file and network ID can be viewed by logging into https://nvtop.njucite.cn, or contact me. After configuration, contact me to provide the address for authentication.

For example, the address is 15ffbcaa44.

> zerotier-cli info
200 info 15ffbcaa44 1.14.2 ONLINE

After verification, restart the Zerotier service again. You should obtain an IP address of 10.128.3.0/24 and be able to access https://test.nju.do1e.cn/. If this step is successful, proceed to the next step.

Routing#

The following commands require administrator (sudo) privileges.

Windows

First, run route print to find the number corresponding to the ZeroTier Virtual Port, as shown in the example below, which is 11.

> route print
Interface List
  5...xx xx xx xx xx xx ......Microsoft Wi-Fi Direct Virtual Adapter
  3...xx xx xx xx xx xx ......MediaTek Wi-Fi 6E MT7922 (RZ616) 160MHz Wireless LAN Card
  11...xx xx xx xx xx xx ......ZeroTier Virtual Port

Then run route add 114.212.0.0 mask 255.255.0.0 10.128.3.4 if {No} metric 1 (replace {No} with the number obtained earlier).

Linux

First, run ifconfig to check the interface corresponding to ZeroTier, which usually starts with zt.

Then run sudo ip route add 114.212.0.0/16 via 10.128.3.4 dev {Interface} metric 1 (replace {Interface} with the interface name obtained earlier).

MacOS

route add -net 114.212.0.0/16 10.128.3.4 -hopcount 1 (AI result, unverified).

At this point, you should be able to connect to the server and NAS off-campus and access https://nvtop.main.njucite.cn.

Note: Each time you restart, you need to execute the above routing configuration, or find a method for permanent configuration, but it is not recommended to permanently configure on laptops.

Note: Only ensures that the server and NAS can connect.