PISA is a batch system for Python scripts, providing a simple method for distributed computation without the need for a complicated cluster configuration.
More details, please!
Do you have a Python script and you wished you could run multiple instances of it in parallel on many machines? Do you want to analyze multiple files using the same method? Do you have heavy computational tasks that are independent from each other but each requires a long time to process? If setting up a computing cluster sounds daunting and you want to run your Python program at decent speed, PISA is the tool you are looking for!
But I don't have two dozen computers at home to build my own cluster.
Don't worry, you are not alone with this problem. The computer pool from the physics faculty at KIT consists of 34 PCs, and PISA was developed to distribute computational jobs among them. It uses SSH access to distribute jobs, and user home directories are synchronized through a network drive, eliminating the need to manually synchronize files within the network. With a cluster configuration, a file provided on this website for the computing pool, PISA can be used to combine any set of homogeneous computers with SSH access and operate as a batch system on this composition. The PISA application takes on the responsibility of distributing the workload among the available devices, restarting tasks if a remote machine fails to respond, and collecting the output of all jobs.
Which rules do I have to follow when I want to use the computers?
When you registered for an account, you agreed to the computing rules, which can be found here. These rules allow to cluster the computers, provided it is for purposes related to your studies. Illegal activities, such as mining for cryptocurrencies or downloading copyrighted media, will be detected and the responsible user will be held accountable. If you have any concerns, you can always ask the administrator or the "poolraum-hiwi" for advice.
Furthermore, we ask you to be considerate of other users. The computers are also used for tutorials and practical courses, which cannot take place if they are overloaded. Therefore, we request that you limit resource allocation by running a limited number of jobs on each computer simultaneously, ensuring that no single person monopolizes the computers. If you require more computational power, you can perform your tasks with higher resource allocation overnight when no users are present, but make sure everything is completed by morning. If this is still insufficient, you might need to consider other platforms to run your code, or perhaps revisit the idea of building your own cluster.
What kind of jobs are suitable for a batch system?
A batch system is designed for running programs on other devices independently from each other. If your program needs to be executed only a single time, has a long runtime, and cannot be divided into individual tasks, then PISA cannot assist you. Design your Python script so that specific conditions, varying between each run, can be passed as command-line arguments, as PISA (and batch systems in general) do not support user input during runtime. For an easy way to parse command-line arguments within your script, check out the Python argparse module. Additionally, you will achieve higher benefits when your jobs are compute-bound, meaning they spend more time performing computational work rather than waiting for input/output operations on files, network interfaces, or memory access. PISA is designed to create a high-throughput system; for high-performance systems, there are other requirements.
Prerequisites
Before you can use PISA, you need to have a few things set up.
SSH
To use PISA, you first need to configure SSH access between all devices to authenticate with a key file. Without this configuration, each submitted job would require a human to type in the login password to establish the connection. With a trusted SSH key configured, the authentication process replaces the password prompt with the key file. For more information about establishing an SSH connection, please refer to the Instructions for using SSH for the physics computer pool. Once you understand how to establish an SSH connection, you need to set up passwordless login to remote devices via the SSH key. If you are unsure how to proceed you can just open a terminal and call enable_ssh_key_pool.sh from anywhere, it should be installed on all computers in the pool. More details about this script can be read in the Instructions for using SSH for the physics computer pool (link above).
Virtual Environment
Your Python environment needs to be consistent across all devices where you want to run your programs. This is important to ensure that all Python packages required by your application are available. PISA relies on Python virtual environments (venv) to provide a uniform operational basis for all tasks executed remotely. If you are unfamiliar with virtual environments, it is highly recommended to become familiar with them. In this setup, your virtual environment is tied to the source code of your project. Assuming your Python source files are located in a directory, open a shell in that directory and run:
python -m venv venv/
This command creates a virtual environment. You will notice a new directory named "venv" (the last argument is the directory). To activate the virtual environment, run:
source ./venv/bin/activate
in the same directory where the venv was created. To deactivate a virtual environment, simply use the deactivate command in the shell. Within this venv, you can now install Python packages using pip install ..., and the packages will only be available inside the venv, with no impact on globally installed packages.
Even if your Python code has no dependencies, PISA still requires you to specify a virtual environment, even if it may not seem necessary.
How to use PISA?
Step 1: Preparation
Before you start using PISA, make sure that the requirements from the "Prerequisites" paragraph are fulfilled: you should be able to establish an SSH connection to all the computers in the pool without having to type in a password, and your project should contain a virtual environment. To install PISA on your system, simply use:
pip install pisa-ssh
You can install PISA globally or within your project-specific venv; it just needs to be callable.
Step 2: Cluster configuration
PISA needs to know which computers it can connect to. The set of available machines is specified in a cluster configuration file when PISA is executed. There is a predefined configuration file for the computer pool at the physics faculty at KIT that can be downloaded from the command line:
In this final step, you need to tell PISA which jobs to execute. Typically, you want your script to be called with some command line arguments that vary for each run. Imagine you want to execute python myscript.py -l <number>, with the parameter l being a number. You want the number to be 1 on the first run, 2 on the second run, and so on, until it finally reaches the number for the last run, in this case 10. In this case, you need to provide PISA with the following information:
Which virtual environment should be used to execute the script.
Which script should be executed.
Where the output of the programs should be stored.
Any command line arguments that are the same for each run.
Any command line arguments that vary for each run and the values that should be passed to the program.
The output of all jobs is stored in files, and the assignment of each run to its corresponding command line arguments is stored in an assignment file, generated while PISA is running. To provide PISA with the necessary information, a job description file is passed to PISA when it is executed. An example job description file for a simple example program (a Fibonacci number calculator with poor runtime) is provided. This file needs to be adjusted for each batch of jobs that PISA should process.
To simplify the structure of the job description file, all file locations are specified relative to a working directory. PISA can handle any number of variable arguments for each run.
Step 4: Submit your jobs
Now that the prerequisites are met, the cluster configuration file is downloaded, and the job description file for the batch of tasks is prepared, you can start the distributed parallel processing of the jobs. To do so, run PISA using the following command:
PISA will start running the jobs on the other machines. It stops when all jobs are finished. You will notice that the directory for the output files and the assignment file are created, allowing you to collect the results from your jobs. Additionally, you can add the -l parameter when running PISA to enable detailed output of the currently performed actions, also giving you an overview about the currently running tasks on the remote machine. If you are unsure about what you are doing, you can use the -d parameter to perform only a dry-run, only telling you what PISA might have done when being executed. This can also be used to reconstruct the assignment file in case you lost it and don't want to run all jobs again. For more information, check out:
pisa --help
Example program
The GitHub repository contains a directory with an example program, the corresponding task description file, and the cluster configuration file for the fphct computing pool. Check out these files to familiarize yourself with the use of PISA. To run the example, a virtual environment needs to be created, and the file locations should match the ones in the task description file.
PISA development
PISA is an open-source project. If you encounter any issues with PISA or have feature requests, you can submit an issue or implement the solution yourself and make a pull request.