RSH 15: Containers for science

Collaborative notes taken during the session

Icebreaker question

When have you experienced some difficult to install software? How did you solve it?

Just this/last week. I did not even manage to understand the instructions and failed with pip3. I managed to create a virtual environment but then the pip3 command did not seem to work. These choices --upgrade --user caused problems for me. https://pypi.org/project/crispy/ Second option, created a conda env and installed the Python modules. Still no improvement. Finally, I found that a .dmg was provided. Will go for this in the next try. A friend of mine is doing a PhD in computer science and struggles still sometimes with installations.
When trying to install our own research software on older OS:es (testing portability). The software relies on much newer versions of packages (everything from CMake to Python packages) than those OS:es can access through package managers. In some cases, we managed to solve it by manually compiling everything, in other cases not. Installation instructions had to be updated with a lot of hacks for these cases.
When trying to install a software on Windows that required C,C++ build toolchain, lack of instructions made it difficult. Switched to Windows subsystem for Linux for now. Another instance was when the instructions were unclear and to make matters worse, the setup.py was broken, not yet fixed.

Correction

A question came up whether Singularity images can be made executable and my answer there after a quick test was wrong and one of the watchers has sent a clarification which is amazing:

Got puzzled by that executable singularity container remark someone did.

Tried it out here and I'm amazed that it actually works:

# cat Singularity
BootStrap: docker
From: alpine:edge

%runscript
    /bin/echo "$@"

and after:

# singularity build singularity.img Singularity
...
# chmod +x singularity.img

# ./singularity.img Hello world
Hello world

Then something else I found while googling:

# cat Singularity
BootStrap: docker
From: alpine:edge

%post
    apk --no-cache add vim python3

%runscript
    /usr/bin/$SINGULARITY_NAME "$@"

# l
lrwxrwxrwx   1 root root       15 Nov 19 22:05 python3 -> singularity.img
-rw-r--r--   1 root root      120 Nov 19 22:05 Singularity
-rwxr-xr-x   1 root root 22192128 Nov 19 22:05 singularity.img
lrwxrwxrwx   1 root root       15 Nov 19 22:02 vim -> singularity.img

# ./python3 --version
Python 3.8.6

# ./vim --version | head -n1
VIM - Vi IMproved 8.2 (2019 Dec 12, compiled Nov 19 2020 05:07:05)

So $SINGULARITY_NAME becomes the name of the symlink (or the container).
This is blew my mind, basically drop-in binary replacements.

Questions/comments

Is it so difficult to write good installation instructions?
- apparently so, it's one of the most annoying parts of me supporting other researchers
  - As even you as a more advanced RSE has to troubleshout, figure it out?
  - it's harder than I would like... I can usually figure it out, but we need everyone to be able to, too.
Ten simple rules for writing Dockerfiles for reproducible data science: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008316
Source code is to a compiled code, as container recipe file is to the container image
Not really a question, but I like .dmg files on MacOSX. I unpack them and drag the app into any folder. Done. Or they come just with a little installation program, done.
- Sure, but you have .exe and .msi files for Windows.
- The main question here, I think, is how much is included in the installation? Does it depend on the OS much, so that it can only be installed on a few OS versions? MacOS is a very controlled environment so they tend to have fewer problems.
- Also https://flatpak.org/ and https://appimage.org/ both are sort of .dmg or .App for Linux
- yes!
Newbie question, but: which container technology should I use and why? Is there a handy comparison chart somewhere?
- We'll discuss this later. Most popular is Docker. On supercomputers, most popular seems to be Singularity.
To what extent is Docker containers / Kubernetes used in university HPC environments?
- we will come back to that and show something. mostly used using Singularity, which can use/import Docker images.
What is the advantage of these containers? Is this software not available via conda, for example? Create a virtual env and install the modules.
- great question! we will discuss this: advantage is that if there is dependency on or "interference" with system libraries, then these are well defined
  - :+1:
- we will hopefully also discuss drawbacks later
Is running graphical user interfaces a good use of Docker? In general it seems like there's quite a lot of trickery to make it work reliably? More useful for running services, no?
- Yeah, I think that command line stuff or services is more typical. But it's still quite useful for graphical stuff, if you can't do anything else!
Containers often bundle many things without consideration for licensing. Are there any recommendations to handle the explosion of different licenses you have to consider once you are bundling a large fraction of an operating system in a container? I guess similar question for other "virtualization solutions" and images (VMWare, VirtualBox)
- This is a good question, and I've seen some issues about this with nvidia libraries. I have a feeling it's sort of a wild west out there.
Anne, thank you for the encouragement! Thanks for telling us your first baby steps. It looks complicated. Does one have to know a lot of Linux?
On the topic of layers, what are the current recommendations to handle them? I recently read about a new solution to avoid pages of instructions terminating with \ or && but I don't understand this.
- I haven't heard of this, do you have a link to the recommendation?
  - I think it was Multi-stage builds
Any experiences with using Ansible for the recipe part instead of Dockerfiles? From what I understand it can produce Docker images, while also having the ability to deploy via SSH and other options.
- Good question, I know it can be but I don't have any other extra advice on this.
- I think they are different technologies with similar purposes. With docker the container is the final result and the Dockerfile the way to get there. With ansible the recipe (roles?) is more about creating entire systems from scratch (e.g. in the cloud) and having an entire computer configured.
In the example on screen, won't line 8 apt-get update change very often?
- It caches each layer based on the text of the layer. So while the results may be different each time it runs, docker won't realize that, and will leave it cached until you change something else.
From what I understand, the recommendation is to not have apt update by itself because it will be cached? If we want it to update on each build it should be RUN apt update && apt install ... && apt clean to force the update.
- Yep, I've seen that more often. Also then the package information isn't saved in a layer.
Any advice for automatic cleanup after unsuccessful Docker builds or long unused images? Docker images fill up the hard drive really fast.
- docker image prune and docker container(s?) prune
- also the --rm option that was mentioned
But the docker file does not strictly define the result. Those package repos added presumably change over time.
- yes!!!! Exactly, sort of related to two questions above.
- I wish I could pin PyPI/conda/apt to "install latest version available from this date", so I can pin to a certain time, without having to pin all the versions.
- This is an interesting point from a reproducability aspect. Would you recommend long term storage of the built binary Docker images (which includes everything)??
Do you not mess up your system with such misbehaving Docker installations? AFAI understand, it would install all over the place things into the system, or not just in a controlled virt env
- One of the points of docker is that it's strictly isolated from the system (everything is in the layers only), so there is little risk of that. But there could be security problems that allow a malicious container to escape (but that would be a bug - people assume containers stay contained)
  - Thank you, I did not know.
I arrived late so maybe you already talked about this. But many sites disallow pure docker (but supports singularity or shifter). But it's typically not that much of added hassle since singularity can import directly from docker(hub)
- True, we haven't gotten to a different runtimes yet, but we should mention this.
Often the hurdle I see in using Docker container is the fact that the size of the images scales quite poorly. For example, a simple opencv based image resizing application (4 lines of code) will end up in a ~1.2 GB of image size. The image size grows bigger if you wish to have common Linux distributions as base image and then some libraries on top of it. So, if someone is working on developing and distributing a software library, Docker may not be a good choice for distributing it as it may not make for a good dev environment for potential contributors. However, it can be a good option to consider if one is developing a script/black box kind of software that the end user will only execute. Would like to hear your thoughts on this.
- Good point, we were discussing something like this before the show. We'll bring it up by the end, thanks for the idea.
- Great!, will stay tuned
- But I agree with the sentiment of what you said, I think it's a good analysis.
Are there any open-source registries as alternative to docker-hub? Like if I have sensitive data in images and don't want to host them in a public registry.
- There are some, we use Portus but I don't know much about it.
- If you could, I would try to separate the sensitive data from from the image
It's really easy to run your own registry as a Docker container, to be independent on DockerHub etc. :)
- good point :-) e.g. dockerized singularityhub
  - Got a link/reference?
    - this is what we are trying to set up for Norwegian high-perf computing users, need to find some example ...
  - I'm running Docker registry using a simple docker-compose file which is automatically picked up by Traefik to be served using HTTPS.
Anyone paying for hosting your large files can stop doing it (for free)
- yes! like Dockerhub just started doing. Biocontainers and quay.io?
- Which is why I found it a bit weird when people would recommend docker to make things reproducible, without thinking much about storage (other than dockerhub)
From a reproducibility point of view, since the underlying operating system is always changing, how to you handle this? How do you version such image?
- Images can have tags, so you build on (FROM) a specific image tag, so it won't change too much. Of course, what if repositories or package versions change?
  - Yes... that apt-get update line... if I build from the same recipe what guarantee do I have someone else will be able to reproduce the same container?
alpine vs ubuntu vs centos as base images. alpine is popular for being small but uses a different C runtime (musl libc). Any experience with incompatibilities while changing or using different bases?
- I haven't had this hit me, but it certainly could be a problem. For a microservice, I think alpine makes sense, for a user environment I'd start with Ubuntu or similar.
You can use traditional software build systems (spack as an example) and put the result in a docker image
- neat, have you done this?
You mentioned miniconda within Docker containers. Any recommendations for using/activating environments in this setting? Can be a little cumbersome since conda activate requires a shell...
- I usually use conda activate b
- conda is almost a container technology already. By that I mean it's almost a complete linux distribution in itself (where the packages are like layers)
- Ran into this kind of challenges
- conda activate doesn't require login shell, but does require shell be activated. Ofter this is done with conda init, but that makes permanent changes and require login shell. I recommend to people source $conda_path/bin/activate, even though that seems less recommended.
  - I don't think the latter works anymore with latest conda. "source activate ENVNAME" is an alternative
    - I see it still works if you install the conda in the environment you are sourcing. Then I think you don't have to go through the base environment first (can someone correct?)
  - Doing "conda init bash" has broken a gazillion things for people on our systems (man RT ticket...)
    - +1, it feels dirty (or at least risky) to me
    - conda config --set auto_activate_base false : this command disables automatic conda 'base' environment activation upon launching a terminal
    - +1
  - This was the solution I ended up with. The problem is conda init only affects interactive sessions like docker run -it ubuntu bash, but if there's an entrypoint it will not source .bashrc and hence conda things won't be available.
- The link seems like a good solution (I didn't know of conda run, that will solve a lot of problems!). It is indeed a difficult issue, I wish conda wasn't so focused on interactive work. I need to investigate more.
can you chmod +x singularity.img && ./singularity.img ?
- We tested, and it seems like no
- Yeah, won't work, think of it as a filesystem image or compressed archive of files.
Where do you get help when you are stuck with Docker or Singularity? Is there something like stackoverflow for them?
- Docker forums is decently active
- I think this is a good case for Research Software Engineers to help people.
Drawbacks/problems with containers
- binaries do not always work the same way on different machines (breaking that ideal encapsulated behavior).
- containers can become stale (contain long known problems). Like security issues. See for example Redhats container health checks.
- you can have an old image that "works" but it's no longer possible to re-create it.
- re-creating a container would ideally make the same image as before but it's easy to make Dockerfiles that are (way too) dynamic
- There are for example OpenFOAM containers that require very close matching openmpi outside of the container causing issues (MPI and containers can be painful)
- They can get large, very large!

Ideas for next week

Maybe some topics for beginners? Or would it be too boring for the more advanced users in this community? Edit: I don't know really. I am a learner. [Comment: I think the current level is good, understandable for beginners but still advanced enough to be useful] I disagree a bit, but nevermind. I will just ask questions.
Powertools such as screen/tmux maybe? others?
Cloud or not cloud?
Kubernetes? Quantum computing?
What shell was Radovan using? colors!!!
- fish with customized oh-my-fish, customization: https://github.com/bast/config/tree/master/omf
Debugging with an IDE? Breakpoints? Code inspection? Still fear the segfault.
Debugging parallel software with DDT or totalview?
Debugging in general
- strategies
- printf style
- deubgger features
- runtime checks and trace collection
How to speed up software? Data analysis in real time? The data arrives on a stream and needs to be analysed as fast as possible.

Decision for next week: debugging. Please send us stuff that doesn't work!

send stuff here: https://github.com/ResearchSoftwareHour/rsh-notes/issues