RSH 24: Computers for research 101
The essential course that everyone skipped
Video
Collaborative notes taken during the session
Computers for research 101: The essential course that everyone skipped
-
Website with a couple of topics planned: https://researchsoftwarehour.github.io/
-
It is good to count the episodes though maybe you can count seasons i.e. season 2, episode 1 ;)
- this is then S03E01 :-)
- wonderful! may many seasons to come!
-
converting line endings from Windows to Linux/Unix:
- dos2unix utility can do that
-
htop (a flavor of the "top" command)
-
BSD = berkeley systems distrbution, a form of unix
Our session notes
- We won't talk about theory of computation, hardware, or stuff like that.
- Instead, we talk about the most common misconceptions when you start doing more serious work:
Basics
Data and files
In many ways, how data works is more important than the computing
- A file is nothing but data - there is nothing special about it
- file extension is just a hint, nothing more
- Files are not tied to certain programs
- The more programs that can open a file, the more useful it is
- Some file formats are designed to be hard to use by other programs - bad for you
- If a file is simple enough that you can process it yourself
- Users and permissions
- On your own computer, you may never realize this
- On a multi-user computer, you need users and permissions
- Found with
ls -l
- In the background, they even exist on personal computers, phones, etc., but it is designed so that they never get in your way
- Size
- Obviously, files have a size
- What's an ideal file size, i.e. when is a file too big to handle and when are (many) small files too many to handle
- Number of files
- Number of files is rarely a concern on own computers
- but can be a concern on shared resources which typically impose quota
- For serious work, you might end up with lots of files (see: files you can control). This usually isn't great
- Files on different OSes
- File systems (NTFS, ext, FAT etc), how about on phones?
- Line endings (CRLF and LF)
- magic number
- MIME
Computing
"How computers work" usually focuses on this
- What is a processor
- Executes instructions
- We don't compete based on clock speeds anymore
- Newer ones tend to be more efficient (lower power, more instructions), thus faster
- Parallel: doing multiple things at once
- Parallel is tricky
- Usually you try to use existing frameworks for certain strategies: mapping, numerical libraries, etc.
- Just because you have more processors doesn't mean you can use them
- Concurrency vs parallelism
- Real scaling up: distributing strategies
- multi-threading
- multi-process
- MPI
- Requirements to use a batch system
- Amdahl's law
- related: as you scale the system/problem, new/unexpected bottlenecks may appear, "O(n2)"
Networking
- Schmidt's law: the network becomes the computer
- You need to be able to efficiently work across all computers on a network to take advantage of all tools available
- Server: provides functionality for clients that connect.
- Can A connect to B?
- Networked data storage
- absolutely transformative
- Makes it easy to transfer files
- Mounting a filesystem makes it available as local files: saves you from even having to transfer files
- The more places data is available, the more useful it is
- Common cases: shared home directories. shared project directories
Memory?
- swapping
- memory leaks
- what is a memory corruption?
- what is a segfault?
- what is a core dump?
- memory often shows up as bottleneck once we increase system size (example: list comprehension vs generator in Python)
Operating systems and interfaces
- Windows, Linux, Mac
- These days, the differences between them are decreasing
- Graphical vs command line interface
- Graphical: easy to get started or do new things, options presented to you
- Command line: much more powerful, but harder to find things
- web search +
man
to look up command options
Using software
Clearly, you need to tell the computer what to do.
- Consumer programs: someone provides it
- Installing things is hard!
- The fact that people can install software onto own computers has been slowly developed over many years
- But also means it is tightly controlled: not good for
- Installing is often possible for everybody
- EasyBuild/Spack
- Containers
- Often different on laptop and cluster
Programming
- How to read manuals
- Is the worst kind of error is something that appears to work, but is wrong? Be safe and "fail fast"
- It is very helpful to know how to make things not horribly inefficient. It is rarely important to know how to make things as fast as possible.
- making things more efficient (in time and space) often makes code more complicated (more difficult to understand for others)
- Why is programming so important?
- You often need to do things which no one else has done
- You need to do the same thing over and over again
- Maybe you can get by with less knowledge, but it is much less efficient.
- "Fortran is always fast, Python is always slow"
- "one language is better than another" - on choosing languages
- Can programming be creative?
- What do you like most about programming?
Development environments
- GUI vs console
- ssh
- text editor
Debugging
- Don't give up as soon as you see an error
- Read carefully from both top and bottom
- or from bottom to top sometimes
- Try to identify the key lines and do a web search
- If you aren't sure, just try it. if it breaks, you can try to fix it and then you get more
- Change one thing at a time
- When is a debugger useful?
- Try to halve/reduce the search space
- Go for a walk
Data management
- when you start doing serious work, don't organize by meeting/assignment: you need something that lasts a long time.
Security
Information security is a big topic, but is more than you think.
- Confidentiality
- Prevent access to information
- What you usually think of for "security"
- Integrity
- Is the information correct?
- Example: change hospital records to remove information on allergies to medicines
- Example: misinformation
- Availability
- Can you access information when you need it?
- Example: Lose the laptop that has your thesis on it.
- is this a good moment to talk about backup?
- This is perhaps the most common security problem in research! We have seen so many times people having horribla
Email
- Phishing: don't believe anything that comes to you, unless you have some particular reason to expect it. If someone asks you to do something, you can always just not.
How you learn
How do you learn all of this?
- try it, if you break then fix it and you learned two things
- most really dangerous things are clear
- trying to solve a real problem instead of a "hello world" example
- maybe more motivating
- once you get it to work you solved a real problem and you learned something new
- Reading documentation
- Follow or get involved with other projects