Would https://syncthing.net/ work for syncing home?
- Maybe. Syncthing handles each file individually so if you require folder consistency it's not the best option. As an example, it's not recommended to store git repositories in syncthing shared folders if they are being used from multiple locations simultaneously.
It makes a symlink out of the file? What does that mean for general use?
- It's like having multiple names for the same file
- another usecase: same file which is really large and you don't want to fill the disk with multiple copies of the same thing (see also hardlink vs symlink)
- yet another usecase: maybe your code always saves data to
data/, but on one of your computers/servers, you might want to save it somewhere else (without changing code), you can then symlink
data/ on that computer to point to an external drive or wherever
- -> Awesome, thanks for extensive answer and usage example !
If the files are not stored in git, are there locations were we can store the files? (for free?)
- we will discuss that but to my limited knowledge git-annex can connect also to storage in the cloud by any of the cloud providers
- :heart: any key-value store can be used to store the data
Is there any compatibility layer to use (for instance) storage provided by git-lfs via git-annex? (both GitHub and GitLab provide git-lfs integration)
- I don't know the answer but I will ask it in a moment ...
- I saw git-lfs mentioned in the special remotes list. So Yay!
So in the current example, if someone would clone the repo, he gets the metadata but could not use the files?
- right, unless the person also has access to the storage of the actual files
- So to upload the actual files I would use 'git add/commit/..' instead of git annex?
- In my understanding, no. But we will get back to that.
- You need to use git-annex commands. git commands will not touch git-annexed files. Git only sees the symlink.
- or store them somewhere else and link them?
- yay magic!
A word of caution,
git annex sync also syncronizes the history of all git remotes. If one of the remotes is a github/gitlab repository that reacts to special keywords in commits (e.g. fixes #issue-number) issues can be automatically closed by accident. True story on gitlab.com.
git annex web if you can get there.
- what does it do? (in case we run out of time)
- launches a web interface to manage remotes and synchronization
About encryption, if the repository is public, having the tokens/keys in the metadata would make little sense. Are there common approaches used for open projects with sensitive data?
Related to the previous is encryption a requirement or entirely optional?
- optional. it can be none, shared, or public key
great questions which we have discussed on coderefinery.zulipchat.com (chat stream #RSHour) just before this session. hopefully we will have time to discuss these here
- Ah I missed that one. Thanks!
can you share the instructions on how to set this up on allas? This is awesome! I did not know that this was possible
- we will share instructions on this and attach them to the session notes on https://researchsoftwarehour.github.io/
- So, yes, this was quite hard because CSC doesn't provide generic instructions for Allas, so I had to reverse-engineer it
- Get the allas_conf script: https://docs.csc.fi/data/Allas/accessing_allas/
- Install the other dependencies needed to make that work
- Run the script in
source allas_conf --mode=s3cmd
export the environment variables, so that you can source it in the next step.
source ~/.aws/credentials. This should set the environment variables
- (Wonder why the above steps are so hard... why doesn't CSC provide a direct way to get those tokens?)
- The git-annex S3 special remote then uses the
AWS_SECRET_ACCESS_KEY (this is very clear from the docs)
git annex initremote allas type=S3 encryption=shared host=a3s.fi bucket=ga-demo-2 protocol=https port=443 (this is clearly documented in git-annex here, but was very hard to figure out what the Allas parameters are)
- Now, you have the
allas special remote you can push/pull data from.
- Allas also has a Swift interface, which git-annex can also use.
- There are also differences in how S3 and swift manage their tokens and authentication, so you actually have to make some choices before following the above instructions. Consult with us and we can consider the possibilities.
Out of curiosity: Do you have to pay for this cloud space or how does it work?
- in a way its free for Finnish Researchers ( you can apply for Billing units as the PI of a research project)
- Thanks, not applicable to me. :-)
- not sure if you can apply for this when not from Finland. But I believe there is similar for many countries
does this use s3cmd in background? or swift or..?
- might use a haskell library for s3 stuff
In this case, only people that also have access to the Allas bucket can access the data?
What is the largest file or largest amount of files/data you have stored in a single git-annex repository?
- limit with number of files seems to be limited by git
- size probably limited by the actual storage
- git-annex can do chunking
If I already have lots of data in an Allas bucket, can I then also link it with my repository?
- Yes, via the "import tree" function. I'm not yet an full expert in it, but the idea is you can export a view to a special remote (files/S3/etc) and re-import it. It can somehow see what has changed.
- Commands that made it work:
git annex initremote tree importtree=yes exporttree=yes type=directory directory=/home/rkdarst/git/ga-demo-tree/ encryption=none
git annex import master --from tree
git merge tree/master
- The special remote has to be set up as an
exporttree=yes remote, then
git annex import and
git annex export and get a normal view of files from it.
Thank you very much. I am looking forward to more RSH sessions.