RSH 20: Data preparation and release
About data management plans, tidy data format, and FAIR data release
Video
Collaborative notes taken during the session
Part 1: data management plans
- Why DMPs?
- What is wrong with historical process?
- I guess this came from the open scinece movement, pressuring you to
- should a DMP be fully copy-paste from best practices, by sort-of default?
- What should you actually do?
- First, learn the best way to handle your data
- Then, write a plan for funder
- Hopefully, use the plan. But as they say, plans are useless but planning es essential
- How do you typically do them (what does your university insturctions say?)
Part 2: tidy data
- I have 2 slides as example of this, maybe I manage to create a slide-less example
- Discuss how to store tabular data
- excel
- standard formats: csv, numpy array, netcdf
- pandas
- Discuss problems with non-tidy data storage
- Tidy data: H. Wickham, "Tidy Data"
- Ddvantages/disadvantages of tidy data
- Metadata
Part 3: release
- Findable
- Does someone know your data exists only if they finds your paper first? Or can they search directly for it?
- What's the minimum required to find it?
- General search engine? Domain-specific search that understands the concepts of the data?
- Accessible
- You can somehow request access to it
- It's in a place that won't disappear later
- In a good repository
- Or maybe, it's in some isolated permanent place, disconnected from a searchable repository
- re3data.org - find a repository
- Is your own website acceptable?
- Is github acceptable?
- Interoperable
- If someone can get value by combining it with other data, can they?
- This is mostly covered in 'preparing data for release'
- If there are no existing standards, you can't make it interoperable
- There may be standards, but they may not be common enough to be useful. It depends on what your particular case.
- Or it may not be worth putting that much effort.
- Reusable
- Someone has permission to reuse it
- Licensing
Main Finnish national funding agency decided to drop DMP from grant applications. I reviewed grants from NL and AT national grant agencies and there was no DMP to review.
Given the perserverence of (beginning) researchers just using an external drive for their project. Perhaps, the questions on a DMP should not be so much about which data, but more:
- Where are you going to store the data?
- Who will have access to this drive?
- What will you do when the drive fails?
- What will you do when you accidentally delete something important?
- How will you find your data again in 5 years time?
- How will you share the data?
Does this DMP take the place of README files that I would normally put this in?
"Funny" anecdote from Twente University in the NL: all network drives were in a single building. The building manager felt unappreciated and set fire to the building with the intention to become the hero that put out the fire. He couldn't put out the fire. All emails, all researcher and student data, everything was gone, for the entire university. The backup was located in the same building. This was one year before I entered the university as a student :)
-
wow
-
Tidy data
- https://en.wikipedia.org/wiki/Tidy_data
- http://vita.had.co.nz/papers/tidy-data.pdf
-
could you say that if you need a custom parser, it's not tidy data?
- i think it still can be tidy data but if i need to adapt the parser every few weeks as new data is coming in, then it was perhaps not in "tidy" format
-
Just FYI: A webinar by ELIXIR Norway was super useful. Of course the contents focus on very much life-science field and Norwegian context (it was before and for NCR funding application)
-
https://www.re3data.org/
-
Was FAIRSharing mentioned? Was recommended in FAIR DS course by NeIC.
-
TOP 10 FAIR DATA & SOFTWARE THINGS Research Software looks also useful.
-
Maybe relevant: https://elixir-europe.org/events/webinar-software-management-plans
-
Large scale facilities these days demand to provide a DMP. The user has to provide a plan, who is going to analyse the data with what software and where. Is the infrastructure available?
Next stream
What should we talk about in two weeks?
Possible topics: https://researchsoftwarehour.github.io/
- Python libaries, tricks in numpy, mpi4py. Data science maybe?
- please tell us about your fav python libraries that we can highlight/ test out