Are they using Git LFS  or did they make something else?
And what is their proposed value add over using git directly?
Edit: They say a little more about the large file stuff
> DVC keeps metafiles in Git instead of Google Docs to describe and version control your data sets and models. DVC supports a variety of external storage types as a remote cache for large files.
So from what they said in the video and what I read on the page this is probably a limited front-end to make using git easier for people that don’t know git.
And in terms of the large file stuff it seems from what they are saying like they have implemented the equivalent of git-annex . Or maybe they are using that even. I didn’t look to see if they wrote their own or used git-annex.
- Large binary files tend to be not very "deflatable"
- xdelta (used in Git to diff files) tries to load the entire content of a file into memory, at once.
This is why there are solutions like Git-LFS, where you keep your versions on a remote server / cloud storage and you use git to track only the "metadata" files.
DVC implemented its own solution, in order to be SCM agnostic and cloud flexible (supporting different remote storages).
Here's more info comparing DVC to similar/related technologies: https://dvc.org/doc/dvc-philosophy/related-technologies
DVC basically sym-links those big files and checks-in the symlinks.
It also can download those files from GCS/S3, and track which file came from where (e.g. if you generated output.data using input.data, then whenever input.data changes, DVC can detect that output.data needs to be regenerated as well).
That's my understanding.
To my understanding you could do the same with Docker. E.g. if you COPY your input files into the image, rebuilding the image would only be an action if the input files changed.
Also, Docker has an overhead - copy of data needs to be created. While DVC just saves links (sym-links, hard-links or reflinks) with a minimum overhead. It is crucial when you work with multi-GB datasets.
> Docker can help only if there is a single step in your project. In ML projects you usually have many steps - Preprocess, Train. Each of the steps can be divided: extract Evaluate step from Train etc.
Yeah this is something I've been struggling with. In a project I'm working on I use docker build to 1) set up the environment 2) get canonical datasets 3) pre-process the datasets. However I've left reproducing as manual instructions, e.g. run the container, call script 1 to repro experiment 1, call script 2 to repro experiment 2, etc. I think I could improve this by providing `--entrypoint` at docker run, or by providing a docker-compose file (wherein I could specify an entrypoint) for each experiment.
What do you think are the generalizability pitfalls in this workflow? How could dvc help?
> Also, Docker has an overhead - copy of data needs to be created. While DVC just saves links (sym-links, hard-links or reflinks) with a minimum overhead. It is crucial when you work with multi-GB datasets.
1 - git based management (not storage) of data files used in ML experiments;
2 - lightweight pipelines integrated with git to allow reproducibility of outputs and intermediaries
3 - integrating git with experimentation
If you've worked on teams building ML products, this is something you've at least half-built internally. So you can share outputs internally with tracked lineage showing how to repro. Plus the pipeline management.
We've started using it as a replacement for git LFS for different projects internally, not especially for Data Science, and we're very happy with it. Works like a charm with Linux, Mac and Windows.
Also we realised we don't really need a word- or even line-level diff, but we just want to know which files have been modified (e.g. large binaries). So maybe we shouldn't have started with Git LFS in the first place.
DVC allowed us to have everything in our monorepo, kept in sync, without having people to install Git LFS before they clone the repo. You don't have to pull the large files if you don't want to or don't need them for your personal work. In general I think you're more flexible in terms of local and remote caching and sharing of these large files IMHO. If network is an issue (technical or money-wise) it's pretty useful.
I am sure there are more reasons for and against DVC, but it worked surprisingly well for us, the support on Github is super reactive, and so far we couldn't find a reason against it for our use case.
You can navigate branches, and be able to download the data, model, and intermediate pipeline files from a shared team AWS,GS,Hadoop, or plain NFS or SSH server, as they were in a specific commit.
Also compare metrics between branches for comparison of different experiments, etc.
A team member can checkout a branch, immediately get the relevant files which were already computed by someone else, modify the training code, reproduce the out-of-date parts of the pipeline using dvc repro, and then git commit the resulting metrics + dvc push the resulting model back to the shared team storage.
2. Configuration (libraries etc)
3. Input/training data
1 and 2 are easily solved with Git and Docker respectively, although you would need some tooling to keep track of the various versions in a given run. 3 doesn't quite figure.
According to the site DVC uses object storage to store input data but that leads to a few questions:
1. Why wouldn't I just use Docker and Git + Git LFS to do all of this? Is DVC just a wrapper for these tools?
2. Why wouldn't I just version control the query that created the data along with the code that creates the model?
3. What if I'm working on a large file and make a one byte change? I've never come across an object store that can send a diff, so surely you'd need to retransmit the whole file?
1. DVC does dependency tracking in addition to that. It is like a lightweight ML pipelines tool or ML specific Makefile. Also, DVC works just faster that LFS which is critical in 10Gb+ cases.
2. This is a great case. However, in some scenarios, you would prefer to store the query output along with the query and DVC helps with that.
3. Correct, there are no data diffs. DVC just stores blobs and you can GC the old ones - https://dvc.org/doc/commands-reference/gc
Have you looked into using content-defined chunking (a-la restic or borgbackup) so that you get deduplication without the need to send around diffs? This is related to a problem that I'm working on solving in OCI (container) images.
Also, can I combine DVC with a pipeline tool like Apache Airflow?
This is like assigning a random-seed to DB :)
Sure, some teams combine DVC with AirFlow. It gives a clear separation between engineering (reliability) and data science (lightweight and quick iteration). A recent discussion about this: https://twitter.com/FullStackML/status/1091840829683990528
As I understood the idea is to:
- use git branch for each experiment (change of hyperparameters etc.)
- define pipeline stages (preprocessing, train/test split, model training, model validation)
- after this steps you can change any part of pipeline (say data preprocessing or model parameters) and run `dvc repro` to reproduce all stages for which dependancies changed and track metrics for all branches, which os pretty cool and reduce experiment logs in wiki
I'm not sure I buy the pipeline and repro functionality as useful. I'd rather see nice integration with Docker since it can be used to define the environment as well as repro steps.
There is an ongoing discussion in DVC GitHub about datasets tracking and tags https://github.com/iterative/dvc/issues/1487 and some discussions in DVC discord channel https://dvc.org/chat