Imagine you are a data scientist (as many of us are/have become). Systems you build typically require many data sources and many packages (machine learning/data mining, data management, and visualization) to run. Your working configuration will consist of a set of packages each at a particular version.You want to update some packages (software or data) to their most recent versions possible, but you want your system to run after the upgrades, thus perhaps entailing changes to the versions of other packages. One approach is to hope the latest versions of all packages work.If that fails, the fallback is manual trial and error, but that quickly ends in frustration. We advocate a provenance-style approach in which tools like ptrace enable us to identify version combinations of different packages. Then version control systems like pip, and github and VirtualEnv enable us to fetch particular versions of packages and try them in a sandbox-like environment. Because the space of versions to explore grows exponentially with the number of packages, we have developed a memoizing algorithm that avoids exponential search while still finding an optimum version combination. Heuristics combined with certain empirical facts about packages (e.g. local upward compatibility) improves performance further still. We present experimental results on well known packages used in data science to illustrate the effectiveness of our approach.
Dennis Shasha is a professor of computer science at the Courant Institute of New York University and an Associate Director of NYU Wireless. He works with biologists on pattern discovery for network inference; with computational chemists on algorithms for protein design; with physicists and financial people on algorithms for time series; on clocked computation for DNA computing; and on computational reproducibility. Other areas of interest include database tuning as well as tree and graph matching. Because he likes to type, he has written six books of puzzles about a mathematical detective named Dr. Ecco, a biography about great computer scientists, and a book about the future of computing. He has also written five technical books about database tuning, biological pattern recognition, time series, DNA computing, resampling statistics, and causal inference in molecular networks. He has co-authored over eighty journal papers, seventy conference papers, and twenty-five patents.He has written the puzzle column for various publications including Scientific American, Dr. Dobb's Journal, and the Communications of the ACM. He is a fellow of the ACM and an INRIA International Chair.
Helena Isabel de Jesus Galhardas