110e9b15e8950357eac0de2e37b6916e01f8d88e
[software/elephly-net.git] / posts / 2015-04-17-gnu-guix.markdown
1 ---
2 title: GNU Guix in an HPC environment
3 date: 2015/04/17
4 tags: free software, bioinformatics, system administration, packaging, cluster
5 ---
6
7 I spend my daytime hours as a system administrator at a research
8 institute in a heterogeneous computing environment. We have two big
9 compute clusters (one on CentOS the other on Ubuntu) with about 100
10 nodes each and dozens of custom GNU/Linux workstations. A common task
11 for me is to ensure the users can run their bioinformatics software,
12 both on their workstation and on the clusters. Only few
13 bioinformatics tools and libraries are popular enough to have been
14 packaged for CentOS or Ubuntu, so usually some work has to be done to
15 build the applications and all of their dependencies for the target
16 platforms.
17
18 ## How to waste time building and deploying software
19
20 In theory compiling software is not a very difficult thing to do.
21 Once all development headers have been installed on the build host,
22 compilation is usually a matter of configuring the build with a
23 configure script and running GNU make with various flags (this is an
24 assumption which is violated by bioinformatics software on a regular
25 basis, but let's not get into this now). However, there are practical
26 problems that become painfully obvious in a shared environment with a
27 large number of users.
28
29 ### Naive compilation
30
31 Compiling software directly on the target machine is an option only in
32 the most trivial cases. With more complicated build systems or
33 complicated build-time dependencies there is a strong incentive for
34 system administrators to do the hard work of setting up a suitable
35 build environment for a particular piece of software only once. Most
36 people would agree that package management is a great step up from
37 naive compilation, as the build steps are formalised in some sort of
38 recipe that can be executed by build tools in a reproducible manner.
39 Updates to software only require tweaks to these recipes. Package
40 management is a good thing.
41
42 ### System-dependence
43
44 Non-trivial software that was built and dynamically linked on one
45 machine with a particular set of libraries and header files at
46 particular versions can only really work on a system with the very
47 same libraries at compatible versions in place. Established package
48 managers allow packagers to specify hard dependencies and version
49 ranges, but the binaries that are produced on the build host will only
50 work under the constraints imposed on them at build time. To support
51 an environment in which software must run on, say, both CentOS 6.5 and
52 CentOS 7.1, the packages must be built in both environments and
53 binaries for both targets have to be provided.
54
55 There are ways to emulate a different build environment (e.g. Fedora's
56 `mockbuild`), but we cannot get around the fact that dynamically
57 linked software built for one kind of system will only ever work on
58 that very kind of system. At runtime we can change what libraries
59 will be dynamically loaded, but this is a hack that pushes the problem
60 from package maintainers to users. Running software with
61 `LD_LIBRARY_PATH` set is not a solution, nor is static linking, the
62 equivalent to copying chunks of libraries at build time.
63
64 ### Version conflicts
65
66 Libraries and applications that come pre-installed or pre-packaged
67 with the system may not be the versions a user claims to need. Say, a
68 user wants the latest version of GCC to compile code using new
69 language features specified in C++11 (e.g. anonymous functions). Full
70 support for C++11 arrived in GCC 4.8.1, yet on CentOS 6.5 only version
71 4.4.7 is available through the repositories. The system administrator
72 may not necessarily be able to upgrade GCC system-wide. Or maybe
73 other users on a shared system do need version 4.4.7 to be available
74 (e.g. for bug-compatibility). There is no easy way to satisfy all
75 users, so a system administrator might give up and let users build
76 their own software in their home directories instead of solving the
77 problem.
78
79 However, compiling GCC is a daunting task for a user and they really
80 shouldn't have to do this at all. We already established that package
81 management is a good thing; why should we deny users the benefits of
82 package management? Traditional package management techniques are
83 ill-suited to the task of installing multiple versions of applications
84 or libraries into independent prefixes. RPM, for example, allows
85 users to maintain a local, independent package database, but `yum`
86 won't work with multiple package databases. Additionally, only *one*
87 package database can be used at once, so a user would have to
88 re-install system libraries into the local package database to satisfy
89 dependencies. As a result, users lose the important feature of
90 automatic dependency resolution.
91
92 ### Interoperability
93
94 A system administrator who decides to package software as relocatable
95 RPMs, to install the applications to custom prefixes and to maintain a
96 separate repository has nothing to show for when a user asks to have
97 the packaged software installed on an Ubuntu workstation. There are
98 ways to convert RPMs to DEB packages (with varying degrees of
99 success), but it seems silly to have to convert or rebuild stuff
100 repeatedly when the software, its dependencies and its mode of
101 deployment really didn't change at all.
102
103 What happens when a Slackware user comes along next? Or someone using
104 Arch Linux? Sure, as a system administrator you could refuse to
105 support any system other than CentOS 7.1, users be damned.
106 Traditionally, it seems that system administrators default to this
107 style for convenience and/or practical reasons, but I consider this
108 unhelpful and even somewhat oppressive.
109
110
111 ## Functional package management with GNU Guix
112
113 Luckily I'm not the only person to consider traditional packaging
114 methods inadequate for a number of valid purposes. There are
115 different projects aiming to improve and simplify software deployment
116 and management, one of which I will focus on in this article. As a
117 functional programmer, Scheme aficionado and free software enthusiast
118 I was intrigued to learn about
119 [GNU Guix](https://www.gnu.org/software/guix/), a functional package
120 manager written in
121 [Guile Scheme](https://www.gnu.org/software/guile/), the designated
122 extension language for the [GNU system](https://www.gnu.org/).
123
124 In purely functional programming languages a function will produce the
125 very same output when called repeatedly with the same input values.
126 This allows for interesting optimisation, but most importantly it
127 makes it *possible* and in some cases even *easy* to reason about the
128 behaviour of a function. It is independent from global state, has no
129 side effects, and its outputs can be cached as they are certain not to
130 change as long as the inputs stay the same.
131
132 Functional package management lifts this concept to the realm of
133 software building and deployment. Global state in a system equates to
134 system-wide installations of software, libraries and development
135 headers. Side effects are changes to the global environment or global
136 system paths such as `/usr/bin/`. To reject global state means to
137 reject the common file system hierarchy for software deployment and to
138 use a minimal `chroot` for building software. The introduction of the
139 Guix manual describes the approach as follows:
140
141 > The term "functional" refers to a specific package management
142 > discipline. In Guix, the package build and installation process is
143 > seen as a function, in the mathematical sense. That function takes
144 > inputs, such as build scripts, a compiler, and libraries, and
145 > returns an installed package. As a pure function, its result
146 > depends solely on its inputs—for instance, it cannot refer to
147 > software or scripts that were not explicitly passed as inputs. A
148 > build function always produces the same result when passed a given
149 > set of inputs. It cannot alter the system’s environment in any way;
150 > for instance, it cannot create, modify, or delete files outside of
151 > its build and installation directories. This is achieved by running
152 > build processes in isolated environments (or "containers"), where
153 > only their explicit inputs are visible.
154
155 > The result of package build functions is "cached" in the file
156 > system, in a special directory called "the store". Each package is
157 > installed in a directory of its own, in the store—by default under
158 > ‘/gnu/store’. The directory name contains a hash of all the inputs
159 > used to build that package; thus, changing an input yields a
160 > different directory name.
161
162 ### Isolated, yet shared
163
164 Note that the package outputs are still dynamically linked. Libraries
165 are referenced in the binaries with their full store paths using the
166 runpath feature. These package outputs are no self-contained,
167 monolithic application directories as you might know them from MacOS.
168
169 Any built software is cached in the store which is shared by all users
170 system-wide. However, by default the software in the store has no
171 effect whatsoever on the users' environments. Building software and
172 have the results stored in `/gnu/store` does not alter any global
173 state; no files pollute `/usr/bin/` or `/usr/lib/`. Any effects are
174 restricted to the package's single output directory inside the
175 `/gnu/store`.
176
177 Guix provides per-user profiles to map software from the store into a
178 user environment. The store provides deduplication as it serves as a
179 cache for packages that have already been built. A profile is little
180 more than a "forest" of symbolic links to items in the store. The
181 union of links to the outputs of all software packages the user
182 requested makes up the user's profile. By adding another layer of
183 symbolic link indirection, Guix allows users to seamlessly switch
184 among different generations of the same profile, going back in time.
185
186 Each user profile is completely isolated from one another, making it
187 possible for different users to have different versions of GCC
188 installed. Even one and the same user could have multiple profiles
189 with different versions of GCC and switch between them as needed.
190
191 Guix takes the functional packaging method seriously, so except for
192 the running kernel and the exposed machine hardware there are
193 virtually no dependencies on global state (i.e. system libraries or
194 headers). This also means that the Guix store is populated with the
195 complete dependency tree, down to the kernel headers and the C
196 library. As a result, software in the Guix store can run on very
197 different GNU/Linux distributions; a shared Guix store allows me to
198 use the very same software on my Fedora workstation, as well as on the
199 Ubuntu cluster, and on the CentOS 6.5 cluster.
200
201 This means that software only has to be packaged up once. Since
202 package recipes are written in a very declarative domain-specific
203 language on top of Scheme, packaging is surprisingly simple (and to
204 this Schemer is rather enjoyable).
205
206 ### User freedom
207
208 Guix liberates users from the software deployment decisions of their
209 system administrators by giving them the power to build software into
210 an isolated directory in the store using simple package recipes.
211 Administrators only need to configure and run the Guix daemon, the
212 core piece running as root. The daemon listens to requests issued by
213 the Guix command line tool, which can be run by users without root
214 permissions. The command line tool allows users to manage their
215 profiles, switch generations, build and install software through the
216 Guix daemon. The daemon takes care of the store, of evaluating the
217 build expressions and "caching" build results, and it updates the
218 forest of symbolic links to update profile state.
219
220 Users are finally free to conveniently manage their own software,
221 something they could previously only do in a crude manner by compiling
222 manually.
223
224
225 ## Using a shared Guix store
226
227 Guix is not designed to be run in a centralised manner. A Guix daemon
228 is supposed to run on each system as root and it listens to RPCs from
229 local users only. In an environment with multiple clusters and
230 multiple workstations this approach requires considerable effort to
231 make it work correctly and securely.
232
233 Instead we opted to run the Guix daemon on a single dedicated server,
234 writing profile data and store items onto an NFS share. The cluster
235 nodes and workstations mount this share read-only. Although this
236 means that users lose the ability to manage their profiles directly on
237 their workstations and on the cluster nodes (because they have no
238 local installation of the Guix client or the Guix daemon, and because
239 they lack write access to the shared store), their software profiles
240 are now available wherever they are. To manage their profiles, users
241 would log on to the Guix server where they can install software into
242 their profiles, roll back to previous versions or send other queries
243 to the Guix daemon. (At some point I think it would make sense to
244 enhance Guix such that RPCs can be made over SSH, so that explicit
245 logging on to a management machine is no longer necessary.)
246
247
248 ## Guix as a platform for scientific software
249
250 Since winter 2014 I have been packaging software for GNU Guix, which
251 meanwhile has accumulated quite a few common and obscure
252 [bioinformatics tools and libraries](git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/bioinformatics.scm).
253 A list of software (updated daily) available through Guix is
254 [available here](https://www.gnu.org/software/guix/package-list.html).
255 We also have common Python modules for scientific computing, as well
256 as programming languages such as R and Julia.
257
258 I think GNU Guix is a great platform for scientific software in
259 heterogeneous computing environments. The Guix project follows the
260 [Free System Distribution Guidelines](https://gnu.org/distros/free-system-distribution-guidelines.html),
261 which mean that free software is welcome upstream. For software that
262 imposes additional usage or distribution restrictions (such as when
263 the original Artistic license is used instead of the Clarified
264 Artistic license, or when commercial use is prohibited by the license)
265 Guix allows the use of out-of-tree package modules through the
266 `GUIX_PACKAGE_PATH` variable. As Guix packages are just Scheme
267 variables in Scheme modules, it is trivial to extend the official GNU
268 Guix distribution with package modules by simply setting the
269 `GUIX_PACKAGE_PATH`.
270
271 If you want to learn more about GNU Guix I recommend taking a look at
272 the excellent
273 [GNU Guix project page](https://www.gnu.org/software/guix/). Feel
274 free to contact me if you want to learn more about packaging
275 scientific software for Guix. It is not difficult and we all can
276 benefit from joining efforts in adopting this usable, dependable,
277 hackable, and liberating platform for scientific computing with free
278 software.
279
280 The Guix community is very friendly, supportive, responsive and
281 welcoming. I encourage you to visit the project's
282 [IRC channel #guix on Freenode](https://webchat.freenode.net?channels=#guix),
283 where I go by the handle "rekado".