posts: new post about Guix
[software/elephly-net.git] / posts / 2015-04-17-gnu-guix.markdown
1 ---
2 title: GNU Guix in an HPC environment
3 date: 2015/04/17
4 tags: free software, bioinformatics, system administration, packaging, cluster
5 ---
6
7 I spend my daytime hours as a system administrator at a research
8 institute in a heterogeneous computing environment. We have two big
9 compute clusters (one on CentOS the other on Ubuntu) with about 100
10 nodes each and dozens of custom GNU/Linux workstations. A common task
11 for me is to ensure the users can run their bioinformatics software,
12 both on their workstation and on the clusters. Only few
13 bioinformatics tools and libraries are popular enough to have been
14 packaged for CentOS or Ubuntu, so usually some work has to be done to
15 build the applications and all of their dependencies for the target
16 platforms.
17
18 ## How to waste time building and deploying software
19
20 In theory compiling software is not a very difficult thing to do.
21 Once all development headers have been installed on the build host,
22 compilation is usually a matter of configuring the build with a
23 configure script and running GNU make with various flags (this is an
24 assumption which is violated by bioinformatics software on a regular
25 basis, but let's not get into this now). However, there are practical
26 problems that become painfully obvious in a shared environment with a
27 large number of users.
28
29 ### Naive compilation
30
31 Compiling software directly on the target machine is an option only in
32 the most trivial cases. With more complicated build systems or
33 complicated build-time dependencies there is a strong incentive for
34 system administrators to do the hard work of setting up a suitable
35 build environment for a particular piece of software only once. Most
36 people would agree that package management is a great step up from
37 naive compilation, as the build steps are formalised in some sort of
38 recipe that can be executed by build tools in a reproducible manner.
39 Updates to software only require tweaks to these recipes. Package
40 management is a good thing.
41
42 ### System-dependence
43
44 Non-trivial software that was built and dynamically linked on one
45 machine with a particular set of libraries and header files at
46 particular versions can only really work on a system with the very
47 same libraries at compatible versions in place. Established package
48 managers allow packagers to specify hard dependencies and version
49 ranges, but the binaries that are produced on the build host will only
50 work under the constraints imposed on them at build time. To support
51 an environment in which software must run on, say, both CentOS 6.5 and
52 CentOS 7.1, the packages must be built in both environments and
53 binaries for both targets have to be provided.
54
55 There are ways to emulate a different build environment (e.g. Fedora's
56 `mockbuild`), but we cannot get around the fact that dynamically
57 linked software built for one kind of system will only ever work on
58 that very kind of system. At runtime we can change what libraries
59 will be dynamically loaded, but this is a hack that pushes the problem
60 from package maintainers to users. Running software with
61 `LD_LIBRARY_PATH` set is not a solution, nor is static linking, the
62 equivalent to copying chunks of libraries at build time.
63
64 ### Version conflicts
65
66 Libraries and applications that come pre-installed or pre-packaged
67 with the system may not be the versions a user claims to need. Say, a
68 user wants the latest version of GCC to compile code using new
69 language features specified in C++11 (e.g. anonymous functions). Full
70 support for C++11 arrived in GCC 4.8.1, yet on CentOS 6.5 only version
71 4.4.7 is available through the repositories. The system administrator
72 may not necessarily be able to upgrade GCC system-wide. Or maybe
73 other users on a shared system do need version 4.4.7 to be available
74 (e.g. for bug-compatibility). There is no easy way to satisfy all
75 users, so a system administrator might give up and let users build
76 their own software in their home directories instead of solving the
77 problem.
78
79 However, compiling GCC is a daunting task for a user and they really
80 shouldn't have to do this at all. We already established that package
81 management is a good thing; why should we deny users the benefits of
82 package management? Traditional package management techniques are
83 ill-suited to the task of installing multiple versions of applications
84 or libraries into independent prefixes. RPM, for example, allows
85 users to maintain a local, independent package database, but `yum`
86 won't work with multiple package databases. Additionally, only *one*
87 package database can be used at once, so a user would have to
88 re-install system libraries into the local package database to satisfy
89 dependencies. As a result, users lose the important feature of
90 automatic dependency resolution.
91
92 ### Interoperability
93
94 A system administrator who decides to package software as relocatable
95 RPMs, to install the applications to custom prefixes and to maintain a
96 separate repository has nothing to show for when a user asks to have
97 the packaged software installed on an Ubuntu workstation. There are
98 ways to convert RPMs to DEB packages (with varying degrees of
99 success), but it seems silly to have to convert or rebuild stuff
100 repeatedly when the software, its dependencies and its mode of
101 deployment really didn't change at all.
102
103 What happens when a Slackware user comes along next? Or someone using
104 Arch Linux? Sure, as a system administrator you could refuse to
105 support any system other than CentOS 7.1, users be damned.
106 Traditionally, it seems that system administrators default to this
107 style for convenience and/or practical reasons, but I consider this
108 unhelpful and even somewhat oppressive.
109
110
111 ## Functional package management with GNU Guix
112
113 Luckily I'm not the only person to consider traditional packaging
114 methods inadequate for a number of valid purposes. There are
115 different projects aiming to improve and simplify software deployment
116 and management, one of which I will focus on in this article. As a
117 functional programmer, Scheme aficionado and free software enthusiast
118 I was intrigued to learn about GNU Guix, a functional package manager
119 written in Guile Scheme, the designated extension language for the GNU
120 system.
121
122 In purely functional programming languages a function will produce the
123 very same output when called repeatedly with the same input values.
124 This allows for interesting optimisation, but most importantly it
125 makes it *possible* and in some cases even *easy* to reason about the
126 behaviour of a function. It is independent from global state, has no
127 side effects, and its outputs can be cached as they are certain not to
128 change as long as the inputs stay the same.
129
130 Functional package management lifts this concept to the realm of
131 software building and deployment. Global state in a system equates to
132 system-wide installations of software, libraries and development
133 headers. Side effects are changes to the global environment or global
134 system paths such as `/usr/bin/`. To reject global state means to
135 reject the common file system hierarchy for software deployment and to
136 use a minimal `chroot` for building software. The introduction of the
137 Guix manual describes the approach as follows:
138
139 > The term "functional" refers to a specific package management
140 > discipline. In Guix, the package build and installation process is
141 > seen as a function, in the mathematical sense. That function takes
142 > inputs, such as build scripts, a compiler, and libraries, and
143 > returns an installed package. As a pure function, its result
144 > depends solely on its inputs—for instance, it cannot refer to
145 > software or scripts that were not explicitly passed as inputs. A
146 > build function always produces the same result when passed a given
147 > set of inputs. It cannot alter the system’s environment in any way;
148 > for instance, it cannot create, modify, or delete files outside of
149 > its build and installation directories. This is achieved by running
150 > build processes in isolated environments (or "containers"), where
151 > only their explicit inputs are visible.
152
153 > The result of package build functions is "cached" in the file
154 > system, in a special directory called "the store". Each package is
155 > installed in a directory of its own, in the store—by default under
156 > ‘/gnu/store’. The directory name contains a hash of all the inputs
157 > used to build that package; thus, changing an input yields a
158 > different directory name.
159
160 ### Isolated, yet shared
161
162 Note that the package outputs are still dynamically linked. Libraries
163 are referenced in the binaries with their full store paths using the
164 runpath feature. These package outputs are no self-contained,
165 monolithic application directories as you might know them from MacOS.
166
167 Any built software is cached in the store which is shared by all users
168 system-wide. However, by default the software in the store has no
169 effect whatsoever on the users' environments. Building software and
170 have the results stored in `/gnu/store` does not alter any global
171 state; no files pollute `/usr/bin/` or `/usr/lib/`. Any effects are
172 restricted to the package's single output directory inside the
173 `/gnu/store`.
174
175 Guix provides per-user profiles to map software from the store into a
176 user environment. The store provides deduplication as it serves as a
177 cache for packages that have already been built. A profile is little
178 more than a "forest" of symbolic links to items in the store. The
179 union of links to the outputs of all software packages the user
180 requested makes up the user's profile. By adding another layer of
181 symbolic link indirection, Guix allows users to seamlessly switch
182 among different generations of the same profile, going back in time.
183
184 Each user profile is completely isolated from one another, making it
185 possible for different users to have different versions of GCC
186 installed. Even one and the same user could have multiple profiles
187 with different versions of GCC and switch between them as needed.
188
189 Guix takes the functional packaging method seriously, so except for
190 the running kernel and the exposed machine hardware there are
191 virtually no dependencies on global state (i.e. system libraries or
192 headers). This also means that the Guix store is populated with the
193 complete dependency tree, down to the kernel headers and the C
194 library. As a result, software in the Guix store can run on very
195 different GNU/Linux distributions; a shared Guix store allows me to
196 use the very same software on my Fedora workstation, as well as on the
197 Ubuntu cluster, and on the CentOS 6.5 cluster.
198
199 This means that software only has to be packaged up once. Since
200 package recipes are written in a very declarative domain-specific
201 language on top of Scheme, packaging is surprisingly simple (and to
202 this Schemer is rather enjoyable).
203
204 ### User freedom
205
206 Guix liberates users from the software deployment decisions of their
207 system administrators by giving them the power to build software into
208 an isolated directory in the store using simple package recipes.
209 Administrators only need to configure and run the Guix daemon, the
210 core piece running as root. The daemon listens to requests issued by
211 the Guix command line tool, which can be run by users without root
212 permissions. The command line tool allows users to manage their
213 profiles, switch generations, build and install software through the
214 Guix daemon. The daemon takes care of the store, of evaluating the
215 build expressions and "caching" build results, and it updates the
216 forest of symbolic links to update profile state.
217
218 Users are finally free to conveniently manage their own software,
219 something they could previously only do in a crude manner by compiling
220 manually.
221
222
223 ## Using a shared Guix store
224
225 Guix is not designed to be run in a centralised manner. A Guix daemon
226 is supposed to run on each system as root and it listens to RPCs from
227 local users only. In an environment with multiple clusters and
228 multiple workstations this approach requires considerable effort to
229 make it work correctly and securely.
230
231 Instead we opted to run the Guix daemon on a single dedicated server,
232 writing profile data and store items onto an NFS share. The cluster
233 nodes and workstations mount this share read-only. Although this
234 means that users lose the ability to manage their profiles directly on
235 their workstations and on the cluster nodes (because they have no
236 local installation of the Guix client or the Guix daemon, and because
237 they lack write access to the shared store), their software profiles
238 are now available wherever they are. To manage their profiles, users
239 would log on to the Guix server where they can install software into
240 their profiles, roll back to previous versions or send other queries
241 to the Guix daemon. (At some point I think it would make sense to
242 enhance Guix such that RPCs can be made over SSH, so that explicit
243 logging on to a management machine is no longer necessary.)
244
245
246 ## Guix as a platform for scientific software
247
248 Since winter 2014 I have been packaging software for GNU Guix, which
249 meanwhile has accumulated quite a few common and obscure
250 [bioinformatics tools and libraries](git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/bioinformatics.scm).
251 A list of software (updated daily) available through Guix is
252 [available here](https://www.gnu.org/software/guix/package-list.html).
253 We also have common Python modules for scientific computing, as well
254 as programming languages such as R and Julia.
255
256 I think GNU Guix is a great platform for scientific software in
257 heterogeneous computing environments. The Guix project follows the
258 [Free System Distribution Guidelines](https://gnu.org/distros/free-system-distribution-guidelines.html),
259 which mean that free software is welcome upstream. For software that
260 imposes additional usage or distribution restrictions (such as when
261 the original Artistic license is used instead of the Clarified
262 Artistic license, or when commercial use is prohibited by the license)
263 Guix allows the use of out-of-tree package modules through the
264 `GUIX_PACKAGE_PATH` variable. As Guix packages are just Scheme
265 variables in Scheme modules, it is trivial to extend the official GNU
266 Guix distribution with package modules by simply setting the
267 `GUIX_PACKAGE_PATH`.
268
269 If you want to learn more about GNU Guix I recommend taking a look at
270 the excellent
271 [GNU Guix project page](https://www.gnu.org/software/guix/). Feel
272 free to contact me if you want to learn more about packaging
273 scientific software for Guix. It is not difficult and we all can
274 benefit from joining efforts in adopting this usable, dependable,
275 hackable, and liberating platform for scientific computing with free
276 software.
277
278 The Guix community is very friendly, supportive, responsive and
279 welcoming. I encourage you to visit the project's
280 [IRC channel #guix on Freenode](https://webchat.freenode.net?channels=#guix),
281 where I go by the handle "rekado".