summaryrefslogtreecommitdiff
path: root/posts/2015-04-17-gnu-guix.skr
blob: c6813d3c5c9bbabfbe0aa6198db83a1a08591565 (about) (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
(post
 :title "GNU Guix in an HPC environment"
 :date (string->date* "2015-04-17 00:00")
 :tags '("gnu"
         "planet-fsfe-en"
         "free software"
         "guix"
         "bioinformatics"
         "system administration"
         "packaging"
         "cluster")

 (p [I spend my daytime hours as a system administrator at a research
     institute in a heterogeneous computing environment.  We have two
     big compute clusters (one on CentOS the other on Ubuntu) with
     about 100 nodes each and dozens of custom GNU/Linux workstations.
     A common task for me is to ensure the users can run their
     bioinformatics software, both on their workstation and on the
     clusters.  Only few bioinformatics tools and libraries are
     popular enough to have been packaged for CentOS or Ubuntu, so
     usually some work has to be done to build the applications and
     all of their dependencies for the target platforms.])

 (h2 [How to waste time building and deploying software])

 (p [In theory, compiling software is not a very difficult thing to
     do.  Once all development headers have been installed on the
     build host, compilation is usually a matter of configuring the
     build with a configure script and running GNU make with various
     flags (this is an assumption which is violated by bioinformatics
     software on a regular basis, but let’s not get into this now).
     However, there are practical problems that become painfully
     obvious in a shared environment with a large number of users.])

 (h3 [Naive compilation])

 (p [Compiling software directly on the target machine is an option
     only in the most trivial cases.  With more complicated build
     systems or complicated build-time dependencies there is a strong
     incentive for system administrators to do the hard work of
     setting up a suitable build environment for a particular piece of
     software only once.  Most people would agree that package
     management is a great step up from naive compilation, as the
     build steps are formalised in some sort of recipe that can be
     executed by build tools in a reproducible manner.  Updates to
     software only require tweaks to these recipes.  Package
     management is a good thing.])

 (h3 [System-dependence])

 (p [Non-trivial software that was built and dynamically linked on one
     machine with a particular set of libraries and header files at
     particular versions can only really work on a system with the
     very same libraries at compatible versions in place.  Established
     package managers allow packagers to specify hard dependencies and
     version ranges, but the binaries that are produced on the build
     host will only work under the constraints imposed on them at
     build time.  To support an environment in which software must run
     on, say, both CentOS 6.5 and CentOS 7.1, the packages must be
     built in both environments and binaries for both targets have to
     be provided.])

 (p [There are ways to emulate a different build environment
     (e.g. Fedora’s ,(code [mockbuild])), but we cannot get around the
     fact that dynamically linked software built for one kind of
     system will only ever work on that very kind of system.  At
     runtime we can change what libraries will be dynamically loaded,
     but this is a hack that pushes the problem from package
     maintainers to users.  Running software with ,(code
     [LD_LIBRARY_PATH]) set is not a solution, nor is static linking,
     the equivalent to copying chunks of libraries at build time.])

 (h3 [Version conflicts])

 (p [Libraries and applications that come pre-installed or
     pre-packaged with the system may not be the versions a user
     claims to need.  Say, a user wants the latest version of GCC to
     compile code using new language features specified in C++11
     (e.g. anonymous functions).  Full support for C++11 arrived in
     GCC 4.8.1, yet on CentOS 6.5 only version 4.4.7 is available
     through the repositories.  The system administrator may not
     necessarily be able to upgrade GCC system-wide.  Or maybe other
     users on a shared system do need version 4.4.7 to be available
     (e.g. for bug-compatibility).  There is no easy way to satisfy
     all users, so a system administrator might give up and let users
     build their own software in their home directories instead of
     solving the problem.])

 (p [However, compiling GCC is a daunting task for a user and they
     really shouldn’t have to do this at all.  We already established
     that package management is a good thing; why should we deny users
     the benefits of package management?  Traditional package
     management techniques are ill-suited to the task of installing
     multiple versions of applications or libraries into independent
     prefixes.  RPM, for example, allows users to maintain a local,
     independent package database, but ,(code [yum]) won’t work with
     multiple package databases.  Additionally, only ,(em [one])
     package database can be used at once, so a user would have to
     re-install system libraries into the local package database to
     satisfy dependencies.  As a result, users lose the important
     feature of automatic dependency resolution.])

 (h3 [Interoperability])

 (p [A system administrator who decides to package software as
     relocatable RPMs, to install the applications to custom prefixes
     and to maintain a separate repository has nothing to show for
     when a user asks to have the packaged software installed on an
     Ubuntu workstation.  There are ways to convert RPMs to DEB
     packages (with varying degrees of success), but it seems silly to
     have to convert or rebuild stuff repeatedly when the software,
     its dependencies and its mode of deployment really didn’t change
     at all.])

 (p [What happens when a Slackware user comes along next?  Or someone
     using Arch Linux?  Sure, as a system administrator you could
     refuse to support any system other than CentOS 7.1, users be
     damned.  Traditionally, it seems that system administrators
     default to this style for convenience and/or practical reasons,
     but I consider this unhelpful and even somewhat oppressive.])


 (h2 [Functional package management with GNU Guix])

 (p [Luckily, I’m not the only person to consider traditional
     packaging methods inadequate for a number of valid purposes.
     There are different projects aiming to improve and simplify
     software deployment and management, one of which I will focus on
     in this article.  As a functional programmer, Scheme aficionado
     and free software enthusiast I was intrigued to learn about ,(ref
     "https://www.gnu.org/software/guix/" "GNU Guix"), a functional
     package manager written in ,(ref
     "https://www.gnu.org/software/guile/" "Guile Scheme"), the
     designated extension language for the ,(ref
     "https://www.gnu.org/" "GNU system").])

 (p [In purely functional programming languages a function will
     produce the very same output when called repeatedly with the same
     input values.  This allows for interesting optimisation, but most
     importantly it makes it ,(em [possible]) and in some cases even
     ,(em [easy]) to reason about the behaviour of a function.  It is
     independent from global state, has no side effects, and its
     outputs can be cached as they are certain not to change as long
     as the inputs stay the same.])

 (p [Functional package management lifts this concept to the realm of
     software building and deployment.  Global state in a system
     equates to system-wide installations of software, libraries and
     development headers.  Side effects are changes to the global
     environment or global system paths such as ,(code [/usr/bin/]).
     To reject global state means to reject the common file system
     hierarchy for software deployment and to use a minimal ,(code
     [chroot]) for building software.  The introduction of the Guix
     manual describes the approach as follows:])

 (blockquote
  (p [The term “functional” refers to a specific package management
      discipline.  In Guix, the package build and installation process
      is seen as a function, in the mathematical sense.  That function
      takes inputs, such as build scripts, a compiler, and libraries,
      and returns an installed package.  As a pure function, its
      result depends solely on its inputs—for instance, it cannot
      refer to software or scripts that were not explicitly passed as
      inputs.  A build function always produces the same result when
      passed a given set of inputs.  It cannot alter the system’s
      environment in any way; for instance, it cannot create, modify,
      or delete files outside of its build and installation
      directories.  This is achieved by running build processes in
      isolated environments (or “containers”), where only their
      explicit inputs are visible.])

  (p [The result of package build functions is “cached” in the file
      system, in a special directory called “the store”.  Each package
      is installed in a directory of its own, in the store—by default
      under ‘/gnu/store’.  The directory name contains a hash of all
      the inputs used to build that package; thus, changing an input
      yields a different directory name.]))

 (p [The following diagram (taken from the ,(ref
     "https://www.gnu.org/software/guix/guix-fosdem-20150131.pdf"
     "slides for a talk by Ludovic Courtès")) illustrates how the
     build daemon handles the package build processes requested by a
     client via remote procedure calls:])

 (wide-img "2015/guix-build.png"
           "Software is built by the Guix daemon in isolation")

 (h3 [Isolated, yet shared])

 (p [Note that the package outputs are still dynamically linked.
     Libraries are referenced in the binaries with their full store
     paths using the runpath feature.  These package outputs are no
     self-contained, monolithic application directories as you might
     know them from MacOS.])

 (p [Any built software is cached in the store which is shared by all
     users system-wide.  However, by default the software in the store
     has no effect whatsoever on the users’ environments.  Building
     software and have the results stored in ,(code [/gnu/store]) does
     not alter any global state; no files pollute ,(code [/usr/bin/])
     or ,(code [/usr/lib/]).  Any effects are restricted to the
     package’s single output directory inside the ,(code
     [/gnu/store]).])

 (p [Guix provides per-user profiles to map software from the store
     into a user environment.  The store provides deduplication as it
     serves as a cache for packages that have already been built.  A
     profile is little more than a “forest” of symbolic links to items
     in the store.  The union of links to the outputs of all software
     packages the user requested makes up the user’s profile.  By
     adding another layer of symbolic link indirection, Guix allows
     users to seamlessly switch among different generations of the
     same profile, going back in time.])

 (p [Each user profile is completely isolated from one another, making
     it possible for different users to have different versions of GCC
     installed.  Even one and the same user could have multiple
     profiles with different versions of GCC and switch between them
     as needed.])

 (p [Guix takes the functional packaging method seriously, so except
     for the running kernel and the exposed machine hardware there are
     virtually no dependencies on global state (i.e. system libraries
     or headers).  This also means that the Guix store is populated
     with the complete dependency tree, down to the kernel headers and
     the C library.  As a result, software in the Guix store can run
     on very different GNU/Linux distributions; a shared Guix store
     allows me to use the very same software on my Fedora workstation,
     as well as on the Ubuntu cluster, and on the CentOS 6.5 cluster.])

 (p [This means that software only has to be packaged up once.  Since
     package recipes are written in a very declarative domain-specific
     language on top of Scheme, packaging is surprisingly simple (and
     to this Schemer is rather enjoyable).])

 (h3 [User freedom])

 (p [Guix liberates users from the software deployment decisions of
     their system administrators by giving them the power to build
     software into an isolated directory in the store using simple
     package recipes.  Administrators only need to configure and run
     the Guix daemon, the core piece running as root.  The daemon
     listens to requests issued by the Guix command line tool, which
     can be run by users without root permissions.  The command line
     tool allows users to manage their profiles, switch generations,
     build and install software through the Guix daemon.  The daemon
     takes care of the store, of evaluating the build expressions and
     “caching” build results, and it updates the forest of symbolic
     links to update profile state.])

 (p [Users are finally free to conveniently manage their own software,
     something they could previously only do in a crude manner by
     compiling manually.])


 (h2 [Using a shared Guix store])

 (p [Guix is not designed to be run in a centralised manner.  A Guix
     daemon is supposed to run on each system as root and it listens
     to RPCs from local users only.  In an environment with multiple
     clusters and multiple workstations this approach requires
     considerable effort to make it work correctly and securely.])

 (wide-img "2015/guix-shared.svg"
           "Sharing Guix store and profiles")

 (p [Instead we opted to run the Guix daemon on a single dedicated
     server, writing profile data and store items onto an NFS share.
     The cluster nodes and workstations mount this share read-only.
     Although this means that users lose the ability to manage their
     profiles directly on their workstations and on the cluster nodes
     (because they have no local installation of the Guix client or
     the Guix daemon, and because they lack write access to the shared
     store), their software profiles are now available wherever they
     are.  To manage their profiles, users would log on to the Guix
     server where they can install software into their profiles, roll
     back to previous versions or send other queries to the Guix
     daemon.  (At some point I think it would make sense to enhance
     Guix such that RPCs can be made over SSH, so that explicit
     logging on to a management machine is no longer necessary.)])


 (h2 [Guix as a platform for scientific software])

 (p [Since winter 2014 I have been packaging software for GNU Guix,
     which meanwhile has accumulated quite a few common and obscure
     ,(ref
     "http://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/bioinformatics.scm"
     "bioinformatics tools and libraries").  A list of software
     (updated daily) available through Guix is ,(ref
     "https://www.gnu.org/software/guix/package-list.html" "available
     here").  We also have common Python modules for scientific
     computing, as well as programming languages such as R and Julia.])

 (p [I think GNU Guix is a great platform for scientific software in
     heterogeneous computing environments.  The Guix project follows
     the ,(ref
     "https://gnu.org/distros/free-system-distribution-guidelines.html"
     "Free System Distribution Guidelines"), which mean that free
     software is welcome upstream.  For software that imposes
     additional usage or distribution restrictions (such as when the
     original Artistic license is used instead of the Clarified
     Artistic license, or when commercial use is prohibited by the
     license) Guix allows the use of out-of-tree package modules
     through the ,(code [GUIX_PACKAGE_PATH]) variable.  As Guix
     packages are just Scheme variables in Scheme modules, it is
     trivial to extend the official GNU Guix distribution with package
     modules by simply setting the ,(code [GUIX_PACKAGE_PATH]).])

 (p [If you want to learn more about GNU Guix I recommend taking a
     look at the excellent ,(ref "https://www.gnu.org/software/guix/"
     "GNU Guix project page").  Feel free to contact me if you want to
     learn more about packaging scientific software for Guix.  It is
     not difficult and we all can benefit from joining efforts in
     adopting this usable, dependable, hackable, and liberating
     platform for scientific computing with free software.])

 (p [The Guix community is very friendly, supportive, responsive and
     welcoming.  I encourage you to visit the project’s ,(ref
     "https://webchat.freenode.net?channels=#guix" "IRC channel #guix
     on Freenode"), where I go by the handle “rekado”.]))