This book is an account of the practice of machine learning – that is, the project of configuring software, by means of a learning process, so that it can learn from data and, as a consequence, be able to deal with further data in an intelligent way. However, some of these terms – “learn” and “intelligent” – cannot be unproblematically applied to machines: furthermore, the title of the book – Machine Learners – can be applied both to the machines which thus learn, and to the humans who develop and configure the software (p. 6), and this ambiguity is acknowledged and persists throughout the book. In addition, the book is “not an ethnography, [but] has an ethnographic situation” (p. xi), and the author, consequently, uses terminology derived from ethnography and from other fields: some of this terminology is borrowed from the field of machine learning. Thus, the author writes at one point that one of the standard books in the field, Elements of Statistical Learning,1 “has a matrix-like character that constantly superimposes and transposes elements across boundaries and barriers” (p. 34) (and here the word “matrix” is used metaphorically, in the context of ethnography); a few lines further down, however, there is the subheading R: a matrix of transformations, where here R is a standard statistical software package, which uses matrices – here the mathematical entities – a great deal. Semantically, then, this book is richly textured and often deliberately ambiguous. In consequence, one might think that the book would be pretentious, and also difficult to read: neither is, in fact, the case, which says a great deal about the author’s skills as a writer.
It would be wrong, in any case, to dismiss this texture of writing merely as an affectation. One of the author’s central questions is “who is the machine learner subject?” (p. 179), and this question, it turns out, can be answered in several ways. Certainly one of the answers is that human machine learning specialists are machine learner subjects, and there are many descriptions of such subjects in the book (for example, on pp. 179). But the machine learning software – neural nets and similar programs – can also be seen as subjects, or, as in the author’s more precise description, they can be assigned a subject position. This terminology comes from Foucault, who describes subjectivity in terms of relations between statements about the subject and operations which the subject performs (p. 197, citing Foucault.2) This conception of the subject was important for Foucault, who thought that “there was no true self that could be deciphered and emancipated, but that the self was something that had been – and must be – created.”3
For our author, on the other hand, Foucault’s views on the subject were important because it allowed for the assignment of subjectivity to software entities such as neural nets: not, of course, the introspection-driven subjectivity that Descartes, Kant and Hegel talk of, but a sort of subjectivity which can be assigned to neural nets on the grounds that “the … operations of the machine learner support the production of statements, and these operations become a way of producing future statements to the extent that the subject of the operation is also the subject of the statement. The assignation of a subject position occurs in the forward and backward, feed-forwarding and back-propagating, motion between operation and statement.” (p. 197)
And so we see the point of the author’s distinctive texture of writing: if we describe these algorithms as “machine learners”, then we are attributing to them certain capacities which were, traditionally, attributed only to human beings, or, at least, mentally well-endowed animals. Now the authors are not claiming that these algorithms have exactly those capacities which humans and certain animals possess: however, it is not misleading to say that those algorithms have something resembling those capacities. Thus, the writing strategy which the author employs exploits, quite deliberately, the similarities between the activity of the two sorts of machine learners, both human and non-human.
The project outlined above is both intellectually satisfying and aesthetically attractive: it uses critical thought (mostly in the French tradition of Deleuze and Foucault) to illuminate the intellectual practice of machine learning. It is, on the whole, successful in what it attempts, and gives a great deal of insight into machine learning and into the French tradition of critical theory. However, (as always) there are gaps: other intellectual projects which are relevant but which have been ignored, and evaluative, often critical, perspectives on machine learning which have not been highlighted. Both of these, I would claim, result from the sidelining of research communities from the standard account of machine learning, as, I hope, should become apparent in the following.
The project described is not the only instance of large scale, computer-aided, intellectual enquiry in the modern software industry. There is, also, the project known as software verification: this is attempt to use logical analysis to find out whether software actually does what it claims to do. This is a project which started out on very small pieces of programming4, but which has now become large and sophisticated, able to tackle the verification of large, real-world problems.5 This project is, of course, rather different from the Machine Learners project: the language and the methods are more rigid and less open to the sort of riffing that Mackenzie is fond of. However, there are important similarities: both projects operate at a large scale (in Mackenzie’s case, the scale of very large datasets, and in the case of software verification, the scale of extremely large proofs, which can only be managed by computers). Both projects are community-based: software verification employs experts of various sorts – those developing the logic used in the proofs, those developing the software which checks the proofs, those who write and implement the algorithms that the proofs verify – whereas the machine learning community employs, again, a variety of different sorts of experts. Both of these projects, then, operate in a space where community relations, and communication between different sorts of experts, become very important, and where communications between experts often cross cultural boundaries.
Scale and Range
There is, in the machine learning community, a slogan which is often written N=∀X (Ch IV, pp. 103): translated from the mathematics, this means that the sample which the statistics is dealing with is the entire set of individuals which the statistics studies. Computers have become powerful enough, and data gathering has become efficient enough, that nothing is left out. This is an impressive, but also a very frightening, claim.
It is, however, also rather exaggerated. Suppose that one starts trying to find out about language by collecting data from speakers: one can gather data and gradually expand the people one gathers data from. One would hope that the larger one’s collection of data, the more accurate one’s results would be. However, the extent to which one can do this is limited, because of linguistic variation: if one’s collection gets too large, then one can gather data from people who are no longer in the same speech community. This gets worse with historical data, because of the historical variation of language: too large a time span, and the language that one samples will have altered significantly. This is an argument that was used by Schleiermacher in his Hermeneutics6; see also White.7 Thus, there will, in general, be no precisely delimited totality in which one can find the data relevant for a particular issue: it will always be a matter of trading off the completeness of one’s data collection against the relevance of the data in that question for the issue one seeks to resolve.
There is another, related, issue, which is this. Linguistic data has long range dependencies: that is, there are significant correlations between the beginning of a text and the end of it. Questions raised at the beginning of a novel may not be resolved until the end of it: an utterance which was cryptic in the context of the beginning of a narrative may only be resolved by what happens at the end of it. These dependencies can be very long.
However, standard statistical theory is based on what is called the normal distribution8 (sometimes known as the “bell curve”): we can ask how the normal distribution arises, and the standard answer to this is that it arises when we add together measurements of a large number of events, and each of these events is affected only be the events near to it (in technical terms, there are no long-range dependencies, or long-range correlations). This result is known as the Central Limit Theorem: it applies in many cases, and it is the foundation of standard statistical theory.
But, because linguistic data has long-range dependencies, there is no reason why linguistic data should be normally distributed, and, in fact, it is not: the distribution seems to be close to what is called Zipf’s law.9 There are two things to be said here. One is that, because of this, the grounds for using standard statistical methods on linguistic data seem to be quite tenuous. The other is that, in practice, nobody cares: the vast majority of the statistics-using community simply gets on with their tests of significance, and does not worry. Of course, there are people who do worry: they are mathematicians, however, and live in a different world from that of the majority of statistics-using researchers. Viewed from this point of view, things seem a good deal more ramshackle than they are generally taken to be.
The moral of this part of the story is as follows. There may be researchers who are relevant for the evaluation of a particular work, and who may not be part of the community which produced that work. We see this in the case of the debate over Zipf’s law: much of the community which uses linguistic data either does not know, or does not want to be bothered by, this debate. It is, despite that, relevant for what they do. It is to the credit of our author that he seems to be not unaware that considerations of this sort may indicate trouble (I have put it somewhat broadly, because the range of possible difficulties is somewhat broad, and not confined solely to the debate over Zipf’s law): it is to the author’s credit that he realises (pp. 166) that the statistics of this data may not be entirely straightforward. Examining these issues, however, would take a book of an entirely different sort from the one which this review is concerned with.
- Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman, Elements of Statistical Learning: Data Mining, Inference and Prediction (New York: Springer, 2nd edition, 2009. ↩
- Michel Foucault, The Archaeology of Knowledge and the Discourse on Language (New York: Pantheon Books, 1972), 94–95 ↩
- Gary Gutting and Johanna Oksala, “Michel Foucault” in Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy ( Stanford University, Metaphysics Research Lab, spring 2019 edition, 2019). ↩
- Christopher Strachey, “On taking the square root of a complex number,” The Computer Journal, 2:89, 1959. ↩
- Samin S. Ishtiaq and Peter W. O’Hearn, “BI as an assertion language for mutable data structures” in Proceedings of the 28th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL’01 (New York: Association for Computing Machinery, 2001), 14-26 . Viresh Paruthi, “Large-scale application of formal verification: From fiction to fact” in Roderick Bloem and Natasha Sharygina, editors, Proceedings of 10th International Conference on Formal Methods in Computer-Aided Design, FMCAD 2010, Lugano, Switzerland, October 20-23 (IEEE, 2010), 175–180. ↩
- Friedrich Daniel Ernst Schleiermacher, Hermeneutik und Kritik, ed. Manfred Frank, volume 211 of Suhrkamp Taschenbuch Wissenschaft (Suhrkamp, 1977). See also Friedrich Daniel Ernst Schleiermacher, Hermeneutics and Criticism And Other Writings, ed. Andrew Bowie (Camridge: Cambridge University Press, Cambridge Texts in the History of Philosophy, 1998). ↩
- G. Graham White. “Semantics, hermeneutics, statistics: Some reflections on the semantic web” in Proceedings of HCI 2011 (British Computer Society, 2011). ↩
- Springer Verlag GmbH, European Mathematical Society. ‘Normal distribution’. ↩
- Wikipedia contributors. “Zipf’s law”. Wikipedia, the free encyclopedia (Online; 2020); accessed 25-April-2020 ). ↩