Databases make the world. They do so in at least two ways. The first and more trivial of these is that the world is increasingly made up of databases, as digital technologies continue to supplement, surround, or displace other forms of record keeping. The second and more consequential way in which databases make the world, though, is that this very spread of digital forms means that we increasingly understand, talk about, think about, and describe the world as the sort of thing that can be encoded and represented in a database. When the database is a tool for encoding aspects of the world, the world increasingly seems to us as a collection of opportunities for databasing; and as Castelle 1 notes, the forms of databasing in which we engage are ones that are entwined with the structures and patterns of organizational life.
This relationship between technology and practice is by no means unique to digital databases. Rather, representational forms and representational practices have always had this dual role, as both ways of encoding and ways of seeing. Jack Goody2, for example, has detailed the evolution of different ways of understanding and knowing about the world – in terms of hierarchies, collections, and relationships, for example – and how they are entwined with the emergence of written notations (trees, lists, and tables), arguing that these are bound together in a co-evolutionary cycle so that knowledge and representations arise as two aspects of a unified epistemological practice. Clanchy 3 has similarly documented the entwined production of written records and legal practice in medieval England. To the extent that database forms provide us with new capacities for encoding and acting upon representations of objects of interest, they are merely the latest site at which we can examine this conjoined practice.
In a widely influential analysis, media theorist Lev Manovich has argued that the database is the primary cultural form of the twenty-first century, in much the same way as the novel was in the nineteenth and the film in the twentieth.4 Manovich argues that the database as a cultural form emphasizes relationality and connection over narrative and sequence, and that this shift in attention characterizes new media forms such as hypertext, interactive digital media, and computer games, for example. In exploring the database as an cultural form, Manovich is not simply considering database-driven media in relationship to others, but rather examining a sensibility and aesthetic that play across different media – film can adopt relational conventions, just as digital media may be filmic. In line with Bolter and Grusin’s 5 concept of ‘remediation’ – the way in which new media forms recontextualize and engender a re-evaluation of existing media – Manovich instead draws our attention to the consequences of thinking of and approaching the world with a database as a lens, and examining the consequences of the database as a technology of representation.
Manovich’s argument is compelling and useful, but, as he himself notes, it takes the database form largely as read. He examines the database as, broadly, an unordered collection of relational data items, linking one object to another – a hypertext link and its destination, for example, or a person’s name and an account number, or a GPS location with an air quality measurement. Here, because databases are unordered, (The unordered quality of database tables makes them unlike, say, mathematical tables.6 In a relational database, data items are added in a particular order, and might be presented in a particular order by sorting according to some criterion, but the database itself maintains no order amongst the items.) Associativity displaces sequentiality in database representations. The broad argument within which Manovich pursues his analysis – that different cultural moments may be associated with different representational systems, and hence with different cultural and media experiences – does not, in itself, require a closer examination of just what kinds of properties and mechanisms constitute database-ness. Indeed, this is a question that Manovich explicitly lays to one side – and which I take up here.
Databases, after all, are specific material objects. They combine software, hardware, and data formatting components to produce particular kinds of effects that go beyond simply symbolic, discrete, and algorithmic aspects. Consequently, then, we might choose to ask how the specificities of database technologies as they manifest themselves in contemporary computational practice might be relevant for the phenomena he identifies. As it happens, we find ourselves at an especially interesting techno-cultural moment with respect to just this inquiry, as recent years have seen some radical reconsideration of the relationship between database functions, infrastructural arrangements, and informational practice which make it particularly appropriate to inquire further into how the notion of associative media might be grounded in specific notions of the database and the specific materialities of digital forms. Examining the materialities of the database form uncovers a range of historical specificities that help to set broader arguments about digital media, and which also illuminate the significance of contemporary shifts, dislocations, disruptions and evolutions. (This approach is, in fact, in line with the forms of analysis that Manovich has subsequently developed, although he has focused his attention largely on the applications involved in the production of contemporary digital media rather than core infrastructures like database technology. 7)
There are overarching questions that we might ask of such an enterprise. The first is, What does such a project involve? First, it involves thinking of the database as, itself, a collection of historically specific forms, and so, a technological object that is continually evolving. Second, it involves reformulating our notion of the database as an assemblage of hardware, software, data representations, diagrams, algebras, business needs, spatial practices, principles, computer languages, and related elements, linked by conventions of use and rules of thumb. Third, it requires formulating an analysis on multiple scales, multiple levels of abstraction, and multiple domains of practice simultaneously. Fourth, it demands that we understand the database comparatively as an element within a framework of possibilities that reflect competing needs, interests, and accounts of both the present and the future. What can such a project provide? It enables an engagement with the materiality of contemporary information practices that might more adequately illuminate the politics of digital representation and, taking Manovich’s argument seriously, open up the database to scrutiny as a cultural object itself.
The second overarching question is, what is at stake in such a project? This is a question to which I will return at the close of the paper, but the most significant consideration might be how this contributes to an opening up of the different layers of analysis to which software systems and digital systems might be subjected. One might consider Kirschenbaum’s 8 examination of digital storage media, Mackenzie’s 9 examination of the mathematical philosophy underlying databases, and Montford and Bogost’s 10 detailed tracing of the operation of a particular machine as each offering a way to slice through the complex whole of digital systems. My own focus here complements these by examining algorithms and implementations particularly at the nexus of representation, encoding, and digital manifestation. The hope is that such a project can speak to the historically situated practices of information processing and provide a perspective from which some of the mutual influences of evolving technical practice and evolving material forms become visible.
This investigation is part of a broader project that investigates the materialities of information. As suggested above, this concern with materiality is one that regards materiality as foundational property for the digital, rather than as an aspect only of particular kinds of systems. When construed this way, materiality does not arise as an alternative to digitality; no digital-material divide is invoked or supportable. Nor yet can we support a notion of the digital and material as coextensive (in terms of the intimate connections between a digital and material world, which of course is similarly committed to an ontological separation between digital and material). We cannot even support a digital/material parallelism, in which computational experience is simultaneously digital and material, and yet these two remain distinct as spheres of analysis and scopes of practice. Rather, what is of interest here is the materiality of the digital; the digital is always, inherently, and inescapably material, and it is its very materiality that is our topic. That is – a material account of digitality is not a new idea that is predicated on the emergence of tangible interaction 11 or ubiquitous computing 12, nor is a material account of digitality something that we need in response to new interests in computational materials 13 or physical computational platforms 14. Instead, a program of research around the materiality of digital information (as inherent rather than alternate, coextensive, or parallel) is one that examines the existing and consequential materialities of digital systems and digital representations, from the flowchart to the algorithm, the data structure, and the virtual machine – or, in this case, the database.
With that as a backdrop, my goal in this article is three-fold. First, I want to examine the specific materialities of the database form in the sense that Manovich and others have examined. That is, I want to show how the specific kinds of relationality that are expressed in database forms are historically and materially contingent. Second, I want to explore some contemporary developments in which the evolution of hardware platforms, database technologies, and media experiences are tightly entwined, partly in order to illustrate the relevance of an argument based in media materiality and partly in order to document a slightly different pattern emerging in the context of large-scale networked digital media of the sort increasingly familiar to us online. Third, I want to use these ideas to reflect back upon the idea of the materialities of information and what that broader project might achieve. While different elements of the argument are likely familiar to readers from different backgrounds, what is at stake here is the question of how a material account of digitality and digital information, rather than just a material account of digital infrastructures, reconfigures the agenda of software studies and might be taken up within an interdisciplinary inquiry into the cultural practices of information.
My starting point for this is the recent interest in the materiality of digital information, which sets a context then for a re-reading of the technologies of both conventional and alternative database forms.
Scholars from a number of different disciplines have increasingly become interested in what we might term the materialities of information. For those from information studies, this topic has arisen as a corrective to the notion of information as purely abstract15; for those from media and cultural studies, it has arisen as a turn towards the infrastructures that underwrite contemporary media practice 16; while for those from science and technology studies, it has largely in part a reframing of interests in materialist studies of technology and the production of scientific knowledge into the realm of the digital (e.g. Bowker 2000).17 For some, the term ‘software studies’ (e.g. Fuller 2008)18 has emerged as a unifying banner, although considerable debate still attends the use of that or any other label.
Kirschenbaum 19 attempts to reinvest digital media with a sense of their material foundations through a ‘forensic’ examination of the media objects themselves, such as the patterns of both digital and analogue information encoded on disks. Kirschenbaum suggests that the disk has been curiously absent from analyses of digital media, given its critical enabling role, and his turn to the disk as an object of examination undermines conventional accounts of the digital as inherently ephemeral and fleeting in comparison with the traditional inscriptions of literary analysis.
Blanchette 20 sets out to counter rhetorical oppositions of ‘atoms’ and ‘bits’ as distinct spheres of interest, offering instead an account that ranges from models of processor architecture to deliberations on Internet culture. His particular focus is computer science’s ‘layered model’ as both a technical and a practical device for organizing computer systems and their production. He argues that the modular arrangements of digital systems is driven as much by material as formal mathematical considerations, and indeed can be seen as a response precisely to the materialities of those systems while at the same time creating institutional arrangements (separating and reifying different areas of digital system design) that enshrine particular materially-grounded architectural forms. Castelle 21, while not addressing himself specifically to the questions of materiality or software as such, instructively details the relationship between processing models and institutional forms in the same relational databases that occupy my attention here.
Montford and Bogost 22 provide a compelling example of the mutual entwining of platform and interactive experience in their analysis of the Atari 2600 Video Computer System (VCS). The VCS was a home game console that played a significant role in the development and spread of video games as a medium. The technical constraints of the hardware platform, including video display timing and memory considerations, significantly shaped the sorts of games that could be played on it while, in turn, the success of the console gave it a particularly significant role in shaping the emerging genre of games and game-based interaction techniques in that period. Montford and Bogost provide an extremely detailed account of the VCS hardware, its capacities and its constraints, that illustrates the way that visual, auditory, and interactional features of their games arise in response to particular platform considerations. In doing so, they argue for an approach to software studies that starts from, and is grounded in, the specific technical nature of different technology platforms.
A fascinating recent alternative approach is taken by Montford et al. 23 who take a single line of code written in the BASIC programming language for the Commodore 64 as a launching-off point for a wide-ranging exploration of algorithmic art, graphical representations, digital representation, sequentiality, and randomness, amongst other topics. While this retains some of the platform studies focus, it expands its scope to focus more directly on the representational practices of software systems.
We can see already, then, that even amongst these allied (if not always closely aligned) accounts of digital materiality, many different interpretations are at work of what aspects of the digital may be subject to a material reading and in what ways this materiality matters. Dourish and Mazmanian 24 identify five separate programs that arise around what the materialities of information might be, discussing a material culture reading of digital goods, a geographical account of the spatial practices of digital infrastructures, a historical materialist reading of the production of information systems, a consideration of the widening application of the digital as a model for different areas of human activity, and the issue of material forms and representational practices. It is this last consideration that they examine in more detail (and which occupies us here in the specific case of database technologies); they explore two example domains (digital photography and nuclear weapons testing) as sites at which the coevolution of technology, representation, and practice can be examined.
Dourish and Mazmanian argue that this represents a third approach to the study of software as a cultural form, which is simultaneously more empirically grounded than the metaphorical invocation of algorithmics as a counterpoint to literary inscription while being less narrowly focused than platform studies. This third approach focuses on the recurrent patterns of arrangements of software, hardware, and representational practice – what is often thought of in technology terms as ‘architecture’ rather than implementation – as the foundation for a study of digital cultural practice. These architectures link, on the one hand, specific arrangements of technical components with, on the other, conventions of use and the software models that support them. Here, I want to exemplify this approach through an examination of the architectures of database systems, focusing on the interplay between representational practice and technological design.
Doing so entails, first, identifying just what we might mean by database.
The term ‘database’ is often used ambiguously. This is not simply a complaint about, say, the black boxing of database in Manovich’s analysis, but is true in the technical domain too, where the term has different uses at different times. Purely within the technical domain, we can distinguish between three levels at which the term is sometimes used.
The first level is simply a collection of data. In informal parlance, the term ‘database’ is sometimes used simply to denote the amassing of data or information, perhaps synonymously with ‘data set.’ However, database, as a technical term, is not merely a quantitative measure of data.
A database, as a technical term in computer science and a second level of description, is a collection of data that is, critically, encoded and organized according to a common scheme. As we will discuss below, databases may employ different sorts of schemes or arrangements for their data, including record structures, tree structures, and network structures. Arranging data into a common format makes each data item amenable to a common set of operations – data items can be sorted, compared, collated, grouped, summarized, processed, transformed, and related because a common format is employed. At this level, then, the database comprises two components – both the data itself and some description of the organizational scheme by which it is formatted.
The third common use of the term ‘database’ is to refer to the software system that implements this relationship between data format and data. In computer science, these are more formally called database management systems (DBMS). Software systems like Oracle Database (often just called Oracle), Filemaker Pro or the open source MySQL system are database management systems – that is, they provide programmers and end users with the tools they need to describe data formats and to process data collections according to those formats.
In summary, a (3) database management system implements (2) a database, which organizes (1) a collection of data. Each of these three elements is sometimes called a database, although it is useful and important to distinguish between them. In particular, the common data formats, incorporated into databases and implemented in database management systems, are worthy of more attention.
1. Data Formats and the Relational Model
The fundamental challenge of database development is the identification of a common format for data items or objects. If a database is to encode information about people and their belongings, for example, how should we represent people, and how should we represent the objects they possess, and how should we represent the relationship between the two?
Aspects of this problem are specific to particular databases and applications; for instance, representing people and their belongings will probably be done differently by an insurance company than by a moving company. Other aspects of the data modeling problem, though, are more generic. Different database management systems and database technologies have offered different styles of representation. The challenge is to find an approach that is rich enough to handle all the different applications that might be developed and uniform enough to handle each of them the same way.
The most common approach in contemporary database systems, in widespread use since the 1970s, is the relational model 25. In a relational system, data is organized into tables. Each column of a table encodes information of a particular sort – for instance, a table might have one column for someone’s name, and a second column for their Social Security Number. Each row of a table encodes a relationship or relation (hence the name ‘relational’) between data items; for instance, one row expresses the relationship between the name “John Smith” and the social security number 123-45-6789 while another expresses that between the name “Nancy Jones” and the social security number 246-89-7531. A relational database might comprise many different tables, which together relate different items of data to each other – within a single database, for example, one table might relate names to social security numbers, a second might relate social security numbers to years and tax returns, while a third relates years to total tax receipts. Such a model has a number of useful properties. First, it fulfills the requirement that it be flexible enough to accommodate a wide range of applications within a single, generic data model. Second, it allows a small number of operations to be defined that apply across those operations (insert a new row, remove a row, find all rows with a specific value, update a value in a row, etc.) Third, it corresponds to a mathematical description (the relational calculus) that can be used to analyze the properties of specific operations. Fourth, it supports a particular set of optimizations (database normalizations) that can be shown to improve system performance.
Since its introduction in the 1970s in IBM’s influential System R, the relational model has become the dominant model in commercial and independent database technologies; the SQL query language that was designed in concert with the model is, similarly, both a conventional and formal standard for database programming. The fact that it was the IBM of the 1970s that introduced the relational model will be important to our story here, given that IBM was simultaneously in the business of hardware design, software production, and bureau computer services. Before we return to that question, though, we should place the relational model in more context by noting that, although it became the dominant model, it was never the only one. Alternatives exist – both some that predated the relational model and some that have been developed since. Examining these alternatives briefly will provide some context for understanding the materiality of relational data processing.
2. Alternatives to the Relational Model
For comparative purposes, let’s briefly consider three alternatives to the relational data model. The goal here is not to be exhaustive, by any means, but to set out some alternatives as a basis for exploring the specific materialities of different forms as they arise in real systems. I will focus on three models in particular – hierarchical, network, and attribute-value approaches.
In a hierarchical model, the fundamental structure is not a table, as in the relational model, but a tree (and in particular, the inverted trees common in computer science, with a single object at the ‘top’ or ‘root’ of the tree.) Objects are connected together in tree structures, which look rather like family trees; each data object is potentially the ‘parent’ of a number of other objects, which can in turn be ‘parents’ themselves. A tree is a recursive data structure in that each subcomponent – that is, the collection of elements that are grouped under any given data node – is itself a tree. A tree structure supports a range of operations, including moving ‘up’ and ‘down’ the tree, moving and merging its ‘branches,’ and performing searches over subunits. Just as a relational data model does not specify any particular relationship between the objects that are placed in a table, so too the hierarchical model does not specify what relationship is implied by the ‘parent/child’ relationship. It might describe part/whole relationships (as in the phylogenetic tree that organizes organisms into species, genus, family, and order) or it might describe institutional relationships (such as a tree of academic advisors and graduate students).
A network model relaxes the constraints even further. In a network model, data objects are connected to each other in arbitrary structures by links, which might have a range of ‘types.’ One sort of link might, for instance, be an IS-A link, which connects an object representing something in the world with an object representing the kind of thing that it is. For instance, the statement “Clyde is an elephant” might be encoded in the database by an IS-A link which connects a Clyde object to an Elephant object, which might find itself similarly linked by objects representing other elephants while it also connects via a different sort of link, a HAS-A link, to objects representing Tail, Trunk, and Hide. The result is a network of interrelated objects that can be navigated via links that describe relationships between objects and classes of objects. As in the hierarchical model, links have no predefined meaning; just as database programmers creates a series of objects that match the particular domains they are working with, so too do they develop a set of appropriate link types as part of their modeling exercise.
If network models relax the linkage constraints of a hierarchical model, the attribute-value systems relax them still further. An attribute-value system comprises an unordered and unstructured collection of data objects, each of which has a set of attributes (themselves potentially unordered and unstructured.) So for instance, a person might have attributes that describe nationality, date of birth, and city of residence, while a car might have attributes that describe color, make, and model. Clearly, this is similar to the way that the relational model uses a table to organize the attributes of objects, but in attribute-value systems, attributes are associated directly with specific objects, rather than being captured in a table that will store data for multiple objects. That is, in an attribute-value system, each person can have not just different attributes but a different set of attributes. This matches cases where different items might be used in different contexts; for instance, I might know a lot of people, but the sorts of things that I know about different people depend on the contexts in which I encounter them or relate to them, so I remember different sorts of attributes about a student, a professional colleague, a friend, and a service professional. The attribute-value approach is very broad, and in slightly different contexts it goes by different names; with minor variations, it is also known as entity-attribute-value and as key-value, although the code ideas are the same.
These approaches to data representation and storage largely predate the relational model. Indeed, the development of the relational model was motivated not least by the desire to develop a unified approach, and one that provided a stronger separation between an abstract data model and a machine-specific representation, which was seen to be a problem of network models, for example. Further, Codd’s efforts in building his relational database systems on top of an algebraic foundation with provable properties made it attractive for high-performance and high-reliability settings. However, the alternative approaches persisted, particularly in settings like Artificial Intelligence (AI) research, where they were seen as more ‘natural’ approaches that could be used to model human knowledge practice. Indeed, hierarchical and networked models are more likely to be seen now in AI textbooks than database textbooks, while attribute-value mechanisms were seen as so foundational to AI research that they were incorporated into basic infrastructures like the Lisp programming language. 26
In the relationship between these alternative forms, though, we can find the first hint of the way that the specific materialities of representational models necessitate an elaboration of Manovich’s argument about databases as cultural forms. It makes clear that the fundamental element on which Manovich’s database analysis rests, the relationship between abstract descriptions and specific data items, is a feature of the dominant model of data representation, the relational model, but not an inherent feature of databases per se. It is more specifically grounded not just in the digital in general in in particular kinds of computational platforms – and as those platforms shift, the foundations of his analysis do too. Before we consider those shifts, though, let’s examine some of the specific materialities of the relational form.
The Materialities of Relational Data
One key feature of the relational model, in comparison to the others, is the separation it creates between the structure and content of a database. The structure of the database – technically known as a schema – defines the set of tables and the columns of the table, along with the types of data (such as numbers, dates, and text items) that each column will contain. The content of the database is the actual data that gets stored in each table – the data items that make up each record. Defining or constructing a database is the process of setting up the database schema. Subsequently using the database is the process of adding, removing, updating, and searching the data records that make up its content. The separation is both temporal and spatial. It is spatial because the database schema is represented separately from the data; it is temporal because the schema must be set up before any data can be entered (and because, in many cases, the schema is difficult or impossible to change once data entry has begun.)
In hierarchical databases and network databases, by contrast, there is no separation between structure and content. In these models, data items are directly related to each other; in the relational model, the relationships between items are encoded in their mutual relationships as defined by the database schema. Hierarchical and network databases – as well as the attribute-value systems that we will see later – do not (in general) reify structure and manage it as a separate concern.
Relational databases derive several advantages from the separation of structure and content. First, the separation means that database operations are made uniform, because they will all fit with a predefined structure. Second, it allows the system to optimize database performance; since it knows in advance what sort of data will be processed, efficient algorithms that are specialized to defined structures can be employed rather than more flexible algorithms that tend to be less efficient.
The significance of the separation between structure and content lies in the radical differences in their malleability. Content is infinitely malleable; indeed, that is the point of the database in the first place. The openness, flexibility, and extensibility of a database lies in the content that fits within the schema. That schema itself, however, is much more rigid. While there are mechanisms in most relational databases for columns to be added or deleted to the database structure, such changes are limited and invalidate data contents, limiting their uses still further.
One of the features of the database as a generic form – as a tool, as a research product, or as a platform for collaboration – is its provisionality. Manovich celebrates the database’s resistance to narrative, in comparison to film and literary forms; the database, he argues, is open to many different narratives since the user can traverse it freely. In addition to this narrative provisionality, it also offers an archival provisionality, since the database can be extended and revised, its contents continually updated; indeed, we generally think of databases as dynamic and evolving rather than fixed and complete. However, narrative and archival provisionality are primarily associated with content. The structure of a relational database is much more resistant to change. So the database is provisional, but only with limits; it is open-ended, but only in the terms originally laid down in its structure. It invites new users in, as viewers, as editors, and as contributors, but the structure of the schema largely defines and constrains the conditions of viewing, editing, and contributing.
When we consider the materiality of database forms, then, we need to consider it in this light. The database is composed entirely of bits, but those bits are not equal. Some are more easily changed than others; some carry more significance than others; some exhibit more flexibility and others more rigidity. A database is not simply a collection of bits, any more than an airplane is simply a collection of atoms. If the database is malleable, extensible, or revisable, it is so not simply because it is represented as electrical signals in a computer or magnetic traces on a disk; malleability, extensibility, and revisability depend too on the maintenance of constraints that make this specific collection of electrical signals or magnetic traces work as a database; and within these constraints, new materialities need to be acknowledged.
The significance of the materialities of relational data do not lie solely within the domain of its structural forms. Those structural forms are entwined too with specific procedural demands; indeed, the ability to encode data and the ability to operate on those encodings are twinned aspects of the relational model. In fact, it is in the area of database processing that we find the seeds of contemporary shifts in database materiality, so I turn now to the question of processing relational data.
Processing Relational Data
Relational databases store their data in tabular form. Like a row in a printed table, a single entry in a table describes a relation amongst items of data. Basic operations on these tables include changing an item of data, inserting a new relation (that is, a row in the table), or deleting a relation. However, many of the actions that we might want to perform in an application using the database involve more than one database operation. For instance, transferring money from one bank account to another might involve changing the values of two different relations in a table (the relation describing the balance of the originating account and the relation describing the balance of the destination account.) These multi-operation actions are problematic because they create the opportunity for inconsistency, from the application’s point of view (e.g. a point at which the money exists in both accounts, or in neither account), if the action should be interrupted part-way through, or if another action should be carried out simultaneously. Accordingly, databases are designed to cluster fundamental actions together into so-called ‘transactions,’ which group basic operations into indivisible logical units that correspond to application actions.
In traditional relational databases, the execution of transactions is held to what are known as the ACID properties. ACID is an acronym for the four fundamental properties of transaction processing – atomicity, consistency, isolation, and durability. These are the properties that a processing system for relational database transactions must maintain:
- Atomicity means that transactions are carried out as an indivisible unit – either the entire transaction is performed, or none of it.
- Consistency means that the database is in a consistent state at the end of the execution of each transaction (and, hence, before the execution of any other).
- Isolation means that transactions are executed as if they are the only transaction being performed at the system at a time, or in other words, the execution of any transaction is independent of, or isolated from, the concurrent execution of any other transaction.
- Durability means that once a transaction has been executed, its effects are permanent. A user may choose to change the data, but no internal action of the database will ‘undo’ a transaction once it is committed.
The ACID properties were first defined by Gray 27. They have become the fundamental touchstones for database implementation so that, as new execution models and new software architectures arise, maintaining the ACID properties ensures the consistency and effectiveness of applications designed to rely on database functionality. While ACID is not the only database consistency regime, it is the most widely adopted model, particularly for high-reliability applications.
The ACID properties describe the constraints that govern the execution of transactions within a relational database system. In doing so, they establish a link between the relational data model – which, for Codd, was characterized by both its universality and its mathematical foundations – and a processing model which pays more attention to algorithms and performance. Although the terms are used almost interchangeably to refer to conventional database processing (given that almost all relational databases are transactional, and almost all transactional databases relational), they draw attention to different aspects of the systems.
Castelle 28 has pointed to transactional processing as an important link between organizational practice and technological systems, compellingly suggesting that this is one important consideration in accounting for the success of the relational model. My immediate concern here with the transactional model though is in terms of the link that it provides from the relational model to specific execution contexts and their material arrangements.
The Materialities of Relational Data Processing
As I noted earlier, the relational model was developed by researchers at IBM, and first enshrined in an IBM software product, System R. System R was notable for many reasons; it introduced the first version of the programming system (SQL) that is most commonly used with relational database systems, and it demonstrated that the relational model could be implemented efficiently. In many ways, it is the progenitor of the vast majority of contemporary database systems.
IBM, however, was not only a software provider; it was also an influential hardware developer. IBM’s hardware was both sold and rented to clients, while at the same time, it also provided bureau computer services. Database services were a critical component of IBM’s business. With a new data model and transaction processing framework in hand, then, IBM was also in a position to develop its computer architectures to enhance performance in executing relational database transactions, and to define the benchmarks and measures by which database systems would be evaluated.
Indeed, we could reasonably argue that the identification and definition of the ACID properties enabled, or at least fuelled, the development of the database field, or what we might call the ACID regime, which encompasses both the software and hardware developments that enable high transaction-throughput systems. The ACID properties provide a processing framework for relational data, and turn attention to transactions-per-second as a measure of system performance. Further, they set the context for the development of hardware features (cache lines, storage system interfaces, etc) which are themselves assessed by their impact upon benchmarks. The database industry adopted as an initial measure of performance a benchmark (known as TP1) that had originated inside IBM, so that the relational data model, the ACID processing framework, the design of hardware to execute it efficiently, and the use of a standard model for evaluating performance in terms of transactions became mutually reinforcing.
In the ACID regime we find at work highly entwined sociomaterial configurations. High performance database architectures are designed to take advantage of the latest hardware and machine architecture developments, while simultaneously, hardware designs emerge in order to support the needs of database software implementations. Similarly, database systems are developed to meet the needs of business and commercial applications, which commercial applications develop around the capacities and possibilities of database systems. The material configurations of database technologies – the architectures of hardware platforms and the capabilities that they offer when twinned with appropriately designed software systems – can be seen to be woven into the very fabrics of organizational life and commercial arrangements, if we see the cost of database communications and integration as a component of the ‘transaction costs’ that Coase 29, in his classic paper, outlines as fundamental to the shaping of organizational structure in a market economy. Coase argues that there are costs associated with the execution of transactions in a marketplace, beyond the basic price of goods and services. As an analogy, if I want to depend on an external agency to clean my house, I have to make my house ‘cleanable’ by a third party by, for instance, separating fragile items from regular household goods (something that takes time and hence costs money.) On this basis, he shows that the question of whether a particular function (say, document management in a law firm, catering facilities in a university, or IT support in government agency) should be supported in-house or should be outsourced to the market-place depends on the relationship between the costs of actually providing the function and the transaction costs involved in letting the market handle it. In high tech operations, database interoperation is certainly a relevant factor; not only are there transaction costs involved in establishing the compatibility of two databases, but there may also be costs associated with the loss of, or monitoring of, the ACID properties when functions are divided across two or more distinct database systems. By this means we can trace a bidirectional relationship between, on one hand, computer system structure and, on the other, the organizations that employ them.
We have seen, then, that the notion of the database which has occupied the attention of theorists in the software studies movement needs to be unpacked. Databases abound; different data models have different constraints and are differently adapted to themselves-evolving hardware platforms.
The current moment is a particularly interesting one to investigate this topic, precisely because a number of the elements are currently undergoing some reconsideration. Although there have always been alternatives to the dominant relational model, there has been a specific and coordinated shift in business models, application demands, database technologies, and hardware platforms that reframe the database in ways that highlight the relevance of a material reading of database practice. The shift can be seen primarily in Internet-based services.
In 2004, Google researchers Jeffrey Dean and Sanjay Ghemawat published a description of the programming model on which Google bases its software services 30. The mechanism, known as map-reduce, was not in itself startlingly novel, and yet, in context, it has proved to be remarkably influential. The context here includes contemporaneous descriptions of a compatible distributed file system 31, the highly visible success of Google itself, and the shift in hardware platforms associated with Google and Google-like services. Google’s approach to hardware is not to rely on high-performance mainframe systems, but rather to harness very large numbers of medium-powered computers, often stock PCs. This approach has been adopted by many other corporations. Map-reduce provides a programming model and infrastructure that allows computations to be distributed across and coordinated amongst these large and variable computing clusters. Arguably, at least for a range of applications, systems like map-reduce provide high performance computing platforms without the massive capital investments required for high-performance hardware. The devil lies in the details, though, and in particular here the details of the hedge “for a range of applications.”
The basis of the map-reduce mechanism lies in two operations – ‘map’ and ‘reduce.’ ‘Map’ transforms a single large computing task into a set of smaller tasks that can be conducted independently, each producing a single result; ‘reduce’ takes the results of those separate computations and combines them into a single solution. In the map-reduce programming model, the challenge for the programmer is to encode a problem in terms of map and reduce operations. So, for example, to find the most frequently-occurring set of words in a document, then we might ‘map’ this into a set of parallel tasks, each of which will count up the instances of single words, producing an array of individual word counts, and then ‘reduce’ that array by selecting just those most-frequent words in which we are interested; or if I wanted to count up the number of nail salons in a city, I might break the city up into mile-square sections, count up the salons for each section concurrently, and then sum up the results. What we achieve through this mechanism is a degree of scalability; that is, our task can be executed on different-sized clusters of commodity computers depending on the size of the data set.
This achievement comes at a cost, though. One component of this cost is the differential support for applications. Some tasks map more easily on the map-reduce model than others, and so some aspects of application functionality may have to be sacrificed to fit this model. For instance, counting up the number of barbers in a city can be easily decomposed by dividing the city into smaller regions, counting up barbers in each region in parallel and combining the results. Counting up the number of distinct roads, though, cannot be broken down the same way because roads will cross the boundaries of smaller regions; the decisions are not independent when the decomposition is spatial. A second, related component is the degree of consistency achievable within the system – a system based on map-reduce and the independent but coordinated operation of many small computing systems may not be able to achieve the degrees of consistency achievable under the ACID regime. Here again the concern is that there is no communication between the processing units handling each mapped task until the reduce stage is performed.
Map-reduce has been influential in at least two ways. First, it is the fundamental mechanism that powers Google’s own services, including not just web search, but services such as Mail, Maps, Translate, and other services that many people rely on day to day. Second, it has strongly influenced others; for instance, an open-source implementation of the map-reduce model, called Hadoop, has been adopted as a core infrastructure by a wide range of Internet corporations, including EBay, Facebook, Hulu, and LinkedIn.
Like the ACID properties, map-reduce is a single component in a larger system, but one that encapsulates a larger regime of computational configuration. Map-reduce as a programming model is associated with particular styles of hardware provisioning (flexibly-sized clusters of medium-powered, interconnected computational devices) and, in turn, with a particular approach to data storage. In the cluster-computing or cloud-computing model associated with the map-reduce regime, traditional relational storage, as an application element, is often displaced by associative storage characterized by large, informally-organized collections of data objects linked together by broad associations. If bank accounts are the canonical example of data managed in the ACID regime, then Facebook’s collection of photographs are the canonical example of associative storage – a large-scale, informal collection of ‘tagged’ data objects in which consistency demands are low (if some temporary inconsistency arises between two users’ views of a collection of images of Paul, the consequences are not particularly problematic.) Again, like the ACID regime, we find an entwining of technological capacity and commercial use, mediated in this case not only by the development of hardware platforms by computer manufacturers, but also by the provision of commercial computational facilities by service providers like Amazon, Rackspace, and SoftLayer. Software like Hadoop makes feasible computational services deployed across scalable clusters of computational nodes; the combination of these systems, along with virtualization technology allows scalable cluster resources to be sold as commodity services, simultaneously making new hardware configurations commercially productive; and the availability of these services enables the development of new sorts of cloud computing applications characterized by distributed access, low degrees of integration, and loose consistency. Once again, these trends are mutually influential, and are linked too to the emergence of particular genres of user experience, as in the proliferation of social networking and informal-update notification services – the essential framework of what is popularly called ‘Web 2.0.’
What happens, then, to the database in a map-reduce world? As we have seen, the relational data model, and the ACID-based transactional execution model that it supports were entwined with the development of mainframe computing platforms, but map-reduce is part of an alternative platform, based on heterogeneous clusters of commodity processors connected by networks. Formal ACID properties are difficult to maintain in these inherently decentralized systems, since the consistency that ACID requires and the course-grained parallelism of map-reduce cluster computing are largely incompatible. Two strategies emerge in response to this. The first is to build data storage platforms which replicate some of the traditional execution model but relax the constraints a little, as represented by systems like Spanner 32, to be discussed in more detail later. The second is to abandon the relational/transactional model altogether, an approach that has resulted in a range of alternative data management platforms like MongoDB, FlockDB, and FluidDB, as well as research products such as Silt 33 and Hyperdex 34. Collectively, systems of this sort are sometimes called ‘NoSQL’ databases 35, an indication of their explicit positioning as alternatives to the relational-transactional approach (since SQL, the Structured Query Language, is the canonical way of programming relational-transactional databases).
So, while the arrival of map-reduce, as a programming model for a new platform architecture, was a significant trigger for innovation and transformation in database technologies, this is not all about map-reduce by any means. These new database platforms lend themselves to effective and efficient implementation on a new class of hardware architectures, but they by no means require it. Rather, they implement an alterative to the relational model that can be implemented across a wide range of potential platforms. Rather than being tied to particular hardware arrangements, they embody instead a different set of trade-offs between expressiveness, efficiency, and consistency, and propose too a different set of criteria by which data management tools should be evaluated. By re-evaluating the trade-offs, they also support a different set of applications – ones in which responsiveness, distribution, informality and user interaction play a greater part.
What we see entwined here, then, are three separate concerns. The first is the emergence of an alternative architecture for large-scale computer hardware. The second is the development of an alternative approach to data management. The third is the rise of a new range of applications that take a different approach to the end-user aspects of data management. These three concerns develop in coordination with each other, in response to each other, but with none as a primary causal factor. A material reading of this constellation focuses on its specific properties as they shape representational practice and the enactment of digital data.
Data without Relations
The term ‘NoSQL database’ – a members’ term redolent with moral valence – does not refer to a single implementation or conceptual model, but rather to a range of designs that, in different ways, respond to perceived problems with the relational approach. These problems might be conceptual, technical, pragmatic, or economic, and different problems provoke different design responses. One should be careful, then, in generalizing across them. However, when we take them together, we can see some patterns in both the motivation for alternatives to the relational model and in the forms that those alternatives take. These patterns begin to point to the entwinings of representational technology and representational practice that were amongst our original motivations.
Key considerations that drive the development of NoSQL databases include the need to conveniently incorporate them within a map-reduce framework, implemented on loosely-coupled clusters of individual computers rather than on tightly-coupled or monolithic architectures. Developers have also sought data models that more closely conform to the conventions of contemporary programming languages, especially object-oriented languages, and processing models that relax consistency constraints in exchange for the higher performance that can be obtained when parallel streams of processing no longer need to be closely coordinated. These considerations are not independent; for instance, the processing costs of transactional consistency are higher in loosely-coupled distributed systems, while object-oriented data models and programming frameworks are regularly employed in cloud-based and Web 2.0-style programming systems (themselves frequently built on cluster-based servers rather than large mainframes.)
One of the most common approaches to NoSQL data management is an attribute-value system (also sometimes known as a key-value system.) Whereas the fundamental element in the relational model is the table, which describes a relationship amongst data items, attribute-value systems are built around data objects, which frequently represent directly elements in the domain being modeled (e.g. people, pages, apps, or photographs.) Associated with each object is an unstructured collection of data values, each of which is associated with an identifying term or ‘key.’ So, if I want to associate myself as the owner of a particular data object, I might link the value “Paul Dourish” to the object, using the key “owner.” The expected pattern of interaction is that programs will take an object and look up the value associated with the key (e.g. to see what value represents the “owner” of an object); finding whether a particular value is associated with an object without knowing the key is frequently inefficient or not even possible.) The central database operations here, then, are to add a new key/value pair to an object, to update the value associated with a key for an object, or to retrieve the value associated with a key for an object.
In some implementations, pairs of keys and values are associated directly with an object, as described above; in others, there may be no anchoring objects but just an unstructured collection of key/value pairs.
There may be many objects in such a database, but there is generally no formal structure that collects them together. Similarly, there is generally no formal structure that describes what keys are associated with what objects. While a particular program might employ convention (such as using an “owner” key, for example), keys and values are generally associated directly with objects rather than being defined in schemas to which objects must then conform. The data is ‘unstructured’ in the sense that no formal structure governs or limits the keys that can be associated with an object.
This arrangement has a number of properties that meet the particular needs outlined above. Since objects are relatively independent, they can be distributed across different nodes in a cluster of computers with relative easy; similarly, key/value pairs are independent of each other and can also be distributed. Similarly, operations on each object are independent and can be distributed across a range of machines with relatively little coordination. There are, then, different ways that an attribute-value system can be managed in a map-reduce framework fairly easily; operations can ‘map’ over objects, or they can ‘map’ over keys, and in both cases, operations can be performed that take advantage of the map-reduce approach with little overhead. The relative independence of objects and of keys from each other is also a convenient starting point for a replication strategy that takes advantage of distributed resources especially in applications which read data more often than they write it. The Dynamo system, a distributed attribute-value system designed and used by Amazon.com in its Web services, was an early and highly influential example of a NoSQL database that accords to this model 36.
While attribute-value systems are a common approach to NoSQL databases, other approaches attempt to retain more of the relational model and its guarantees within evolving technical contexts. Google’s BigTable 37, for instance, is an internally-developed infrastructure providing database services to many of Google’s applications and is itself deployed on Google’s distributed infrastructure. In order to take advantage of this environment, and to provide high availability, BigTable databases are distributed across several servers. BigTable provides a tablular format, but does not provide the traditional relational database model. It provides no transaction facility for clustering database operations into larger blocks that maintain the ACID properties. Further, given the importance of questions of locality (that is, how data is clustered) and of storage (whether data is small enough to store in main memory for faster access, or must be stored on a disk) – both of which are considerations that traditional database systems explicitly hide from applications and application programmers – Bigtable provides application programmers with mechanisms to control them.
The designers of a follow-on system, Spanner 38, turned explicit attention to attempting to recover those elements of traditional relational databases that had been lost in the development of BigTable, including transactional consistency and relational tables (or variants on the same). Similarly, one of the advantages that they list for Spanner over BigTable is its use of an SQL-like query language – i.e., one that is similar to the query language introduced in IBM’s System R and which, over time, evolved into a national (ANSI) and international (ISO) standard.
The desirability of transactional semantics and an SQL-based language reflect a different form of materiality that is beyond the scope of the paper here – the way that particular standards, expectations, and conceptual models are so deeply entwined with programming systems that, despite the vaunted flexibility and malleability of software, some interfaces emerge almost as fixed points due to their ubiquity. However, more relevant here are the alterative materialities of relational and key/value approaches to databases, and their consequences.
Materialities in NoSQL Databases
If the approach to materialities that I advocate here is one that attends to specific material properties that are entwined with representational practice, what properties are significant in the rise of NoSQL databases? I will deal with four here – granularity, associativity, multiplicity, and convergence.
Granularity concerns the size of the elements that are the objects of attention and action within the system – and, in turn, with the entities that they might represent. Relational databases, for instance, deal in terms of three different scales of action – data values, relations, and tables. Different commands in the SQL control language will act directly on entities of these different scales (the UPDATE instruction operates on data values, for example, while INSERT operates on relations and CREATE TABLE operates on tables.) These matter because representational strategies – that is, how a social or physical object or phenomenon in the world might be represented both in database notation and as a stream of bits – have consequences for what things might be simply or difficult to perform in a particular database. For instance, when the scalar arrangements at work make it easy to address streets and houses as individual objects in a system, then it is relatively easy to build systems that reason about, for instance, postal routes. If, on the other hand, a house must always be represented in terms of a collection of rooms, then it becomes easier to reason about heating systems but harder to reason about property rights.
These considerations are, in part, an aspect of the art of programming, and reflect the design choices of those who build databases, but they are constrained by and shaped in response to the representational possibilities and material consequences of underlying technologies. As a representational and notational issue, then, granularity is a consideration that directly relates specific aspects of database technology implementation (such as cache size, memory architecture, query plan generation, and programming language integration) with the representational ‘reach’ of databases (including the question of what sorts of objects in the world they might represent, what forms of consistency can be achieved, with what temporalities they can be tied to dynamic events, and how easily they can be updated.) Where relational databases focus primarily on relations – that is, on recurrent structural associations between data values – NoSQL databases tend to focus on objects; and where the relational form sets up a hierarchy of data values, relations, and tables, NoSQL databases tend to operate at a single level of granularity, or, where they provide clustering mechanisms, do so to give the programmer control over aspects of system performance rather than as a representational strategy.
Related to the material questions of granularity is the question of associativity. How do different data objects cluster and associate? What binds them together, and allows them to be manipulated as collections rather than individual objects? Here, again, we see a difference between the relational model and NoSQL approaches, a difference that has consequences for representation and manipulation in database-backed technologies. In relational systems, data objects are associated through patterns in data values. This takes two forms. Within a table, such associations happen by bringing objects together according to the values stored in one or more columns within a table. So for example, one might use the SELECT statement to identify all managers whose salaries are over $80000, or to find all the books published by MIT Press, or to look for all the stellar observations made between particular dates. In all of these cases, a table is searched and particular columns are examined to find which entries (relations) will be returned for further processing. The second form uses data values to cluster data across multiple tables. Here, distinguishing features are often used to bring relations together. For instance, if I have a table that relates people’s names to their social security number, and another table that relates vehicle license numbers to the social security numbers of the vehicle’s owners, then I can use the social security numbers to create a new, amalgamated table that relates people’s names to the license numbers of their cars. Or, again, if one table maps product UPC codes to prices, and another maps UPC codes to supermarket sales, I can use the UPC codes to link prices to sales and find the total take at the checkout. Here, patterns in data values are used to link and associate records across different tables.
The attribute-value systems of most NoSQL entities, though, operate largely in a different manner. Here, we have two quite different forms of association.
The first is the association of a value to a data object. This is similar to that of database relations, but it differs in a critical way – here, values belong to particular objects directly, rather than being instances of an abstract relationship (which is how they manifest themselves in the relational model, which, as we have seen, separates data from its generic description in a schema.) Keys and values (which are perhaps similar to column names and data values in the relational model) are closely associated with other, while different values (similar to related values) are loosely associated, even for the same base object. This alternative arrangement results in a different pattern of data distribution and consistency (see below), and affects the way that data might be moved around in a particular implementation (including how it might be migrated between different nodes or even different data centers.)
The second form of association in the attribute-value system describes how a base object is linked to another. In some systems, the value associated with a key can be a pointer to another object. This allows objects to be ‘chained’ together – connected in patterns that relate a first object to a second, and then a third, and then a fourth, and so on. This is reminiscent of the networked data model that we encountered earlier. Indeed, attribute-value systems are regularly used to implement such networks because of their other key difference from relational systems – the absence of the schema as an abstract description of data structures. Network structures map naturally and obviously onto the interaction patterns of web-based applications, including so-called social networking applications; they provide poor support, though, for analytic applications which need to summarize, sort, and aggregate information across data items. Once again, as in other cases, the question is not whether particular operations are possible or impossible, but whether they are efficient or inefficient, easy or difficult, or natural or unnatural within particular programming frameworks. In questions of associativity, as in other material considerations, the structural and notational properties of data stores involve commitments to particular models of processing and contexts of information system development and deployment that make visible the relationship between technological infrastructure and social contexts.
Mackenzie 39 examines the centrality of multiplicity as a concern that databases manage. Within the materialist frame adopted here, though, a slightly different concern arises which concerns not just the production of unities within a database form but the potential for a single conceptual datum to be represented multiple times. As databases are used to coordinate tasks and activities that are distributed in the world, as performance becomes a key consideration, and as computer systems are increasingly constructed as assemblies of many processing elements, database designers have come to place a good deal of reliance on replication – the creation of replicas of data items which can be made available more speedily by being present in different places at once. Replication is especially useful for data is that read more frequently than it is updated. Updating replicated data involves considerably more work than replicating data that is stored centrally, but for data that is to be read frequently, being in two places at once can be extremely helpful.
As we have seen, data models (such as the relational model, the network model, and the hierarchical model) are notational strategies by which aspects of everyday world are encoded in computer systems. Replication, as a way that database systems attempt to optimize their performance, arises within the structure offered by a data model. What can be replicated, then, is not “the information about my bank account” or “the files relating to my book” – what is replicated are tables and relations, or whatever data objects make sense to the database, rather than to its users.
In exploring the problems of granularity and associativity, then, we found ourselves exploring the way that data objects become ‘live’ inside the system. Fundamentally, the concern is that the database is both a representational form and an effective form; that is, databases provide a notation that people use to represent and reason about the world, which is simultaneously an engineering artifact that directs and structures the operation of computer systems. The data model points in two directions, which is precisely why the materialities matter.
Multiplicity extends our previous concern with granularity to consider the way that different representational forms lend themselves to different implementation strategies, including different strategies of replication, and hence to different approaches to system performance. Here again we find ourselves in territory where ‘the database’ must be more than simply a formal construct; it is also a specific object with specific manifestations that make it effective for specific uses in specific contexts. Multiplicity – the ability to act coherently upon multiple manifestations of objects that, from another perspective, are ‘the same’ – reflects this concern with the particular, and the different constraints associated with different approaches to database form 40.
In particular, approaches to multiplicity speak to the opportunities for partition – that is, for a system to operate as multiple independent entities, either briefly or for an extended period. A simple case might be a database that continues to operate even when your computer is temporarily disconnected from the network (implying that there may be, for some time at least, replicas of data both on your own computer and on a server – replicas that may need to be resolved later.) Users of cloud services are familiar with the problems of synchronization that these imply (to be discussed in more detail in a moment.) More generally, what this suggests is that the granularity with which data is modeled is also entwined with the granularity of access and even of the digital entities that I encounter (whether my laptop appears to me as a device independent of the server or cloud infrastructure to which it connects.)
Multiplicity and the management of replicas is typically seen as a purely ‘internal’ feature of the database – an implementation detail that is irrelevant to, and often to be hidden from, other aspects of the system, its interface, and its function. It is seen, too, as independent of (and hence indescribably in terms of) the notational system of representation implied by the underlying data model. However, these material considerations ‘bleed through’ in interaction.
Multiplicity opens up the question of consistency, and consistency opens up the question of convergence. In distributed databases, convergence refers to the way that temporary inconsistencies, such as those that arise when part of the database has been updated while other parts have not, are eventually resolved; that is, how different paths eventually converge on a single, consistent outcome. We can think of this as the stabilization or normalization of diversity and disagreement; databases that converge are those that eventually resolve partial inconsistencies into a consistent whole.
With respect to a materialist analysis, two aspects of convergence are particularly worth examining. The first is the question of bounded consistency and the relationship between databases and the software systems that use them; the second is the question of the temporalities of convergence.
‘Bounded consistency’ refers to the way that a particular system might be able to accommodate a certain degree of inconsistency. For instance, in a banking system, absolute consistency is typically required for account balances, but minor deviations in the spelling of a customer’s name might be tolerable, at least for short periods. So, if a customer conducts a money transfer at an ATM, the details must be immediately and consistently available within the rest of the system; but if he talks to a teller to correct a spelling mistake, it may be sufficient that this information makes its way to the central system before the end of the business day.
A software system often incorporates both a database system for storage, and some application-specific components that manage the specific details of that application. For example, a computer game might rely on a database that records the details of objects and places, and then an application specific component that implements the rules of the game. When we speak of bounded consistency, we are often talking about the relationship between these two components. To what extent can inconsistency in the database component be tolerated by the application logic? Where, then, does the boundary between those components fall, and how rigorous is the separation between then?
Traditional relational database systems, with their transactional semantics, embody one answer to these questions. Transactional semantics guarantee that no inconsistency is ever ‘visible’ to components outside the database (including to other pieces of software). Users or software clients request that the database perform some operation, and the database carries this operation out while guaranteeing observable consistency. In other words, the guarantee of consistency is also, inter alia, a commitment to a separation between the database and other components. The communication channel between the components is narrow. When we begin to allow bounded consistency, we also ‘open up’ aspects of the operation of the database to external scrutiny, and widen the channel of communication. This has implications for other aspects of the fabric of the software system, including choices of programming languages, production environments, and the physical distribution of software components across sites; the more tightly coupled are application logic and database function, the more constrained are the choices.
The temporalities of convergence – that is, the temporal rhythms and dynamics of data consistency – further illustrate the entwining of materiality and practice in databases. At the lowest level, we must pay attention to propagation times and network delays that are consequences of the distribution of data across multiple nodes in a data center, or between different data centers (or even between different components in an apparently ‘monolithic’ computer, which is always, of course, constructed from many component, including a hierarchy of storage systems from registers to disks, with different temporal characteristics). A little more abstractly, we must pay attention to the computational complexity of the algorithms that will resolve conflicts, which themselves depend on the ways that the data is structured. But perhaps more important than the simple question of “how long will it take?” is the question of “how do the temporalities of convergence relate to the temporalities of use?” For example, while applications that serve interactive users are typically ‘slow’ in comparison to computer performance (since most systems can perform thousands of operations in the literal blink of a human eye), nonetheless we can observe radically different temporal dynamics when a database is serving up tagged photographs in a social networking application and when it is serving up objects to be rendered in three dimensions at thirty frames per second in a multi-player game.
Further, the temporalities of database resolution mean that sometimes, from a user’s perspective, time flows backwards. That is, in order to achieve convergence, it may be necessary for a system to undo conflicting operations that, from a user’s perspective, have already been completed. In other words, convergence is not simply a question of resolving inconsistency, but in doing it in a way that makes apparent sense, and that sense might be subject to a temporal logic on the part of human users; and similarly, the temporality of this process must be compatible with the dynamics of human experience or other forms of external action within the relevant domain.
Like the other properties that have come before, then, problems of convergence shape the choices available in the ways that different elements are assembled to form a system and highlight the constraints that structure them. Temporally and structurally, convergence highlights the nature of coupling between different components, including the coupling between digital systems and the world beyond. It casts doubt, then, upon any simple reading of “how digital systems operate” or “what databases do” by suggesting that particular manifestations of the digital, shaped by and giving rise to material arrangements, must be understood in their specificities.
We could discuss other properties. We could examine, for instance, the extent to which different data models support different applications and uses of the same database or data collection, or how committed they are to particular applications. We could examine the difference between static data and responsive data achieved through the use of triggers. We could examine the consequences of particular representational strategies for fault-tolerance and error-detection, for the transportability of data between different systems, and for interoperation with different technologies. However, the broader question is, what does it mean to look at these kinds of properties, or those that we have already examined, as material considerations?
Materialist conceptions of information have tended to focus on the ‘brute materiality’ of digital technologies, as manifest in the large-scale data centers that make possible the ineffable metaphor of the ‘cloud’ 41, the high-speed networks that shape not just flows of data but flows of capital and expertise 42, or the projection of digital infrastructures into urban spaces 43. The project here is somewhat different. It is guided by two criteria. First, it argues that material properties as those aspects of the fabric of information systems that constrain, shape, guide, and resist patterns of engagement and use. Second, it argues for the importance of material specificities, uncovered not least through a consideration of the relationship between design alternatives, including historical ones. So questions of, for instance, granularity or convergence matter because they are elements of the technological fabric of information processing that shape the emergence of particular forms of both technology and technological practice. In his account of design and design education, Donald Schön 44 talks of design as “a reflexive conversation with materials.” This examination of material properties focuses on those that are elements of that conversation. What we begin to glimpse in this exploration, then, is a sociomaterial account not just of computer systems but of computational infrastructures and their ramifications.
On the one hand, we can see a mutually constitutive relationship at work between hardware platforms and data representation systems. The essence of computational representations is that they are effective, that is that they can be operated and operated upon so as to achieve particular results. This effectiveness is dependent upon the relationship between the representations and the material tools through which those representations are manipulated, that is, both computational engines (computers and processors) and storage engines (disks and memory architectures.) In turn, their structures are also materially realized; we need to pay attention to the material considerations in building computational resources, which include power, air conditioning, size, structure, and distribution. The so-called Von Neumann architecture – the basic design principle of computer systems since the 1950s –separates processing elements that operate upon data from storage elements that record it, in turn placing considerable emphasis on the linkages between them. Similarly, the question of ‘interconnects’ – how different elements of computer systems are linked together – has arguably become the central question of high-performance computer design over the last several decades. What we see here is that the considerations that shape these architectural elements, including such ‘brute material’ elements as the propagation of electrical signals over distance and the heat dissipation requirements of densely packed computational elements – are intimately entwined with the data representational schemes that can be achieved.
On the other hand, we also see how data representational schemes – with all their constraints – play a similarly mutually-constitutive role in the applications to which they provide services. The capacities of effective representations and data storage facilities suggest and support particular kinds of application functions, while at the same time, application scenarios contextualize and drive the development of infrastructures and the packaging of infrastructure services as toolkits and frameworks. In particular, the rise of Web 2.0 has been associated with a series of new forms of data storage and manipulation service that eschew the ACID properties of traditional high-performance relational databases in favor of alternatives that prioritize responsiveness over consistency in the face of alternative implementation infrastructures.
What does it mean to take the materialities of information seriously? At the outset, four considerations were laid out. It requires us first, to examine specific information technologies as historically specific forms; second, to examine the entwining of hardware, software, representations, encodings, algorithms, programming practices, principles, languages, and to incorporate too the practices that bring these together; third, to examines these topics at multiple scales and consider how those scales are related; and fourth, to do so comparatively within a landscape of possible alternatives, realized and unrealized. Taking our leads from Schön’s “reflexive conversation with materials,” we think of the materials of information as those things with which, as members of contemporary society, we find ourselves in this reflexive conversation. Again, this is an approach that rejects both an opposition between digital and material and a parallelism between them but seeks instead a dualism, taking the digital to be always and already material and, perhaps more importantly, takes the practices of digitality to be material engagements.
Part of the struggle in this enterprise is to find an appropriate point of analytic purchase and in particular to be able to distinguish between the different levels at which one might operate. One might examine the mathematical foundations and the connection between formalism and practice, although that runs the risk of neglecting the very approximate ways in which mathematical descriptions and engineering objects are connected. A Turing Machine, after all, may be the underlying abstraction for computational devices, but contemporary computers are all both more than Turing Machines (they have input and output) and less than Turing Machines (no infinite storage). We might eschew the mathematics for engineering and trace the electrons and signals as they flow through the machine, although that runs the risk of losing sight of the cultural practices of digitality. My own approach here is not meant to displace either of these but to complement them by examining the material manifestation of programs, processes, and representations as my starting point.
It is with this in mind, then, that we have taken Manovich’s argument about the database as a cultural form and asked, what sort of database is that? The contemporary shifts in database infrastructures highlight the shifting materialities of information infrastructures and knowledge practice. A confluence of broader concerns – from data-driven ‘audit culture’ as a form of accountability-driven organizational management (Strathern 2000), to contemporary interests in ‘big data’ and the opportunities around statistical machine learning – place the database increasingly at the center of many areas of personal, organization, and civic life. In much the same way as we should be wary of the claims to objectivity in the term ‘raw data’ 45, we should be alert to the implications of database infrastructures as representational tools. The materialities of database technologies shape the kinds of ‘databasing’ that can be done and imagined. If databases make the world, they do so not as ineffable abstractions but in material ways.
This paper forms part of a larger project with Melissa Mazmanian whose contributions are central. Many ideas here have developed from conversations with Geof Bowker, Jean-Francois Blanchette, and Lev Manovich, and with audiences at the University of Melbourne, UCLA, and the IT University of Copenhagen. I am grateful to the reviewers for Computational Culture whose thoughtful readings provided useful guidance for both content and form. This work has been supported in part by the National Science Foundation under awards 0917401, 0968616, and 1025761, and by the Intel Science and Technology Center for Social Computing.
Buechley, L., M. Eisenberg, J. Catchen, and A. Crockett. ‘The LilyPad Arduino: Using Computational Textiles to Investigate Engagement, Aesthetics, and Diversity in Computer Science Education.’ In Proceedings of the ACM Conference on Human factors in computing systems (New York, NY: ACM, 2008), 423-432.
Blanchette, J.-F. ‘A Material History of Bits.’ Journal of the American Society for Information Science and Technology 62, no. 6 (2011), 1024-1057.
Bolter, J. and R. Grusin. Remediation: Understanding new media. Cambridge, MA: MIT Press, 1999.
Bowker, G. ‘Biodiversity Datadiversity.’ Social Studies of Science 30, no. 5 (2000), 643-683.
Campbell-Kelly, R., M. Croarken, R. Flood, and E. Robson. The History of Mathematical Tables: From Sumer to Spreadsheets. Oxford, UK: Oxford University Press, 2003.
Castelle, M. ‘Relational and Non-Relational Models in the Entextualization of Bureaucracy.’ Computational Culture 3 (2013).
Cattell, R. ‘Scalable SQL and NoSQL Data Stores.’ ACM SIGMOD Record 39, no. 4 (2010), 12-27.
Chang, F., J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. ‘Bigtable: A Distributed Storage System for Structured Data.’ ACM Transactions on Computing Systems 26, no. 2 (2008), 4:1-4:26.
Clanchy, M. From Memory to Written Record: England, 1066-1307. London: Hodder and Stoughton, 1979.
Coase, R. ‘The Nature of the Firm.’ Economica 4, no. 16 (1937), 386-405.
Codd, E. ‘A Relational Model of Data for Large Shared Data Banks.’ Communications of the ACM 13, no. 6 (1970), 377-387.
Corbett, J., J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford. ‘Spanner: Google’s Globally-Distributed Database.’ In Proceedings of the Conference on Operating System Design and Implementation (Berkeley, CA: USENIX, 2012), 251-264.
Dean, J. and S. Ghemawat, ‘Map-Reduce: Simplified Data Processing on Large Clusters.’ In Proceedings of the Conference on Operating Systems Design and Implementation (Berkeley, CA: USENIX, 2004), 137-149.
DeCandia, G., D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. ‘Dynamo: Amazon’s Highly-Available Key-Value Store.’ In Proceedings of the Symposium on Operating System Principles (New York, NY: ACM, 2007), 205-220.
Dourish, P. and G. Bell. Divining a Digital Future: Mess and Mythology in Ubiquitous Computing. Cambridge, MA: MIT Press, 2011.
Dourish, P. and M. Mazmanian. ‘Media as Material: Information Representations as Material Foundations for Organizational Practice.’ In How Matter Matters: Objects, Artifacts, and Materiality in Organization Studies, eds. P., Carlile, D. Nicolini, A. Langley, and H. Tsoukas (eds), Oxford: Oxford University Press, 2013.
Escriva, R., B. Wong, and E. Sirer. ‘Hyperdex: a distributed, searchable key-value store.’ In Proceedings of the Conference on Applications, technologies, architectures, and protocols for computer communication (New York, NY: ACM, 2012), 25-36.
Farman, J. 2013. ‘The Materiality of Locative Media: On the Invisible Infrastructure of Mobile Networks.’ In The Routledge Handbook of Mobilities, eds P. Adey, D. Bissell, K. Hannam, P. Merriman, and M. Sheller, New York, NY: Routledge, 2013.
Fuller, M. Software Studies: A Lexicon. Cambridge, MA: MIT Press, 2008.
Ghemawat, S., H. Gobioff, and S.-T. Leung. ‘The Google File System.’ In Proceedings of the Symposium on Operating System Principles (New York, NY: ACM, 2003), 29-43.
Gitelman, L. (ed). ‘Raw Data’ Is An Oxymoron. Cambridge, MA: MIT Press, 2013.
Goody, J. The Domestication of the Savage Mind. Cambridge: Cambridge University Press, 1977.
Graham, S. and S. Marvin. Splintering Urbanism: Networked Infrastructures, Technological Mobilities, and the Urban Condition. London: Routledge, 2001.
Gray, J. ‘Notes on data base operating systems.’ In Lecture Notes on Computer Science 60, eds. R. Bayer, R. N. Graham, and G. Seegmueller, Berlin: Springer-Verlag, 1978.
Gray, J. 1981. ‘The transaction concept: Virtues and limitations’. In Proceedings of the 7th Interna- tional Conference on Very Large Database Systems (New York, NY: ACM, 1981), 144-154.
Ishii, H. and B. Ullmer. ‘Tangible Bits: Towards Seamless Interfaces between People, Bits, and Atoms.’ In Proceedings of the Conference on Human Factors in Computing Systems (New York, NY: ACM, 1997), 234-241.
Ishii, H., D. Lakatos, L. Bonanni, and J.-B. Pabrune. ‘Radical Atoms: Beyond Tangible Bits, Towards Transformable Materials.’ Interactions 19, no. 1 (2012), 38-51.
Kirschenbaum, M. Mechanisms: New Media and the Forensic Imagination. Cambridge, MA: MIT Press, 2008.
Leavitt, N. ‘Will NoSQL Databases Live Up to Their Promise?’ IEEE Computer 43, no. 2 (2010), 12-14.
Lim, H., B. Fan, D. Andersen, and M. Kaminsky, 2011. ‘Silt: a memory-efficient, high-performance key-value store.’ In Proceedings of the Symposium on Operating Systems Principles (New York, NY: ACM, 2011), 1-13.
Mackenzie, A. ‘More Parts than Elements: How Databases Multiply.’ Environment and Planning D: Society and Space 30 (2012), 335-350.
Manovich, L. The Language of New Media. Cambridge, MA: MIT Press, 2001.
Manovich,L. Software Takes Command. New York, NY: Bloomsbury, 2013.
Montford, N. and I. Bogost. Racing the Beam: The Atari Video Computer System. Cambridge, MA: MIT Press, 2009.
Montford, N., P. Baudoin, J. Bell, I. Bogost, J. Douglass, M. Marino, M. Mateas, C. Reas, M. Sample, and N. Vawter. 10 PRINT CHR$(205.5+RND(1)); : GOTO 10. Cambridge, MA: MIT Press. 2012.
Parks, L. Cultures in Orbit: Satellites and the Televisual. Durham, NC: Duke University Press, 2005.
Rosner, D. ‘The Material Practices of Collaboration.’ In Proceedings of the Conference on Computer-Supported Cooperative Work (New York, NY: ACM, 2012), 1155-1164.
Schön, D. The Reflective Practitioner: How Professionals Think In Action. New York, NY: Basic Books, 1984.
Starosielski, N. ‘Underwater Flow.’ Flow 15, no 1, 2011. Flowtv.org.
Steele, G. and R. Gabriel. ‘The Evolution of Lisp.’ In Proceedings of the Conference on the History of Programming Languages (New York, NY: ACM, 1993), 231-270.
Strathern, M. Audit Cultures: Anthropological Studies in Accountability, Ethics and the Academy. London: Routledge, 2000.
Stonebreaker, M. 2010. ‘SQL Databases vs NoSQL Databases.’ Communications of the ACM 53, no. 4 (2010), 10-11.
- Castelle, M. ‘Relational and Non-Relational Models in the Entextualization of Bureaucracy.’ Computational Culture 3 (2013). ↩
- Goody, J. The Domestication of the Savage Mind. Cambridge: Cambridge University Press, 1977. ↩
- Clanchy, M. From Memory to Written Record: England, 1066-1307. London: Hodder and Stoughton, 1979. ↩
- Manovich, L. The Language of New Media. Cambridge, MA: MIT Press, 2001. ↩
- Bolter, J. and R. Grusin. Remediation: Understanding new media. Cambridge, MA: MIT Press, 1999. ↩
- Campbell-Kelly, R., M. Croarken, R. Flood, and E. Robson. The History of Mathematical Tables: From Sumer to Spreadsheets. Oxford, UK: Oxford University Press, 2003. ↩
- Manovich,L. Software Takes Command. New York, NY: Bloomsbury, 2013. ↩
- Kirschenbaum, M. Mechanisms: New Media and the Forensic Imagination. Cambridge, MA: MIT Press, 2008. ↩
- Mackenzie, A. ‘More Parts than Elements: How Databases Multiply.’ Environment and Planning D: Society and Space 30 (2012), 335-350. ↩
- Montford, N. and I. Bogost. Racing the Beam: The Atari Video Computer System. Cambridge, MA: MIT Press, 2009. ↩
- Ishii, H. and B. Ullmer. ‘Tangible Bits: Towards Seamless Interfaces between People, Bits, and Atoms.’ In Proceedings of the Conference on Human Factors in Computing Systems (New York, NY: ACM, 1997), 234-241. ↩
- Dourish, P. and G. Bell. Divining a Digital Future: Mess and Mythology in Ubiquitous Computing. Cambridge, MA: MIT Press, 2011. ↩
- Buechley, L., M. Eisenberg, J. Catchen, and A. Crockett. ‘The LilyPad Arduino: Using Computational Textiles to Investigate Engagement, Aesthetics, and Diversity in Computer Science Education.’ In Proceedings of the ACM Conference on Human factors in computing systems (New York, NY: ACM, 2008), 423-432. See also Rosner, D. ‘The Material Practices of Collaboration.’ In Proceedings of the Conference on Computer-Supported Cooperative Work (New York, NY: ACM, 2012), 1155-1164. ↩
- Ishii, H., D. Lakatos, L. Bonanni, and J.-B. Pabrune. ‘Radical Atoms: Beyond Tangible Bits, Towards Transformable Materials.’ Interactions 19, no. 1 (2012), 38-51. ↩
- See, for example. Dourish, P. and M. Mazmanian. ‘Media as Material: Information Representations as Material Foundations for Organizational Practice.’ In How Matter Matters: Objects, Artifacts, and Materiality in Organization Studies, eds. P., Carlile, D. Nicolini, A. Langley, and H. Tsoukas (eds), Oxford: Oxford University Press, 2013. ↩
- See, for example, Parks, L. Cultures in Orbit: Satellites and the Televisual. Durham, NC: Duke University Press, 2005. ↩
- See, for example, Bowker, G. ‘Biodiversity Datadiversity.’ Social Studies of Science 30, no. 5 (2000), 643-683. ↩
- See, for example, Fuller, M. Software Studies: A Lexicon. Cambridge, MA: MIT Press, 2008. ↩
- Kirschenbaum, Mechanisms. ↩
- Blanchette, J.-F. ‘A Material History of Bits.’ Journal of the American Society for Information Science and Technology 62, no. 6 (2011), 1024-1057. ↩
- Castelle, ‘Relational and Non-Relational Models.’ ↩
- Montford and Bogost. Racing the Beam. ↩
- Montford, N., P. Baudoin, J. Bell, I. Bogost, J. Douglass, M. Marino, M. Mateas, C. Reas, M. Sample, and N. Vawter. 10 PRINT CHR$(205.5+RND(1)); : GOTO 10. Cambridge, MA: MIT Press. 2012. ↩
- Dourish and Mazmanian. ‘Media as Material.’ ↩
- Codd, E. ‘A Relational Model of Data for Large Shared Data Banks.’ Communications of the ACM 13, no. 6 (1970), 377-387. ↩
- Steele, G. and R. Gabriel. ‘The Evolution of Lisp.’ In Proceedings of the Conference on the History of Programming Languages (New York, NY: ACM, 1993), 231-270. ↩
- Gray, J. ‘Notes on data base operating systems.’ In Lecture Notes on Computer Science 60, eds. R. Bayer, R. N. Graham, and G. Seegmueller, Berlin: Springer-Verlag, 1978. Also Gray, J. 1981. ‘The transaction concept: Virtues and limitations’. In Proceedings of the 7th International Conference on Very Large Database Systems (New York, NY: ACM, 1981), 144-154. ↩
- Castelle, ‘Relational and Non-Relational Models.’ ↩
- Coase, R. ‘The Nature of the Firm.’ Economica 4, no. 16 (1937), 386-405. ↩
- Dean, J. and S. Ghemawat, ‘Map-Reduce: Simplified Data Processing on Large Clusters.’ In Proceedings of the Conference on Operating Systems Design and Implementation (Berkeley, CA: USENIX, 2004), 137-149. ↩
- Ghemawat, S., H. Gobioff, and S.-T. Leung. ‘The Google File System.’ In Proceedings of the Symposium on Operating System Principles (New York, NY: ACM, 2003), 29-43. ↩
- Corbett, J., J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford. ‘Spanner: Google’s Globally-Distributed Database.’ In Proceedings of the Conference on Operating System Design and Implementation (Berkeley, CA: USENIX, 2012), 251-264. ↩
- Lim, H., B. Fan, D. Andersen, and M. Kaminsky, 2011. ‘Silt: a memory-efficient, high-performance key-value store.’ In Proceedings of the Symposium on Operating Systems Principles (New York, NY: ACM, 2011), 1-13. ↩
- Escriva, R., B. Wong, and E. Sirer. ‘Hyperdex: a distributed, searchable key-value store.’ In Proceedings of the Conference on Applications, technologies, architectures, and protocols for computer communication (New York, NY: ACM, 2012), 25-36. ↩
- See, for example: Cattell, R. ‘Scalable SQL and NoSQL Data Stores.’ ACM SIGMOD Record 39, no. 4 (2010), 12-27; Leavitt, N. ‘Will NoSQL Databases Live Up to Their Promise?’ IEEE Computer 43, no. 2 (2010), 12-14; and Stonebreaker, M. 2010. ‘SQL Databases vs NoSQL Databases.’ Communications of the ACM 53, no. 4 (2010), 10-11. ↩
- DeCandia, G., D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. ‘Dynamo: Amazon’s Highly-Available Key-Value Store.’ In Proceedings of the Symposium on Operating System Principles (New York, NY: ACM, 2007), 205-220. ↩
- Chang, F., J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. ‘Bigtable: A Distributed Storage System for Structured Data.’ ACM Transactions on Computing Systems 26, no. 2 (2008), 4:1-4:26. ↩
- Corbett, J., J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford. ‘Spanner: Google’s Globally-Distributed Database.’ In Proceedings of the Conference on Operating System Design and Implementation (Berkeley, CA: USENIX, 2012), 251-264. ↩
- Mackenzie, ‘More Parts than Elements.’ ↩
- ibid. ↩
- Farman, J. 2013. ‘The Materiality of Locative Media: On the Invisible Infrastructure of Mobile Networks.’ In The Routledge Handbook of Mobilities, eds P. Adey, D. Bissell, K. Hannam, P. Merriman, and M. Sheller, New York, NY: Routledge, 2013. ↩
- Starosielski, N. ‘Underwater Flow.’ Flow 15, no 1, 2011. Flowtv.org. ↩
- Graham, S. and S. Marvin. Splintering Urbanism: Networked Infrastructures, Technological Mobilities, and the Urban Condition. London: Routledge, 2001. ↩
- Schön, D. The Reflective Practitioner: How Professionals Think In Action. New York, NY: Basic Books, 1984. ↩
- Gitelman, L. (ed). ‘Raw Data’ Is An Oxymoron. Cambridge, MA: MIT Press, 2013. ↩