Amazon recently announced a new business model for ebooks. Essentially, they’re planning to sell books by the page. It’s an interesting business model, and I suspect that there’ll be much written about its effects on authors and author income. However, I’d like to spend some time thinking about their business model from the perspectives of systems design, privacy, and the general freedom to read and learn. This is more a reader-centric (or consumer-centric) point of view, but there are lots of digital consumers out there, so I think these views are worth discussing.
Let’s start with the basic business requirements. As I see it, there are two: (1) users will be charged for each page that they read (i.e., turning to a new page in a book is a chargeable event), and (2) users will be charged at most once per page. To fulfill the first requirement, Amazon needs to know what book you’re reading (you must have licensed it from Amazon), they need to know who you are (you needed an Amazon account in order to license the book), and they need to know whenever you turn the page. It’s easy to model this type of record keeping in the pen and paper world: picture a sheet of paper with your name written across the top, along with the title of the book you’re reading. Below your name and the book title are a series of numbered checkboxes, one for each page in the book. Each time you turn to a new page, check off the box with the corresponding page number. When you’re done reading, put the paper away for safe keeping, until you pick the book up again.
As an exercise, try doing this the next time you read a book or magazine. Write down every page you read, and to make it more interesting, write the date and time next to each page number, along with where you were at the time. Multiply this by millions of people and dozens of books, and you’ve got big data. At that scale you really need to store the data in electronic form, but that doesn’t change the basic nature of the data itself.
The digital version of this record keeping system involves database tables and records. The main record format probably looks like this:
(book_id, customer_id, page_number, timestamp)
book_id is something to identify the book you’re reading (think ISBN, or the moral equivalent thereof). It’s something you can use to (unambiguously) look up information about a book; for example, the title, author, and payment rate. customer_id identifies you (think Amazon’s equivalent of a Driver’s license id). Amazon needs to know whose account to charge, and whose credit card to whack at the end of the month (or, however often Amazon to whack credit cards for per-page transactions). The other two fields should be self explanatory.
Thus, each time you turn the virtual page, your Kindle sends a little bundle of information back to Amazon’s servers. On the receiving side, Amazon puts this information through an algorithm — a computer’s decision making process. The decision process probably looks something like this:
receive (book_id, customer_id, page_number, timestamp) from Kindle user if received data is not valid discard data endif if (book_id, customer_id, page_number) combination has not been seen before record (book_id, customer_id, page_number, timestamp) charge customer's account for new page read endif
This kind of notation is called pseudocode. It’s a convenient way to express program logic, without going into all the nitty-gritty programming details. There are two decision processes here: “is this data valid?”, and “is this a chargeable event”. Let’s examine these decisions one at a time.
Data validation is the unglamorous, boring drudgery of software development (which be why bad data is the source of so many software bugs). At the very least, Amazon needs to make sure that ‘book_id’ represents a valid book, that ‘customer_id’ represents a valid customer, and that the given customer has previously licensed the given book. That’s a bare minimum, but there are harder problems. We’ve all seen the New Yorker cartoon “On the internet, nobody knows you’re a dog“. That applies here too: once you have a valid book_id and customer_id, how can you be sure that a `new page read’ event really came from the given customer (as opposed to, say, an unscrupulous author, trying to inflate their book income)? Public key cryptography is one way to solve this problem: a Kindle could have a private key, and use that to sign each record it sends to Amazon; Amazon would then use the Kindle’s public key to verify that the data was authentic. I don’t know if Amazon uses this technique, but I’m hoping they do.
Let’s look at another scenario: suppose Amazon receives a flurry of records — several per second — faster than anyone could actually read or skim through a book. They might need another validation rule that says “there must be at least N seconds between records from any individual customer_id”.
I could think of more validation routines, but I won’t go into them here. Like I said, validation is messy business.
Going back to the pseudocode, the goal behind the second condition is to ensure that customers are charged at most once per page. Unlike validation, this decision process is a piece of cake. Amazon keeps a record of every page that every customer reads, ever. When a new record arrives, Amazon already has that (book_id, customer_id, page_number) combination (and the customer re-reads it for free), or they don’t (and the customer’s account is charged).
This segues nicely into our next topics: data privacy, and the freedom to read. Let’s turn away from Amazon for a moment, and consider another institution that allows people to read books — your local library. Libraries have record-keeping needs, but they tend to be very different than Amazon’s, particularly in the area of data retention. My library needs to know what books I’ve checked out, and what fines I owe. A librarian friend tells me that this is the general rule of thumb for retaining borrowing information: once the book is returned and any fines have been paid, there’s no need to retain the borrowing information, and the library deletes it. Contrast this with Amazon’s new model, which essentially requires them to keep a record of every page of every book you read … forever.
When the US patriot act was first introduced, the American Library Association (ALA) was one of it’s strongest critics. They were afraid that the business records provision (aka Section 215, aka the
library provision) would require libraries to turn over patron’s borrowing records. From the ALA’s point of view, everyone has the freedom to read, to learn, and to draw their own conclusions. In today’s world, I’m not sure we can take this right for granted. People have come been questioned over Google searches, and come under suspicion for their reading habits. Perhaps I’d like read the Communist Manifesto, the Qur’an, the Anarchist’s Cookbook, A People’s History of the United States, or some George Orwell. Or (gasp) a couple of Kurt Vonnegut’s novels. Because hey, they’ve got pictures of beavers.
Of course, this would make Amazon’s database a gold mine for those who’d seek to limit what we can read, what ideas we can be exposed to, and what things we’re able to learn. The pay-per-page model lays out this information in incredible detail. In other words, one would not only know that you were looking at Kurt Vonnegut’s beavers; they’d know exactly how much time you spent looking at Kurt Vonnegut’s beavers. Amazon’s databases haven’t wound up on pastebin yet, but I won’t be surprised to see that happen eventually.
I’m using this article to have a little fun looking at Amazon’s new pay-per-page model, some of the technical requirements it implies, and a couple of related issues. I cannot say how Amazon plans to implement this new business model; rather, my goal is to give a high level description of how an implementation might be done, and provoke some discussion on what the pay-per-page model implies.