Jim Starkey

I sat down with Jim Starkey, inventor of the "BLOB" data type, which was used in databases in the late 80s and in the 90s to handle unstructured data. A BLOB field was a variable-length sequence of bytes. A database record that contained a field of the BLOB data type would also include other fields that would provide the application with more information about what was in the BLOB field.

For example, if the field contained a scanned document, other fields in the record would record certain information about the contents of the document to allow applications to search and retrieve the right document. BLOBs were used primarily for images and audio. The data type solved many of the problems we are still facing with unstructured data.

I remember using BLOBs in 1988 for document imaging, and I thought this new data type was quite a clever way have handling what we now refer to as unstructured data. Tell me about how you came up with idea for BLOB?
Jim Starkey: It came about when I was working for DEC around 1982 or 1983. The funny thing is, there was just amazing amount of resistance to the idea inside DEC - you know, the standard arguments, like no customer ever asked for BLOBs, and they never lost a sale because we didn't have BLOBs, and none of our competitors have BLOBs. Most people were asking, "Why would anybody ever want to put a document in the database?"

But finally, after lots of discussion, it was decided that DEC database products would have this new data type, but they wouldn't be called BLOBs. Instead, DEC called them "segmented strings".

I left DEC and started with another company, and with that company we had our first deal with Apollo computing. Apollo really loved BLOBs. It solved all sorts of problems, and they caught on very, very rapidly. Of course, at the time Apollo was a rising star, so inside of twelve months every database company announced either they had BLOBs or they're going to have them in their next release. This included DEC, who never quite figured out that the segmented string was actually a BLOB, and that they already had it, just under a different name. That's the way it goes.

I read that the name BLOB came from the bad Steve McQueen movie "The Blob" and that you thought the data in a BLOB field could get so big that it might eat Cincinnati, like the BLOB monster in the movie.
I don't remember saying that, but that sounds like something I'm prone to saying. It didn't stand for anything actually. But other people later retrofitted it as an acronym for either Binary Large Objects or Basic Large Objects.

For me, the BLOB was the first attempt to deal with unstructured data. Would you agree with that?
JS: Yes. I would. By the way, I also created an interface, which I built into the relational database, but it never really caught on. This interface had a really nice method where you could take a BLOB by type. There were system-defined types and there were user-defined types, and you could register BLOB filters. So if you ask for type X and it was a type Y, you find the filter that would automatically filter from X to Y. It turned out to be very useful for printing out binary book in printable ASCII, but the rest of the industry never picked up this idea, which is kind of a shame.

Then again, SQL committees have never had a lot of inspiration. The database world has a long history of not being able to deal with unstructured data, and it's pretty embarrassing.

The world hasn't caught on to two ideas: that you need to have data consistency, even when the data is unstructured, and that you need to be able to do unstructured searches against structured data. There's no reason you can't do these two things.

What exactly is an unstructured search?
JS: It's a search in which the application doesn't have to name the tables it's searching. It's also called "integrated search", because it's the database itself, rather than the layer on top, that figures out what tables you're searching. It doesn't fit the SQL model or language, because with SQL you always have to say what table you're retrieving from. But what if you don't know what table it is?

Actually the product that eventually became the storage engine in MySQL had a completely integrated search - a multi table, multi column search, where any text fields could be marked as searchable, under which case they were automatically indexed.

I talk to a lot of CIOs, and all they're hearing about is Big Data and NoSQL. What do you think about NoSQL as a solution to managing large data sets that are unstructured and distributed?
JS: I don't like it much. The reason I don't like it is you can do much better than that. I believe very strongly in transactions, and you need consistency. If you have an application that is running against the database system, it has to see consistent data. If it doesn't see consistent data, then the application programmers may have to deal with a very complex environment - and there just aren't that many application programmers out there that have those kinds of skills.

The database system really should solve the problem of consistency, and when a transaction fails, the database system should be able to say that the transaction didn't work, so the application can back it out and try it again. None of this stuff is present in NoSQL. The NoSQL guys gave up and said: "This is the best we know how to do."

My argument is that's not good enough.

The relational model has a lot going for it, but it's sometimes too rigid, which makes handling unstructured data more difficult. What I think is interesting looking forward, is to take those elements of the relational model that are very successful and start relaxing the rules.

In a database, every record should be self-describing. You can put some intelligence at a higher level—not the level of the SQL language but using higher-level language on top of that. I think that is going to be the next step.  I think it's a much better approach than what the NoSQL guys are trying to do. But the industry is not quite there yet.

Besides, NoSQL is an umbrella term, isn't it? It's just a set of different approaches to solve the same problem by using tools other than SQL. There several different competing models, right?
JS: That's right. It's kind of crazy. It's technology that's defined by what it isn't. My cell phone is NoSQL too.

Sometimes NoSQL solutions are described as providing what's called, "eventual consistency," which means that, for some interval, the data may be inconsistent. What do you think about that?
Virtually any adjective used with the word consistency means not. So "eventual consistency" means inconsistent.

Do you think that eventually people will start to come see the light and deal with Big Data with a more transactional approach?
JS: Yes, I do. But then again, my crystal ball has always been flawed.

Here's what I don't like about NoSQL. For any given database technology there's a division of intelligence between the client and the database system. The more the intelligence lies in the database system, the more you can optimize at the database level, and the more you free application developers to focus on business logic.

The more the database does, the lower the skill level of the application programmer has to be. I think we're living in a world where the expertise of programmers is now very heavily oriented towards mobile applications and fancy GUIs. So dealing with practical aspects of data management in the application just isn't going to happen.

I think the database world has to adapt and become more flexible. There needs to be a higher-level language, so you can do very efficient single round trip very high-level database accesses. For example, consider the huge amount of data access required for a web page, which is the biggest driver for data access these days.

To generate a good web page, like amazon.com, you need a data access point that is sufficiently high level that you can send in a single query and get everything you need back at once. Otherwise, the application needs to make somewhere between two-dozen and four-dozen queries. Even if each query is very fast, all those round trips are completely unacceptable in terms of latency.

You should be able to get away with a single turnaround. That's where the world has to go. But that just isn't going to happen soon. It could happen on an SQL database, if somebody were to at a higher-level language on top.

But it certainly won't be solved by NoSQL. NoSQL requires all the intelligence to be on the client side, and that's the wrong place to put the intelligence. In NoSQL, data views are completely defined by the application program, so no two applications have the same view of the data. Imagine the nightmares that result when applications have different views of the data, and the data becomes inconsistent.

By and large, the database world is pretty dreary when it comes to innovation.