Unfogged: Comment on Guest Post - How long will MS keep data hostage?

I think we've had the R vs SAS conversation here-- I mostly see people working in R, also what I use.

I am also curious about the fate of cloud computing, cloud data, and cloud code. Haven't done much with this myself. But access to and control over large volumes of data is a big deal, and having everything hidden makes me nervous.

Actually, a tangent to the distributed archives sharing metadata post below-- I wonder about two issues, referential integrity (is the shared metadata in synch with the primary data? How to test that? Does the metadata scheme support external references (i.e., My.document1 contains a reference to Your.document3) ? ) and separately, hidden updates (either oh, we put in placeholder primary data for those entries-- they'll be filled in later, possibly. Or yes, we'll explain on the telephone that we yanked the article that turned out to be libellous and replaced it with something else.) Wikipedia avoids these problems via transparent revision history and central storage.

Posted by: lw | Link to this comment | 04-26-13 11:11 AM

Anyway, everything goes into SAS in the end.

This reads to me like some kind of grim, fatalistic aphorism, but that's probably because I have no idea what "SAS" is.

Posted by: politicalfootball | Link to this comment | 04-26-13 11:14 AM

I used to be pro-cloud, because the advantages of having professionals in charge of keeping your data safe and backed up is huge. But I'm increasingly skeptical, as google turns increasingly evil. The actual rights you have to anything in the cloud is shockingly low, and since you have no control over cloud software the data could suddenly become unusable at any point.

Posted by: Unfoggetarian: "Pause endlessly, then go in" (9) | Link to this comment | 04-26-13 11:14 AM

I work with excel as well as various cloud options all the time, including a great deal of moving data back and forth, both between workstations and between platforms. I therefore feel very well qualified to say that I find the whole topic extremely boring, and shouldn't be relied on to have anything useful, on topic, or non-vaguely annoying to say.

Posted by: Beefo Meaty | Link to this comment | 04-26-13 11:18 AM

I am also curious about the fate of cloud computing, cloud data, and cloud code.

We now have cloud access, but I don't trust it for what may be get-off-my-lawn reasons. (Also, HIPAA reasons, but even when those don't apply, I'm nervous.) Mainly, there are certain data sets that I don't want anybody but me to be able to revise. Those get distributed, but the base copy stays with me. There are other sets where other people or teams hold similar roles. The cloud is for working with analyses, not the underlying core data.

Posted by: Moby Hick | Link to this comment | 04-26-13 11:20 AM

With R, do you spend less time hammering array-shaped math into matrix-shaped holes? I should switch anyway, but this would provide incentive.

Posted by: Eggplant | Link to this comment | 04-26-13 11:20 AM

5 is why it's my view that Google (or whatever Google successor takes on a similar role) will eventually become basically a regulated utility like the gas company or a streetcar company circa 1925, as people start to realize the power that they have. On the OP it's all just a confusing bunch of letters and I have no idea what to say about it.

Posted by: Robert Halford | Link to this comment | 04-26-13 11:21 AM

That is, less time than Matlab.

Posted by: Eggplant | Link to this comment | 04-26-13 11:21 AM

I don't know what array-shaped math means. Everything is matrices anyway.

Posted by: Moby Hick | Link to this comment | 04-26-13 11:21 AM

It's also obviously unsustainable that the only way to solve major problems with google (e.g. they decide to kick you off your email) is to have a friend who works there.

Posted by: Unfoggetarian: "Pause endlessly, then go in" (9) | Link to this comment | 04-26-13 11:24 AM

10: Probably less time than MATLAB, yes.

Also:

This reads to me like some kind of grim, fatalistic aphorism

Is it ever.

Posted by: Kreskin | Link to this comment | 04-26-13 11:24 AM

That reminds me. I should figure out how to archive gmail.

Posted by: Moby Hick | Link to this comment | 04-26-13 11:24 AM

I wonder about two issues, referential integrity ... and separately, hidden updates

As I understand it, the current consensus among those working on distributed data is that it's impossible to achieve those goals, and one shouldn't try. RDF, for example, was intentionally designed to have no way even to describe referential integrity at the level of the schema, and although you can specify version information, there's no enforcement mechanism.

Posted by: M. Lambchop | Link to this comment | 04-26-13 11:25 AM

Things like the DPLA and Europeana work something like this. Participating institutions map their data to the metadata schema the aggregator is using, and then either transorm the data themselves (via XSLTs, say) or provide the mapping to the aggregator who is then responsible for the transformation. Typically then the institution either provides an OAI-PMH target or similar method for the aggregator to harvest their data from. When you sign up, things like harvesting intervals are agreed, methods for pushing data (although the aggregator pulling it is more common, I think) and so on are agreed. Often the institutions may already have mappings or crosswalks to common standards, so may do nothing more than provide a link to an OAI target offering DC.

The aggregator will, I think, typically just harvest and chuck it all into a Solr index.

(Writing on phone so acronym links and explanations left out)

Posted by: nattarGcM ttaM | Link to this comment | 04-26-13 11:30 AM

15. Excellent. So to describe a useful data summary generated by some version of a public program run against public data, what would the distributed archivist do?

Posted by: lw | Link to this comment | 04-26-13 11:30 AM

I manage systems, rather than actually use them. But my take is that any system that you don't control is a system you can't count on. I use Gmail for throw-away address usage, and S3 as a tertiary-redundant backup, but all primary and secondary storage of data is handled locally. The reason is responsibility and liability. Not to mention, if a construction project kills your connectivity, you still have access to your data.

Posted by: Grumbles | Link to this comment | 04-26-13 11:31 AM

12: did that piece get linked here already? Or are there just tons of stories of that form?

Posted by: lurid keyaki | Link to this comment | 04-26-13 11:34 AM

That reminds me. I should figure out how to archive gmail.

Yeah, no kidding. I should also decide on a replacement for Google Reader, and save links to the stuff I access via iGoogle, and . . . oh, hell, I'm just going to go back to reading newspapers and sending letters through the mail.

Posted by: MAE | Link to this comment | 04-26-13 11:35 AM

19: I'm not sure what piece you're referring to, but there do seem to be lots of stories of that form, yeah.

Posted by: nosflow | Link to this comment | 04-26-13 11:35 AM

We now have cloud access, but I don't trust it for what may be get-off-my-lawn reasons.

I've worked a bunch with data which is generated on systems that are (for very good reasons) heavily protected. The data itself isn't that confidential, but it's a pain in the neck getting approval for any process of getting it off of the original system.

It's kind of interesting but every couple of months or so I'll have a conversation with somebody who will say, "why don't you do X, that would simple" to which I reply, "I'd love to, but there's no way to get permission for a firewall exception for X."

For me, personally, it probably makes my life simpler on balance, because it constrains the range of options. But it's always nice in the cases where it possible to just set up a read-only account and get information directly.

Posted by: NickS | Link to this comment | 04-26-13 11:36 AM

16 is great if all participants are well-intentioned and competent. Is there provision for what will happen when some of these people ask to join? They might like to list 100 million records submitted to the archive on their homepage.

Posted by: lw | Link to this comment | 04-26-13 11:46 AM

People who aren't me often suck at naming variables.

Posted by: Moby Hick | Link to this comment | 04-26-13 11:46 AM

24 to the aether.

Posted by: Moby Hick | Link to this comment | 04-26-13 11:51 AM

It's also obviously unsustainable that the only way to solve major problems with google (e.g. they decide to kick you off your email) is to have a friend who works there.

I recently learned that the (exceedingly unimportant, but nifty for me) feature I suggested to an acquaintance who works deep in the bowels of google has been implemented. Yay, faceless corporation!

Posted by: knecht ruprecht | Link to this comment | 04-26-13 11:57 AM

(exceedingly unimportant, but nifty for me)

And now you can sort a google image search on t-shirt thinness and nipple erectness.

Posted by: Moby Hick | Link to this comment | 04-26-13 11:58 AM

But my take is that any system that you don't control is a system you can't count on.

The flipside of this is that not everyone is competent (or has the time) to control the systems they count on. Lord knows that while I could run my own mailserver if I really needed to I'm very very happy I don't have to, and the ultimate reliability of if it I did would probably work out to about the same.

Posted by: Josh | Link to this comment | 04-26-13 12:08 PM

So to describe a useful data summary generated by some version of a public program run against public data, what would the distributed archivist do?

The official line of the Semantic Web crowd is that any consumer of such data has to be fault-tolerant, which specifically means making no assumptions at all about the availability of any resource not under your control. Essentially, having recognized that something like the traditional database system's model of data integrity isn't applicable in a distributed context, they decided to give up on data integrity completely.

In practice no one is really willing to work that way, and I'm sure that the arrangements ttaM describes in 16 include understandings about what each party can expect from the other in terms of availability, how data will be revised or versioned, etc.

What's lacking right now is any common and explicit framework for describing such guarantees. What that should look like I have no idea, but the absence of such is (I would bet) going to become a significant drag on the development of distributed information systems.

Posted by: M. Lambchop | Link to this comment | 04-26-13 12:20 PM

they decided to give up on data integrity completely

But if the goal of your movement is to help people access data, isn't this just a way of saying your movement has failed?

(Is the semantic web thing a 'movement'? If not, what is it?)

Posted by: rob helpy-chalk | Link to this comment | 04-26-13 12:27 PM

29. Right, I shouldn't have framed that as a direct question. I guess that my main points are that the technical underpinnings of data storage are intimately intertwined with the politics of publishing and distribution.

Basically, it is now much easier than before to make something transient that is widely useful, but I fear much more difficult to preserve the useful data.

Posted by: lw | Link to this comment | 04-26-13 12:33 PM

Yeah. You need an actual archive with people to preserve data over long periods of time.

Posted by: Moby Hick | Link to this comment | 04-26-13 12:34 PM

whoops, main point is. Also, IMO Lessig is great on these issues, should be widely read. Personally, I was sorry that he decided against a political run.

Posted by: lw | Link to this comment | 04-26-13 12:35 PM

re: 29.1

Yes, I have colleagues who think that way, and colleagues who think the traditional 'library' way. I used to think the former people had a point, but the more I work in this area, the more I'm coming round to a more traditional point of view, as it's clear that you really can't 'fix it all in the search'.

Posted by: nattarGcM ttaM | Link to this comment | 04-26-13 12:43 PM

On the plus side, even quite bad scholarly aggregation can drive hits towards proper scholarly resources which do have some level of data integrity.

Posted by: nattarGcM ttaM | Link to this comment | 04-26-13 12:44 PM

FWIW, I'm building an internal aggregator (taking all of our existing digitised resources, migrating the metadata to a common standard, and then exposing it via various methods), it's bastard hard work to do well. I could just bang it all in an index without bothering to do anything with the metadata, and it'd be much quicker.

Posted by: nattarGcM ttaM | Link to this comment | 04-26-13 12:50 PM

There are people who archive their data as .xls ?!?!?!

Text files exist for a reason. They're smaller, and can be converted more reliably.

Posted by: Benquo | Link to this comment | 04-26-13 12:54 PM

Is the semantic web thing a 'movement'? If not, what is it?

It's an umbrella term covering a particular vision of how distributed data should work; the technologies that have been developed to realize that vision; and the international standards groups that are promulgating those technologies. But it's also a faddish movement.

But if the goal of your movement is to help people access data, isn't this just a way of saying your movement has failed?

Well, one response might be that the situation is analogous to the early development of the Web, where HTML/HTTP provided a minimal set of capabilities allowing sites to relate to each other and to end users, but no enforcement of particular ways of using those capabilities or guarantees about how others would use them. (I suspect a number of people at the time thought it a flaw that one had no control over who could link to one's site, for example.) And yet people sorted out workable norms and expectations, and added layers with more explicit guarantees where needed (e.g., secure connections for e-commerce).

The analogy to early development of the Web isn't adventitious, as Tim Berners-Lee is one of the driving forces behind this stuff.

Posted by: M. Lambchop | Link to this comment | 04-26-13 12:57 PM

re: 37

I get academics sending me textual data in .xls format, never mind data that is even remotely suited to the format. Microsoft, in their wisdom, then add shitloads of spurious apostrophes and fuck up the unicode when you try to convert it to comma or tab-delimited files.

Posted by: nattarGcM ttaM | Link to this comment | 04-26-13 1:00 PM

Text files exist for a reason. They're smaller, and can be converted more reliably.

I never saw an archive of text files without being accompanied by code to put them into SAS.

Posted by: Moby Hick | Link to this comment | 04-26-13 1:00 PM

People archive their everything as .xls.

Posted by: Kreskin | Link to this comment | 04-26-13 1:01 PM

Or Lotus 1-2-3.

Posted by: Moby Hick | Link to this comment | 04-26-13 1:05 PM

I'm always surprised to learn people use Excel for real work. Or use spreadsheets at all.

Posted by: essear | Link to this comment | 04-26-13 1:10 PM

Yes, it is odd. Mostly they're not even numerical spreadsheets. They're just tables that have been shoehorned into a spreadsheet form.

Posted by: Kreskin | Link to this comment | 04-26-13 1:13 PM

I once saw a guy merge files in Excel. Or I saw the file afterward. I think he sorted the two files and then put them against each other in the same sheet where the ids matched. I redid it, but he was correct.

Posted by: Moby Hick | Link to this comment | 04-26-13 1:13 PM

But essear, spreadsheets are declarative, reactive programming models! The future!

Posted by: nosflow | Link to this comment | 04-26-13 1:13 PM

39,40 And there's lots of data that is not suitable to tab-delimited representation, leaving aside the technical issue of reversible and system-independent escaping of embedded tabs.

The spreadsheet HAS an internal XML representation. It's just that MS not only doesn't keep that representation stable, they intentionally choose to make export difficult and version-dependent, and so unreliable.

Out of curiosity, does the digital representation of your institution's records support cross-references to other records?

Posted by: lw | Link to this comment | 04-26-13 1:14 PM

As pre 39, I'll get things like bibliographic records, or transcriptions, or annotations delivered in .xls files. Often, just for added pleasure, formatted differently from each other. So you'll get a few thousands records, and then a week later, another few. Only this time the columns are in a different order and the header row spelled differently.

Posted by: nattarGcM ttaM | Link to this comment | 04-26-13 1:15 PM

24 to 48.last.

Posted by: Moby Hick | Link to this comment | 04-26-13 1:17 PM

47.1: I never use tab-delimited representation and rarely see it. If it is a txt file, it's fixed-width columns or csv.

Posted by: Moby Hick | Link to this comment | 04-26-13 1:20 PM

Unless the 'cards' statement in SAS can be considered a form of tab-delimitation. I don't use that often, but it does come up.

Posted by: Moby Hick | Link to this comment | 04-26-13 1:28 PM

Excel spreadsheets are fun. Especially if you have to convert thousands of them into millions of single-page TIFF files in order to produce them in response to an administrative subpoena. I won't forget that project in a hurry.

Posted by: widget | Link to this comment | 04-26-13 1:28 PM

csv has the same problem, embedded comma escaping.

In any such fixed-record format, how to represent a delimited set of attributes which each apply to a small fraction of the data? (i.e., 0.1 % of records are radioactive, 0.1% are blue, 0.1% are flagged by an expensive global predictor.... No, there is no controlled set of attributes, and the list of attributes to be recorded will grow with time as the datset is used.) Or hierarchical data-- please represent taxonomy as a tab-delimited table (Yes, it can be done, but takes a little thought and I think is not easy to do inside of Excel).

Posted by: lw | Link to this comment | 04-26-13 1:29 PM

Just get a whole bunch of dummy variables.

Posted by: Moby Hick | Link to this comment | 04-26-13 1:36 PM

re: 47.last

Not the core catalogue, I don't think. That'll just be Marc21 (I think). But our more recent systems, often will do, yes. It's increasingly common to treat digital objections as aggregations of other digital objects. So, for example, an image of a manuscript page and its associated metadata is an object. The manuscript is another object, which is an aggregation of lots of others. Each will have a record in the system, with a unique identifier and a persistent URL. So you can build new objects out of others, or create novel aggregations, or aggregate a text and some images, or some images and annotations, and so on.

Posted by: nattarGcM ttaM | Link to this comment | 04-26-13 1:37 PM

This is the second security gig this month where I'm sitting outside a company because one of their female employee's crazy ex has threatened to kill them. You'd think a building that manufactures fentanyl would have armed security, but all they've got is a pudgy middled aged woman with a flashlight.

Posted by: gswift | Link to this comment | 04-26-13 1:40 PM

54. With that planning style, you must be in sales management.

Posted by: lw | Link to this comment | 04-26-13 1:45 PM

Here, read this. Ignore the release dates, of course we're behind schedule.

Posted by: SP | Link to this comment | 04-26-13 1:46 PM

57: No. I just turn everything into SAS after I get it.

Posted by: Moby Hick | Link to this comment | 04-26-13 1:47 PM

I should note that the data deposition for that project will accept either Excel spreadsheets or .xml data submissions, but once we digest it the data is all stored in a proper relational database.

Posted by: SP | Link to this comment | 04-26-13 1:50 PM

From my mixed-up former student on FB:

Best Advice in two Lines..
"Silence is the Best answer for all questions"
"Smiling is the best reaction in all situations"

(sure, why not.)
|>

Posted by: heebie-geebie | Link to this comment | 04-26-13 2:02 PM

53.1: Use a different delimiter, then. Double pipes ("||") work pretty well for most things. Fixed width works even better. Tab delimited should work for just about anything that Excel can handle.

53.2: One row per attribute per item. Don't bother converting to indicators until you're actually using the data.

Posted by: Benquo | Link to this comment | 04-26-13 2:03 PM

61: Do you have allergies to any medications?

Posted by: Moby Hick | Link to this comment | 04-26-13 2:06 PM

Best Advice in two Lines..
"Silence is the Best answer for all questions"
"Smiling is the best reaction in all situations"

This is how she passed all your tests, no doubt.

Posted by: rob helpy-chalk | Link to this comment | 04-26-13 2:11 PM

I assumed she was enjoying herself.

Posted by: heebie-geebie | Link to this comment | 04-26-13 2:17 PM

61: I think this dildo is way too big for anal, should we try anyways?

Posted by: gswift | Link to this comment | 04-26-13 2:19 PM

Best Advice in two Lines..
"Silence is the Best answer for all questions"
"Smiling is the best reaction in all situations"

Oh, god, there's so many situations where this could go wrong . . .

Wife: You've been staying late at the office a lot recently. You're not sleeping with your new coworker, are you?
Husband: [silence]
Wife: OH MY GOD, I KNEW IT, YOU'RE HAVING AN AFFAIR WITH HER!
Husband: [smiles]

Posted by: MAE | Link to this comment | 04-26-13 2:22 PM

"Waiter, could you tell me if there's meat in this soup? I'm a vegetarian."

Posted by: heebie-geebie | Link to this comment | 04-26-13 2:22 PM

"Thank you for calling our helpdesk. How may I direct your call?"

Posted by: heebie-geebie | Link to this comment | 04-26-13 2:23 PM

I think I like it best applied to exceedingly mundane, exchange of information situations.

Posted by: heebie-geebie | Link to this comment | 04-26-13 2:24 PM

"Who wants to sex Mutombo?"

Posted by: Stanley | Link to this comment | 04-26-13 2:25 PM

"Who wants Trident?"

Posted by: heebie-geebie | Link to this comment | 04-26-13 2:30 PM

"Is there anybody alive in there?"

Posted by: Beefo Meaty | Link to this comment | 04-26-13 2:32 PM

"Where to?"

Posted by: Beefo Meaty | Link to this comment | 04-26-13 2:34 PM

"Do you mind if I play Pink Floyd on repeat all evening long?"

Posted by: heebie-geebie | Link to this comment | 04-26-13 2:43 PM

"Is it ok if I send you this 150,000 row dataset as an Excel file?"

Posted by: Benquo | Link to this comment | 04-26-13 2:51 PM

Pause/play: I'm at crazy Beverly Hills doctor for a physical (in the style of Entertainment 720, previously described here). Not only is there club music playing and thematic mood lighting over white leather chairs, but the male nurse just fist bumped me and said "decent abs bro. You're getting there."

Posted by: Robert Halford | Link to this comment | 04-26-13 2:53 PM

"You have control."

"Go-around!"

Posted by: Alex | Link to this comment | 04-26-13 3:00 PM

Halford's doctors manage their data in ab-delimited files. Don't SAS me, bro!

Posted by: fake accent | Link to this comment | 04-26-13 3:34 PM

India's caste system is based on karma-separated values.

Posted by: foolishmortal | Link to this comment | 04-26-13 3:45 PM

Diet colas are Tab delineated.

Posted by: Moby Hick | Link to this comment | 04-26-13 4:30 PM

M/tch's girlfriends are Kraab delimited.

Posted by: nosflow | Link to this comment | 04-26-13 6:10 PM

Knecht tracks his business travel in Expensible Markup Language.

Posted by: Beefo Meaty | Link to this comment | 04-26-13 6:15 PM

Way to spell, coach.

Posted by: Beefo Meaty | Link to this comment | 04-26-13 6:16 PM

Speaking of holding hostage Quicken is claiming it will soon stop updating stock prices unless I buy a new version. Any ideas, alternatives?

Posted by: James B. Shearer | Link to this comment | 04-26-13 7:52 PM

I've studied RDF quite a bit, but I can't come up with a use case where an RDF triplestore is more appropriate than Postgres for a production-grade project. I love the idea of RDF, but is it ever worth using?

Posted by: Sentar | Link to this comment | 04-26-13 10:03 PM

common to treat digital objections as aggregations of other digital objects.

Injection, surjection, objection.

I am wondering whether to try to remora onto a project that's trying to be Europeana-ish for biology specimens. A *lot* of them. Bones! Cross-sections! Things for which `temperature history' is an important variable!

Posted by: clew | Link to this comment | 04-27-13 1:23 AM

re: 87

I've come across a possible project in the UK that's planning to do 3D scanning of specimen drawers. With some sort of crowd-sourcey/OCRish method of decoding the specimen labels. Not sure how far along it is, though.

Posted by: nattarGcM ttaM | Link to this comment | 04-27-13 2:46 AM

Are we to understand that the male nurse had even better abs, or what? It sounds a bit condescending, as fist-bump validations go.

Posted by: x.trapnel | Link to this comment | 04-27-13 7:49 AM

Oh, it could have been "I am a trained medical professional who sees lots and lots of men naked. In my professional opinion, your abs are smokin'. Also, you've been Iced!" Not condescending, but just an (incredibly tacky) compliment from someone in a position of expertise.

Posted by: LizardBreath | Link to this comment | 04-27-13 8:29 AM

Re: Guest Post - How long will MS keep data hostage?