I haven't even switched from .xls to .xlsx for that reason.
Or maybe that means I'm making lw's problem worse. Anyway, everything goes into SAS in the end.
I think we've had the R vs SAS conversation here-- I mostly see people working in R, also what I use.
I am also curious about the fate of cloud computing, cloud data, and cloud code. Haven't done much with this myself. But access to and control over large volumes of data is a big deal, and having everything hidden makes me nervous.
Actually, a tangent to the distributed archives sharing metadata post below-- I wonder about two issues, referential integrity (is the shared metadata in synch with the primary data? How to test that? Does the metadata scheme support external references (i.e., My.document1 contains a reference to Your.document3) ? ) and separately, hidden updates (either oh, we put in placeholder primary data for those entries-- they'll be filled in later, possibly. Or yes, we'll explain on the telephone that we yanked the article that turned out to be libellous and replaced it with something else.) Wikipedia avoids these problems via transparent revision history and central storage.
Anyway, everything goes into SAS in the end.
This reads to me like some kind of grim, fatalistic aphorism, but that's probably because I have no idea what "SAS" is.
I used to be pro-cloud, because the advantages of having professionals in charge of keeping your data safe and backed up is huge. But I'm increasingly skeptical, as google turns increasingly evil. The actual rights you have to anything in the cloud is shockingly low, and since you have no control over cloud software the data could suddenly become unusable at any point.
I work with excel as well as various cloud options all the time, including a great deal of moving data back and forth, both between workstations and between platforms. I therefore feel very well qualified to say that I find the whole topic extremely boring, and shouldn't be relied on to have anything useful, on topic, or non-vaguely annoying to say.
I am also curious about the fate of cloud computing, cloud data, and cloud code.
We now have cloud access, but I don't trust it for what may be get-off-my-lawn reasons. (Also, HIPAA reasons, but even when those don't apply, I'm nervous.) Mainly, there are certain data sets that I don't want anybody but me to be able to revise. Those get distributed, but the base copy stays with me. There are other sets where other people or teams hold similar roles. The cloud is for working with analyses, not the underlying core data.
With R, do you spend less time hammering array-shaped math into matrix-shaped holes? I should switch anyway, but this would provide incentive.
5 is why it's my view that Google (or whatever Google successor takes on a similar role) will eventually become basically a regulated utility like the gas company or a streetcar company circa 1925, as people start to realize the power that they have. On the OP it's all just a confusing bunch of letters and I have no idea what to say about it.
I don't know what array-shaped math means. Everything is matrices anyway.
It's also obviously unsustainable that the only way to solve major problems with google (e.g. they decide to kick you off your email) is to have a friend who works there.
10: Probably less time than MATLAB, yes.
Also:
This reads to me like some kind of grim, fatalistic aphorism
Is it ever.
That reminds me. I should figure out how to archive gmail.
I wonder about two issues, referential integrity ... and separately, hidden updates
As I understand it, the current consensus among those working on distributed data is that it's impossible to achieve those goals, and one shouldn't try. RDF, for example, was intentionally designed to have no way even to describe referential integrity at the level of the schema, and although you can specify version information, there's no enforcement mechanism.
Things like the DPLA and Europeana work something like this. Participating institutions map their data to the metadata schema the aggregator is using, and then either transorm the data themselves (via XSLTs, say) or provide the mapping to the aggregator who is then responsible for the transformation. Typically then the institution either provides an OAI-PMH target or similar method for the aggregator to harvest their data from. When you sign up, things like harvesting intervals are agreed, methods for pushing data (although the aggregator pulling it is more common, I think) and so on are agreed. Often the institutions may already have mappings or crosswalks to common standards, so may do nothing more than provide a link to an OAI target offering DC.
The aggregator will, I think, typically just harvest and chuck it all into a Solr index.
(Writing on phone so acronym links and explanations left out)
15. Excellent. So to describe a useful data summary generated by some version of a public program run against public data, what would the distributed archivist do?
I manage systems, rather than actually use them. But my take is that any system that you don't control is a system you can't count on. I use Gmail for throw-away address usage, and S3 as a tertiary-redundant backup, but all primary and secondary storage of data is handled locally. The reason is responsibility and liability. Not to mention, if a construction project kills your connectivity, you still have access to your data.
12: did that piece get linked here already? Or are there just tons of stories of that form?
That reminds me. I should figure out how to archive gmail.
Yeah, no kidding. I should also decide on a replacement for Google Reader, and save links to the stuff I access via iGoogle, and . . . oh, hell, I'm just going to go back to reading newspapers and sending letters through the mail.
19: I'm not sure what piece you're referring to, but there do seem to be lots of stories of that form, yeah.
We now have cloud access, but I don't trust it for what may be get-off-my-lawn reasons.
I've worked a bunch with data which is generated on systems that are (for very good reasons) heavily protected. The data itself isn't that confidential, but it's a pain in the neck getting approval for any process of getting it off of the original system.
It's kind of interesting but every couple of months or so I'll have a conversation with somebody who will say, "why don't you do X, that would simple" to which I reply, "I'd love to, but there's no way to get permission for a firewall exception for X."
For me, personally, it probably makes my life simpler on balance, because it constrains the range of options. But it's always nice in the cases where it possible to just set up a read-only account and get information directly.
16 is great if all participants are well-intentioned and competent. Is there provision for what will happen when some of these people ask to join? They might like to list 100 million records submitted to the archive on their homepage.
People who aren't me often suck at naming variables.
It's also obviously unsustainable that the only way to solve major problems with google (e.g. they decide to kick you off your email) is to have a friend who works there.
I recently learned that the (exceedingly unimportant, but nifty for me) feature I suggested to an acquaintance who works deep in the bowels of google has been implemented. Yay, faceless corporation!
(exceedingly unimportant, but nifty for me)
And now you can sort a google image search on t-shirt thinness and nipple erectness.
But my take is that any system that you don't control is a system you can't count on.
The flipside of this is that not everyone is competent (or has the time) to control the systems they count on. Lord knows that while I could run my own mailserver if I really needed to I'm very very happy I don't have to, and the ultimate reliability of if it I did would probably work out to about the same.
So to describe a useful data summary generated by some version of a public program run against public data, what would the distributed archivist do?
The official line of the Semantic Web crowd is that any consumer of such data has to be fault-tolerant, which specifically means making no assumptions at all about the availability of any resource not under your control. Essentially, having recognized that something like the traditional database system's model of data integrity isn't applicable in a distributed context, they decided to give up on data integrity completely.
In practice no one is really willing to work that way, and I'm sure that the arrangements ttaM describes in 16 include understandings about what each party can expect from the other in terms of availability, how data will be revised or versioned, etc.
What's lacking right now is any common and explicit framework for describing such guarantees. What that should look like I have no idea, but the absence of such is (I would bet) going to become a significant drag on the development of distributed information systems.
they decided to give up on data integrity completely
But if the goal of your movement is to help people access data, isn't this just a way of saying your movement has failed?
(Is the semantic web thing a 'movement'? If not, what is it?)
29. Right, I shouldn't have framed that as a direct question. I guess that my main points are that the technical underpinnings of data storage are intimately intertwined with the politics of publishing and distribution.
Basically, it is now much easier than before to make something transient that is widely useful, but I fear much more difficult to preserve the useful data.
Yeah. You need an actual archive with people to preserve data over long periods of time.
whoops, main point is. Also, IMO Lessig is great on these issues, should be widely read. Personally, I was sorry that he decided against a political run.
re: 29.1
Yes, I have colleagues who think that way, and colleagues who think the traditional 'library' way. I used to think the former people had a point, but the more I work in this area, the more I'm coming round to a more traditional point of view, as it's clear that you really can't 'fix it all in the search'.
On the plus side, even quite bad scholarly aggregation can drive hits towards proper scholarly resources which do have some level of data integrity.
FWIW, I'm building an internal aggregator (taking all of our existing digitised resources, migrating the metadata to a common standard, and then exposing it via various methods), it's bastard hard work to do well. I could just bang it all in an index without bothering to do anything with the metadata, and it'd be much quicker.
There are people who archive their data as .xls ?!?!?!
Text files exist for a reason. They're smaller, and can be converted more reliably.
Is the semantic web thing a 'movement'? If not, what is it?
It's an umbrella term covering a particular vision of how distributed data should work; the technologies that have been developed to realize that vision; and the international standards groups that are promulgating those technologies. But it's also a faddish movement.
But if the goal of your movement is to help people access data, isn't this just a way of saying your movement has failed?
Well, one response might be that the situation is analogous to the early development of the Web, where HTML/HTTP provided a minimal set of capabilities allowing sites to relate to each other and to end users, but no enforcement of particular ways of using those capabilities or guarantees about how others would use them. (I suspect a number of people at the time thought it a flaw that one had no control over who could link to one's site, for example.) And yet people sorted out workable norms and expectations, and added layers with more explicit guarantees where needed (e.g., secure connections for e-commerce).
The analogy to early development of the Web isn't adventitious, as Tim Berners-Lee is one of the driving forces behind this stuff.
re: 37
I get academics sending me textual data in .xls format, never mind data that is even remotely suited to the format. Microsoft, in their wisdom, then add shitloads of spurious apostrophes and fuck up the unicode when you try to convert it to comma or tab-delimited files.
Text files exist for a reason. They're smaller, and can be converted more reliably.
I never saw an archive of text files without being accompanied by code to put them into SAS.
People archive their everything as .xls.
I'm always surprised to learn people use Excel for real work. Or use spreadsheets at all.
Yes, it is odd. Mostly they're not even numerical spreadsheets. They're just tables that have been shoehorned into a spreadsheet form.
I once saw a guy merge files in Excel. Or I saw the file afterward. I think he sorted the two files and then put them against each other in the same sheet where the ids matched. I redid it, but he was correct.
But essear, spreadsheets are declarative, reactive programming models! The future!
39,40 And there's lots of data that is not suitable to tab-delimited representation, leaving aside the technical issue of reversible and system-independent escaping of embedded tabs.
The spreadsheet HAS an internal XML representation. It's just that MS not only doesn't keep that representation stable, they intentionally choose to make export difficult and version-dependent, and so unreliable.
Out of curiosity, does the digital representation of your institution's records support cross-references to other records?
As pre 39, I'll get things like bibliographic records, or transcriptions, or annotations delivered in .xls files. Often, just for added pleasure, formatted differently from each other. So you'll get a few thousands records, and then a week later, another few. Only this time the columns are in a different order and the header row spelled differently.
47.1: I never use tab-delimited representation and rarely see it. If it is a txt file, it's fixed-width columns or csv.
Unless the 'cards' statement in SAS can be considered a form of tab-delimitation. I don't use that often, but it does come up.
Excel spreadsheets are fun. Especially if you have to convert thousands of them into millions of single-page TIFF files in order to produce them in response to an administrative subpoena. I won't forget that project in a hurry.
csv has the same problem, embedded comma escaping.
In any such fixed-record format, how to represent a delimited set of attributes which each apply to a small fraction of the data? (i.e., 0.1 % of records are radioactive, 0.1% are blue, 0.1% are flagged by an expensive global predictor.... No, there is no controlled set of attributes, and the list of attributes to be recorded will grow with time as the datset is used.) Or hierarchical data-- please represent taxonomy as a tab-delimited table (Yes, it can be done, but takes a little thought and I think is not easy to do inside of Excel).
Just get a whole bunch of dummy variables.
re: 47.last
Not the core catalogue, I don't think. That'll just be Marc21 (I think). But our more recent systems, often will do, yes. It's increasingly common to treat digital objections as aggregations of other digital objects. So, for example, an image of a manuscript page and its associated metadata is an object. The manuscript is another object, which is an aggregation of lots of others. Each will have a record in the system, with a unique identifier and a persistent URL. So you can build new objects out of others, or create novel aggregations, or aggregate a text and some images, or some images and annotations, and so on.
||
This is the second security gig this month where I'm sitting outside a company because one of their female employee's crazy ex has threatened to kill them. You'd think a building that manufactures fentanyl would have armed security, but all they've got is a pudgy middled aged woman with a flashlight.
|>
54. With that planning style, you must be in sales management.
Here, read this. Ignore the release dates, of course we're behind schedule.
57: No. I just turn everything into SAS after I get it.
I should note that the data deposition for that project will accept either Excel spreadsheets or .xml data submissions, but once we digest it the data is all stored in a proper relational database.
||
From my mixed-up former student on FB:
Best Advice in two Lines..
"Silence is the Best answer for all questions"
"Smiling is the best reaction in all situations"
(sure, why not.)
|>
53.1: Use a different delimiter, then. Double pipes ("||") work pretty well for most things. Fixed width works even better. Tab delimited should work for just about anything that Excel can handle.
53.2: One row per attribute per item. Don't bother converting to indicators until you're actually using the data.
61: Do you have allergies to any medications?
Best Advice in two Lines..
"Silence is the Best answer for all questions"
"Smiling is the best reaction in all situations"
This is how she passed all your tests, no doubt.
I assumed she was enjoying herself.
61: I think this dildo is way too big for anal, should we try anyways?
Best Advice in two Lines..
"Silence is the Best answer for all questions"
"Smiling is the best reaction in all situations"
Oh, god, there's so many situations where this could go wrong . . .
Wife: You've been staying late at the office a lot recently. You're not sleeping with your new coworker, are you?
Husband: [silence]
Wife: OH MY GOD, I KNEW IT, YOU'RE HAVING AN AFFAIR WITH HER!
Husband: [smiles]
"Waiter, could you tell me if there's meat in this soup? I'm a vegetarian."
"Thank you for calling our helpdesk. How may I direct your call?"
I think I like it best applied to exceedingly mundane, exchange of information situations.
"Is there anybody alive in there?"
"Do you mind if I play Pink Floyd on repeat all evening long?"
"Is it ok if I send you this 150,000 row dataset as an Excel file?"
Pause/play: I'm at crazy Beverly Hills doctor for a physical (in the style of Entertainment 720, previously described here). Not only is there club music playing and thematic mood lighting over white leather chairs, but the male nurse just fist bumped me and said "decent abs bro. You're getting there."
Halford's doctors manage their data in ab-delimited files. Don't SAS me, bro!
India's caste system is based on karma-separated values.
M/tch's girlfriends are Kraab delimited.
Knecht tracks his business travel in Expensible Markup Language.
Speaking of holding hostage Quicken is claiming it will soon stop updating stock prices unless I buy a new version. Any ideas, alternatives?
I've studied RDF quite a bit, but I can't come up with a use case where an RDF triplestore is more appropriate than Postgres for a production-grade project. I love the idea of RDF, but is it ever worth using?
common to treat digital objections as aggregations of other digital objects.
Injection, surjection, objection.
I am wondering whether to try to remora onto a project that's trying to be Europeana-ish for biology specimens. A *lot* of them. Bones! Cross-sections! Things for which `temperature history' is an important variable!
re: 87
I've come across a possible project in the UK that's planning to do 3D scanning of specimen drawers. With some sort of crowd-sourcey/OCRish method of decoding the specimen labels. Not sure how far along it is, though.
Are we to understand that the male nurse had even better abs, or what? It sounds a bit condescending, as fist-bump validations go.
Oh, it could have been "I am a trained medical professional who sees lots and lots of men naked. In my professional opinion, your abs are smokin'. Also, you've been Iced!" Not condescending, but just an (incredibly tacky) compliment from someone in a position of expertise.
89 --I didn't see his abs, but they were likely better. Tbh I dont really know what was going on.
You should go back there and demand to see his abs.
Did he also grab your ass? Because that could be an important clue.
Hey, LB, let's ease into it, here.
Let things evolve, you know? Keep some mystery.