For not-want of an accent, the unfoggedbot was lost.
That's a pretty grave accent aigu error.
You could call it an acute condition.
Text parsing, Ben.
Wave of the future.
So the unfoggedbot couldn't be arsed with fancy characters.
HTML entities were good enough for Jesus and they're good enough for me, Sifu.
6: yeah look what happened to him, though.
I'm just suprised that it's ticket and not billet. Do the French favor the former, or do the two have subtley different meanings?
Google I feel lucky result on "Python HTML entity parser."
What you really want is the htmlentitydefs module, if the problem is entities. However, that comment wasn't written with entities. Had it been, there would have been no problems.
After all, all of the characters &, e, a, c, u, t, e, and ; (pretend those are mentioned and not used) are perfectly good ascii already, Sifu, so why would they cause problems? Do try to keep up.
14: that was a lovely time.
I still refuse to believe that basic ASCII text parsing is beyond the scope of Ben's prodigious gifts; if it's converting to ASCII then you're going to have discrete strings, and somebody must have run across this.
Even if nobody has run across this problem, of course, it would be a simple matter of taking the XHTML entity documentation and parsing that, such that one ended up with a formatted list of HTML entity strings, suitable for dumping into one's bot bot, for parsing.
God knows Ben is way ahead of me on this one.
Are you proposing codecs for text strings?? Almost as bad an idea as this.
The problem isn't that I don't know what to do with œ (that is, & o e l i g ;, no spaces); the problem is that it broke on œ (that is, LATIN SMALL LIGATURE OE).
>>> print u'\u0153'
Traceback (most recent call last):
File "", line 1, in ?
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0153' in position 0: ordinal not in range(256)
>>>
It will be interesting to see how this comment gets transmitted, containing the characters it does.
19: never prevent Ben from applying more complexity to a problem. No better way to make the poor fellow lose interest.
And here I was thinking that the answer to the titular question was that it had become sentient and gone on strike until Ben started paying it a living wage.
And, in fact, it got busted.
That is, the bot got busted trying to transmit it. The earlier solution with é worked; œ, no dice. The latter must be more esoteric.
You should probably hire the Pinkertons.
You should probably listen to Pinkerton. That was an underrated album.
Also, Ben: your 24 says 'no', but your 20's "Latin-1 codec" snippet says 'yes'. Maybe 19 should have been "proposing" s/b "using."
Pinkerton is possibly the least underrated album ever recorded. I have never seen a single music critic say anything negative about it. I have never heard anyone say it is anything less than 400,000% better than anything Weezer has done since then.
The latter must be more esoteric.
...and the former does in fact fall in [0,256).
28: I thought Sifu was proposing something that would decode "é" by, say, "e" (close enough!). I don't really understand the import of 19, I guess.
What do you mean, for instance, by "text string"?
8: Ah, answered my own question via Google. In case there are any other non-Francophones out there:
billet = train or plane ticket; ticket = métro or bus ticketSource.
My east-coast privilege (which has me commenting, instead of sleeping, at 1:50am) allows me to refer to them as "text" strings instead of what I suppose to be the more proper, "character strings."
Where a character is what, an ascii character, a utf-8 character, a utf-16 character, or what? There is no bare "character". If you mean byte strings, then your 28 is off base. 339 does not fit in a byte. And while 339 is a character according to utf-16, it's not a character according to utf-8, but rather two: 0xC5, and 0x93.
(Anyway it doesn't fit in an 8-bit byte.)
What, exactly, is the difference between ascii and utf-8? Wait, don't answer that.
By "character," Ben, I probably mean something closer to "glyph." It's the thing I see on the screen, which you can encode any which way.
I'm obviously not trying to put down the idea of encoding things as other things, in general. My 19 was meant somewhat as a joke, but mainly as an excuse to give that link to a page about Binary XML (which I once heard someone suggest would lead to a world where we had "codecs for XML," which strikes me as a uniquely awful idea). That is all.
& now I am curious, for what do you think a codec is appropriate?
38: compressed video formats, of course.
or audio, or still images ...
The anagram results for "codec ben w-lfs-n" include:
blown condo feces
snob concede wolf
So there's that.
simple anagram:
bén w-lfs-n -> né wolfsnob
Ah. But I can't properly be said to be proposing latin-1, for example. Such things are already with us.
(35 is wrong on some things, btw.)
By "character," Ben, I probably mean something closer to "glyph." It's the thing I see on the screen, which you can encode any which way.
I don't really see the conceptual difference between taking the bytes in a file and representing them as letters and taking the bytes in a file and representing them as a series of images. I mean, plainly there aren't actually glyphs, or even letters, in the file, any more than there are actually files on a hard drive. All of this is added by the faculty of the understandingprograms that interpret things for us. And if there are multiple ways of interpreting a sequence of bytes as glyphs, if you can mark that into the sequence, then you'll be able to translate them. (Of course, if some are supersets of others, and you deal with them in such a way that you don't just have sequences of bytes but also the information that here is a character, you'll get failures, like the failure the latin-1 codec had above—it only recognizes characters that are one byte long, and it's told that some two-byte sequence is a character. Well, that won't work. But it does explain why encoding the two-byte character as two one-byte utf-8 things allows it at least to be printed, albeit as nonsense: "Å". And that at least allows us to proceed with only a minor stumble.)
I wonder why the nonsense got doubled up when the bot reported 43. Something up in the database too? I wonder, but I don't really care.
Speaking as someone whose pathetic bits of python code are littered with unicode error catching sequences, most containing the word "Fuck" and its derivatives, you have my sympathy. Can't you send stuff as UTF-8 instead, though? That way there might not be a crisis when some smartarse sends a non-latin character ...
What I really meant, though, was: "I didn't mean to come off as pretentious, and thank you for hosting/fixing the bot, which is pretty much the greatest thing ever." I shouldn't have pushed on the whole UTF/codec thing, since obviously you're doing us all a favor by spending a long time fixing the damn thing. As a public service, no less.
I'm so sorry that my one delurk crashed your new toy.
I'm so sorry bursting with pride that my one delurk crashed your new toy.
Fixed.
Not Frog! Explainer of (not even ambassador for) Frogs. Big difference.
w-lfs-n, the last time I had to deal with a nightmarish jumble of characters -- processing a large volume of Atom and RSS feeds, many of which misidentified their character sets -- I ended up jamming everything through Tidy first. Python has a libtidy API, right?
"Explainer of Frogs" would be a bitchin' title to have on a business card.
"Frog Whisperer" could've been good, but I think the [x]-whisperer formula is getting a bit played out. Too bad.
Greengage has gone native over there and he doesn't even know it.
Greengage is a she. And she's just been asked to reorder her business cards, and is now thinking of all sorts of titles she could give herself. Thanks, Sifu.
You're not the person behind greenshade are you, Greengage?
I'm hoping 'Greengage' comes from Cold Comfort Farm, and you're planning to comment in a vaguely D.H. Lawrence fashion. But simply explaining frogs will be satisfactory.
I have reluctantly concluded that it will be easier to simply invade all of the other countries on earth and force them to use English than it will be get UTF-8 support working properly in all the software that needs it. I'm not sure if this is an argument that Anne Coulter has made, but if not she should consider it.
Norba: If you check back into this thread, or notice the Tre Kroner signal we've shone on a low cloud bank the last few nights, can you help us find out whatever happened to Gunhild Larking?
8 and 33:
Your definition is probably more useful, but I was going to answer that a "billet" is big, with lots of information on it, and may be sort of floppy, while a "ticket" is small, cardboardy, and simple.
---Jackmormon's French-English Phenomenological Dictionary.
Oddly, there's no specific name for lady frogs. They're all just frogs.
A bullfrog is a species. There are lady bullfrogs.
63 - There are, and the lady wildcats kicked their asses.
61: huh? Is that me?
and, in re Ben, the thing you want is, I've just remembered, something like "sillystring.encode('latin-1','xmlcharrefreplace')" which will -- if you really want to use latin-1 -- produce something legible for all the extraneous characters. Now I will go off and search for ways to get MS word smart quotes into the feed and really scramble the bugger.
All I can find about the luscious Gunhild is that she came from Jönköping (a town of otherwise remarkable dullness) and was fourth in something or other at the 1956 Melbourne Olympics. High jump, I think -- a score of 1m 67. She hasn't died since 2002: it's hard to get newspaper searches earlier than that.
Sorry about the name, I confused Nworb with Wuggie Norple or something. The town is something nobody else had, thanks.
OK. Further: a scholarly article from the university of Malmö (http://www.idrottsforum.org/articles/tolvhed/tolvhed.html) claims that she was described by a magazine at the time (Melbourne 1956) as "swinging her well-turned thighs with feminine grace over the bar"and later remembered as "the pin-up girl for the whole olympiad".
Sorry about weird charsÖ swedish keyboard layouts have punctuation where they shouldnät.
Also ++ a pdf report from, I think, the Swedish national sport board on "The sexualisation of public space in sport" (great cover, worth a click) http://www.rf.se/files/%7B7E36FB48-BC66-4966-828E-7A65BF267A27%7D.pdf
This is quite as humourless as anyone could hope. It finds a Swedish magazine that described her as a "blonde bombshell" and another which printed pictures that contrated her favourably with a Russian shot-putter.
She is mentioned in an article on Swedish athletes on page 48 of the 1990 yearbook of her highschool, but while the index is on the wb. the text isn't. and now I had better do some work.
oh, and if your owrkplace objects to nude statues with snow on them, don't click on the pdf link above.
This is quite as humourless as anyone could hope Wow, that cover! Switching off Tre Kroner and turning on the Ragebunny signal now.
(Sending it as utf-8 is one of the things that causes libpurple clients to display a mess of chinese characters. Perhaps Frowner, M/tch and Emerson can get some use from that, I can't.)
Hey, and, IDP -- it's Tre Kronor. I suppose the shock of defeat prevents Canadians from noticing this detail.
56, 57, 59, 60: No, I'm not greenshade, but thanks for the link. And I'm not Cold Comfort Farm either, though it was one of my earliest favorite reads, long ago before Unfogged. (I read it when I was too young to know that a brassiere -- not doing that accent again -- was just a bra, and so when the woman who collects them is talking about finding a cool three-paneled one I thought perhaps it might be a kind of dresser.) I am a plum, though, and I ripen at this time of year. You've outed me.
Switching off Tre Kroner and turning on the Ragebunny signal now.
Wha? Huh? I got nuthin'. Aaargh.
[nods off again.]
There's a good Swedish restaurant called Tre Kroner in Chicago. It used to have fabulously cute waitresses, but that was ten or twelve years ago.
78: so they're probably still cute, but a little bit too old for you?
I never signed on for grappling with text encoding issues, dammit.
The hell you didn't.
Also, this is a good read, though you probably knew all that already.
There's a good Swedish restaurant called Tre Kroner in Chicago. It used to have fabulously cute waitresses, but that was ten or twelve years ago.
My wife's favorite; we had breakfast there, without the kids, this past Sunday. And the Gunhildicity of the staff is still apparent. But it is, really, spelled Tre Kronor as it should be.
Swedish restaurant called Tre Kroner
My mom loves this place; I like it too. The cute waitress is Slovak.
80: Not knowingly, anyway.
In other news, SUCCESS. One must first decode the deliverances of the database from utf-8, then encode them as utf-16 bigendianly.
OT: I love the terrifying skull with braces that replaced the light-giving dildo amongst the main page icons. Way better than the robot icon I sent you, ogged.
Gosh, that was a lot of Chinese.
I love the terrifying skull with braces
My morning: not wasted! Consider it a placeholder until someone is moved to create a better one.
It sure does make me feel dumb that the reason for the last ten minutes of mistakes was that, even though I knew that I had to use utf-16-be, I left out the "-be" part.
You've got fœtus on your breath, Sifu.
I'm still not going to parse out the html entities, Sifu.
The cute waitress is Slovak.
This is now a universal truth across much of the UK.
As I discovered to my cost, when I made a joke in Czech and discovered the Czech words I was using were 'baby' Czech and really embarrassing for an adult man to be using.
Ah yes, the bot icon is teh cool. Nice job.
The bot icon is cool. Did ogged do that?
He claims he did. Why, do you know different?
My robot icon was better.
Not better than 100, though. That's genius.
85 worked for me if it was that o+e thingy.