Unfogged: Comment on Why did the bot stop working today?

1

For not-want of an accent, the unfoggedbot was lost.

That's a pretty grave accent aigu error.

Posted by: arthegall | Link to this comment | 09-18-07 10:57 PM

2

You could call it an acute condition.

Posted by: ben w-lfs-n | Link to this comment | 09-18-07 11:08 PM

3

DIE-acritic!

Posted by: Stanley | Link to this comment | 09-18-07 11:09 PM

4

Text parsing, Ben.

Wave of the future.

Posted by: Beefo Meaty | Link to this comment | 09-18-07 11:12 PM

5

So the unfoggedbot couldn't be arsed with fancy characters.

Posted by: teofilo | Link to this comment | 09-18-07 11:12 PM

6

HTML entities were good enough for Jesus and they're good enough for me, Sifu.

Posted by: ben w-lfs-n | Link to this comment | 09-18-07 11:13 PM

7

6: yeah look what happened to him, though.

Posted by: Beefo Meaty | Link to this comment | 09-18-07 11:14 PM

8

I'm just suprised that it's ticket and not billet. Do the French favor the former, or do the two have subtley different meanings?

Posted by: Otto von Bisquick | Link to this comment | 09-18-07 11:15 PM

9

Google I feel lucky result on "Python HTML entity parser."

Posted by: Beefo Meaty | Link to this comment | 09-18-07 11:16 PM

10

Second result.

Posted by: Beefo Meaty | Link to this comment | 09-18-07 11:17 PM

11

What you really want is the htmlentitydefs module, if the problem is entities. However, that comment wasn't written with entities. Had it been, there would have been no problems.

Posted by: ben w-lfs-n | Link to this comment | 09-18-07 11:17 PM

12

Uh, second result.

Posted by: Beefo Meaty | Link to this comment | 09-18-07 11:17 PM

13

After all, all of the characters &, e, a, c, u, t, e, and ; (pretend those are mentioned and not used) are perfectly good ascii already, Sifu, so why would they cause problems? Do try to keep up.

Posted by: ben w-lfs-n | Link to this comment | 09-18-07 11:18 PM

14

Welcome to four days ago, Tweety.

Posted by: Josh | Link to this comment | 09-18-07 11:19 PM

15

14: that was a lovely time.

I still refuse to believe that basic ASCII text parsing is beyond the scope of Ben's prodigious gifts; if it's converting to ASCII then you're going to have discrete strings, and somebody must have run across this.

Posted by: Beefo Meaty | Link to this comment | 09-18-07 11:21 PM

16

Even if nobody has run across this problem, of course, it would be a simple matter of taking the XHTML entity documentation and parsing that, such that one ended up with a formatted list of HTML entity strings, suitable for dumping into one's bot bot, for parsing.

God knows Ben is way ahead of me on this one.

Posted by: Beefo Meaty | Link to this comment | 09-18-07 11:25 PM

17

You mean like a codec, sifu?

Posted by: ben w-lfs-n | Link to this comment | 09-18-07 11:25 PM

18

I suppose I do, sure.

Posted by: Beefo Meaty | Link to this comment | 09-18-07 11:26 PM

19

Are you proposing codecs for text strings?? Almost as bad an idea as this.

Posted by: arthegall | Link to this comment | 09-18-07 11:27 PM

20

The problem isn't that I don't know what to do with œ (that is, & o e l i g ;, no spaces); the problem is that it broke on œ (that is, LATIN SMALL LIGATURE OE).

>>> print u'\u0153'
Traceback (most recent call last):
File "", line 1, in ?
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0153' in position 0: ordinal not in range(256)
>>>

It will be interesting to see how this comment gets transmitted, containing the characters it does.

Posted by: ben w-lfs-n | Link to this comment | 09-18-07 11:29 PM

21

19: never prevent Ben from applying more complexity to a problem. No better way to make the poor fellow lose interest.

Posted by: Beefo Meaty | Link to this comment | 09-18-07 11:30 PM

22

And here I was thinking that the answer to the titular question was that it had become sentient and gone on strike until Ben started paying it a living wage.

Posted by: washerdreyer | Link to this comment | 09-18-07 11:30 PM

23

And, in fact, it got busted.

Posted by: ben w-lfs-n | Link to this comment | 09-18-07 11:30 PM

24

19: no, I'm not.

Posted by: ben w-lfs-n | Link to this comment | 09-18-07 11:32 PM

25

And, in fact, it got busted.

That is, the bot got busted trying to transmit it. The earlier solution with é worked; œ, no dice. The latter must be more esoteric.

Posted by: ben w-lfs-n | Link to this comment | 09-18-07 11:34 PM

26

You should probably hire the Pinkertons.

Posted by: washerdreyer | Link to this comment | 09-18-07 11:35 PM

27

You should probably listen to Pinkerton. That was an underrated album.

Posted by: arthegall | Link to this comment | 09-18-07 11:36 PM

28

Also, Ben: your 24 says 'no', but your 20's "Latin-1 codec" snippet says 'yes'. Maybe 19 should have been "proposing" s/b "using."

Posted by: arthegall | Link to this comment | 09-18-07 11:38 PM

29

Pinkerton is possibly the least underrated album ever recorded. I have never seen a single music critic say anything negative about it. I have never heard anyone say it is anything less than 400,000% better than anything Weezer has done since then.

Posted by: Cryptic Ned | Link to this comment | 09-18-07 11:38 PM

30

The latter must be more esoteric.

...and the former does in fact fall in [0,256).

28: I thought Sifu was proposing something that would decode "é" by, say, "e" (close enough!). I don't really understand the import of 19, I guess.

Posted by: ben w-lfs-n | Link to this comment | 09-18-07 11:42 PM

31

29: All Rivers come to an end.

Posted by: Stanley | Link to this comment | 09-18-07 11:43 PM

32

What do you mean, for instance, by "text string"?

Posted by: ben w-lfs-n | Link to this comment | 09-18-07 11:44 PM

33

8: Ah, answered my own question via Google. In case there are any other non-Francophones out there:

billet = train or plane ticket; ticket = métro or bus ticket

Source.

Posted by: Otto von Bisquick | Link to this comment | 09-18-07 11:47 PM

34

My east-coast privilege (which has me commenting, instead of sleeping, at 1:50am) allows me to refer to them as "text" strings instead of what I suppose to be the more proper, "character strings."

Posted by: arthegall | Link to this comment | 09-18-07 11:53 PM

35

Where a character is what, an ascii character, a utf-8 character, a utf-16 character, or what? There is no bare "character". If you mean byte strings, then your 28 is off base. 339 does not fit in a byte. And while 339 is a character according to utf-16, it's not a character according to utf-8, but rather two: 0xC5, and 0x93.

Posted by: ben w-lfs-n | Link to this comment | 09-18-07 11:57 PM

36

(Anyway it doesn't fit in an 8-bit byte.)

Posted by: ben w-lfs-n | Link to this comment | 09-18-07 11:58 PM

37

What, exactly, is the difference between ascii and utf-8? Wait, don't answer that.

By "character," Ben, I probably mean something closer to "glyph." It's the thing I see on the screen, which you can encode any which way.

I'm obviously not trying to put down the idea of encoding things as other things, in general. My 19 was meant somewhat as a joke, but mainly as an excuse to give that link to a page about Binary XML (which I once heard someone suggest would lead to a world where we had "codecs for XML," which strikes me as a uniquely awful idea). That is all.

Posted by: arthegall | Link to this comment | 09-19-07 12:06 AM

38

& now I am curious, for what do you think a codec is appropriate?

Posted by: ben w-lfs-n | Link to this comment | 09-19-07 12:06 AM

39

38: compressed video formats, of course.

Posted by: arthegall | Link to this comment | 09-19-07 12:07 AM

40

or audio, or still images ...

Posted by: imposter syndrome | Link to this comment | 09-19-07 12:12 AM

41

The anagram results for "codec ben w-lfs-n" include:

blown condo feces

snob concede wolf

So there's that.

Posted by: Stanley | Link to this comment | 09-19-07 12:21 AM

42

simple anagram:

bén w-lfs-n -> né wolfsnob

Posted by: Cryptic Ned | Link to this comment | 09-19-07 12:31 AM

43

Ah. But I can't properly be said to be proposing latin-1, for example. Such things are already with us.

(35 is wrong on some things, btw.)

By "character," Ben, I probably mean something closer to "glyph." It's the thing I see on the screen, which you can encode any which way.

I don't really see the conceptual difference between taking the bytes in a file and representing them as letters and taking the bytes in a file and representing them as a series of images. I mean, plainly there aren't actually glyphs, or even letters, in the file, any more than there are actually files on a hard drive. All of this is added by the ~~faculty of the understanding~~programs that interpret things for us. And if there are multiple ways of interpreting a sequence of bytes as glyphs, if you can mark that into the sequence, then you'll be able to translate them. (Of course, if some are supersets of others, and you deal with them in such a way that you don't just have sequences of bytes but also the information that here is a character, you'll get failures, like the failure the latin-1 codec had above—it only recognizes characters that are one byte long, and it's told that some two-byte sequence is a character. Well, that won't work. But it does explain why encoding the two-byte character as two one-byte utf-8 things allows it at least to be printed, albeit as nonsense: "Å". And that at least allows us to proceed with only a minor stumble.)

Posted by: ben w-lfs-n | Link to this comment | 09-19-07 12:37 AM

44

I wonder why the nonsense got doubled up when the bot reported 43. Something up in the database too? I wonder, but I don't really care.

Posted by: ben w-lfs-n | Link to this comment | 09-19-07 12:38 AM

45

Speaking as someone whose pathetic bits of python code are littered with unicode error catching sequences, most containing the word "Fuck" and its derivatives, you have my sympathy. Can't you send stuff as UTF-8 instead, though? That way there might not be a crisis when some smartarse sends a non-latin character ...

Posted by: Nworb Werdna | Link to this comment | 09-19-07 12:58 AM

46

What I really meant, though, was: "I didn't mean to come off as pretentious, and thank you for hosting/fixing the bot, which is pretty much the greatest thing ever." I shouldn't have pushed on the whole UTF/codec thing, since obviously you're doing us all a favor by spending a long time fixing the damn thing. As a public service, no less.

Posted by: arthegall | Link to this comment | 09-19-07 1:06 AM

47

I'm so sorry that my one delurk crashed your new toy.

Posted by: Greengage | Link to this comment | 09-19-07 6:04 AM

48

I'm ~~so sorry~~ bursting with pride that my one delurk crashed your new toy.

Fixed.

Posted by: DS | Link to this comment | 09-19-07 6:18 AM

49

Frog.

Posted by: John Emerson | Link to this comment | 09-19-07 6:19 AM

50

Not Frog! Explainer of (not even ambassador for) Frogs. Big difference.

Posted by: Greengage | Link to this comment | 09-19-07 6:52 AM

51

w-lfs-n, the last time I had to deal with a nightmarish jumble of characters -- processing a large volume of Atom and RSS feeds, many of which misidentified their character sets -- I ended up jamming everything through Tidy first. Python has a libtidy API, right?

Posted by: snarkout | Link to this comment | 09-19-07 7:11 AM

52

"Explainer of Frogs" would be a bitchin' title to have on a business card.

Posted by: Beefo Meaty | Link to this comment | 09-19-07 7:24 AM

53

"Frog Whisperer" could've been good, but I think the [x]-whisperer formula is getting a bit played out. Too bad.

Posted by: DS | Link to this comment | 09-19-07 7:26 AM

54

Greengage has gone native over there and he doesn't even know it.

Posted by: John Emerson | Link to this comment | 09-19-07 7:31 AM

55

Greengage is a she. And she's just been asked to reorder her business cards, and is now thinking of all sorts of titles she could give herself. Thanks, Sifu.

Posted by: Greengage | Link to this comment | 09-19-07 7:53 AM

56

You're not the person behind greenshade are you, Greengage?

Posted by: ogged | Link to this comment | 09-19-07 7:57 AM

57

I'm hoping 'Greengage' comes from Cold Comfort Farm, and you're planning to comment in a vaguely D.H. Lawrence fashion. But simply explaining frogs will be satisfactory.

Posted by: LizardBreath | Link to this comment | 09-19-07 8:10 AM

58

I have reluctantly concluded that it will be easier to simply invade all of the other countries on earth and force them to use English than it will be get UTF-8 support working properly in all the software that needs it. I'm not sure if this is an argument that Anne Coulter has made, but if not she should consider it.

Posted by: Tom | Link to this comment | 09-19-07 8:22 AM

59

Greengage is a plum, too.

Posted by: A White Bear | Link to this comment | 09-19-07 8:23 AM

60

Isn't she, though!

Posted by: Beefo Meaty | Link to this comment | 09-19-07 8:24 AM

61

Norba: If you check back into this thread, or notice the Tre Kroner signal we've shone on a low cloud bank the last few nights, can you help us find out whatever happened to Gunhild Larking?

Posted by: I don't pay | Link to this comment | 09-19-07 8:31 AM

62

8 and 33:
Your definition is probably more useful, but I was going to answer that a "billet" is big, with lots of information on it, and may be sort of floppy, while a "ticket" is small, cardboardy, and simple.

---Jackmormon's French-English Phenomenological Dictionary.

Posted by: Jackmormon | Link to this comment | 09-19-07 8:38 AM

63

Oddly, there's no specific name for lady frogs. They're all just frogs.

A bullfrog is a species. There are lady bullfrogs.

Posted by: John Emerson | Link to this comment | 09-19-07 8:43 AM

64

63 - There are, and the lady wildcats kicked their asses.

Posted by: snarkout | Link to this comment | 09-19-07 8:51 AM

65

61: huh? Is that me?
and, in re Ben, the thing you want is, I've just remembered, something like "sillystring.encode('latin-1','xmlcharrefreplace')" which will -- if you really want to use latin-1 -- produce something legible for all the extraneous characters. Now I will go off and search for ways to get MS word smart quotes into the feed and really scramble the bugger.

Posted by: Nworb Werdna | Link to this comment | 09-19-07 9:38 AM

66

All I can find about the luscious Gunhild is that she came from Jönköping (a town of otherwise remarkable dullness) and was fourth in something or other at the 1956 Melbourne Olympics. High jump, I think -- a score of 1m 67. She hasn't died since 2002: it's hard to get newspaper searches earlier than that.

Posted by: Nworb Werdna | Link to this comment | 09-19-07 9:52 AM

67

Sorry about the name, I confused Nworb with Wuggie Norple or something. The town is something nobody else had, thanks.

Posted by: I don't pay | Link to this comment | 09-19-07 9:57 AM

68

Nworb is a genius. Thanks.

Posted by: ben w-lfs-n | Link to this comment | 09-19-07 10:22 AM

69

OK. Further: a scholarly article from the university of Malmö (http://www.idrottsforum.org/articles/tolvhed/tolvhed.html) claims that she was described by a magazine at the time (Melbourne 1956) as "swinging her well-turned thighs with feminine grace over the bar"and later remembered as "the pin-up girl for the whole olympiad".

Sorry about weird charsÖ swedish keyboard layouts have punctuation where they shouldnät.

Also ++ a pdf report from, I think, the Swedish national sport board on "The sexualisation of public space in sport" (great cover, worth a click) http://www.rf.se/files/%7B7E36FB48-BC66-4966-828E-7A65BF267A27%7D.pdf

This is quite as humourless as anyone could hope. It finds a Swedish magazine that described her as a "blonde bombshell" and another which printed pictures that contrated her favourably with a Russian shot-putter.

She is mentioned in an article on Swedish athletes on page 48 of the 1990 yearbook of her highschool, but while the index is on the wb. the text isn't. and now I had better do some work.

Posted by: Nworb Werdna | Link to this comment | 09-19-07 10:23 AM

70

oh, and if your owrkplace objects to nude statues with snow on them, don't click on the pdf link above.

Posted by: Nworb Werdna | Link to this comment | 09-19-07 10:25 AM

71

This is quite as humourless as anyone could hope Wow, that cover! Switching off Tre Kroner and turning on the Ragebunny signal now.

Posted by: I don't pay | Link to this comment | 09-19-07 10:29 AM

72

(Sending it as utf-8 is one of the things that causes libpurple clients to display a mess of chinese characters. Perhaps Frowner, M/tch and Emerson can get some use from that, I can't.)

Posted by: ben w-lfs-n | Link to this comment | 09-19-07 10:30 AM

73

72: Racist.

Posted by: M/tch M/lls | Link to this comment | 09-19-07 10:32 AM

74

Hey, and, IDP -- it's Tre Kronor. I suppose the shock of defeat prevents Canadians from noticing this detail.

Posted by: Nworb Werdna | Link to this comment | 09-19-07 12:18 PM

75

Yeah, that's it.

Posted by: I don't pay | Link to this comment | 09-19-07 12:19 PM

76

56, 57, 59, 60: No, I'm not greenshade, but thanks for the link. And I'm not Cold Comfort Farm either, though it was one of my earliest favorite reads, long ago before Unfogged. (I read it when I was too young to know that a brassiere -- not doing that accent again -- was just a bra, and so when the woman who collects them is talking about finding a cool three-paneled one I thought perhaps it might be a kind of dresser.) I am a plum, though, and I ripen at this time of year. You've outed me.

Posted by: Greengage | Link to this comment | 09-19-07 12:24 PM

77

Switching off Tre Kroner and turning on the Ragebunny signal now.

Wha? Huh? I got nuthin'. Aaargh.
[nods off again.]

Posted by: mcmc | Link to this comment | 09-19-07 12:38 PM

78

There's a good Swedish restaurant called Tre Kroner in Chicago. It used to have fabulously cute waitresses, but that was ten or twelve years ago.

Posted by: ogged | Link to this comment | 09-19-07 12:50 PM

79

78: so they're probably still cute, but a little bit too old for you?

Posted by: Beefo Meaty | Link to this comment | 09-19-07 12:52 PM

80

I never signed on for grappling with text encoding issues, dammit.

The hell you didn't.

Also, this is a good read, though you probably knew all that already.

Posted by: Hamilton-Lovecraft | Link to this comment | 09-19-07 1:39 PM

81

There's a good Swedish restaurant called Tre Kroner in Chicago. It used to have fabulously cute waitresses, but that was ten or twelve years ago.

My wife's favorite; we had breakfast there, without the kids, this past Sunday. And the Gunhildicity of the staff is still apparent. But it is, really, spelled Tre Kronor as it should be.

Posted by: I don't pay | Link to this comment | 09-19-07 1:49 PM

82

Swedish restaurant called Tre Kroner

My mom loves this place; I like it too. The cute waitress is Slovak.

Posted by: lw | Link to this comment | 09-19-07 1:59 PM

83

80: Not knowingly, anyway.

In other news, SUCCESS. One must first decode the deliverances of the database from utf-8, then encode them as utf-16 bigendianly.

Posted by: ben w-lfs-n | Link to this comment | 09-19-07 2:44 PM

84

OT: I love the terrifying skull with braces that replaced the light-giving dildo amongst the main page icons. Way better than the robot icon I sent you, ogged.

Posted by: Beefo Meaty | Link to this comment | 09-19-07 2:46 PM

85

Test: œ.

Posted by: ben w-lfs-n | Link to this comment | 09-19-07 2:46 PM

86

Hm, that didn't work.

Posted by: ben w-lfs-n | Link to this comment | 09-19-07 2:46 PM

87

Gosh, that was a lot of Chinese.

Posted by: Nathan Williams | Link to this comment | 09-19-07 2:47 PM

88

You need more bigendæ, ben.

Posted by: Beefo Meaty | Link to this comment | 09-19-07 2:48 PM

89

I love the terrifying skull with braces

My morning: not wasted! Consider it a placeholder until someone is moved to create a better one.

Posted by: ogged | Link to this comment | 09-19-07 2:50 PM

90

Fœtus!

Posted by: ben w-lfs-n | Link to this comment | 09-19-07 2:56 PM

91

Huzzah!

Posted by: ben w-lfs-n | Link to this comment | 09-19-07 2:57 PM

92

It sure does make me feel dumb that the reason for the last ten minutes of mistakes was that, even though I knew that I had to use utf-16-be, I left out the "-be" part.

Posted by: ben w-lfs-n | Link to this comment | 09-19-07 2:58 PM

93

You've got fœtus on your breath, Sifu.

Posted by: ben w-lfs-n | Link to this comment | 09-19-07 2:59 PM

94

I am the Ω of baby-eating.

Posted by: Beefo Meaty | Link to this comment | 09-19-07 3:04 PM

95

I'm still not going to parse out the html entities, Sifu.

Posted by: ben w-lfs-n | Link to this comment | 09-19-07 3:05 PM

96

The cute waitress is Slovak.

This is now a universal truth across much of the UK.

As I discovered to my cost, when I made a joke in Czech and discovered the Czech words I was using were 'baby' Czech and really embarrassing for an adult man to be using.

Posted by: nattarGcM ttaM | Link to this comment | 09-19-07 3:06 PM

97

Ah yes, the bot icon is teh cool. Nice job.

Posted by: bitchphd | Link to this comment | 09-19-07 3:13 PM

98

The bot icon is cool. Did ogged do that?

Posted by: ben w-lfs-n | Link to this comment | 09-19-07 3:14 PM

99

He claims he did. Why, do you know different?

Posted by: bitchphd | Link to this comment | 09-19-07 3:22 PM

100

I do not know anything.

Posted by: ben w-lfs-n | Link to this comment | 09-19-07 3:24 PM

101

Best comment #100 ever.

Posted by: bitchphd | Link to this comment | 09-19-07 3:30 PM

102

Pah.

Posted by: ben w-lfs-n | Link to this comment | 09-19-07 3:31 PM

103

My robot icon was better.

Not better than 100, though. That's genius.

Posted by: Beefo Meaty | Link to this comment | 09-19-07 3:50 PM

104

85 worked for me if it was that o+e thingy.

Posted by: John Emerson | Link to this comment | 09-19-07 4:55 PM

Re: Why did the bot stop working today?