actually I love this hypothetical conlang that is English where the dative case is marked by putting an acute accent over the last letter of a word and adding -s
@jfriedhoff mastodon has always been like this (scroll back through my feed and look at literally any post about computer programming). not sure what to do about it. the main benefit is that on mastodon you only get a handful of toots about whatever topic you were talking about instead of threads that go on and on for days where the most important person in the world involved in that topic gets tagged and you get notifications about the whole thing weeks and weeks later
also lots of files (>1%?) that say "yeah I'm utf8 sure whatever" but are actually ISO-8859-1 (according to chardet at least)
lessons learned: (a) never trust someone's claim about the encoding of a text file (b) character encodings are bad and trying to digitize text in the first place was bad idea
if you were going to download this today, maybe hold off—I found a frustrating bug where some utf8-encoded texts were being decoded incorrectly with a different encoding, leading to hilarious mojibake when they came out the other end—will post a fix in a few hrs
none of this would be a problem if the reported charset in the metadata was always the correct charset. but there are a lot of texts that report "us-ascii" when what they really mean is "ascii with occasional 8-bit chars just for fun!"
do you think there's a literal physical knob at twitter they can use to control how many "haha this tweet really blew up, please check out my other thing" tweets get posted every day
"Facebook, Twitter, and Instagram are now so large that they are considered 'unmoderatable' communities. We like to pretend this was a pure facet of their size, but it is inescapably a part of their ethos. They are platforms forged in the fires of troll culture, founded and operated by techno-libertarians who didn’t understand why they had to care about any of this." https://www.theverge.com/2018/7/12/17561768/dont-feed-the-trolls-online-harassment-abuse
remembering the late nineties when I had a "microblog" by which I mean "nested <ul>s on a web page that I manually added <li>s to on occasion" those were the days
realized you could make a sort of markov chain text generator using concatenated word vectors instead of the tokens themselves, which has the benefit of being able to cope pretty well with out-of-vocab strings. anyway, here's word-vector-markov Jane Austen elaborating on what the Internet is