Now you have [1-9]{1,9} problems.
February 3, 2020 4:09 PM   Subscribe

iHateRegex, a regex cheatsheet for the haters.

Visual explanation of some useful and some truly horrifying regex expressions, such as this abomination.
posted by signal (59 comments total) 58 users marked this as a favorite
 
abomination

It's not that much worse than the one for date format
posted by thelonius at 4:24 PM on February 3, 2020 [3 favorites]


The worst thing about those is that I understand them
posted by mystyk at 4:36 PM on February 3, 2020 [16 favorites]


Yeah, they're horrible, but there's a certain queasy pleasure in seeing how all the tiny cogs and gears fit together.
posted by thatwhichfalls at 4:47 PM on February 3, 2020 [1 favorite]


No. Stop. The way people try to pack matching for multiple formats into a single regex is just fucking maddening.

It's not 1972 anymore. We have plenty of memory and processing power. Are there ten formats that people use to enter phone numbers? Great! Write ten simple regexes, and run them in order of priority. I mean, fuuuuuuuuck, come the fuck oooonnnnnn.
posted by phooky at 5:06 PM on February 3, 2020 [44 favorites]


Abomination!

Using a regex to recognize dates is bad enough, but using a regex to validate dates is a crime against programming!

As for phone numbers, simply check for a leading plus sign, then discard all non-digits. The regex does more harm than good!
posted by monotreme at 5:18 PM on February 3, 2020 [11 favorites]


I got [9]{2,2} problems but a regex ain't one.
posted by benzenedream at 5:19 PM on February 3, 2020 [12 favorites]


We have plenty of memory and processing power. Are there ten formats that people use to enter phone numbers? Great! Write ten simple regexes, and run them in order of priority. I mean, fuuuuuuuuck, come the fuck oooonnnnnn.

Actually, I think this approach would execute faster and have a smaller memory footprint anyway. And that's before you even think about parallelizing the tests.
posted by tobascodagama at 5:19 PM on February 3, 2020 [7 favorites]


It's apparent that there's value to using a specification for a regular grammar to validate character sequences, but what I'm becoming increasingly skeptical of is the general utility of an ultra-terse specification language like regex as a means to specify those. It's my impression that many people find the FSMs that a regex represents to be fairly understandable when represented either visually or more verbosely, and so I wonder if it would beneficial for readability and debuggability to be able to specify regular grammars explicitly in code in terms of states and transitions, with some backend capable of compiling those down into an efficient low-level representation.
posted by invitapriore at 5:30 PM on February 3, 2020 [6 favorites]


The worst way for matching text except for all the others.
posted by feloniousmonk at 5:41 PM on February 3, 2020 [3 favorites]


To get to the really bizarre stuff, you need to look at PCRE.

Bonus: many of those don't relate to, or compose with, theory about automata in any meaningful way.
posted by silentbicycle at 5:47 PM on February 3, 2020


I was first exposed to Zalgo by that StackOverflow post about why you should never try to parse HTML with regex.

I half love regex. (The other half is hate. This post is germane to my interests).
posted by aspersioncast at 5:52 PM on February 3, 2020 [6 favorites]


Similar, but less flowchart-y: regexr.com
posted by endquote at 5:52 PM on February 3, 2020 [1 favorite]


I think every company that employs developers should also hire a person whose title is Regular Expressionist. They would spend their days writing all the regexes the developers need and be paid very handsomely.
posted by bendy at 5:57 PM on February 3, 2020 [14 favorites]


I really like this kind of thing, I find it super useful because I'm a very visual learner and dealing with regular expressions is kind of like the intellectual equivalent of fingernails on a chalkboard for me.

Recently, I actually got so frustrated with trying to write a regex the old-fashioned way (i.e., changing stuff around randomly until it seems like its working) that I gave up and started using Verbal Expressions. I mean, I'm sure it won't cover *your* use case, but it did wonders for *my* sanity.
posted by the painkiller at 6:03 PM on February 3, 2020 [5 favorites]


I always love to break out the classics!
posted by lkc at 6:22 PM on February 3, 2020 [2 favorites]


Oh I like this. I was first exposed to the horrors of regex about a month ago because I've started doing a lot of scripting work with JMP (statistical analysis and graphing software), and my first go was a complete failure. JMP has some simplified, regex-adjacent built in functions and I hacked my way through with those.
posted by MillMan at 6:23 PM on February 3, 2020 [1 favorite]


Everybody stand back, this is meta.
posted by signal at 6:31 PM on February 3, 2020


/* Additional note: this code is very heavily munged from Henry's version
* in places. In some spots I've traded clarity for efficiency, so don't
* blame Henry for some of the lack of readability.
*/
posted by lkc at 6:40 PM on February 3, 2020


As Churchill famously said: "Many forms of string search have been tried, and will be tried in this world of sin and woe. No one pretends that regex is perfect or all-wise. Indeed it has been said that regex is the worst form of text parsing, except for all those other forms that have been tried..."
posted by gwint at 6:42 PM on February 3, 2020 [3 favorites]


i see these and always wonder why grok isn't more widely used. it's another level of indirection, but when you have to deal with complicated regex, sometimes deeply nested, it's the ideal tool for the job.
posted by rye bread at 6:55 PM on February 3, 2020


To get to the really bizarre stuff, you need to look at PCRE.

PCRE is insane. With subroutines and recursion it's basically a programming language.

Bonus: many of those don't relate to, or compose with, theory about automata in any meaningful way.

Really, you've left the realm of regular languages as soon as you add back references—effectively, none of the widely used regular expression dialects are regular expressions in the original sense.

One possible alternative to this madness is the Rosie Pattern Language.
posted by thedward at 7:05 PM on February 3, 2020 [3 favorites]


Yes, it's a huge grab bag of extensions that probably sounded like a good idea at the time, but have become a big sprawling mess. I've implemented some of them for another regex engine, so I've spent some quality time with that documentation. I tried to find the operator that says, "if you get here, start over" but I'm on my phone. Edit: Found it! \K! Whyyyyyy?

Rosie is built in top of PEGs, which are nice, but have their own issues. Ordered choice can become really hard to reason about when combining individually simple PEG expressions.
posted by silentbicycle at 7:13 PM on February 3, 2020


Are there ten formats that people use to enter phone numbers? Great! Write ten simple regexes, and run them in order of priority.

It would indeed be madness to attempt to write a new giant regex to recognize, parse, or validate multiple formats of phone numbers.

On the other hand, if the giant regex in a library like this does what you need, and it's already more thoroughly tested and debugged then your own software is ever likely to be, maybe even actively maintained by someone -- would it really be rational to ignore those advantages and write a bunch of (likely enough buggy) little regexes of your own from scratch?
posted by Western Infidels at 7:15 PM on February 3, 2020 [2 favorites]


They would spend their days writing all the regexes the developers need and be paid very handsomely.

Writing regular expressions is easy. What I need is someone to read the ones written by other developers
posted by Maxwell's demon at 8:29 PM on February 3, 2020 [9 favorites]


Is it just me, or is “machine learning” (or at least the fundamental initial iterations of it), really just about regex, but with data that isn’t text?

Sorry, I’ve been down a rabbit hole of “machine learning” stuff lately, and sometimes it just feels like the whole point of it is ‘make he machine look at these millions of things and spit out an answer’.

I mean, fundamentally, the difference between an ascii character and a pixel/pattern/graph is just... numbers that humans use to represent something and don’t mean anything else to anything else?

Or am I missing the plot?
posted by daq at 8:46 PM on February 3, 2020


Here's a fun nerdsnipe: write a regex that recognizes numbers divisible by 3. (or any other finite number, but you do have to pick one.)

I do not know of a way to recognize true statements of the form

N mod M === 0

for all finite, integral N, M, but I would believe that such a thing exists.
posted by meaty shoe puppet at 8:50 PM on February 3, 2020 [2 favorites]


Does your phone number regex handle "Pennsylvania 6-5000", “6060-842”, "Beechwood 4-5789", handle 555 (fake movie prefix), 867-5309, or "one ringy dingy" then it just ain't gunna sing to me.
posted by sammyo at 9:30 PM on February 3, 2020 [2 favorites]


Regex is about manually writing incomprehensible code to recognize patterns; machine learning is asking a computer to write the incomprehensible code for you.
posted by phooky at 9:38 PM on February 3, 2020 [6 favorites]


Another site for getting a flowchart diagram from regex: debuggex.com
posted by Pronoiac at 10:36 PM on February 3, 2020


Is it just me, or is “machine learning” (or at least the fundamental initial iterations of it), really just about regex, but with data that isn’t text?

Just you.

Machine Learning is "tools for those with (coding and/or) domain knowledge who slept through their stats classes".


...I gather.
posted by pompomtom at 10:37 PM on February 3, 2020 [1 favorite]


@phooky absolutely; it drives me nuts when people try to write their entire program in regex for some reason instead of splitting up the big string into smaller strings with one simple regex, then processing those strings with another simple regex, and so on.

(You don't have to do this! Why do you keep punching yourself in the face, and then complaining that regex is horrible?!)

I am personally guilty, in Python, of using a regex substitution with a function replacement and another regex substitution inside that function. Regexception! Still, I maintain, better than one giant regex.
posted by confluency at 11:18 PM on February 3, 2020


Writing regular expressions is easy. What I need is someone to read the ones written by other developers.

This Regular Expressionist does it all, both ways.
posted by bendy at 11:40 PM on February 3, 2020 [1 favorite]


Here's a fun nerdsnipe: write a regex that recognizes numbers divisible by 3. (or any other finite number, but you do have to pick one.)

... in a chosen base b. This is essentially a classic exercise in automata theory (and not impossible if the students know modulo arithmetic).

If you want to recognise all numbers N with N mod M === 0, you build an NFA with one state for every remainder 0 to M-1. Appending a new digit d is the same as multiplying by b and then adding d, and all this is executed modulo M. (So, if you are in state i and append d, you go to the state j that has i*b+d mod M === j).

Make the state for 0 initial and accepting, and convert that thing into a regular expression using your favourite algorithm. Add an extra state if you want to exclude leading zeroes.

The resulting expression is not something that you would want to build by hand (and it is unreadable), but it does the job.
posted by erdferkel at 2:27 AM on February 4, 2020 [3 favorites]


I occasionally have to use regexes for stuff at work, but I try to keep it to a minimum because the supplies I need for the requisite ritual self-cleansing come out of my salary.
posted by Mr. Bad Example at 3:07 AM on February 4, 2020 [2 favorites]


It's funny that nobody has mentioned the regex in the title. It won't match the zero in numbers like "10" or "8675309". (Impossible to know for certain if that was intended or not.)

Regular expressions are only as hard as the problem you're trying to solve with them. People will still screw up even simple things. Switching to something else probably won't help.

I really like that I only needed to learn one pattern matching language decades ago and it is still relevant, across all major platforms and programming languages, and (not coincidentally) probably will be for decades to come. Such a thing is rare in a field where technologies go in and out of fashion, and most knowledge has a very short useful life.
posted by swr at 4:40 AM on February 4, 2020 [1 favorite]


swr: "Impossible to know for certain if that was intended or not."

Of course it was…

What I'm surprised nobody griped about is the Redundant Acronym Phrase, i.e.: "regex expressions".
posted by signal at 6:04 AM on February 4, 2020 [1 favorite]


A few decades ago I wrote a regex to pull citations and references from APA formatted articles. So much trial and so much error and it was never quite perfect but it was pretty good. It took forever to run but I was still quite pleased with it. Then several weeks later I ran into references with special characters and had to try and get it to work with umlauts and such. I put on my debugging hat and stared in horror at the incomprehensible gibberish I had written and immediately quit.
posted by srboisvert at 6:17 AM on February 4, 2020 [2 favorites]


Went looking for the wackiest (something something) but just have to leave this here:

Optimizing RegEx for Emoji
posted by sammyo at 6:32 AM on February 4, 2020 [2 favorites]


As for phone numbers, simply check for a leading plus sign, then discard all non-digits. The regex does more harm than good!

There are contexts where a comma, semi-colon, 'p', or 'w' are part of dialing strings (although not strictly phone numbers) ditto for # and *.
posted by atrazine at 7:05 AM on February 4, 2020 [1 favorite]


Is there a regex that will validate regexes? Almost.
posted by hypnogogue at 7:27 AM on February 4, 2020 [1 favorite]


Writing regular expressions is easy. What I need is someone to read the ones written by other developers

...in which set I include the developer known as "me as of yesterday or before".

One way I make things easier for myself is building regexes piecewise. I think of regexes as machine language programs for a particular kind of finite state machine, and like any other machine language I find an assembly language version easier to comprehend. So for example if I had an actual need for that IPv6 monster I'd spend a bit of time on breaking it down after pasting it in, like this:

ipv6='(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))'
ipv6_flags='gm'

Too much line noise, can't grok. Hmm. Lots of repeated pieces there. How about this?

hex='[0-9a-fA-F]'
hex04="$hex{0,4}"
hex14="$hex{1,4}"
ipv6="(($hex14:){7,7}$hex14|($hex14:){1,7}:|($hex14:){1,6}:$hex14|($hex14:){1,5}(:$hex14){1,2}|($hex14:){1,4}(:$hex14){1,3}|($hex14:){1,3}(:$hex14){1,4}|($hex14:){1,2}(:$hex14){1,5}|$hex14:((:$hex14){1,6})|:((:$hex14){1,7}|:)|fe80:(:$hex04){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|($hex14:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))"
ipv6_flags='gm'

That's a bit better but still hella confusing. Can't see any whitespace being matched, so what if I throw in a few newlines and tabs and get rid of them later?
hex='[0-9a-fA-F]'
hex04="$hex{0,4}"
hex14="$hex{1,4}"
ipv6="
(($hex14:){7,7}$hex14
|($hex14:){1,7}:
|($hex14:){1,6}:$hex14
|($hex14:){1,5}(:$hex14){1,2}
|($hex14:){1,4}(:$hex14){1,3}
|($hex14:){1,3}(:$hex14){1,4}
|($hex14:){1,2}(:$hex14){1,5}
|$hex14:((:$hex14){1,6})
|:((:$hex14){1,7}|:)
|fe80:(:$hex04){0,4}%[0-9a-zA-Z]{1,}
|::(ffff(:0{1,4}){0,1}:){0,1}
	((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])
|($hex14:){1,4}:
	((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])
)"
ipv6=${ipv6//[$'\t\n']}
ipv6_flags='gm'
OK, I can see a bit of structure emerging there. But what's that's long ugly thing happening twice near the end? Could that be an IPv4 address in dotted-decimal format? (checks IPv6 spec) Why yes, it could. OK, it gets its own name. And so does that other weird thing I just found out you can do in IPv6 at the same time.
byte='(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])'
ipv4="($byte\.){3,3}$byte"

hex='[0-9a-fA-F]'
hex04="$hex{0,4}"
hex14="$hex{1,4}"

zone='[0-9a-zA-Z]{1,}'

ipv6="
(($hex14:){7,7}$hex14
|($hex14:){1,7}:
|($hex14:){1,6}:$hex14
|($hex14:){1,5}(:$hex14){1,2}
|($hex14:){1,4}(:$hex14){1,3}
|($hex14:){1,3}(:$hex14){1,4}
|($hex14:){1,2}(:$hex14){1,5}
|$hex14:((:$hex14){1,6})
|:((:$hex14){1,7}|:)
|fe80:(:$hex04){0,4}%$zone
|::(ffff(:0{1,4}){0,1}:){0,1}$ipv4
|($hex14:){1,4}:$ipv4
)"

ipv6=${ipv6//[$' \t\n']} #strip whitespace
ipv6_flags='gm'
That's quite tractable. Tractable enough, in fact, to spot a bug: everything is allowed to be case-insensitive except for the fe80 and ffff special values. Can't think of a good reason for that restriction and it's probably going to cause trouble. So let's get rid of all the explicit uppercase match values and just use a flag to make the whole thing case-insensitive. Also {n,n} is the same as {n}, {0,1} is the same as ? and {1,} is the same as + so tidy those too.
byte='(25[0-5]|(2[0-4]|1?[0-9])?[0-9])'
ipv4="($byte\.){3}$byte"

hex='[0-9a-f]'
hex04="$hex{0,4}"
hex14="$hex{1,4}"

zone='[0-9a-z]+'

ipv6="
(($hex14:){7,7}$hex14
|($hex14:){1,7}:
|($hex14:){1,6}:$hex14
|($hex14:){1,5}(:$hex14){1,2}
|($hex14:){1,4}(:$hex14){1,3}
|($hex14:){1,3}(:$hex14){1,4}
|($hex14:){1,2}(:$hex14){1,5}
|$hex14:((:$hex14){1,6})
|:((:$hex14){1,7}|:)
|fe80:(:$hex04){0,4}%$zone
|::(ffff(:0{1,4})?:)?$ipv4
|($hex14:){1,4}:$ipv4
)"

ipv6=${ipv6//[$' \t\n']} #strip whitespace
ipv6_flags='gim'
Now that, I'll be able to read tomorrow. Probably even debug and/or loosen and/or tighten tomorrow.

Using a language like bash or perl that allows for easy string interpolation and multi-line string literals makes this kind of thing much less fiddly, but it's worth doing even if you have to do the whole 'this' + 'that' concatenation dance. I've done it in VBScript and been glad I had.
posted by flabdablet at 7:34 AM on February 4, 2020 [9 favorites]


This: (?=.*?[#?!@$ %^&*-])

I want every site that validates password rules to gaze upon that expression, commit it to memory, and then KILL IT WITH FIRE. Nuke it from orbit.

"Include one special character" ≠ "Include one of the 10 characters we bothered to add into this regex". If I add a password that includes one of ∂∞∆ß≈嵬™ instead, then your goddamn regex should NOT be telling me that the password is "not complex enough". YOU are not complex enough.
posted by caution live frogs at 7:48 AM on February 4, 2020 [12 favorites]


I want every site that validates password rules to gaze upon that expression, commit it to memory, and then KILL IT WITH FIRE. Nuke it from orbit.

I want every site that does password complexity checks to gaze upon that expression, commit it to memory, then KILL THE WHOLE THING with fire and use zxcvbn instead.

It is risible that Apple doesn't allow dpkai.lnijr.dzlnj.hkfei.pwnis for an Apple ID password but still* rates the strength of Apple-01 as "moderate".

*checked again just before posting this
posted by flabdablet at 8:04 AM on February 4, 2020 [6 favorites]


there's a certain queasy pleasure in seeing how all the tiny cogs and gears fit together

For those who struggle to experience that pleasure but still need to find the busted cog before lunch, there's regex101.
posted by flabdablet at 8:44 AM on February 4, 2020 [1 favorite]


flabdabet, if you really want a nice way to write maintainable regexps then Python's tools are really nice. Verbose regex lets you add whitespace and comments to a regex. And named capture groups let you refer to bits you are extracting with the regex symbolically. An example from here, for breaking apart ISO dates. It's a trivial regex but you get the idea.
iso_date_matcher = re.compile(r'''
  (?P<date>           # the whole date
    (?P<year> \d{4})  # YYYY year
    - 
    (?P<month>\d{2})  # MM month
    -
    (?P<day>  \d{2})  # DD day
  )
''', re.VERBOSE)
I've been writing ugly regexps for 10 years this way and I can still go back to an old one and understand it well enough to modify it.
posted by Nelson at 9:17 AM on February 4, 2020 [4 favorites]


∂∞∆ß≈嵬™
my new sockpuppet
although the genius creator never did, to my knowledge, take "Yukon Sid, or me, your young apprentice" and that is really tempting
posted by thelonius at 9:25 AM on February 4, 2020 [1 favorite]


I, uh, I kind of love regular expressions. They're like a fun puzzle. And they have saved me SO much time.
posted by Anonymous at 10:10 AM on February 4, 2020


The biggest issue for me is when I write a really long one and leave it alone for a while and then have to go back and figure out what the hell I did.
posted by Anonymous at 10:13 AM on February 4, 2020


I'm a translator and technical writer. Although my regex-fu is weak, I use it all the time. I can't imagine doing my job without it. I was flabbergasted when I learned my supervisor has never even heard of it.
posted by adamrice at 11:18 AM on February 4, 2020 [1 favorite]


I use regex semi regularly. Probably the ugliest one I ever made was early in my coding career, where I was working with local political candidates, building websites and helping them compile mailing lists. It was meant to cope with really badly formatted voter data from a small but wealthy bay area city. I think I may still have it, but it had a couple of hacks that would reveal a bit of voter data, and I am both a big believer in data privacy and not entirely sure of the legality of sharing any data about california voters, so I'll just describe it.

Basically it was comma delimited data (a csv file) that should ideally have been importable as a spreadsheet, but they had allowed voters to include commas and quotes in their addresses and other data, and didn't strip those or comment it correctly >_<. I think it was probably a database dump that they didn't think about carefully.

So I figured out a number of that places that voters were likely to include commas (IE between city and state), regexed it to remove bad characters before importing it, saw I missed some because the columns were off, added another place, and checked to make sure the last entry had the correct columns. Which didn't work 100% because somehow some of the data was still off, so I am not ashamed to admit manually edited the last few bad lines.

All that was before the actual data processing, where I was combining last names so that so that mailers could go to addresses designated for the whole family, rather than individual ones for each family member. detecting the same last name and combining required it's own hacky regex and detection of when an address had apartments... basically hacky regexes all the way down.

I used perl which a lot of people think is cool to dislike, but I have no regrets because it worked and I got paid.
posted by gryftir at 1:13 PM on February 4, 2020 [2 favorites]


I know just enough regex to be dangerous. Recently I found this Quick-Start: Regex Cheat Sheet.

It groups families of short, useful features - and gives simple examples - without getting into flavors or syntax complications. It let me quickly find my 'not' solution: [^m] .
posted by Twang at 5:58 PM on February 4, 2020 [2 favorites]


Nontrivial "not" solutions are among the fiddliest things to express in regex. Single-character is easily done with a negated character class as you point out, but a regex to match "any word except hippopotamus" is pretty horrendous unless the regex implementation you're using has explicit syntax for that case. If it doesn't, something like this is required:

\b(.{1,11}|.{13,}|[^h].{11}|.[^i].{10}|..[^p].{9}|...[^p].{8}|.{4}[^o].{7}|.{5}[^p].{6}|.{6}[^o].{5}|.{7}[^t].{4}|.{8}[^a]...|.{9}[^m]..|.{10}[^u].|.{11}[^s])\b

If you find yourself apparently in need of a regex that embeds beasts of that nature, perhaps because you're trying to parse text where reserved words are being used as content boundary markers, one approach that can walk the sheer fiddliness back from don't-even-go-there is choosing some set of single characters that can't occur in the original text, and doing some preliminary cleanup and substitution passes to replace reserved words with single-character tokens chosen from that set. Then you can use something like [^\x01]+ to match a range of text that doesn't contain hippopotamus tokens.

But in general this kind of thing is Regex Code Smell and a strong indication that it would pay you to invest a bit of time in learning a more general (though perhaps also less performant) parsing library.
posted by flabdablet at 10:50 PM on February 4, 2020 [1 favorite]


You people are killing me.
posted by bendy at 12:03 AM on February 5, 2020


...which only goes to show that hippopotamuses with bugs are dangerous. This version works.

There are similar bugs in the IPv6 matcher above that I don't propose to fix. If you were thinking of putting that into production, don't.
posted by flabdablet at 7:02 AM on February 5, 2020


A less clumsy hippopotamus filter
posted by flabdablet at 7:53 AM on February 5, 2020



Sorry, I’ve been down a rabbit hole of “machine learning” stuff lately, and sometimes it just feels like the whole point of it is ‘make he machine look at these millions of things and spit out an answer’.

I mean, fundamentally, the difference between an ascii character and a pixel/pattern/graph is just... numbers that humans use to represent something and don’t mean anything else to anything else?

Or am I missing the plot?


Kinda, ML has a lot going on, and to make it sexy companies make it easy to do sexy things really easy. I'm not an expert by any means but I did study stats (warning, ML is not just stats). You have "cheating" ML that gets the people going. Someone at some point identified 5,000 pictures of dogs then said "these are dogs" and then ran through them through figured out what are dogs, looked at pictures that are not dogs and said these are not dogs then using recognition techniques you can give it a random picture and say it is a dog or not a dog. People are impressed by this and facial/object recognition generally require a large training set where someone at some point has a good picture of a lot of dogs or whatever. A lot of ML really is optimizing, which is why Snapchat funny faces work well because you have a high quality camera looking directly at you in good lighting. Have a crowd of 20 people, low resolution and suddenly performance goes down, as does quality of recognition.

Again, lot of what goes into this is compression of the image so you can process it in near real time and also that someone has classified an object beforehand which itself is a big deal. Not to undermine that work but that's done.

What's interesting to me is unsupervised machine learning with something like facial recognition. I do not care who you are other then you are you. You are in a restaurant, I take your picture. You show up again, then I have to go through all pictures of all people and figure out if maybe the match is you. Come in enough times suddenly I have customer XYZ1B2. I don't care about your name but I'm effectively going through the entire database, clustering pictures and trying to identify if you are customer XYZ1B2. That's resource intensive. But if I know XYZ1B2 comes in on X Date of the week, orders an appetizer and if that happens orders two drinks. If it is at the end of the month maybe they don't order the appetizer because rent is due or whatever but I have another machine learning dataset that doesn't care why but knows if an appetizer is ordered two drinks at higher profit margins are consumed, so "in near real time" the system decides to give the appetizer for free or half off or whatever knowing the profit lost on that is made up on the higher profit drinks that there's a 90% chance XYZ1B2 will stay.

Now a good bartender/waiter will know this for regulars and there's a lot of psychology there of feeling special, that they know your habits. But that's easy for humans to remember for regulars, if you come in one every two months maybe not so much. Or maybe there's non-intuitive patterns the induce higher spending. Google has demonstrated this with regulating data center power consumption it is not at all far off to see this sort of thing work at Applebee's when the cost of all this comes down.

So yeah it comes down to a million iterations and finding patterns but right now most of what you see is "cheating" in a way where I think the real value comes from when you don't know the classification and the unsupervised learning does the work for you.
posted by geoff. at 4:29 AM on February 6, 2020 [1 favorite]




> Jamie Zawinski, 1997.
Paraphrasing “If you have a problem and you think awk(1) is the solution, then you have two problems.” – David Tilbrook, 1989

or possibly “The solution of every problem is another problem.” – Johann Wolfgang von Goethe, 1821
posted by farlukar at 12:06 PM on February 7, 2020 [2 favorites]


On second look, that was said by Friedrich von Müller, only written down by Goethe.
posted by farlukar at 12:19 PM on February 8, 2020


« Older Cards Against Humanity Bought Clickhole   |   The impossible task of reconciling internat'l... Newer »


This thread has been archived and is closed to new comments