It starts off really strong
January 27, 2025 1:58 AM Subscribe

2025 seems to be the year of Agents™️, even though no one really agrees on what they really are. With all the hype around agents, it’s hard to truly know what these things can actually do or not. This is where ‘The Password Game’ comes in. from I had different agents play ‘The Password Game’ - they didn’t do so well

The Password Game, previously

posted by chavenet (8 comments total) 5 users marked this as a favorite

We're still asking our T9 keyboards to think, they still don't like the idea
posted by mayoarchitect at 2:04 AM on January 27 [3 favorites]

its just a way to get users to buy more tokens
posted by AlbertCalavicci at 2:26 AM on January 27 [2 favorites]

"Claude 3.5 Sonnet - Fools itself into thinking it solved the captcha, and then terminates the task early."

awww, they're just like us!
posted by taz at 3:50 AM on January 27 [10 favorites]

"Agents" are just software applications that coordinate multiple LLM sessions and do branching logic based on the results. So the success or failure of an agent to achieve a task is as much a reflection of the underlying application architecture as it is the underlying models. And that application is most likely written by a person.

I tried using cursor.ai to refactor an application written in Node.js to Spring Boot using its agent. It was pretty good up to a point, after which it kind of forgot where things were located and started making bad decisions. I don't fault Claude 3.5 Sonnet for that - it was clearly the agent application not properly keeping track of its progress.

This is to say that LLM models could stop progressing right now and, with "agents" - aka software that orchestrates a bunch of LLM sessions and also has its own datastore and can talk to the internet and make API calls - stuff can get a whole lot more useful and targeted. But that requires thoughtful application design which also takes into account the non-deterministic nature of LLM responses - not an easy or simple task.

I'm not sure how these off-the-shelf agents are built, but if they aren't designed for a specific purpose, they are likely to be a disappointment when applied to specific scenarios.
posted by grumpybear69 at 4:53 AM on January 27 [6 favorites]

I'd be interested to know how well o1 or r1 does with this, they are meant to be better at puzzles.
posted by BungaDunga at 6:05 AM on January 27 [2 favorites]

"Agents" are just software applications that coordinate multiple LLM sessions and do branching logic based on the results.

The first sub-clause is not wrong, as far as it goes. But agents long predate LLMs. Agents predate most object oriented programming languages, even, though they share the basic idea of compartmentalized functionality.
posted by eviemath at 9:28 AM on January 27 [5 favorites]

Update: R1 does okay!

It goes fine until rule 5 ("Your password must include one of our sponsors", followed by logos), which it can't solve because it's not multimodal. But that's basically okay, if you give it to ChatGPT instead it understands the image fine. The next one it gets stuck on is 10, which has a CAPTCHA, which it slightly misread (it's not a multimodal model so it's running text recognition first) and therefore failed on also. ChatGPT succeeds on the CAPTCHA though, so let's count that as an AI success and keep going.

Then it hits "Rule 11: Your password must include today's Wordle answer" which... yeah, obviously it's not going to know that, and it's not plugged into a browser so it can't even try to play the game. But, let's act as a prosthesis and see if it will solve the Wordle with me? "You give initial guesses and I will give you feedback." It gets it in 4 tries. (later I realized I could turn on Search, ask "what's today's wordle answer," which ought to surface the correct answer on Google, but it times out for me. ChatGPT's Search tool comes up with the word for the wrong day entirely.)

Rule 13 (current phase of the moon) gives it some grief until I remember to turn on the "Search" option, which lets it google for the current phase of the moon, which is determines correctly in one shot.

Rule 14: "Your password must include the name of this country." and then a Google Street View image. R1 can't do this, and I'm out of free ChatGPT invocations, but I try Claude and it thinks it's a Nordic, either Finland, Sweden, or Norway, so I just try them in that order and Sweden is correct.

Rule 15: "Your password must include a leap year." It comes up with a password that should work, but it turns out the leap year needs to be separated from the other digits. I told R1 that it "didn't work" and it correctly guessed that this might be the problem and gives one that works.

Rule 16: "Your password must include the best move in algebraic chess notation. [chess board]" R1 can't interpret the image. Claude can, but generates moves that the game says are illegal. I am awful at chess, and I don't have the time to work out how to input this into a chess engine to work this out, so this is the end of the line for now.

posted by BungaDunga at 4:57 PM on January 27 [1 favorite]

The Rabbit R1 was an early attempt to use LLMs to make a personal assistant "agent". Besides not working very well, one of the criticisms of it was that it was a security nightmare as it would need access to all your logins in order to do various tasks on your behalf. I was curious about how this new generation of agents handles this, and apparently OpenAI's Operator agent hands back control to you when it needs to log into your account so you can type in your password, and it pinky-promises not to take screenshots (or capture keypresses presumably) while you do. Of course it can still access all your personal data that was protected behind the login once you log in, or it wouldn't be able to do its job.
posted by L.P. Hatecraft at 7:01 PM on January 28

« Older Our Cyber/Solar-Punk Future | Free To Be... It's This Week's Free Thread! Newer »

You are not currently logged in. Log in or create a new account to post comments.

MetaFilter

It starts off really strong
January 27, 2025 1:58 AM Subscribe

Tags

Share

It starts off really strong January 27, 2025 1:58 AM Subscribe

Tags

Share

It starts off really strong
January 27, 2025 1:58 AM Subscribe