Inbar's Substack

Inbar's Substack

5 Claude Code Hacks That Helped us Win đŸ„ˆ at a Hackathon

How we used Claude Code as a system - not a chatbot - to trade tokens for outcomes 🧠

Inbar Shirizly's avatar
Inbar Shirizly
Dec 14, 2025

Recently our Torq team took 2nd place at a cyber hackathon.

Not because we wrote more code.

But because we changed how we worked with Claude Code.

We had six hours. Roughly 100k Sysmon events. And a scoring rule that punished token waste.

Some teams went deep on model and pipeline optimization.

We went in a different direction.

We treated Claude like a system - a team we could conduct.

That shift let us trade tokens for outcomes on purpose:

  • more findings with roughly the same token budget

  • or the same findings with far fewer tokens

And we shipped in time.


The 5 hacks (TL;DR)

  1. Ultrathink - force real trade-off reasoning before you commit to an architecture.

  2. feature-dev - multi-agent planning that explores your codebase before it codes.

  3. Feedback loops - measure every iteration instead of trusting vibes.

  4. Parallel Claude instances - split work across terminals like a human team.

  5. AGENTS.md / CLAUDE.md - bake constraints into every session automatically.

Let me unpack each.


Hack #1: Ultrathink mode for architecture decisions

hack #1 - ultrathink in Claude Code

Most people skip this. We didn’t - because we knew our real bottleneck wasn’t coding, it was choosing the right architecture under a hard token constraint.

What’s not obvious (and barely documented) is that Claude Code maps specific phrases to increasing levels of internal thinking budget:

“think” < “think hard” < “think harder” < “ultrathink”

Each step explicitly allocates more reasoning time before Claude responds. ultrathink is the highest tier - it tells the system to slow down and spend real compute on trade-offs instead of jumping to an answer.

In practice, ultrathink makes Claude spend ~30–60 seconds actually reasoning before responding. Not just generating text, but weighing options, constraints, and second-order effects.

Without ultrathink: You get generic advice. “Use ML.” “Batch it.” “Try caching.”

With ultrathink: You get a real architectural breakdown:

  • 2–3 plausible approaches

  • trade-offs

  • token cost estimates

  • implementation plan

Pay ~2k tokens for planning. Save ~80k tokens by not guessing. And yes I know it feels voodoo, but hey it works like hell!

When we use it:

  • architecture decisions (basically always)

  • algorithm design

  • perf/token optimization

  • when I feel “slightly unsure but already coding”

When we don’t:

  • tiny bug fixes

  • “add a button”

  • when I already know exactly what I want


Hack #2: feature-dev = multi-agent planning system

hack #2 - feature dev

Everyone uses Claude Code to generate code. feature-dev is different: it’s really a planning orchestra.

We use it whenever we don’t trust our own context yet. Under hackathon pressure, that’s basically always.

What actually happens

Phase 1–2: Discovery + Exploration
Claude launches 2–3 explorer agents in parallel:

  • find similar features

  • map architecture + abstractions

  • spot existing patterns Then it reads the key files they found.

Phase 3: Clarifying questions (critical)
It asks about edge cases, integration points, perf requirements - and waits. No “AI guessing what I meant.”

Phase 4: Architecture options
You get 2–3 designs:

  • minimal change

  • clean separation

  • pragmatic middle ground with concrete trade-offs for your codebase.

Phase 5–7: Implement → Review → Summary
Only after you approve the plan.

Example:

/plugin marketplace add anthropics/claude-code
/plugin install feature-dev@claude-code-plugins

-- reset claude

/feature-dev:feature-dev Add network analysis hunter

It comes back with:

  • “here are the key files”

  • “here are 3 ways to implement it”

  • “pick one and I’ll code it”

That saved us from:

  • duplicating existing code

  • choosing the wrong abstraction

  • missing edge cases

Mini takeaway: If a feature touches more than ~2 files, let feature-dev plan it first.


Hack #3: Build your own feedback loop

Ok, this is actually were I need you to focus


hack #3 - feedback loop as part of the agent session

Here’s the default AI-coding loop most of us fall into:

  1. Ask Claude to build something

  2. Run it

  3. Squint at the output

  4. Repeat 🙃

Once Claude started producing options fast, we needed a way to stop arguing with vibes - so we forced ourselves into a data loop.

What we asked Claude to build

Two tiny scripts Claude generated for me as part of the workflow:

  • analyze_tokens.py - per-component token breakdown

  • compare_approaches.py - side-by-side comparison on quality + efficiency

Template you can steal:

# compare_approaches.py
def compare_versions(v1_id, v2_id):
    “”“Compare quality + efficiency metrics.”“”
    return {
        “token_savings”: calculate_token_diff(v1_id, v2_id),
        “quality_delta”: compare_outputs(v1_id, v2_id),
        “false_positives”: count_fps(v1_id, v2_id),
    }

Run:

python compare_approaches.py v1 v2

Hackathon example outcomes:

  • v2 vs v1:
    +23 detections
    +12k tokens
    → NO (efficiency mattered)

  • v3 vs v1:
    +28 detections
    –87k tokens
    → YES (ship it)

How I actually used the loop

The key wasn’t just writing scripts - it was how we ran Claude against them.

For each hunter/agent, we let Claude iterate on the implementation until the metrics got better than the previous version. If an iteration didn’t improve results after a few tries, I ditched that addition and moved on.

So the cycle looked like this:

Each agent kept looping until the numbers stopped improving. If it didn’t converge after a few iterations, we dropped it and moved on.

Generalize this:

  • building an API? measure latency + errors

  • refactoring? track bundle size + tests

  • adding a feature? measure impact + complexity

Mini takeaway: Don’t accept the first draft. Make Claude compete with itself - and ship only what measurably improves the outcome.


Hack #4: Run multiple Claude instances in parallel

hack #4 - running multiple instances of claude using warp

This felt like cheating the first time we tried it. 😄

If your architecture is modular, you can have multiple Claude sessions working at once, each on a separate component.

During the hackathon I opened 3 terminals and gave each Claude instance a different hunter.

Rule we used: Each Claude session owns a file/folder. If two sessions touch the same file, you lose the speedup to merge drama. 📁

Result: what would’ve been ~30 minutes of sequential additions became ~10 minutes in parallel (closer to ~15 once I included the orchestration + feedback overhead).

How to set it up:

  1. design modular pieces first (ultrathink helps)

  2. identify truly independent components

  3. open multiple terminals

  4. run Claude Code separately in each

  5. merge, test, compare, ship

Doesn’t work great for:

  • tightly coupled codebases

  • tiny tasks

  • when architecture is still fuzzy

Mini takeaway: Treat Claude like a team. Give each “teammate” a clean slice of ownership.


Hack #5: Strategic AGENTS.md (not just documentation)

We used to treat AGENTS.md like documentation. In this hackathon it became a constraint engine.

AGENTS.md is not working by default with CC, we need to configure it with a rule or work directly with CLAUDE.md (but for working with other code agents - we need to AGENTS.md file

Our coding agents auto-load this file every session. So instead of re-prompting constraints 50 times, we wrote them once - and every parallel agent stayed aligned.

What we put in it

  • hard constraints (token budget, max tokens, scoring rules)

  • optimization rules of thumb

  • communication style

  • critical files: DO NOT TOUCH

Minimal skeleton:

# AGENTS.md

## Project Overview
AI security agent analyzing Sysmon logs (~100k events). Scores on token efficiency.

## Non-Negotiable Constraints
- 100k token max - fewer = higher score
- All LLM calls via TracedLLM wrapper

## Optimization Preferences
- Aggregate events before LLM calls
- Use structured output for parsing
- Keep prompts short - no redundant context

## Communication Style
- “Be extremely concise, sacrifice grammar for concision”
- Paragraphs → few words, headings, lists

## Plans
- End with unresolved questions (concise)

## Critical Files - DO NOT MODIFY
- `src/traced_llm.py` - token tracking (competition rules)

Credit / inspiration: Some of the ideas around treating agents as first-class collaborators (and baking constraints into a shared context file) were inspired by short-form examples like this YouTube Short by Matt Pocock:

Mini takeaway: Set constraints once. Benefit everywhere.


The hackathon story (what we built, baseline vs our approach)

Task: Build an AI security agent that can process ~100k Windows Sysmon events and surface meaningful detections. Scoring rewarded detection quality/coverage and token efficiency, with a hard cap of 100k tokens per run.

Baseline approach

Classic single-agent pipeline:

  • preprocess logs

  • send large chunks to the LLM

  • generate detections/report

It worked
 but cost ~103k tokens and missed coverage.

Our approach

Our Torq team created a multi-hunter architecture:

  1. Preprocessing + metadata caching

  2. 5 parallel specialist hunters (fileless, sequence, process-tree, network, rare-event)

  3. Merge findings + final report

Parallel specialists meant:

  • smaller, tighter prompts

  • less repeated context

  • better domain focus

  • much lower token burn

Diagram: baseline vs our approach

baseline approach - flow continues in the link
multi-hunter architecture - flow continues in the link

full diagram

Results

Token efficiency

  • baseline: 103,419 tokens

  • our approach: 16,373 tokens

  • savings: 87,046 tokens (84%)

Quality

  • 28 IOCs detected (vs 23 baseline)

  • 12 MITRE techniques mapped (vs 8)

  • lower false positives

Time

  • full system built in ~6 hours

  • 2nd place overall

want to see the full hackathon story? check out this link


The meta-lesson

We didn’t win by typing faster. We won by designing a workflow aligned with constraints:

  • Claude as a thinking partner (ultrathink)

  • planning before coding (feature-dev)

  • measuring instead of guessing (feedback loops)

  • parallelizing like a team (multi-instances)

  • baking constraints into context (AGENTS.md)

Hackathon meta: stop coding. start conducting.


Takeaways (what you should steal)

Start doing today

  1. Add ultrathink before decisions that matter

  2. Measure at least one real metric per iteration

  3. Create a strategic AGENTS.md with true constraints

Try this week 4. Use feature-dev for non-trivial changes

Try next big project 5. Design for parallel Claude sessions


Thanks for reading Inbar's Substack! This post is public so feel free to share it.

Share

© 2026 Inbar Shirizly · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture