DEV Community: Boris Kl

Your page loads fast but still feels slow? It's INP, not load time

Boris Kl — Tue, 16 Jun 2026 10:27:25 +0000

Your Lighthouse report is mostly green. LCP is fine, CLS is fine, the page loads fast. Then the real-world score drops and you can't see why. Nine times out of ten the culprit is INP — and it's the one metric a quick Lighthouse run barely shows you.

What INP actually measures

INP, short for Interaction to Next Paint, replaced FID as a Core Web Vital in March 2024. FID only looked at the delay before your first interaction. INP looks at all of them, the whole time someone uses the page, and reports close to the worst one.

So it's not a loading metric. It's a responsiveness metric. It answers a different question: when I tap, click, or type, how long until the screen actually changes? Google's buckets are simple — 200ms or under is good, over 500ms is poor.

That's why a site can load in a second and still fail. Loading fast and responding fast are two different jobs, done by two different things.

Why it's invisible in a normal audit

LCP and CLS happen during load, so a lab tool catches them every run. INP only happens when a human interacts. Lighthouse doesn't tap your buttons, so its number is an estimate at best. You can have a green lab report and a red field score at the same time, and that gap is exactly where people get stuck.

To see the real number, measure interactions as they happen:

import { onINP } from 'web-vitals';
onINP(function (metric) {
  console.log('INP', metric.value, metric.entries);
});

That logs the actual slow interaction and the element behind it. Now you're fixing a real thing instead of guessing.

What's really slow

INP is almost always one thing: the main thread was busy when the user acted. The browser can't paint the response until the current JavaScript task finishes, so a long task blocks the interaction.

The usual sources:

A heavy event handler doing real work on every click or keystroke.
Third-party scripts like chat widgets, analytics, and tag managers, running long tasks at the wrong moment.
Layout thrash: reading and writing the DOM in a loop so the browser recalculates over and over.
Framework hydration waking the whole page up at once.

The fixes that move it

Break up long tasks. If a handler does a lot, let the browser breathe partway through instead of holding the thread:

async function onClick() {
  doUrgentPart();          // update the UI first
  await yieldToMain();     // give the browser a turn to paint
  doExpensivePart();       // the rest can wait a tick
}

yieldToMain is a one-line helper around scheduler.yield() where it's supported, or a setTimeout(0) fallback. The trick is to paint the response before the slow work, not after.

Beyond that: defer scripts the page doesn't need to react, audit third-party widgets for the ones that run long tasks, debounce expensive handlers, and batch your DOM reads and writes so the browser isn't recalculating layout on every line.

The honest part

I won't promise you a magic number — INP depends on your scripts, your theme, and what your users actually click. But it's measurable, and the field data shows the difference plainly once the long tasks are gone.

I keep my own WordPress sites in the green on Core Web Vitals, and INP is the one I watch most now, because it's the one that quietly fails while everything else looks fine. If your lab report is green but the real score isn't, stop staring at LCP. Go measure an interaction.

Your Telegram bot replies twice? It's timing, not a logic bug

Boris Kl — Mon, 15 Jun 2026 13:23:49 +0000

A Telegram bot replies to the same message twice. An n8n flow processes an order, then processes it again ten seconds later. The owner reads the handler code, finds nothing wrong, and assumes the logic is broken.

It usually isn't. These bugs are almost always about timing, not logic — and once you know the three places timing bites, they stop being mysterious.

1. The webhook you never answered

Telegram (and most webhook senders) wait for an HTTP 200. If your endpoint does the work first and answers afterward, a slow database call or a third-party API can push you past the timeout. The sender assumes delivery failed and sends the same update again. Now your "double reply" isn't a logic bug — it's the same event arriving twice because you were too slow to say "got it."

The fix is to acknowledge first, process second:

@app.post("/webhook")async def webhook(request):
    update = await request.json()
    queue.put_nowait(update)   # hand off
    return Response(status=200) # answer immediately

Return 200 the moment you've safely stored the update. Do the real work in a background task or a worker. The sender stops retrying, and the duplicates dry up.

2. No dedup, so retries become real work

Answering fast helps, but retries still happen — network blips, restarts, a sender that's feeling anxious. The honest assumption is: every event can arrive more than once. So make handling it twice harmless.

Every Telegram update has an update_id. Every message has a message_id. Most webhook payloads have some stable id. Key on it:

if await seen.exists(update_id):
    return            # already handled, do nothing
await seen.add(update_id, ttl=86400)
await handle(update)

seen can be Redis, a unique column in your database, anything that's shared across workers. The point is that "process this order" runs once even if the event shows up three times. People call this idempotency; it just means doing it again changes nothing.

3. Two messages, one piece of state, no lock

This is the one that looks the most like a logic bug and isn't. A user double-taps a button. Two updates arrive almost together. Both handlers read "balance: 100", both subtract 30, both write "70". You charged once for two actions, or booked the same slot twice.

Nothing in the logic is wrong. The two runs just overlapped. The fix is to stop them from overlapping on the same state:

async with lock(f"user:{user_id}"):
    balance = await get_balance(user_id)
    await set_balance(user_id, balance - 30)

A per-user lock (Redis SET NX, a database row lock, whatever you have) means update B waits for update A to finish before it touches the same row. In n8n the same idea shows up as a queue or a "wait for previous execution" step instead of letting every webhook fire its own parallel run.

The part that saves you next time

Most of these never get diagnosed because they're invisible. The handler "works" when you test it by hand — you can't tap fast enough to cause the race, and your local webhook answers instantly. It only breaks under real traffic, at 3am, where you're not looking.

So log the timing, not just the errors. Log the update_id on the way in and the way out. Log when a lock is contended. The first time you see the same update_id logged twice, the whole thing stops being a mystery and becomes a one-line fix.

I run Telegram bots and n8n in production every day, and I've hit all three of these. None of them were in the logic. They were in the gaps between events — and that's almost always where to look first.

Sending Telegram Bot Conversions to Meta? Don't Reach for business_messaging

Boris Kl — Sun, 14 Jun 2026 13:45:21 +0000

A bot was firing Subscribe and Purchase events from Telegram straight to Meta's Conversions API, and every call came back with a 400:

"error_user_title": "Missing Messaging Channel Parameter",
"error_user_msg": "A messaging channel parameter is required when provided
                   action source is business_messaging. Valid value could be
                   messenger, whatsapp and instagram."

The payload looked fine — event_name, event_time, a hashed external_id, and action_source: 'business_messaging'. So why the 400?

The gotcha

business_messaging is not a generic "it happened in a chat" source. Meta ties it to its own messaging products, and it demands a companion messaging_channel whose only valid values are messenger, whatsapp, instagram. Telegram isn't on that list — there's no channel you can hand it — so the request can never validate.

The instinct is to try app next. Don't. app drags in a required app_data block: the extinfo array, advertiser tracking flags, the whole mobile-SDK surface. You don't have that from a bot, and you don't want to fake it.

The fix

For a self-hosted Telegram bot, the right source is plain other:

"action_source": "other"

other has no extra mandatory fields. You need event_name, event_time, action_source, and a user_data with at least one identifier. A SHA-256 hashed Telegram user id as external_id is enough to clear the 400. One-line change.

Making it actually attribute

Not crashing is the low bar. To tie a Subscribe or Purchase back to the ad that caused it, you need the click id:

Capture fbc. Your ad sends people to a deep link — t.me/yourbot?start=.... Meta appends fbclid to that destination. Pack the fbclid into the start payload, read it on /start, and build fbc = fb.1.[unix_time].[fbclid]. Send it in user_data next to external_id. This is the single biggest lever for matching.
Purchase needs money. Add custom_data with value and currency, or there's no ROAS to compute later.
Verify before you trust it. Events Manager has a Test Events tab — send with a test_event_code and watch the events land and match before you point real traffic at it.
Dedupe if a web pixel fires the same events: same event_id on both sides and Meta collapses them.

The 400 is a five-second fix. The attribution is the part that actually pays for itself.

Lighthouse Gave My Site 100/100. The Site Was Down.

Boris Kl — Thu, 11 Jun 2026 18:56:58 +0000

Yesterday I ran PageSpeed Insights on a site I manage. Performance: 100/100. Green circle, confetti, the works.

One problem: the screenshot in the report showed a Cloudflare block page — "Sorry, you have been blocked."

Lighthouse didn't measure my site. It measured the error page my WAF served to Google's crawler. And error pages are, of course, blazing fast.

How this happens

If you put Cloudflare in front of a site and turn the security dial up (Bot Fight Mode, aggressive WAF rules, country blocks), you'll eventually block more than bots:

PageSpeed Insights / Lighthouse — measures a block page, reports nonsense
Uptime monitors — see HTTP 403 with a 200-ish body, or vice versa, and lie to you either way
Google's crawler itself — and that one quietly costs you rankings

The nasty part is the silence. Nothing looks broken from your own browser, because you're whitelisted by your own cookies, IP reputation, or login session. The tools just start telling you fairy tales.

The five-minute audit

Open Cloudflare → Security → Events. Filter the last 7 days. Look at what's actually being challenged or blocked — you'll usually find a legit service in there within a minute.
Check the user agents: Chrome-Lighthouse, GoogleOther, Googlebot, your uptime checker. If they show up here, that traffic never reached your site.
Verify bots properly: Cloudflare has a "Verified Bots" category — allow it instead of hand-maintaining user-agent allowlists (user agents are trivially faked; verified-bot checks aren't).
Re-run your measurement and look at the rendered screenshot, not just the score. The screenshot is the only part of a Lighthouse report that can't lie to you.

Rules I now follow

Never trust a perfect score. 100/100 on a real WordPress/commerce site is a smell, not an achievement. Real sites have real images and real JavaScript.
Check the screenshot first, score second.
After every WAF change, re-test from outside: different network, curl with a Googlebot UA, or just PageSpeed Insights — and read the Events log after.
Monitoring that runs behind your own allowlist isn't monitoring. It's a mirror.

Cloudflare is still the best free thing that ever happened to small sites — I run my own production behind it and it has eaten real attack waves for breakfast. But a security layer you configured and never audited is just a random traffic filter with good branding.

Five minutes in the Events log. That's the whole tip.

I set up Claude Code for a real production project. Here's what actually earned its keep

Boris Kl — Sat, 06 Jun 2026 13:53:19 +0000

Everyone's got a "10 AI coding tricks" post. This isn't that. This is what's left after three weeks of running Claude Code on a real project — a bilingual booking bot for a beauty salon (Telegram + WhatsApp, Postgres, Google Calendar) — once the novelty wore off and only the useful parts survived.

Out of the box, Claude Code is a very smart intern with amnesia. Every session it shows up brilliant and clueless. The whole game is fixing the clueless part. Four things did that for me: a CLAUDE.md file, two custom agents, one skill, and two hooks. Everything else I tried, I deleted.

CLAUDE.md: the file that pays rent every single day

CLAUDE.md sits in your repo root and gets read at the start of every session. Mine started as three lines. It grew every time the assistant did something I had to undo.

That's the trick, honestly. Don't write CLAUDE.md upfront — grow it from failures. Mine now includes things like:

## Architecture rules
- Business logic lives in src/core/ and must not know about
  Telegram or WhatsApp. Channel code lives in src/adapters/.
- All times stored in UTC; convert only for display.
- Booking creation must stay double-booking-safe — never remove
  locks or constraints around it.

## Working agreements
- Before "done": run typecheck && lint && test and show the result.
- Schema changes go through a migration file. Always.
- Prefer the smallest diff that does the job.

Each of those lines exists because the assistant once did the opposite. It put Telegram-specific code in core logic — new rule. It "fixed" a timezone bug by converting at storage time — new rule. It reported "done" with failing types — new rule.

Three weeks in, I almost never repeat an instruction. That file is the difference between an assistant and a goldfish.

Custom agents: the reviewer I argue with

Custom agents live in .claude/agents/ as markdown files with a system prompt. You invoke them for a specific job, they do it with their own instructions and tool limits, and they don't pollute your main session's context.

The one that earns its keep daily is a code reviewer:

---
name: code-reviewer
description: "Reviews changes for bugs and security issues"
  before they are committed.
tools: Read, Grep, Glob, Bash
---

You are a strict but practical code reviewer.
Check, in this order: correctness (timezone boundaries,
double-booking windows), security (unvalidated webhook input,
SQL built by concatenation, missing signature checks),
project rules from CLAUDE.md, and whether behavior changed
without a test changing.
Report findings ordered by severity, with file:line and a
concrete fix. If something is fine, don't pad the review.

The point isn't that it catches everything. The point is that it's a different context with one job. The main session wrote the code and is biased toward liking it. The reviewer agent reads it cold. It regularly catches things the main session waved through — a webhook handler that trusted message_id without checking the signature, a slot calculation that broke across midnight.

It found the midnight bug before my client's customers did. That one agent paid for the whole setup.

A skill that stops me from skipping steps

Skills are reusable workflows — a SKILL.md file describing a procedure the assistant follows when you invoke it. I have exactly one that matters, /add-feature:

Restate what we're building, confirm.
List files that will change and why. Smallest possible diff.
Implement, following CLAUDE.md.
Write tests for the changed units.
Run the code-reviewer agent on the diff. Fix what it finds.
Summarize: what changed, how to try it, what I must do manually.

Nothing clever. It's a checklist. But here's the thing about checklists — they work precisely because on the fifth feature of the day, I would skip the review step. The skill doesn't get tired at 11pm. Pilots figured this out decades ago; we're just catching up.

Hooks: the two-line insurance policy

Hooks run shell commands on events. I only need two.

The first blocks any edit to secrets files. The assistant has no business touching .env, ever, and now it physically can't — a PreToolUse hook checks the file path and exits with an error if it looks like secrets. Cost me five minutes to write. Worth it the first time a refactor tried to "helpfully" update an env var.

The second runs the typecheck after every file edit and pipes problems straight back into the session. The assistant sees its own type errors immediately instead of discovering them at the end, which means it fixes them while the context is hot. This one change cut my "it said done but nothing compiles" rate to roughly zero.

{
  "hooks": {
    "PreToolUse": [{
      "matcher": "Edit|Write",
      "hooks": [{ "type": "command",
        "command": "bash .claude/hooks/protect-secrets.sh" }]
    }],
    "PostToolUse": [{
      "matcher": "Edit|Write",
      "hooks": [{ "type": "command",
        "command": "npm run --silent typecheck" }]
    }]
  }
}

What I tried and deleted

For honesty's sake: I also built an agent for writing commit messages (the main session does this fine), a skill for deployments (too risky to automate, I want my hands on that), and a hook that auto-ran the full test suite on every edit (made everything crawl — the typecheck is the right granularity; full tests run at review time).

If a piece of setup doesn't save you something every day, it's not configuration, it's clutter.

The honest summary

Claude Code without setup is a talented freelancer on their first day, every day. With a grown-from-failures CLAUDE.md, one cold-eyed reviewer agent, one checklist skill and two hooks, it's closer to a colleague who's been on the project for a month.

The setup took me about two hours total, spread over days, mostly as reactions to things that annoyed me. The payback is that I now ship features for a production bot — payments, reminders, a wait-list — in evenings, alone, without the quality dropping.

Start with CLAUDE.md. Add a reviewer agent the first time you catch a bug you should've caught. Grow the rest from your own failures — they're better teachers than my list anyway.

One year of self-hosted n8n on a $6 Hetzner VPS

Boris Kl — Wed, 27 May 2026 11:49:40 +0000

One year of self-hosted n8n on a $6 Hetzner VPS

Twelve months ago I moved my workflow automation off Zapier and onto a single Hetzner CX22 — €4.51/mo, 2 vCPU, 4 GB RAM, 40 GB disk. One Docker host, one n8n container, one Postgres, one Caddy reverse proxy. It's run four production workflows continuously since then, with one outage I'll get to below.

This post is not a "n8n vs Zapier" pitch. It's a year of operating notes — what stayed cheap, what broke, what I'd do differently.

The actual setup

Hetzner Cloud CX22 (Falkenstein)
├── Docker
│   ├── n8n (latest stable)
│   ├── postgres:15
│   └── caddy (with automatic TLS)
├── UFW (22, 80, 443 only)
└── borgbackup → Hetzner Storage Box (€3.81/mo)

The Caddy bit matters more than people think. n8n's built-in HTTP is fine for localhost, but webhook receivers need real TLS, and Caddy gives you ACME, HTTP→HTTPS redirect, and per-domain certificates with zero config. Caddyfile is six lines. You don't have to think about it again.

What's running

Four workflows. None of them invented; all real:

Telegram bot dispatcher. Inbound webhook → routing logic → either a Postgres write or a downstream service call. About 40 events/day average, occasional 200-event spikes.
RSS aggregator → Telegram channel. Polls 12 feeds every 15 min, dedupes by URL hash in Postgres, posts new items to a private channel. ~30 posts/day.
Form submission → CRM-lite. A few WordPress sites hit a webhook on form submit; n8n writes to Postgres, sends an email confirmation, and logs to a Discord channel for me.
Daily reporting cron. Pulls metrics from three internal APIs at 06:00, builds a markdown digest, emails it, also posts it to Slack.

None of these need millisecond latency. All of them benefit from being one config-pull away from changing.

The cost breakdown (12 months)

Item	Monthly	Annual
Hetzner CX22	€4.51	€54.12
Storage Box (backup)	€3.81	€45.72
Domain (.dev)	—	€12
Total	~€9.20	~€112

Equivalent Zapier seat for the same task volume would have been ~$30-50/month depending on the plan, so we're looking at roughly €350-500 saved over the year. Not life-changing. The real win is something else, which I'll get to.

What broke (the one outage)

Month four. n8n upgraded from v1.x to a major release. I'd been running docker compose pull weekly without pinning, because "it's been fine." The upgrade introduced a breaking change to how credentials were stored. Container started; UI loaded; every workflow showed "credentials missing" and refused to execute.

Root cause: I had no version-pin and no upgrade test. The backup was fine (borg snapshots intact), but the restore-and-investigate took me a Saturday afternoon.

What I changed:

Pinned n8n image to a specific minor version (n8nio/n8n:1.45.x).
Added a "staging" branch on a second Hetzner VPS (€3/mo CX21) that gets the upgrade first.
Subscribed to the n8n releases RSS feed so I see breaking changes before I pull.

In hindsight: a SaaS would have done the upgrade for me and either Bigger Things would have broken (multi-tenant blast radius) or none of this would have ever happened. Pick your trade.

The actual win (it's not the money)

The €350/year doesn't matter. What matters is that workflows live in a git-tracked YAML I own, on infrastructure I own.

When a workflow changes, I commit the n8n export. When something breaks, I can diff yesterday's export against today's and see what shifted. When the credentials database gets weird, I open psql and look at the rows. When the webhook target changes, I write the new URL in a Caddyfile and reload — no support ticket, no rate limit on changes, no "this requires an upgrade to the Team plan."

On Zapier, the same change graph is a black box. Some changes are free, some require the next plan tier, and you don't always know which until the click. With n8n on a box you control, the question "can I do this?" reduces to "is it physically possible?" — and the answer is almost always yes.

Things I'd do differently if starting today

Pin the image from day one. Whatever the cost in "missing the new shiny feature for a week" is dwarfed by the cost of an unscheduled Saturday.
Use external Postgres, not the docker-compose one. Hetzner offers managed Postgres now. €11/mo, automatic backups, no "my container restarted and ate the WAL" risk. I'd take the €11 hit gladly.
Don't put auth on the webhook receivers via n8n itself. Put it at Caddy or a separate gateway. n8n's auth model exists, but you can't reuse it for non-n8n endpoints, and you'll regret the coupling.
Write the runbook first, not after the first outage. "How do I restore from borg," "how do I roll the credentials key," "where are the env files" — five minutes to write, an hour to rediscover when stressed.
Don't put more than 10 workflows on one box. Memory usage scales with concurrent execution, and a runaway loop in one workflow will starve the others. If you go past 10, split into two n8n instances, not one.

When NOT to self-host

This setup works because the four workflows are mine, the data is mine, and downtime measured in hours (not minutes) is acceptable. If any of those three change, the calculus changes.

If a client depends on the webhook receiver having 99.95% uptime, this single-box setup is wrong. Use n8n Cloud or a multi-node deployment.
If the workflows touch regulated data (HIPAA, PCI, GDPR's stricter applications), don't reach for the cheapest box. Use a vendor who'll sign a DPA and an audit-ready hosting tier.
If you're a team of more than three and people need fine-grained access, n8n self-host's RBAC is workable but not great. The Cloud tier handles teams better.
If your time is worth more than €30/month, and the workflows are simple enough that Zapier or Make.com handles them without ceremony, the savings aren't worth the operating load. Pay for the SaaS.

The five-line take

Self-hosted n8n on a cheap VPS is one of those rare cases where the "boring" answer is also the cheap one and also the powerful one. Run it for a year before you decide it's not for you. Pin your versions. Write the runbook. Don't put it on the same box as anything else important.

— Boris (@lamastoma)

Publishing checklist

☐ Set published: true
☐ Add cover image (1000×420 — Hetzner ANGE + n8n logo composite? or just terminal screenshot)
☐ Tags: n8n, selfhosted, automation, devops — Dev.to limits to 4
☐ Canonical URL: leave blank (Dev.to is canonical)
☐ Once published, share Fiverr profile URL in bio (not in body of article)
☐ Comment-engagement plan: monitor for first 24h, reply to every comment, no defensive corrections

A Production Python Telegram Bot Was Crashing Every 2 Hours. The Fix Was 18 Lines.

Boris Kl — Wed, 20 May 2026 13:28:23 +0000

"If you see cascading errors, find the first thing that fails and stop reading the log there. Everything after the first failure is the system reacting to the first failure."

A production Python Telegram bot I was looking after started crashing every 2-3 hours. The traceback was a horror show — TelegramRetryAfter, then asyncio.TimeoutError, then sqlite3.OperationalError: database is locked, then 47 leaked sessions, then the process got OOM-killed, then systemd restarted it. Then it happened again, 140 minutes later, like clockwork.

The temptation when you see this kind of cascade is to throw the whole architecture out. "SQLite can't handle our scale, let's move to Postgres." "Bare asyncio is too low-level, let's add a queue." "Let's rewrite it in Go."

I didn't do any of those things. The fix was 18 lines of code in one middleware file. The bot has been up for weeks since.

Here's the diagnosis, the fix, and the takeaway. The code is real (anonymized of any client specifics) and the numbers are real.

The symptoms

Stack: Python 3.12, aiogram 3.x, SQLite for user state, asyncio everywhere. Volume: about 4,000 daily incoming messages. Not high-throughput.

The log every 140 minutes looked like this:

[14:22:01] ERROR  aiogram.TelegramRetryAfter: flood control, retry in 28s
[14:22:03] ERROR  asyncio.TimeoutError in update handler
[14:22:05] WARNING bot.session not closed (47 active)
[14:22:08] ERROR  sqlite3.OperationalError: database is locked
[14:22:14] ERROR  ...same pattern, multiplying...
[14:22:20] ERROR  process killed by OOM
[14:22:21] INFO   systemd: restarted

Process up ~140 minutes. Then the cascade. Then restart. Repeat.

What looked plausible (and was wrong)

When I started looking, the first hypothesis was "SQLite is the bottleneck — it can't handle the concurrency." That's the most obvious thing to say when you see database is locked in a log.

It was wrong. Here's why I dropped it after 30 minutes:

4,000 messages a day is nothing for SQLite. SQLite handles tens of thousands of writes per second on modest hardware. If we were hitting a SQLite ceiling, we'd be hitting it under steady load, not in sudden bursts. The 140-minute interval was the giveaway — something was accumulating, not saturating.

The second hypothesis was "We're hitting Telegram API rate limits." That's what TelegramRetryAfter literally says. But again, 4,000 messages a day = roughly 1 message every 20 seconds on average. Telegram's per-bot rate limit is 30 messages per second. We weren't even in the same order of magnitude.

So whatever was happening was bursty, not steady-state. And the bot was somehow turning a steady stream of inbound updates into a burst of outbound API calls.

The actual root cause

Here's what was happening, step by step:

A user sends a message. aiogram receives it as an update.
The handler runs, does some work, and sends a reply to Telegram.
Normally: that reply goes out, the handler returns, the asyncio task ends, the bot.session HTTP connection is released.
What actually happened: no throttle middleware existed. If 5-10 users happened to message in the same second (which happens during peak hours), the bot fired 5-10 outbound sendMessage API calls concurrently.
Five or ten outbound requests inside one second pushed us past Telegram's per-second rate limit. Telegram answered with 429 Too Many Requests and a retry_after header.
aiogram raised TelegramRetryAfter. But the handler that raised it was waiting on the API response — it couldn't release its HTTP session until the retry window closed (28 seconds in the log above).
While that handler was waiting, the next inbound update hit the same handler code. Another async task spawned. Another bot.session connection opened. Another wait.
Now we have two stuck tasks, each holding a connection, each blocked on retry_after. Both tasks also need to update the user's row in SQLite. SQLite locks the row for the first writer. The second writer waits. Deadlock potential.
Multiply this by 10 minutes of bursty traffic. Now you have 47 leaked sessions, an SQLite deadlock, and a Python process eating memory because tasks aren't completing.
OOM killer hits. Systemd restarts. Cycle resets.

The cascade had one cause: no rate limit on the bot's inbound side. Everything downstream was just the system reacting to the upstream pressure.

The fix — 18 lines

A throttle middleware. Drop incoming updates from a user if they already had a message in the last second. That's it.

# middleware.py
from aiogram import BaseMiddleware
from aiogram.types import Update
from cachetools import TTLCache


class ThrottleMiddleware(BaseMiddleware):
    """Drop second-message-within-N-seconds per user.

    Without this, bursty inbound traffic translates 1:1 into bursty
    outbound API calls and trips Telegram's flood control.
    """

    def __init__(self, rate_limit: float = 1.0):
        self.cache = TTLCache(maxsize=10_000, ttl=rate_limit)

    async def __call__(self, handler, event: Update, data):
        user_id = event.message.from_user.id if event.message else None
        if user_id and user_id in self.cache:
            return  # silently drop — user is over their rate limit
        if user_id:
            self.cache[user_id] = True
        return await handler(event, data)

And wire it up plus a clean shutdown:

# main.py
from aiogram import Bot, Dispatcher

bot = Bot(token=BOT_TOKEN)
dp = Dispatcher()

dp.update.middleware(ThrottleMiddleware(rate_limit=1.0))


async def on_shutdown():
    """Close the bot session explicitly. Otherwise sessions leak
    on graceful shutdown and the next start hits a connection pool
    in a weird state.
    """
    await bot.session.close()


dp.shutdown.register(on_shutdown)

That's 18 lines of production code plus one test:

# test_middleware.py
import pytest
from middleware import ThrottleMiddleware


@pytest.mark.asyncio
async def test_throttle_drops_rapid_second_message(mocker):
    middleware = ThrottleMiddleware(rate_limit=1.0)
    handler = mocker.AsyncMock(return_value="processed")

    event = make_event(user_id=123)  # helper to build a fake aiogram Update

    # First message — goes through
    result1 = await middleware(handler, event, {})
    assert result1 == "processed"

    # Second message same user, same second — dropped
    result2 = await middleware(handler, event, {})
    assert result2 is None

    handler.assert_called_once()

That's the whole patch.

Why this works

The fix doesn't make SQLite faster. It doesn't add a queue. It doesn't change anything about how the handlers process messages. It just stops the upstream pressure before it cascades downstream.

Once incoming updates are rate-limited per-user at 1 per second, the bot never has 10 concurrent outbound API calls. It has at most 1-2. Telegram never gets angry. TelegramRetryAfter never fires. Handlers never get stuck waiting. Sessions never leak. SQLite never sees concurrent writes for the same row.

The cascade isn't a chain. It's a tree, and the throttle cuts the tree at the root.

The result

Numbers (real, from production):

First 4 hours after deploy: zero TelegramRetryAfter. Zero TimeoutError. Session count stable at 1-2 (vs. climbing past 40 every two hours before).
First 24 hours: zero errors of any kind in the log.
First 7 days: zero crashes. Zero systemd restarts.

Bot has been up continuously since deploy. Same SQLite. Same asyncio. Same handlers. The only thing that changed is the throttle middleware.

What I'd tell a junior on the team

A few generic takeaways that apply far beyond this specific bug:

1. Find the first failure in the log and stop reading. When you see cascading errors, everything after the first failure is the system reacting to the first failure. Don't try to "fix" the downstream errors. Find the upstream cause.

2. Upstream backpressure is the cause about 80% of the time when you see async-Python cascades. When the downstream component (SQLite, HTTP client, worker pool) looks stuck, it's almost always waiting for something the upstream is doing too fast. Rate-limit the upstream first.

3. The temptation to rewrite is almost always wrong early in diagnosis. "Rewrite in Go" / "switch to Postgres" / "add a queue" are valid responses to real scale problems. They're not valid responses to "I haven't figured out the bug yet." Spend an hour with the actual logs first.

4. Volume matters less than burstiness. A system handling 4k messages/day average can absolutely fall over from 10 messages in one second. The metric you care about is peak concurrency, not total throughput.

5. Test the throttle as a unit, not as an integration. The fix above has one test (12 lines). It doesn't try to spin up a real bot. It just verifies the middleware behavior in isolation. That's enough — the actual production behavior is downstream of this contract holding.

Code

The middleware and the test are public:

→ github.com/lamas51/claude-code-templates (case studies folder)

Same project also has Claude Code agent/skill/hook templates I deploy across Go, Python, and WordPress projects — feel free to fork.

About me

I'm Boris — IT-pro since 1999. I run production code across Go, Python, and React, mostly for small and mid-size businesses. Last 18 months I've been heavy on Claude Code workflow.

If you have a production Python service throwing similar cascades and want help diagnosing it, I take this kind of work through Fiverr (clean scope, escrow, no off-platform contact):

→ fiverr.com/lamastoma — Python / n8n / Telegram bot bug fixing in 24 hours

Open to questions in the comments — happy to dig into specifics if you're seeing something similar.

Anonymized — no client data, the diagnosis flow and final patch are the actual ones I shipped.

DEV Community: Boris Kl

Your page loads fast but still feels slow? It's INP, not load time

What INP actually measures

Why it's invisible in a normal audit

What's really slow

The fixes that move it

The honest part

Your Telegram bot replies twice? It's timing, not a logic bug

1. The webhook you never answered

2. No dedup, so retries become real work

3. Two messages, one piece of state, no lock

The part that saves you next time

Sending Telegram Bot Conversions to Meta? Don't Reach for business_messaging

The gotcha

The fix

Making it actually attribute

Lighthouse Gave My Site 100/100. The Site Was Down.

How this happens

The five-minute audit

Rules I now follow

I set up Claude Code for a real production project. Here's what actually earned its keep

CLAUDE.md: the file that pays rent every single day

Custom agents: the reviewer I argue with

A skill that stops me from skipping steps

Hooks: the two-line insurance policy

What I tried and deleted

The honest summary

One year of self-hosted n8n on a $6 Hetzner VPS

One year of self-hosted n8n on a $6 Hetzner VPS

The actual setup

What's running

The cost breakdown (12 months)

What broke (the one outage)

The actual win (it's not the money)

Things I'd do differently if starting today

When NOT to self-host

The five-line take

Publishing checklist

See also

A Production Python Telegram Bot Was Crashing Every 2 Hours. The Fix Was 18 Lines.

The symptoms

What looked plausible (and was wrong)

The actual root cause

The fix — 18 lines

Why this works

The result

What I'd tell a junior on the team

Code

About me