DEV Community: Coded Parts

Reading a Paginated API Without Holding the Whole Thing in Memory

Parthipan Natkunam — Sat, 13 Jun 2026 15:36:53 +0000

Your API hands out 50 records at a time across 400 pages. You need all of them. You do not need them all at once.

Here's a very familiar situation that shows up constantly on the backend. Some API returns data in pages, 50 or 100 records at a time, and you need to walk every page: sync them to your database, export them to a file, run a report. The endpoint gives you a cursor or a page number and you keep asking until there's nothing left.
The way most of us write it the first time looks like this:

async function getAllRecords() {
  const all = [];
  let cursor = 0;
  while (cursor !== null) {
    const { records, nextCursor } = await fetchPage(cursor);
    all.push(...records);
    cursor = nextCursor;
  }
  return all;
}

const everything = await getAllRecords();
for (const record of everything) {
  process(record);
}

It works. At four hundred records it's fine. The trouble starts when the dataset grows, and it has three separate problems hiding in it.

It holds the entire dataset in memory before you touch a single record. It's all or nothing: if page 380 fails, you've thrown away the 19,000 records you already fetched. And it's eager. You can't start processing record one until the very last page has landed, even if all you wanted was the first ten.

There's a shape in JavaScript built for exactly this, and if you read the first two posts in this series you already have both halves of it.

Two ideas you've already seen

In the CSV post, we pulled rows out of a huge file one at a time with a generator, so the file never fully loaded into memory. Lazy. Pull-based. You ask for the next row, you get the next row, nothing more.

In the async/await post, we saw that a generator can pause at a yield and resume later.A generator can hold its place across an asynchronous gap.

Put those together. A generator that pulls data lazily, and can pause to await something between pulls. That's an async generator, and it's the natural tool for walking a paginated API. You pull records one at a time, and behind each pull it quietly fetches the next page only when you've run out of the current one.

The async generator

Here it is. Notice it's async function*, with the star, and that it both awaits and yields.

async function* allRecords() {
  let cursor = 0;
  while (cursor !== null) {
    const { records, nextCursor } = await fetchPage(cursor);
    for (const record of records) {
      yield record;          // hand out one record at a time
    }
    cursor = nextCursor;     // remember where we are for next time
  }
}

Here's what it does: It fetches a page, awaiting it like any async function. Then it yields each record in that page one by one.
The function pauses at every yield and sits there, holding its cursor, until someone asks for the next record.
You consume it with for await...of, which is a normal for loop that knows how to wait:

for await (const record of allRecords()) {
  process(record);
}

That reads almost exactly like the eager version's final loop. The difference is what's happening underneath. Each turn of this loop might quietly trigger a network fetch, or might just hand you the next record already sitting in the current page. The loop doesn't care. You write straight-line code and the paging disappears.

I ran this against a fake API holding 20,000 records in pages of 50. It read all of them, in order, no gaps, across exactly 400 fetches. Which is the boring, correct result. The interesting result is what happens when you don't want all of them.

The payoff: you can stop

Here's the thing the eager version can never do. Say you only want the first ten records.

const firstTen = [];
for await (const record of allRecords()) {
  firstTen.push(record.id);
  if (firstTen.length === 10) break;
}

With the collect-everything approach, getting ten records still costs you all 400 page fetches, because it loads the whole dataset before you see record one. With the async generator, I counted the fetches:

pages_fetched=1

One fetch. Not 400. When you break, the generator is paused at a yield, and breaking out of the loop means nobody ever asks it for record eleven. So it never runs the loop body again. It never fetches page two. The laziness here is the entire advantage: you only use the compute for the pages you actually process.

What this does to memory

The eager version's real cost is that it keeps every record alive at once. The streaming version holds about one page at a time. To show the gap more accurately, I measured peak heap growth for both, at three dataset sizes, with the same chunky records:

dataset       collect-all peak     stream peak
10,000 rows        4.0 MB             3.7 MB
100,000 rows      36.1 MB            12.4 MB
500,000 rows     161.1 MB            15.8 MB

Collect-all grows with the dataset: ten times the rows, roughly ten times the memory. The streaming version barely moves, because at any moment it's holding one page and one record, not half a million of them. At ten thousand rows the difference hardly matters. At half a million it's the difference between a job that runs and a job that gets killed.
That's the same lesson as the CSV post, now pointed at the network instead of the disk.

It composes, and stays lazy

The eager array has one more hidden tax. Every transform you bolt on, a filter, a map, walks the whole array again and builds another whole array. With async generators you pipe one into the next and the laziness survives the whole chain.

async function* onlyEven(source) {
  for await (const record of source) {
    if (record.id % 2 === 0) yield record;
  }
}

for await (const record of onlyEven(allRecords())) {
  // first 5 even records, then break
}

I asked this pipeline for the first five even records and broke. It fetched one page. The filter pulls from allRecords one record at a time, and allRecords fetches one page at a time, and nothing runs ahead of what you've actually consumed. You can stack filters and maps like this and the chain still only does the work you draw out of the end of it.

The part that bites people: cleanup

Now the most overlooked gap, because this is where streaming code leaks in production.

Let's say your generator doesn't fetch from a stateless API. Say it opens a database cursor or a file handle and reads from it. If the consumer breaks early, like in the first-ten example, the generator is left paused forever. Does the handle ever close?

It does, if you write it right. When you break out of a for await...of loop, the loop calls .return() on the generator under the hood. That resumes the paused generator just long enough to run any finally block before it shuts down. So you put cleanup in finally and it fires even on early exit:

async function* withCleanup() {
  try {
    while (true) yield i++;
  } finally {
    // prints even when you break
    console.log('finally ran: connection closed');  
  }
}

The important point here is: any resource an async generator holds open goes in a try, and its release goes in the matching finally. Skip that and an early break will quietly leak connections and will probably wake you up at 2 AM through PagerDuty alerts.

Where async generators are the wrong tool

They're built for I/O that arrives in sequence, so they're sequential by default. allRecords fetches page two only after you've finished page one, so you pay the network latency of every page back to back.

If your API can serve pages in parallel and you need throughput more than you need simple code, a plain Promise.all over known page numbers will beat this, and async generators won't parallelize for free.

Error handling is your job. One thrown page ends the loop, same as the eager version. If you want retries or skip-and-continue, you wrap the fetch inside the generator yourself.

And per item, pulling through a generator is slower than indexing into an array. For network-bound work that overhead vanishes next to the latency. For a tight CPU-bound loop over data already in memory, reach for the plain array.

The whole arc, in one mental model

Step back and the three posts in this series are the same idea three times:

Pull local data lazily so a file never fully loads.
Pause and await so async code reads like sync code.
And now, pull remote data lazily so an API never fully lands in memory.

Underneath all of it is one trick: a function that can stop in the middle and pick up later when you ask for more.

The protocol that makes for await...of work, the Symbol.asyncIterator it looks for, the way .return() drives that finally cleanup, the patterns for adding controlled parallelism back in: I pulled the whole async iteration layer apart in a short free book on generators. If this series made the mechanism click and you want the full picture in one place, it's here:
Get Your Free Copy

The next time an API hands you data 50 rows at a time, you don't have to choose between holding all of it and writing a tangle of cursor bookkeeping. You write a loop that looks eager and runs lazy. The paging hides itself, and you only ever pay for the pages you actually walk through.

Cheers :)

async/await is a Generator in Disguise. Let's Build It From Scratch

Parthipan Natkunam — Sun, 07 Jun 2026 00:31:39 +0000

You write await a dozen times before lunch. Fetch a row, await it. Call a service, await that. It works, you move on, and you never have to think about what the word is doing. Then one day someone asks you to explain it. Maybe it's an interviewer."But what does await actually do?" And you open your mouth and what comes out is "it, uh, waits for the promise." Which is true, and also explains nothing.

We can build async/awit mechanism from scratch using generators as a learning exercise. It requires a pause button wired to a small loop that waits on a promise and then presses play again. You already know one half of that machinery if you read the previous post in this series. The other half is a trick generators have that we glossed over. Put the two together and you can build a working version of async/await yourself, by hand, and watch it behave exactly like the real thing.
Let's do that.

The shape of the problem

Strip await down to what it has to accomplish and you get two requirements:

First, a function has to be able to stop in the middle. Right at the await, freeze everything, the local variables, the spot in the loop, all of it, and hand control back to whoever called it. Normal functions can't do this. They run start to finish and that's the deal.

Second, something on the outside has to wait for the promise to settle and then nudge the frozen function back to life, handing it the resolved value as if the await expression had simply evaluated to it.

That's the whole job. A function that pauses, and a driver that resumes it when a promise is ready. Hold that picture, because the rest of this is just filling in those two pieces with things JavaScript already gives you.

The half you've seen: pausing

A generator function, the function* kind, can pause itself with yield and resume later from the exact same spot. We leaned on that hard in the CSV piece to pull rows through a pipeline one at a time. A line came in, got yielded, and the generator sat frozen until someone asked for the next value.

So pausing is solved. A generator pauses at every yield. If we squint, yield and await start to look like the same gesture: stop here, give something to the outside, wait.

But there's a gap. With the CSV pipeline, values only flowed one way. The generator yielded lines outward and the consumer took them. For await to work, the flow has to go both ways. The function yields a promise outward, and then the resolved value has to come back in and become the result of the expression. const user = await getUser() means the generator needs to receive user at the spot where it paused.
Generators can do this. We just never used it in the CSV piece, because we didn't need it there.

The half you probably haven't: talking back

Here is the trick. When you call .next() on a generator, you can pass it an argument, and that argument becomes the value the paused yield expression evaluates to.

The yield doesn't only push a value out. It also waits to receive one back, and whatever you hand to the next .next(value) call is what it gets.
A tiny demo makes it concrete:

function* echo() {
  const first = yield 'pause-1';
  console.log('received:', first);
  const second = yield 'pause-2';
  console.log('received:', second);
  return 'done';
}

const g = echo();
console.log(g.next().value);     // pause-1   (runs up to the first yield)
console.log(g.next('A').value);  // received: A,  then  pause-2
console.log(g.next('B').value);  // received: B,  then  done

Look at what happened. The first .next() runs the generator until it hits yield 'pause-1' and stops. The value 'pause-1' comes out. The generator is now frozen on that line.

When we call .next('A'), the 'A' gets injected as the result of that first yield, so first becomes 'A', the log fires, and the generator runs on to the second yield. Two way communication. The generator speaks, and it also listens.

Now line the two halves up. yield pauses and emits a value. .next(value)
resumes and injects a value. If the thing a generator yields is a promise, an outside driver could wait for that promise, take the result, and pass it straight back in through .next(). The generator would never know it had paused at all. From inside, it would look exactly like the value had been sitting there waiting.
That driver is the only piece we're missing.

Building the driver

Here's the runner. This is the heart of the whole post, and it's shorter than most of the functions you wrote this week:

function run(genFn) {
  return new Promise((resolve, reject) => {
    const gen = genFn();

    function step(method, arg) {
      let result;
      try {
        result = gen[method](arg);   // gen.next(value) or gen.throw(error)
      } catch (err) {
        return reject(err);          // generator threw and nothing caught it
      }

      const { value, done } = result;
      if (done) {
        return resolve(value);       // generator returned: settle the outer promise
      }

      // Treat whatever was yielded as a promise. Wait, then resume.
      Promise.resolve(value).then(
        (v) => step('next', v),      // resolved: feed the value back in
        (e) => step('throw', e),     // rejected: throw it at the yield point
      );
    }

    step('next', undefined);         // kick it off
  });
}

run takes a generator function and returns a promise. That promise stands in for the whole async operation, the same way calling an async function hands you a promise.
Inside, step is the engine. It calls the generator (gen.next(arg) to resume normally, gen.throw(arg) to inject an error, and we'll get to why that matters).
The generator hands back { value, done }. If done is true, the generator has returned, so we resolve the outer promise with whatever it returned.
If it isn't done, then value is whatever got yielded, which we are choosing to treat as a promise. We wrap it in Promise.resolve so plain values work too, wait for it with .then, and when it settles we call step again to wake the generator up. A resolved promise resumes with .next(theValue). A rejected one resumes with .throw(theError).
Then step('next', undefined) starts the machine. Everything after that is the generator and the promises bouncing control back and forth until done.
Here is what using it looks like next to the native version:

// native
async function nativeSequential() {
  const a = await wait(10, 2);
  const b = await wait(10, 3);
  return a + b;
}

// our version: function* and yield instead of async and await
function genSequential() {
  return run(function* () {
    const a = yield wait(10, 2);
    const b = yield wait(10, 3);
    return a + b;
  });
}

Swap async for function* wrapped in run, swap await for yield, and the two functions are the same shape. That's not a coincidence. We'll get to why in a minute.

Why this Works

The runner you just read is not a clever approximation of async/await. It is, give or take some edge-case handling, how async/await actually shipped.

When async functions were proposed for JavaScript, the reference implementation compiled them down to generators driven by a runner, using a tool called regenerator. The proposal itself was built on top of generators and promises, because those two features together already had everything async functions needed. The pause came from generators, the waiting came from promises, and a small driver glued them.

It went further than a proposal. For years, if you wrote async/await and compiled it with TypeScript or Babel to run on older browsers, the output was a generator and a helper function. TypeScript's helper is called __awaiter, and if you read its source, it is the same code you just walked through: a new Promise, a step function, generator .next(value) when a promise resolves, generator.throw(value) when one rejects, resolve when the generator is done. Before the keywords even existed, libraries like co and Bluebird's coroutine handed people this exact pattern so they could write flat, sequential-looking async code using yield.
So the twenty lines above aren't a model of async/await. They're closer to a fossil of it. You rebuilt the thing the feature was made from.

Where the analogy stops

It would be factually inaccurateto say that run is a drop-in replacement for the real keyword, and the honest gaps are worth knowing:

Modern engines don't ship your generator runner. V8 has native async functions now, with their own optimized handling of the microtask queue, so the exact scheduling of when continuations fire is tuned in ways a hand-written .then loop doesn't perfectly reproduce. In ordinary code you won't see a difference, but if you're reasoning about precise microtask ordering across many interleaved tasks, the native version is the source of truth, not this.

The runner is also missing the rough edges a real implementation handles.
The right takeaway from this document is that the implementation we developed is not a code to ship, but rather a model that turns a word you used on faith into a thing you can reason about.

The bit underneath

The two-way communication that makes this work, the .next(value) injection, is one of the most underused features in the language, and it powers more than this. The same back-and-forth drives yield* delegation, lets generators model state machines, and is the foundation the whole async story was built on. I pulled that full layer apart, the bidirectional protocol, yield*, and the runner that grew into async/await, in a short free book on generators. If this post made the mechanism click and you want the complete mental model beneath it, grab it:

Get Your Free Copy

Cheers :)

An Introduction to Alternate Data Streams (ADS)

Parthipan Natkunam — Wed, 03 Jun 2026 23:01:28 +0000

A Hidden Layer of New Technology File System (NTFS)

Alternate Data Streams (ADS) is a New Technology File System (NTFS) feature that allows data to be associated with a file or directory without modifying its primary data or attributes.

Although introduced to provide enhanced functionality, ADS has also sparked debates due to its potential misuse in cybersecurity. This article explores ADS's technical nuances, exploring its design, use cases, and challenges.

What are Alternate Data Streams?

In NTFS, every file or directory consists of multiple data streams. By default, the file’s primary data is stored in the main data stream, also known as the default data stream.

ADS allows developers to attach additional data streams to a file, offering a way to embed metadata or supplementary content without altering the original file’s content.

For instance, a file on an NTFS filesystem can have a primary stream (main stream) for the main content and one or more alternate streams for additional metadata.

Syntax Overview

The syntax for working with ADS is pretty straightforward. You can associate an alternate data stream using a colon (:) as a separator

filename:streamname

For example:

echo "This is an alternate data stream" > document.txt:hiddenstream

Here, document.txt is the primary file, and hiddenstream is the alternate data stream associated with it.

These alternate streams could be anything, for instance, an executable, a script, a log file, etc.

Practical Use Cases of ADS

ADS was designed with legitimate use cases in mind. Some of its primary applications are:

1. Storing Metadata

Alternate Data Streams can store metadata about files without cluttering the primary file content.

For instance, a text editor might save configuration settings or user preferences in an ADS.

2. Attaching Hidden Data

Applications can use ADS to store additional data related to a file, such as thumbnails or indexing information, without exposing it in the file’s primary content.

3. Enhanced File Management

Developers can utilize ADS for logging, tagging, or embedding instructions within files.

For example, a backup application might use ADS to store backup timestamps.

Cybersecurity Challenges with ADS

1. Data Hiding

Attackers can embed malicious code or payloads within ADS to evade detection.

For example, a file might appear benign while carrying a hidden executable within an alternate data stream.

2. Bypassing Security Tools

Many antivirus and security scanners do not thoroughly inspect alternate data streams, making them an effective tool for malware authors to obfuscate threats.

3. Persistence Mechanism

Threat actors can leverage ADS to maintain persistence on a compromised system.

For instance, they might store configuration files, encryption keys, or secondary payloads in ADS.

Detecting and Managing Alternate Data Streams

Understanding how to detect and manage ADS is critical given the potential risks. Here are some tools and techniques:

1. Using Built-in Commands

The dir command with the /R flag can reveal alternate data streams:

dir /R

2. PowerShell Scripts

Custom PowerShell scripts can be used to enumerate ADS.

For example:

# List all alternate data streams in the current directory
Get-ChildItem -Recurse | ForEach-Object { 
    $file = $_
    Get-Item $file.FullName -Stream * | Where-Object Stream -ne ':$Data' | ForEach-Object {
        [PSCustomObject]@{
            FileName = $file.Name
            Stream = $_.Stream
            Length = $_.Length
        }
    }
} | Format-Table -AutoSize

The explanation of the above script is as follows:
The Get-ChildItem -Recurse command retrieves all the files and subdirectories present in the current working directory, which we then pipe the output to a ForEach-Object loop that iterates through each item.

The Get-Item $file.FullName -Stream * command retrieves all streams associated with a particular item being processed by the loop. The output of this is, in turn, passed to the Where-Object Stream -ne ':$Data' which filters out the main stream identified by the tag :$Data (this would contain the main content of the file)

Finally, we pipe the filtered list from above into another loop that iterates through the identified alternate data streams and creates a custom object for each entry found during the process.

We use Format-Table -AutoSize command to display the final output in a tabular form.

The output of the above script, in our case, will reveal the alternate data stream hiddenstream that we created in the earlier section:

3. Third-Party Tools

Specialized tools like Sysinternals' Streams can identify and analyze ADS on a system.

Mitigating Risks of ADS

To balance the utility of ADS with security, organizations and developers can adopt the following practices:

1. Monitor and Audit: Regularly audit systems for unauthorized ADS usage.
2. Restrict Privileges: Limit file system privileges to reduce the risk of ADS exploitation.
3. Educate Users: Train users and administrators on identifying and mitigating ADS risks.
4. Enhance Security Scans: Ensure antivirus and security tools are configured to detect and scan ADS.

Processing a 2GB CSV in Node Without Running Out of Memory

Parthipan Natkunam — Sat, 30 May 2026 05:08:13 +0000

Why the obvious approach crashes, and how a few generator functions keep memory flat no matter how big the file gets.

Here's a task that looks trivial on paper: Read a CSV export, filter the rows you care about, sum one column, write a small report. The kind of thing you bang out in ten minutes. Now say the file is around 2GB.

The first version is four lines. It works great on a 5MB sample. Then you point it at the real export and Node falls over with JavaScript heap out of memory. The reflex is to do what most of us do first, bump --max-old-space-size, give it more heap, run it again. It gets further and dies again. That's the moment to stop fighting the symptom and look at what the code is actually asking the machine to do.

Here is the thing worth internalizing: the size of your data does not have to dictate the size of your memory footprint. You can process a file bigger than your RAM. The trick is to never hold the whole thing at once, and generators give you a clean way to write code that does exactly that without turning into a mess of callbacks and manual state.

Let's build up to it properly.

The version that dies

Here's roughly what the first attempt looked like:

const fs = require('fs');

const rows = fs.readFileSync('export.csv', 'utf8').split('\n');

let total = 0;
for (const row of rows) {
  const amount = Number(row.split(',')[2]);
  if (!Number.isNaN(amount)) total += amount;
}

console.log('total:', total);

Read the file. Split on newlines. Loop. Sum. Clean and readable, and on a small file it's perfect.

The problem is hiding in the first line, and it's actually two problems stacked on top of each other.

fs.readFileSync pulls the entire file into memory as one big buffer before you do anything with it. A 2GB file is a 2GB allocation, minimum. Then .split('\n') takes that buffer and produces an array with one string per line. For a file with millions of rows, that's millions of string objects, each with its own overhead, all alive at the same time. So now you're holding the raw file and a fully expanded array of every line. You've roughly doubled the cost of the thing that was already too big.

I wanted to see how bad it actually is, so I ran it. I generated a CSV with 2 million rows (id,name,amount), which came out to about 45MB. Modest. Not even close to 2GB. Here's what the load-everything approach did to memory:

naive sum: 999000000 | peak RSS MB: 238

238MB of resident memory to process a 45MB file. That's more than five times the file size sitting in RAM at peak. Now scale that ratio up. A 2GB file with the same shape would want somewhere north of 10GB, and your container almost certainly does not have that. Hence the crash.

What we actually want

Step back from the code for a second.

To sum a column, do you ever genuinely need every row in memory simultaneously? No. You need one row at a time. Read a line, pull out the number, add it to a running total, throw the line away, move on. At no point does row 1,400,000 need to coexist with row 3.

That's the whole insight. The work is sequential and one-pass, so the memory should be too. We want to pull rows through the program one at a time, like water through a pipe, instead of trying to fill an entire Ocean in a bucket.

Node has had streams forever, and streams do exactly this. But raw streams are awkward to compose. The moment you want to chain "read lines" into "parse them" into "filter them" into "sum them," you're wiring up event handlers and managing backpressure by hand, and the readable four-line version turns into something you don't want to look at.

This is where generators earn their place.

Generators, the one-paragraph version

A normal function runs start to finish and returns once. A generator function (the function* syntax) can pause itself partway through, hand a value back to whoever called it, and then resume from exactly where it left off the next time you ask for a value. It does this with yield.

For reading files we want the async flavor, async function*, because reading from disk is asynchronous. The consuming side uses for await...of instead of a plain for...of. Same idea, just async.

Building the pipeline

Let's write the big-file version as a set of small generators, each doing one job.

First, a generator that yields the file one line at a time. Node's readline module already reads a stream line by line, so we wrap it:

const fs = require('fs');
const readline = require('readline');

async function* readLines(path) {
  const rl = readline.createInterface({
    input: fs.createReadStream(path),
    crlfDelay: Infinity,
  });
  for await (const line of rl) {
    yield line;
  }
}

createReadStream reads the file in small chunks rather than all at once. readline hands us complete lines off those chunks. We yield each line as it arrives. Crucially, nothing is accumulating here. A line comes in, goes out, and is gone.

Next, a generator that turns raw lines into parsed objects:

async function* parse(lines) {
  for await (const line of lines) {
    const [id, name, amount] = line.split(',');
    if (id === 'id') continue; // skip the header row
    yield { id, name, amount: Number(amount) };
  }
}

Notice it takes a source of lines as its input and yields objects. It doesn't know or care whether those lines came from a file, a network socket, or an array in a test. It just transforms what flows through it.

Now a filter, because in this scenario, I only wanted rows above a threshold:

async function* onlyAbove(rows, min) {
  for await (const row of rows) {
    if (row.amount >= min) {
      yield row;
    }
  }
}

And finally we connect them and consume the result:

(async () => {
  const lines = readLines('export.csv');
  const parsed = parse(lines);
  const filtered = onlyAbove(parsed, 0);

  let total = 0;
  let count = 0;
  for await (const row of filtered) {
    total += row.amount;
    count++;
  }

  console.log('total:', total, 'count:', count);
})();

Read it from the inside out. readLines produces lines, parse consumes those and produces objects, onlyAbove consumes those and produces a filtered subset, and the for await loop at the bottom pulls the whole chain. Each stage is maybe five lines. Each one does a single thing. You can test them in isolation, reorder them, drop one in or out, all without touching the others.

Here's the part that matters. I ran this exact pipeline against the same 2 million row file:

pipeline sum: 999000000 count: 2000000 | peak RSS MB: 89

Same answer, 999000000, down to the last digit. But peak memory went from 238MB to 89MB. And that 89MB is not really "memory for the data." It's Node's baseline plus the read buffer plus a couple of objects in flight. The data itself is barely there because we only ever hold one row at a time. Throw a 2GB file at this and the number stays flat. That's the whole game.

Why this composes when streams alone don't

You might be thinking, fine, but Node streams could do this too, and you'd be right. So why the generators?

Pull versus push. A raw readable stream pushes data at you through events; you react to 'data' and 'end' and you manage the timing yourself. When you chain several transformations, you're coordinating several event emitters and making sure none of them races ahead of a slow consumer. Backpressure, in the jargon.

Generators flip it to pull. The consumer at the bottom of the loop asks for the next value, and that request travels back up the chain. onlyAbove asks parse for a row, parse asks readLines for a line, readLines asks the file for a chunk. Nothing is produced until something downstream wants it. Backpressure isn't something you configure; it's just how yield works. The producer literally cannot get ahead because it's paused until you call for the next value.

That's why the four small functions above read almost like the naive version, but behave like a carefully tuned stream. You get the readability of the simple loop and the memory profile of hand-written streaming, without choosing between them.

Where this bites you

I'd be lying if I said this is free.

The big one: you get one pass. A generator is exhausted once you've iterated it. If you need to loop over the data twice, say, sum a column and then also find the max in a separate pass, you can't just iterate the same pipeline again. It's empty the second time. You either compute both in a single pass, or you re-create the pipeline from the source, or, if the result genuinely fits in memory, you collect it into an array (const arr = []; for await (const x of pipe) arr.push(x);) and accept the cost. The streaming approach is for when the dataset doesn't fit, so collecting it usually defeats the point.

The other one is debugging. With an array you can console.log the whole thing and see your data. With a lazy pipeline there's nothing to log until you pull a value through, and a console.log inside a generator only fires when that value is actually demanded. The execution order can surprise you the first few times. It clicks, but there's an adjustment period.

And async generators do carry some per-iteration overhead compared to a tight synchronous loop over an array. If your data comfortably fits in memory and you care about raw speed, the array might genuinely be faster. This technique is about not dying on data that doesn't fit, not about winning microbenchmarks on data that does.

The bit underneath

What I find quietly interesting is that the for await...of loop driving this whole thing is doing something generators were partly built to enable. The pause-and-resume machinery that lets a generator give up control and pick back up later is the same machinery that async/await is built on top of. When you await a promise, your function is effectively yielding control and waiting to be resumed, exactly like a generator yielding a value. async/await is, more or less, a generator and a runner that feeds it resolved promises. Once you've written a few generators by hand, a lot of the async behavior you've been taking on faith stops being magic.

I dug into that whole layer, the two-way communication, yield* composition, the async runner that became async/await, in a short book on generators. It's free. If the pipeline pattern here was useful and you want the full mental model under it, grab it:

Get Your Free Copy

The next time Node tells you it's out of memory, before you reach for a bigger heap, ask whether you ever needed all that data at once in the first place. Usually you didn't.

A working demo of the ideas discussed in this post can be found in this GitHub repository:

coded-parts / generator-data-processing-demo

A demo PoC on memory optimization using generators in JS

Verification environment

Reproducible checks for every measurable claim in the article "Processing a 2GB CSV in Node Without Running Out of Memory."

Nothing here is mocked. It generates a real CSV, runs both the naive load-everything approach and the generator pipeline in separate processes measures peak memory, and asserts the totals are correct against an independently computed expected sum.

Requirements

Node.js 18 or newer (tested on Node 22). No npm install, zero dependencies.

Quick start

# 1. Run the full check at the article's size (2,000,000 rows / ~45 MB)
node verify.js

# 2. Prove the headline claim: naive dies, pipeline survives the same heap cap
./stress.sh

That's it. verify.js exits 0 if all checks pass. stress.sh exits 0 if the naive approach crashes while the pipeline succeeds.

What each claim maps to

Claim in the article	How it's verified	File
Both approaches produce the same total	Both totals

…

View on GitHub

Cheers :)