Scaling a Fault-Tolerant LLM Translation Pipeline on Zero Budget

At BabyCloud, localization was our greatest growth driver—and our greatest budget threat. To localize over 100,000 developmental activities, bedtime stories, and pediatric recipes into regional Indian languages (Hindi, Marathi, Telugu, Tamil, and Bengali), commercial translation API quotes came in at over $15,000. For a pre-seed startup, that was out of the question.

We decided to build an automated translation engine using the Gemini API's free tier. But free tiers come with a catch: a strict limit of 15 Requests Per Minute (RPM) per project, and unpredictable truncation errors where the model cuts off text mid-response.

To overcome these constraints, we scaled the system to run 23 parallel API pipelines by rotating keys across 23 Google Cloud projects, and wrote a custom state-machine JSON-repair parser to catch and rebuild corrupted outputs.

Here is how we built it.

Sharding Concurrency Across 23 GCP Projects

If you try to hit a single free-tier Gemini API key with parallel worker threads, you will hit rate limits within seconds. To work around this, we set up 23 separate GCP projects, generated a unique API key for each, and implemented a sliding-window lock manager.

This lock manager rotates the active key in real time, tracking the timestamp of every request to ensure no single key exceeds the 15 RPM limit.

LLM Translation and Key Rotation System Schema

Architecture Flow: Zero-Budget Translation Pipeline

interface RequestLog {
  timestamp: number;
}

export class KeyRateLimiter {
  private requestHistory: Map<string, RequestLog[]> = new Map();
  private maxRequests: number = 15;
  private windowSizeMs: number = 60000; // 1 minute sliding window

  public async acquireLock(apiKey: string): Promise<boolean> {
    const now = Date.now();
    const logs = this.requestHistory.get(apiKey) || [];
    
    // Filter out requests older than the sliding window
    const activeLogs = logs.filter(log => now - log.timestamp < this.windowSizeMs);
    this.requestHistory.set(apiKey, activeLogs);

    if (activeLogs.length >= this.maxRequests) {
      return false; // Limit reached for this key
    }

    activeLogs.push({ timestamp: now });
    return true;
  }
}

Rebuilding Truncated JSON Payloads

Because translating thousands of items at once is expensive, we batched requests. But large batches often triggered response truncation when the LLM hit token output limits mid-generation. When this happened, the API returned broken JSON—missing closing objects, arrays, and quotes—which crashed the default JSON.parse() parser.

Instead of discarding these partial translations and wasting tokens, we wrote a character-by-character state-machine parser. This parser tracks whether the cursor is inside a string block or escaping a character, collects unclosed brackets using a LIFO stack, and dynamically appends the correct closing tokens to recover the valid portion of the JSON payload.

Bracket Lexer Stream Analysis

export function repairTruncatedJSON(rawJson: string): string {
  let cleaned = rawJson.trim();
  
  // Remove markdown codeblock wrappers if present
  cleaned = cleaned.replace(/^```json\s*/i, '').replace(/```$/, '').trim();

  const bracketStack: string[] = [];
  let inString = false;
  let escapeNext = false;
  let repaired = '';

  for (let i = 0; i < cleaned.length; i++) {
    const char = cleaned[i];
    
    if (escapeNext) {
      repaired += char;
      escapeNext = false;
      continue;
    }

    if (char === '\\') {
      repaired += char;
      escapeNext = true;
      continue;
    }

    if (char === '"') {
      inString = !inString;
      repaired += char;
      continue;
    }

    if (!inString) {
      if (char === '{' || char === '[') {
        bracketStack.push(char);
      } else if (char === '}') {
        if (bracketStack[bracketStack.length - 1] === '{') {
          bracketStack.pop();
        }
      } else if (char === ']') {
        if (bracketStack[bracketStack.length - 1] === '[') {
          bracketStack.pop();
        }
      }
    }

    repaired += char;
  }

  // If the cursor was cut off inside a string value, close the quote
  if (inString) {
    repaired += '"';
  }

  // Pop and append closing brackets in reverse order (LIFO)
  while (bracketStack.length > 0) {
    const unclosed = bracketStack.pop();
    repaired = repaired.trim();
    if (repaired.endsWith(',')) {
      repaired = repaired.slice(0, -1);
    }
    
    if (unclosed === '{') {
      repaired += '}';
    } else if (unclosed === '[') {
      repaired += ']';
    }
  }

  return repaired;
}

Token Optimization Practices

We also needed to keep the context sizes small to stay within free-tier limits. Before sending nested payloads to the LLM, we stripped out metadata keys (like ObjectIds, timestamps, and database index flags) which did not need translation. We wrote a recursive utility to filter out these fields:

export function pruneTranslateTree(obj: any): any {
  if (obj === null || typeof obj !== 'object') {
    return obj;
  }

  if (Array.isArray(obj)) {
    return obj.map(pruneTranslateTree);
  }

  const pruned: Record<string, any> = {};
  
  // Define metadata keys to strip before parsing
  const blacklistedKeys = ['_id', 'createdAt', 'updatedAt', 'userId', 'status', '__v'];

  for (const [key, value] of Object.entries(obj)) {
    if (blacklistedKeys.includes(key)) {
      continue;
    }
    
    // Only translate string fields or recurse into nested structures
    if (typeof value === 'string' || typeof value === 'object') {
      pruned[key] = pruneTranslateTree(value);
    }
  }

  return pruned;
}

Prompt Engineering for JSON Output Reliability

To ensure that the Gemini model returns a parseable structure, we use a system prompt that explicitly restricts output formats. Forcing the model to return structured data is key to preventing parsing errors:

System Prompt Directive: You are a translation microservice. Translate the text fields of the provided JSON payload into the target language. Preserve all key structures. Output ONLY a valid JSON payload. Do not include markdown codeblocks, notes, or descriptions.

Even with this directive, safety filters or network connection lag can still interrupt generations mid-stream. Combining the prompt instructions with our character-by-character recovery stack ensures our pipeline maintains a high success rate even under heavy processing loads.