iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🍣

Unified Attachment Processing and Salesforce Integration with n8n and AWS Lambda

に公開

Introduction

In my daily work, I frequently have to register information from resumes and skill sheets sent by clients into Salesforce. However, these files come in a variety of formats (PDF, Excel, Word), and manually opening, reading, and transcribing them one by one quickly consumes a significant amount of time.

In this article, I will share my experience building an automation pipeline using n8n. The process involves extracting text from any file format into a unified format, passing it to an LLM for structuring, and then registering it into a Salesforce custom object. During construction, I encountered some constraints of n8n Cloud and specific quirks of the Word file format, which were resolved through integration with AWS Lambda and careful observation of the file structures.


Objective

To build a pipeline that automatically extracts and structures text from any resume-related file format and registers it in Salesforce, thereby reducing the man-hours required for manual transcription to nearly zero.


1. Background and Challenges

1.1 Current Situation

  • Files are received via email or forms from sales representatives or partner companies.
  • File formats include PDF / Excel (.xlsx, .xls) / Word (.docx, .doc).
  • Traditionally, staff manually open each file, read the content, and transcribe it into Salesforce.

1.2 Challenges Faced

When I actually started building, I faced the following challenges:

# Challenge Impact
1 n8n's "Extract From File" does not support Word Cannot process .docx / .doc files
2 n8n Cloud's Code node prohibits external modules Cannot use jszip or zlib, and cannot extract .docx via JS
3 Schema of extraction results differs by file format Cannot pass unified input to the LLM node

2. Solution Strategy (Overview)

The final workflow is as follows:

Webhook

Switch (Branch by file extension)
  ├── .pdf  → Extract From File (PDF)    → Format_PDF Code  ─┐
  ├── .xlsx → Extract From File (XLSX)   → Format_XLSX Code ─┤
  ├── .xls  → Extract From File (XLS)    → Format_XLS Code  ─┼→ Merge → LLM → Salesforce
  ├── .docx → Extract From File (base64) → AWS Lambda → Code─┤
  └── .doc  → Extract From File (base64) → Code (UTF-16LE ) ─┘

The core design concept is to normalize all five paths into the same schema: { extractedText: "..." }. After merging the nodes, all downstream LLM prompts and Salesforce field mappings can be shared.


3. Detailed Processing of Each Path

3.1 PDF Path

n8n's "Extract From File" natively supports PDF, and the body text is output in the text field. To align field names with other paths, I inserted a single Code node downstream to normalize the output.

// PDF path - normalize output to { extractedText }
// The Extract From File node returns the body text in the `text` field,
// along with metadata like `info` and `metadata`. We only need the text.

const text = $input.first().json.text;

return [{
  json: {
    extractedText: text
  }
}];

3.2 XLSX / XLS Path

"Extract From File" parses Excel files as multi-row JSON objects, returning cell values with auto-generated field names like __EMPTY_1 or __EMPTY_2. A 56-row Excel file will be output as 56 separate items. To make this readable by an LLM, I concatenate all rows and columns into a single text block.

// XLSX/XLS path - flatten multi-row JSON output into a single text block
// Extract From File returns one item per row, with auto-generated field names
// like __EMPTY_1, __EMPTY_2. We concatenate all non-empty values per row,
// then join all rows with newlines.

const rows = $input.all();
const lines = [];

for (const item of rows) {
  const obj = item.json;

  // Collect all non-empty values from this row
  const values = Object.values(obj)
    .filter(v => v !== null && v !== undefined && v !== '')
    .map(v => String(v).trim())
    .filter(v => v.length > 0);

  // Join cells with " | " as a column separator
  if (values.length > 0) {
    lines.push(values.join(' | '));
  }
}

// Join rows with newlines to form a single text block
const extractedText = lines.join('\n');

return [{
  json: {
    extractedText: extractedText
  }
}];

Since XLSX and XLS share the same output structure from "Extract From File," the same code can be reused for both paths.

3.3 DOCX Path

DOCX was the most time-consuming part of the entire workflow.

n8n's "Extract From File" does not support .docx. I attempted to extract the file using JSZip within a Code node, but n8n Cloud prohibits external modules, resulting in an error. zlib is also prohibited. I considered writing a ZIP/Deflate decoder from scratch in pure JavaScript, but abandoned the idea due to concerns about code volume and performance.

Ultimately, I decided to offload DOCX processing to AWS Lambda. The overall configuration is as follows:

n8n: Extract From File (Move file to base64 string)
  ↓ (base64 string in word_base64 field)
n8n: HTTP Request (POST → Lambda Function URL)

AWS Lambda (Python 3.12)
  ↓ (Returns { "extractedText": "..." })
n8n: Code node (Parse Lambda response)

{ extractedText: "..." }

Lambda Configuration

Item Value
Runtime Python 3.12
Region ap-northeast-1 (Tokyo)
Memory 256 MB
Timeout 30 seconds
Auth Type NONE (AWS_IAM recommended for production)
Layer None (Standard Python libraries only)

Initially, I implemented this using python-docx + lxml layer, but I could not resolve the error cannot import name 'etree' from 'lxml'. Rebuilding the layer with --platform manylinux2014_x86_64 --only-binary=:all: did not help. C extensions like lxml are difficult to operate normally in the Lambda environment, so I eventually switched to standard Python libraries (zipfile + xml.etree).

Lambda Function Code

# AWS Lambda function: extract plain text from a .docx file
# Input:  JSON body with a base64-encoded .docx file in `word_base64`
# Output: JSON body with extracted plain text in `extractedText`
#
# Uses only Python standard library (zipfile + xml.etree.ElementTree).
# This avoids the lxml C extension import issues that occur on Lambda.

import json
import base64
import zipfile
import xml.etree.ElementTree as ET
from io import BytesIO

def lambda_handler(event, context):
    try:
        # 1. Parse the incoming request body.
        #    Function URL passes the body as a string, while direct invoke
        #    passes the parsed object. Handle both cases.
        if isinstance(event.get('body'), str):
            body = json.loads(event['body'])
        else:
            body = event

        # 2. Extract the base64-encoded file content
        base64_data = body.get('word_base64')
        if not base64_data:
            return {
                'statusCode': 400,
                'headers': {'Content-Type': 'application/json'},
                'body': json.dumps({'error': 'word_base64 field is missing'})
            }

        # 3. Decode base64 -> raw bytes
        file_bytes = base64.b64decode(base64_data)

        # 4. A .docx file is essentially a ZIP archive.
        #    Open it as a ZIP and read the main document XML.
        zip_file = zipfile.ZipFile(BytesIO(file_bytes))
        doc_xml = zip_file.read('word/document.xml')

        # 5. Parse the XML and extract text from all <w:t> elements
        #    inside <w:p> (paragraph) tags.
        root = ET.fromstring(doc_xml)
        ns = 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'

        texts = []
        for paragraph in root.iter(f'{{{ns}}}p'):
            para_texts = []
            for run in paragraph.iter(f'{{{ns}}}t'):
                if run.text:
                    para_texts.append(run.text)
            if para_texts:
                texts.append(''.join(para_texts))

        # 6. Join all paragraphs with newlines
        all_text = '\n'.join(texts)

        # 7. Return as JSON. ensure_ascii=False to preserve Japanese characters.
        return {
            'statusCode': 200,
            'headers': {'Content-Type': 'application/json'},
            'body': json.dumps(
                {'extractedText': all_text},
                ensure_ascii=False
            )
        }

    except Exception as e:
        return {
            'statusCode': 500,
            'headers': {'Content-Type': 'application/json'},
            'body': json.dumps({'error': str(e)})
        }

Lambda Permission Settings

Since October 2025, you must grant two types of permissions to use the Function URL. I initially only added one, which caused a 403 error for a while.

# Permission 1: allow public invocation via Function URL
aws lambda add-permission \
  --function-name docx-text-extractor \
  --statement-id FunctionURLAllowPublicAccess \
  --action lambda:InvokeFunctionUrl \
  --principal "*" \
  --function-url-auth-type NONE

# Permission 2: allow the actual Invoke action
# IMPORTANT: do NOT pass --function-url-auth-type here, it will fail
aws lambda add-permission \
  --function-name docx-text-extractor \
  --statement-id FunctionURLAllowInvoke \
  --action lambda:InvokeFunction \
  --principal "*"

You can name --function-name and --statement-id as you like (as long as the statement-id is unique within the function). Adjust --principal according to your security requirements (using "*" is not recommended in production). Other parameters are basically fixed.

n8n HTTP Request Node Configuration

Item Value
Method POST
URL Lambda Function URL
Body Content Type JSON
Specify Body Using JSON
JSON Mode Expression
JSON Content {"word_base64": "{{ $json.word_base64 }}"}

Code Node to Parse Lambda Response

// Parse the Lambda response and normalize to { extractedText }
// Lambda Function URL returns the body either as a JSON string
// (when invoked via HTTP) or as a parsed object (in some n8n versions).
// We handle both cases defensively.

const response = $input.first().json;

const body = typeof response.body === 'string'
  ? JSON.parse(response.body)
  : response;

return [{
  json: {
    extractedText: body.extractedText
  }
}];

Lambda Costs

Since Lambda only runs when requested, it is completely free while idle. AWS's permanent free tier includes 1 million requests and 400,000 GB-seconds of compute time per month. Estimating 256 MB memory and ~1 second per request, a volume of tens of thousands of processes per month is virtually free, and even at 100,000 processes, it only costs a few dozen to a few hundred yen. In scenarios like this resume processing, it can be operated at near-zero cost.

3.4 DOC Path

.doc files are even trickier than .docx. Since .doc is not a ZIP, but an OLE binary format by Microsoft, the .docx approach is useless. To process .doc files in AWS, you would need to install antiword or libreoffice and deploy via a Lambda Docker image, which increases costs.

I also tried reading them as XML format + Shift_JIS using n8n's "Extract From File," but the output was garbled and unusable.

During testing, I realized a fact: text within .doc files is stored as raw UTF-16LE byte sequences. This means it does not need to be expanded as a ZIP; you can simply scan the binary two bytes at a time, decode them as UTF-16LE, and filter for readable characters to extract the text.

Code Node for DOC Path

// DOC path - extract Japanese text by scanning UTF-16LE byte sequences
// .doc is Microsoft's OLE binary format (NOT a ZIP). However, the actual
// document text is stored as UTF-16LE strings in the binary stream.
// We can scan the bytes 2-at-a-time, decode as UTF-16LE, and keep only
// chunks that look like readable text containing Japanese characters.

const base64Data = $input.first().json.word_base64;
const buffer = Buffer.from(base64Data, 'base64');

const chunks = [];
let currentChunk = '';

// Walk the buffer 2 bytes at a time (UTF-16LE = 2 bytes per character)
for (let i = 0; i < buffer.length - 1; i += 2) {
  const code = buffer.readUInt16LE(i);

  // Define what counts as a "readable" character:
  //   - ASCII printable          (0x0020 - 0x007E)
  //   - CJK Unified Ideographs   (0x3000 - 0x9FFF) -- Japanese kanji + punctuation
  //   - Halfwidth/Fullwidth      (0xFF00 - 0xFFEF) -- Japanese fullwidth chars
  //   - Newlines                 (CR / LF)
  const isReadable = (
    (code >= 0x0020 && code <= 0x007E) ||
    (code >= 0x3000 && code <= 0x9FFF) ||
    (code >= 0xFF00 && code <= 0xFFEF) ||
    (code === 0x000D) || (code === 0x000A)
  );

  if (isReadable) {
    if (code === 0x000D || code === 0x000A) {
      currentChunk += '\n';
    } else {
      currentChunk += String.fromCharCode(code);
    }
  } else {
    // Hit a non-readable byte: flush the current chunk if it looks like
    // real Japanese text (must contain at least one CJK char and be long
    // enough to not be noise).
    if (currentChunk.length >= 6 &&
        /[\u3000-\u9FFF\uFF00-\uFFEF]/.test(currentChunk)) {
      chunks.push(currentChunk.trim());
    }
    currentChunk = '';
  }
}

// Flush the final chunk if any
if (currentChunk.length >= 6 &&
    /[\u3000-\u9FFF\uFF00-\uFFEF]/.test(currentChunk)) {
  chunks.push(currentChunk.trim());
}

// Deduplicate and join (Word often stores the same text multiple times
// in different runs, so dedup helps a lot)
let text = [...new Set(chunks)].join('\n');

// Strip Word internal style metadata noise. Word stores style names like
// "段落フォント", "見出し 1 (文字)" etc. at the tail of the binary stream.
// Cutting from the first occurrence of "段落フォント" removes all of it.
const noiseStart = text.indexOf('段落フォント');
if (noiseStart !== -1) {
  text = text.substring(0, noiseStart);
}

// Collapse 3+ consecutive newlines into 2
text = text.replace(/\n{3,}/g, '\n\n').trim();

return [{
  json: {
    extractedText: text
  }
}];

Internal verification has confirmed that this method extracts the content of attachments without issues.

3.5 Merging Paths

Since all five paths now output { extractedText: "..." }, I merge them using a Merge node.

Note that the Mode of the Merge node must be set to Append. The Combine mode waits for data from all inputs, which would cause this scenario (where only one path flows via Switch) to hang indefinitely. The Append mode allows data from any path to pass through downstream.


4. LLM Node and Salesforce Registration

Because the input is unified, only one type of LLM node (using OpenAI in this case) prompt is needed.

Extract the following items from the resume/CV below in JSON format:
- Name
- Age
- Skills
- Work history (Project name, Duration, Role)
- Qualifications

Document Body:
{{ $json.extractedText }}

Map the JSON returned by the LLM to the Salesforce node and create a record in the custom object for candidate management. The overall data flow is: "Source File (PDF / Excel / Word) → extractedText (plain text) → Structured JSON via LLM → Salesforce Custom Object."


5. Results

Operating this workflow yielded the following results:

  • All steps from Webhook reception to Salesforce registration were fully automated, regardless of file format.
  • Manual work that took about 10–15 minutes per document was reduced to 5–10 seconds (DOCX is slightly slower due to Lambda overhead).
  • Man-hours for dozens of documents per month have become near zero.
  • The structure is highly extensible; if new formats like RTF or ODT are required in the future, it can be handled simply by adding a path to the Switch.

Summary

In this article, I shared the record of building a pipeline that processes multiple file formats uniformly and links them with Salesforce using n8n and AWS Lambda. Key design points are summarized below:

Point Content
Unification of Intermediate Representations Place a Code node at the end of each path to normalize output to { extractedText: "..." }
Handling n8n Cloud Constraints Offload complex processing to AWS Lambda when external modules are prohibited
Lambda Library Selection Prioritize Python standard libraries (zipfile + xml.etree) and avoid lxml C extensions
Handling .doc Files Utilize the fact that text is stored as UTF-16LE raw bytes to extract without external dependencies
Merge Node Mode Always choose Append for Switch branch scenarios to avoid stalls from Combine
Lambda Permission Settings Grant both InvokeFunctionUrl and InvokeFunction permissions to the Function URL

While n8n is a very convenient tool, when you hit the limitations of the Cloud version, collaborating with external services like AWS Lambda greatly expands your capabilities. Also, for legacy formats like .doc, you can sometimes process them without external dependencies if you understand the file structure itself. I hope this article is helpful for those working on business automation with Salesforce × n8n × AI.

Discussion