iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
👏

[M4 Mac mini / 16GB] AI Validation Log #05: Implementing Knowledge Pipelines

に公開

Introduction

Hello, I'm dadu.
In the previous article (#04), I built a RAG environment for the fictional anime 'Nexus Core' using the basic Knowledge feature.

This time, I'm taking on the "Knowledge Pipeline". You might wonder, "What's the difference?", but this is a powerful feature that allows you to visualize and customize the entire process—from data ingestion to cleaning and chunking—as a single "flow."
(That said, it might be hard to grasp, so I think trying it out is the fastest way to understand it.)

However, when trying to use it in a local environment (OrbStack / Docker), it was a bit troublesome as I encountered "communication errors" and other issues, so I'd like to share those as well.

1. Why Adopt a Knowledge Pipeline?

If regular knowledge registration is like "uploading files," a pipeline is "building an information processing line."
Or, broadly speaking, it automates the manual tasks of preparing documents like PDFs or Word files—such as exporting them to Markdown format—to make them easier for AI to learn before registration.

The flow looks like this:

*Note: Dify Extractor and General Chunker need to be installed from Tools after adding a block to the flow.

Extractor (Dify Extractor)

This component is responsible for extracting "raw text" that can be processed by AI from uploaded files (PDF, Word, Excel, Markdown, etc.).

  • Replacing manual labor: Previously, I had to copy and paste PDF content and save it as a .txt file, but this automates the entire process.
  • Cleanup: It removes unnecessary control codes and metadata, extracting only the pure body text to pass to the next stage.

Configuration Example:

Chunker (General Chunker)

This component's role is to "slice" the extracted long text into appropriate sizes that an AI (such as Llama 3.2) can process at once.

  • Maintaining context: Instead of cutting blindly, it recognizes delimiters like periods (.) or \n (line breaks) to divide the text.
  • Overlap (Margin): By slightly overlapping the end of one chunk with the beginning of the next, it ensures that context is not lost even if the sentence is split.
  • Benefits for 16GB Mac mini: By keeping each chunk at an appropriate size (around 800 characters), we can maximize search accuracy while keeping memory consumption low.

Configuration Example:

2. Encountered Error: The Rebellion of Dify Extractor

Immediately after setting up and running the pipeline with high hopes, it stopped at the Dify Extractor node with the following error.

ValueError: Invalid file URL ... missing an 'http://' or 'https://' protocol.

It complained about the missing protocol, even though the input variables were connected correctly. It seemed to have failed in resolving the URL for the temporary file storage location generated within Dify.

3. [Important] Network Settings in OrbStack Environment

The cause was that Dify inside the Docker container could not correctly recognize its own "location (URL)." Therefore, the .env file needs to be modified to tell Dify the addresses for both outside the container (Mac host) and inside (Dify internal).

Solution: Rewriting .env

Open the hidden file .env in the docker folder, modify it as follows, and restart the container (docker compose down && docker compose up -d).

# docker/.env

# Address visible from the browser
FILES_URL=http://localhost:5001

# Address for Docker containers to communicate via internal network
INTERNAL_FILES_URL=http://api:5001

4. Test Run

After fixing the .env file, I ran it again (apparently called a "test run"), and it worked fine.

Clicking the publish button displays the following message.

A pipeline knowledge icon will then appear on the Knowledge screen. Register the knowledge as you would with a regular one (note that data processed during the test run is not saved).

5. Integrating into Chat Flow and Verifying Operation

I will replace the knowledge in the chat flow from the previous verification with the pipeline one.
(Since the pipeline is only for creating the knowledge, there shouldn't be any issues beyond that point.)

This time it gave a quite lengthy explanation, but it seems to be working fine.

It correctly provides the citations at the end of the text, so it looks good. (I should have checked this last time as well.)

6. Summary

In this article, I challenged myself with Dify's new feature, "Knowledge Pipeline."
Although there were some network issues specific to the environment setup, I successfully achieved an "advanced RAG" in a local environment.

The insights gained from this verification are as follows:

  • Pipelines are an investment in "future automation"
    Since manual file processing (such as converting to .txt) can be automated, this system will become increasingly effective as the volume of documents handled grows.
  • Local environment "walls" can be overcome through configuration
    Errors caused by .env misconfigurations can be discouraging, but by correctly informing the Docker containers of their "addresses (URLs)," it operates stably even on a 16GB Mac mini.
  • Providing citations ensures RAG reliability
    By clearly displaying citations in chat responses, it becomes easier to spot AI "hallucinations," increasing peace of mind for practical use.

The current status of the 16GB Mac mini

While I felt the resource limits with "Visual Language Models (VLM)," the M4 Mac mini (16GB) delivers excellent performance when used as a "text processing base (RAG)." Swapping is minimized, and memory pressure remains in a healthy state.

Future Outlook

With this, I have been able to build a chatbot utilizing RAG even in a local environment. Towards practical application, further fine-tuning of parameters and re-selection of models may be necessary, but I plan to continue refining them.

As the next step, I aim to evolve this into an "AI Agent." Specifically, I'm thinking of combining free MCPs (like using Filesystem to manipulate local files or getting the latest information via the Google Search API free tier) to experiment with an "AI that can act" beyond just providing answers.

Discussion