カテゴリー: AI開発

  • Two Kinds of “The Operator Cannot See Your Prompt”

    A map of private inference in 2026

    Darkbloom launched this week, and the response on Hacker News — 470 points, hundreds of comments — is a clean signal: developers want cheaper inference, and they want it without a hyperscaler reading their prompts. The technical pitch is striking. Idle Apple Silicon Macs serve inference. Requests are end-to-end encrypted. Debuggers are denied at the kernel level. An operator with full physical custody of the machine cannot read what flows through it.

    That last claim is the one worth pausing on. Because “the operator cannot see your prompt” is true of Darkbloom in roughly the same sense that it is true of Apple’s Private Cloud Compute, NVIDIA Confidential Computing, and also — in a completely different way — true of systems built on fully homomorphic encryption, multi-party computation, and zero-knowledge proofs.

    These are not the same technology. They are not even the same category of technology. They defend against different adversaries, rely on different trust anchors, and fail in different ways. Treating them interchangeably is how buyers end up deploying the wrong tool for their actual threat model.

    This post is a map. It is not an argument that one approach is correct and another is wrong. Both are real. Both are shipping. Both have uses the other cannot cover. The goal is to give you the distinctions you need to read a “private AI” claim without getting fooled — yours or someone else’s.

    “Operator cannot see the prompt” has two meanings

    The two meanings are worth stating plainly before anything else.

    The TEE-based meaning. The data is decrypted inside a hardware-isolated execution environment — an Apple Secure Enclave, an Intel TDX enclave, an AMD SEV-SNP guest, an NVIDIA H100 or Blackwell GPU running in confidential-compute mode. Inside that environment, the data is in the clear. Computation happens on plaintext, at full hardware speed. What prevents the operator from seeing it is a combination of hardware isolation, memory encryption at the bus level, cryptographic attestation of the software stack, and policy choices like disabling debuggers and logging. The guarantee is: an attacker who controls everything except the silicon root of trust cannot observe the data.

    The cryptographic meaning. The data is never decrypted at all during computation. It remains ciphertext end-to-end. The server performs arithmetic on encrypted values and returns encrypted results. Only the key holder can read the output. What prevents the operator from seeing the data is mathematics — specifically, the hardness of lattice problems underlying schemes like CKKS, BFV, and TFHE. The guarantee is: an attacker who controls everything, including the silicon, cannot observe the data, because no component of the system ever holds it in plaintext.

    These are different guarantees with different costs. The first is fast and practical for realistic workloads but requires you to trust the hardware vendor. The second is slow and narrow but requires you to trust no one in particular.

    The TEE family, examined

    Start with Darkbloom’s concrete design, because it is a good representative of the current state of TEE-based AI privacy. The provider process runs in-process with the inference engine — no subprocess, no local server, no IPC. PT_DENY_ATTACH blocks debuggers at the kernel level. Memory-reading APIs are denied. The coordinator encrypts each request with the provider’s X25519 key before forwarding; only the hardened provider process decrypts. Attestation data is publicly verifiable. The trust anchor is the Apple Secure Enclave.

    Apple’s own Private Cloud Compute is a close cousin of this architecture, deployed at hyperscaler scale. PCC uses custom Apple Silicon servers, a hardened OS, and cryptographic attestation of every software image running in the data center. Requests are routed through an anonymizing relay so that Apple cannot link a request to a user. Crucially, and this is explicit in Apple’s threat model, PCC does not encrypt data during runtime on the node. Data is decrypted inside the trusted environment and processed in the clear. What PCC provides is a hardened path to that environment and a cryptographic guarantee about what code will run once the data arrives.

    NVIDIA’s Confidential Computing on H100 and Blackwell GPUs extends the same pattern to GPU workloads. The GPU has an on-die hardware root of trust, encrypted memory, and an encrypted bounce buffer between CPU and GPU. In confidential-compute mode, data stays encrypted on the bus and in GPU memory until it is inside the TEE boundary. Blackwell adds TEE-I/O, which extends the protected path over NVLink, so multi-GPU workloads can stay confidential across the interconnect. Published benchmarks put Blackwell’s confidential mode at nearly the same throughput as unencrypted — a dramatically different cost curve than the FHE world.

    What all three share is the trust model. You are trusting:

    1. That the hardware vendor designed the root of trust correctly.
    2. That the hardware vendor did not insert a backdoor, whether deliberately or under government compulsion.
    3. That the attestation chain has no exploitable flaw between the hardware measurement and the code running inside.
    4. That side channels — timing, power, electromagnetic, speculative-execution — do not leak enough information to defeat the isolation.
    5. That the supply chain delivered the actual chip the vendor designed, without tampering.

    These are not trivial assumptions. They are routinely challenged by academic research, including recent in-depth analyses of NVIDIA’s GPU confidential-computing architecture. But for most commercial threat models — “don’t let the cloud provider’s engineers read my prompts,” “don’t let a compromised host OS steal my model weights” — TEEs are a perfectly reasonable answer, and they run at production speeds.

    The cryptographic family, examined

    FHE, MPC, and ZKP are not one technology but three closely related ones, each with different primitives and different trade-offs. They share a structural property: the adversary is assumed to be unbounded in their access to the system, and the security guarantee follows from mathematics rather than from hardware.

    Fully homomorphic encryption allows arbitrary arithmetic on ciphertext. Modern schemes — CKKS for approximate arithmetic, BFV/BGV for exact integer arithmetic, TFHE for boolean circuits — encode a vector of plaintext values into a ring-element ciphertext and support ciphertext addition and multiplication with noise that grows with circuit depth. Bootstrapping refreshes the noise but is expensive. The security reduction is to the Ring Learning With Errors problem, which is believed hard against both classical and quantum adversaries.

    Multi-party computation splits data across several parties such that no single party sees the plaintext; computation proceeds through interaction between the parties, and the result is correct as long as some threshold of parties remains honest. Threshold FHE is the natural fusion: the decryption key itself is secret-shared across parties, so no single party can decrypt at all.

    Zero-knowledge proofs let a prover convince a verifier that a statement is true without revealing anything beyond the fact of its truth. For private inference, this matters because you often want not just the answer but a proof that the answer was computed correctly from the encrypted input.

    The honest story about FHE performance in 2026 is that it is improving fast and is still very slow compared to plaintext. Recent surveys put FHE overhead at roughly 10^5× slower than cleartext for realistic deep learning. GPU-accelerated CKKS implementations have brought CNN inference on CIFAR-10 down from thousands of seconds to a few seconds per image. For LLMs, the state of the art is something like GPT-2 small with LoRA, reporting on the order of 1.6 seconds per token under carefully engineered parameter choices. Recent ICLR work on FHE-based transformer inference reports single-digit-hours per prefill for small models. This is not a technology you plug into your Claude replacement and expect interactive chat.

    Where FHE genuinely shines is in computations with modest arithmetic depth applied to data from mutually distrusting parties. Private set intersection. Encrypted database queries. Summing encrypted supplier-level CO₂ emissions across a supply chain so that aggregate Scope 3 reporting becomes possible without any supplier revealing raw data to any other. Matching encrypted medical records across hospitals — an organ-transplant problem, for instance — where Threshold FHE removes the question of “who holds the decryption key” by ensuring nobody does. These are not LLM inference workloads. They are workloads where the privacy requirement is structural, where the parties have legal or competitive reasons to distrust one another, and where latency budgets are measured in minutes or hours rather than milliseconds.

    A threat model comparison

    Laying the two families side by side against concrete adversaries clarifies where each fits.

    A cloud operator’s curious engineer. TEEs defeat this attacker decisively — data is encrypted in transit, attested on arrival, processed only by audited code. FHE defeats this attacker too, but you paid 10^5× in compute to defeat an adversary a TEE would have beaten for free. The engineer is the TEE’s home turf.

    A malicious host OS or hypervisor. TEEs handle this — that is precisely what confidential VMs and confidential containers are designed for. FHE handles it trivially, because the host never sees plaintext at all. Either works.

    A sophisticated physical attacker with a bus analyzer and a DRAM cooling attack. TEEs mitigate this at considerable effort — Apple explicitly includes this attacker in PCC’s threat model; NVIDIA’s published threat model for Hopper and Blackwell confidential mode addresses PCIe bus probing with in-line encryption. Whether the mitigation is sufficient depends on the specific attack and the specific hardware generation. FHE is indifferent to this attacker by construction. The bus carries only ciphertext.

    The hardware vendor itself, or a state actor compelling the hardware vendor. TEEs cannot defend against this — the root of trust is the vendor. This is not a flaw in the technology; it is a definition. FHE defends against this, because no hardware component is assumed trustworthy. If your threat model includes “what if Apple or NVIDIA is compromised,” only the cryptographic family applies.

    A future adversary with a cryptographically relevant quantum computer, decrypting harvested ciphertext from 2026. This is the “harvest now, decrypt later” concern. Most TEE-based systems today use ECC-based key exchange — X25519 is the common choice, including in Darkbloom — which is quantum-broken. The data-in-use guarantee is unaffected by quantum attacks because the data is never encrypted during computation, but the transport layer is. FHE based on lattice assumptions (RLWE) is believed post-quantum secure for the data itself. Mature designs in either family are beginning to adopt ML-KEM for key exchange; check the specific system.

    Side channels — timing, cache, power, EM. Both families are vulnerable, in different ways. TEEs have a substantial published literature of side-channel breaks; mitigating them is an ongoing effort. FHE implementations have their own side-channel issues, particularly around bootstrapping and noise management. Neither is a silver bullet.

    No single technology dominates across this table. That is the whole point.

    Which one should you use

    The honest guidance is that these are complementary, not competing, for most realistic deployments.

    For interactive LLM inference at scale, TEEs are the only practical answer in 2026. The cost curve simply does not support FHE-based chat completion. Darkbloom, Apple PCC, NVIDIA confidential GPU instances, Azure confidential containers with H100 — this is where the industry is, and it is a reasonable place to be for most commercial privacy requirements.

    For fixed-depth arithmetic over encrypted data from mutually distrusting sources, the cryptographic family is frequently the correct choice and sometimes the only choice. Aggregating Scope 3 emissions across a supply chain where each supplier’s raw data is competitively sensitive. Matching medical records across hospitals where legal constraints forbid any party from holding plaintext from another party. Financial settlement calculations where the regulator, the counterparties, and the platform operator are all potential adversaries to each other. Voting systems where verifiability and ballot secrecy must hold simultaneously. These are not LLM problems. They are problems where the structure of distrust is irreducible, and trying to solve them with a TEE means picking which participant gets to be the trusted party — which is often the problem you were hired to eliminate.

    For systems where the threat model genuinely includes the hardware vendor, only the cryptographic family is responsive. This is a smaller market than the previous two, but it exists, and it is where certain strands of post-quantum cryptographic infrastructure are being built.

    A useful heuristic: if you can name the party who holds the decryption key, you are in TEE territory and should probably just use a good TEE. If the honest answer is “nobody holds the key, and that is the point,” you are in cryptographic territory and should not try to reduce it to a hardware problem.

    The 2026 picture

    What I see when I look at this landscape is not a competition but a division of labor that is still being worked out in public.

    TEE-based private inference is having a commercial moment. Darkbloom’s Apple-Silicon-on-idle-Macs architecture, Apple’s data-center PCC deployment, and NVIDIA’s Blackwell confidential GPUs are all maturing at the same time, and they collectively make “private by construction” a realistic default for AI workloads rather than a research curiosity. The remaining questions are not technical so much as governance-shaped: how is attestation verified, who audits the hardened OS images, how are side-channel disclosures handled, how does the supply chain prove itself.

    Cryptographic private computation is having a different kind of moment. GPU-accelerated CKKS has crossed the threshold where small CNN inference is genuinely practical. Threshold FHE is being deployed in real multi-party workflows. Zero-knowledge systems are standardizing. The workloads being unlocked are not “chat with a model” — they are the structural-distrust workloads that TEEs cannot cleanly serve, and there are more of those than the LLM-centric discourse usually admits.

    The mistake to avoid in either direction is conflation. If you read “operator cannot see your prompt” and do not ask which guarantee is being offered, you will eventually end up with the wrong one. If you read “privacy-preserving AI” and do not ask whether the trust root is silicon or mathematics, you cannot evaluate whether the claim matches your threat model.

    Both of these families are real technologies solving real problems. The point is to know which one is in front of you when somebody says the word “private.”


    The author works on cryptographic infrastructure for supply-chain and healthcare applications, including post-quantum key management (hyde), GPU-accelerated CKKS (plat), and multi-party organ-matching (Niobi).

  • GitHub Agentic WorkflowsとCursor並列エージェント:AI開発ツールの2026年2月最前線

    2026年2月、AI開発ツールの世界で2つの大きな動きがありました。GitHubが「Agentic Workflows」をテクニカルプレビューで公開し、CursorがAIエージェントの並列VM実行に対応。どちらも「AIアシスタント」から「AIエージェント」への転換を象徴する出来事です。

    GitHub Agentic Workflows:CI/CDをMarkdownで書く

    2月13日、GitHubが「Agentic Workflows」をテクニカルプレビューとして公開しました。最大の特徴は、CI/CDの自動化をYAMLではなくMarkdownで記述できることです。

    仕組み

    • .github/workflows/にMarkdownファイルを配置
    • gh aw CLIコマンドでMarkdownからGitHub Actionsに変換
    • 複数のAIエンジンに対応:Copilot CLI、Claude Code、OpenAI Codex
    • デフォルトは読み取り専用権限。PRの自動マージは不可
    • MITライセンスで完全オープンソース

    例えば「PRが作成されたらコードレビューして、テストが通ったらステージング環境にデプロイして」という指示を自然言語で書けば、AIエージェントがActionsのワークフローとして実行します。

    注目ポイント

    特に注目すべきはマルチエンジン対応です。Copilot CLIだけでなく、Claude CodeやOpenAI Codexも選択可能。ベンダーロックインを避けるGitHubの姿勢が見えます。セキュリティ面でもデフォルトでread-only、PRの自動マージ禁止と、慎重な設計です。

    Cursor:AIエージェントが並列VMで動く時代

    2月24日、AIコードエディタのCursorが大型アップデートを発表しました。AIエージェントが専用の仮想マシン(VM)上で並列実行できるようになりました。

    何が変わったか

    • 並列実行:複数のAIエージェントがそれぞれ独立したVM上で動作。ローカルPCのリソースを消費しない
    • 自己テスト:エージェントが自分で変更をテストし、ビデオ/スクリーンショットで結果を記録
    • プラグインシステム:Amplitude、AWS、Figma、Stripeなどとの統合
    • クロスプラットフォームサンドボックス:開発者の中断を40%削減
    • CursorのPRの約35%がVM上のエージェントによって生成

    パラダイムシフト

    これは「1ファイルのコード補完」から「10-20の並列エージェントが同時にタスクをこなす」への転換です。1つのエージェントにバグ修正を、別のエージェントにテスト追加を、さらに別のエージェントにドキュメント更新を任せる——そんなワークフローが現実になっています。

    AI開発ツールのトレンド

    これら2つの動きから見えるトレンドは明確です。

    1. 「アシスタント」から「エージェント」へ:補完ではなく、タスク全体を自律的に実行
    2. サンドボックスとセキュリティ:エージェントの権限管理が必須要件に
    3. マルチエンジン:特定のAIモデルに依存しない設計
    4. 並列処理:複数エージェントの同時実行が前提のアーキテクチャ

    2026年は「AIがコードを書く」から「AIチームがプロジェクトを回す」へと進化する年になりそうです。GitHub Agentic Workflowsはオープンソースなので、ぜひ試してみてください。

  • OLMo 3:推論の根拠を訓練データまで遡れる「ガラス箱」LLMとは何か

    ChatGPTやClaudeなどのLLMは「なぜその回答をしたのか」が分かりません。Allen Institute for AI(AI2)が公開したOLMo 3は、推論の過程を訓練データまで遡って追跡できる「ガラス箱(glass-box)」モデルです。AI透明性の新たな一歩として、日本でも注目すべきプロジェクトです。

    OLMo 3とは

    AI2が公開したOLMo 3は、7Bと32Bパラメータのオープンソース推論モデルです。最大の特徴は完全な透明性です。

    • 訓練データ:Dolma 3(約9.3兆トークン)が完全公開
    • 訓練コード:全て公開
    • ポストトレーニングレシピ:RLHFやファインチューニングの手法も公開
    • モデルの重み:オープンウェイト

    つまり、モデルの入力から出力まで、全てのプロセスを検証できます。

    OlmoTrace:回答の根拠を追跡

    OLMo 3の最も革新的な機能がOlmoTraceです。モデルの回答に対して、以下を追跡できます。

    1. 推論の中間ステップ(思考チェーン)を可視化
    2. 各ステップがどの訓練データに基づいているかを特定
    3. 訓練データの原文を参照可能

    例えば、モデルが「東京の人口は約1400万人」と回答した場合、その数字がDolma 3のどのドキュメントから学習されたかを遡って確認できます。これはハルシネーション(幻覚)の原因特定にも直結します。

    OLMo 3-Think:32Bスケール最強のオープン推論モデル

    OLMo 3-Think(32B)は、32Bパラメータスケールにおける完全オープンソースの推論モデルとして最高性能を達成しています。「Think」の名前が示すように、段階的な推論(Chain-of-Thought)を行い、複雑な問題を分解して解く能力を持ちます。

    なぜ「ガラス箱」が重要なのか

    現在のLLMの大きな課題はブラックボックス性です。

    • 規制対応:EU AI Actをはじめ、AIの透明性を求める規制が世界的に強化されている
    • 信頼性:医療・法律・金融など、根拠が必要な分野でのAI活用には透明性が不可欠
    • 研究:モデルの挙動を理解し改善するためには、内部の可視化が必要
    • 著作権:生成AIの出力が訓練データの著作物に基づいているかを検証可能に

    OLMo 3のアプローチは、これら全ての課題に対する回答です。

    試してみるには

    OLMo 3はGitHub上で完全に公開されています。Hugging Faceからモデルをダウンロードし、ローカルで実行できます。32Bモデルは量子化版であれば16GB程度のGPUメモリで動作します。

    OlmoTraceのデモもAI2のサイトで公開されており、ブラウザから試すことも可能です。「このAIはなぜこう答えたのか?」を自分の目で確かめられる体験は、LLMの理解を一段深めてくれるはずです。

    まとめ

    OLMo 3は「最も高性能なモデル」ではありません。しかし「最も透明なモデル」です。AIが社会インフラになりつつある今、「なぜその答えなのか」を追跡できることの価値は計り知れません。商用LLMがブラックボックスのままである以上、OLMo 3のようなプロジェクトの存在はAI業界全体にとって重要です。

  • Claude APIとPythonで作るAIチャットボット入門

    Anthropic社のClaude APIを使って、Pythonで動作するAIチャットボットを作成する方法を解説します。2024年以降、ClaudeはGPT-4oと並ぶ高性能LLMとして注目を集めています。

    Claude APIとは

    Claude APIは、Anthropic社が提供する大規模言語モデル(LLM)のAPIです。ChatGPTのOpenAI APIと同様に、HTTPリクエストでAIとの対話が可能です。特にClaude 3.5 Sonnetは、コーディング支援や長文処理において高い性能を発揮します。

    環境構築

    まずAnthropicの公式サイトでAPIキーを取得し、Pythonの環境を準備します。

    pip install anthropic
    export ANTHROPIC_API_KEY="your-api-key-here"

    基本的なチャットボットの実装

    以下がClaude APIを使った最もシンプルなチャットボットの実装です。

    import anthropic
    
    client = anthropic.Anthropic()
    
    def chat(user_message: str) -> str:
        message = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            messages=[
                {"role": "user", "content": user_message}
            ]
        )
        return message.content[0].text
    
    # 対話ループ
    while True:
        user_input = input("You: ")
        if user_input.lower() in ["quit", "exit"]:
            break
        response = chat(user_input)
        print(f"Claude: {response}")

    会話履歴を保持する

    実用的なチャットボットでは、会話の文脈を保持する必要があります。messagesリストに過去のやり取りを蓄積することで実現できます。

    class ChatBot:
        def __init__(self, system_prompt="あなたは親切なアシスタントです。"):
            self.client = anthropic.Anthropic()
            self.system = system_prompt
            self.messages = []
    
        def send(self, user_message: str) -> str:
            self.messages.append({"role": "user", "content": user_message})
            response = self.client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=2048,
                system=self.system,
                messages=self.messages
            )
            assistant_msg = response.content[0].text
            self.messages.append({"role": "assistant", "content": assistant_msg})
            return assistant_msg
    
    bot = ChatBot("あなたはPythonプログラミングの専門家です。")
    print(bot.send("リスト内包表記について教えてください"))
    print(bot.send("具体例をもう少し見せてください"))

    エラーハンドリング

    本番環境ではレート制限やネットワークエラーへの対策が必要です。anthropicライブラリは自動リトライ機能を備えていますが、明示的なエラーハンドリングも重要です。

    import anthropic
    from anthropic import RateLimitError, APIConnectionError
    
    try:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            messages=[{"role": "user", "content": "Hello"}]
        )
    except RateLimitError:
        print("レート制限に達しました。しばらく待ってから再試行してください。")
    except APIConnectionError:
        print("API接続エラー。ネットワーク状態を確認してください。")

    まとめ

    Claude APIは直感的なインターフェースで、少ないコード量でAIチャットボットを構築できます。次回はFunction Callingを活用した、外部データベースと連携するチャットボットの作り方を紹介します。

  • OpenAI Function Callingの実装ガイド:AIに外部ツールを使わせる

    OpenAIのFunction Calling(関数呼び出し)機能を使うと、AIが外部APIやデータベースと連携して、リアルタイムの情報を取得できるようになります。天気予報の取得、商品検索、データベースクエリなど、実践的なユースケースを解説します。

    Function Callingとは

    Function Callingは、GPT-4やGPT-4oに「使える関数」を定義しておくと、ユーザーの質問に応じて適切な関数を呼び出してくれる機能です。AIが直接関数を実行するのではなく、「この関数をこの引数で呼んでください」という指示を返します。

    基本的な実装

    商品検索を例に実装してみましょう。

    from openai import OpenAI
    import json
    
    client = OpenAI()
    
    # 利用可能な関数を定義
    tools = [
        {
            "type": "function",
            "function": {
                "name": "search_products",
                "description": "商品データベースからキーワードで商品を検索する",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "keyword": {
                            "type": "string",
                            "description": "検索キーワード(例: 赤いTシャツ)"
                        },
                        "max_price": {
                            "type": "integer",
                            "description": "最大価格(円)"
                        }
                    },
                    "required": ["keyword"]
                }
            }
        }
    ]
    
    # 実際の検索関数
    def search_products(keyword, max_price=None):
        # 本番ではDBクエリを実行
        products = [
            {"name": "赤いTシャツ", "price": 2980},
            {"name": "青いTシャツ", "price": 3480},
        ]
        if max_price:
            products = [p for p in products if p["price"] <= max_price]
        return [p for p in products if keyword in p["name"]]

    AIとの対話フロー

    Function Callingの対話は3ステップで進みます。

    # Step 1: ユーザーの質問をAIに送信
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "3000円以下の赤いTシャツはありますか?"}],
        tools=tools,
    )
    
    # Step 2: AIが関数呼び出しを要求
    tool_call = response.choices[0].message.tool_calls[0]
    args = json.loads(tool_call.function.arguments)
    # args = {"keyword": "赤いTシャツ", "max_price": 3000}
    
    # Step 3: 関数を実行して結果をAIに返す
    result = search_products(**args)
    final_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": "3000円以下の赤いTシャツはありますか?"},
            response.choices[0].message,
            {"role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(result, ensure_ascii=False)}
        ],
        tools=tools,
    )
    print(final_response.choices[0].message.content)

    複数関数の定義

    実際のアプリケーションでは複数の関数を定義し、AIに状況に応じて使い分けてもらいます。商品検索に加えて、注文状況の確認や在庫確認なども追加できます。AIは質問の内容から最適な関数を自動で選択します。

    注意点とベストプラクティス

    • 関数のdescriptionは詳しく書く。AIはこれを見て関数を選択します
    • パラメータのdescriptionも具体例を含めると精度が上がります
    • AIが不要な関数呼び出しをしないよう、tool_choice パラメータで制御可能
    • 関数の実行結果は構造化されたJSONで返すと、AIの応答品質が向上します

    まとめ

    Function Callingを使うことで、AIは単なるテキスト生成を超えて、実際のデータやサービスと連携する強力なアシスタントになります。ECサイトの商品検索、カスタマーサポート、社内ツールの自動化など、応用範囲は広大です。

  • RAG(検索拡張生成)をゼロから実装する:LangChainとFAISSで社内ドキュメント検索

    RAG(Retrieval-Augmented Generation)は、LLMに外部知識を与えて回答精度を向上させる手法です。社内ドキュメントやFAQをベクトルデータベースに格納し、質問に関連する情報を検索してからLLMに回答させます。

    RAGの仕組み

    RAGは大きく2つのフェーズで動作します。

    1. インデックス作成: ドキュメントをチャンクに分割→ベクトル化→ベクトルDBに格納
    2. 検索+生成: ユーザーの質問をベクトル化→類似チャンクを検索→検索結果+質問をLLMに渡して回答生成

    環境準備

    pip install langchain langchain-openai faiss-cpu tiktoken

    ドキュメントの読み込みとチャンク分割

    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain_community.document_loaders import TextLoader
    
    # ドキュメント読み込み
    loader = TextLoader("company_faq.txt", encoding="utf-8")
    documents = loader.load()
    
    # チャンク分割(500文字ごと、100文字オーバーラップ)
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=100,
        separators=["nn", "n", "。", "、", " "]
    )
    chunks = splitter.split_documents(documents)
    print(f"チャンク数: {len(chunks)}")

    ベクトルDBの構築

    from langchain_openai import OpenAIEmbeddings
    from langchain_community.vectorstores import FAISS
    
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    
    # FAISSインデックスを作成
    vectorstore = FAISS.from_documents(chunks, embeddings)
    
    # ローカルに保存(永続化)
    vectorstore.save_local("faiss_index")
    
    # 読み込み
    vectorstore = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)

    RAGチェーンの構築

    from langchain_openai import ChatOpenAI
    from langchain.chains import RetrievalQA
    
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
    
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True,
    )
    
    result = qa_chain.invoke({"query": "有給休暇の申請方法を教えてください"})
    print(result["result"])
    print("---参照ドキュメント---")
    for doc in result["source_documents"]:
        print(f"  - {doc.page_content[:100]}...")

    精度向上のコツ

    • チャンクサイズ: 小さすぎると文脈が失われ、大きすぎるとノイズが増える。300-500文字が目安
    • オーバーラップ: チャンク境界での情報欠落を防ぐ。チャンクサイズの20%程度
    • 検索件数(k): 多すぎるとコンテキストウィンドウを圧迫。3-5件が推奨
    • リランキング: 検索結果をLLMで再評価してから使うと精度が上がる

    まとめ

    RAGを使えば、LLMが学習していない最新情報や社内固有の知識に基づいた回答が可能になります。FAISSは無料で使えるベクトルDBとして優秀で、小〜中規模のドキュメントであれば十分な性能を発揮します。

IP: 取得中...
216.73.217.150216.73.217.150