{
  "query": "How can I call and run inference on the open-source language model I want from Hugging Face?",
  "gold_information_need": "How do I run inference by calling an open-source language model I want from Hugging Face? - Please write a step-by-step tutorial that I can follow to load an open-source LLM (e.g., Llama) from Hugging Face in a Jupyter Notebook, input prompts directly, and run inference.\n\n(1) Preparation: \n- Set up a virtual environment and check CUDA.\n- Set your Hugging Face token.\n- Criteria for model selection (including recommended VRAM).\n- Installation list: transformers, torch, accelerate, huggingface_hub, sentencepiece, and optionally bitsandbytes.\n\n(2) Basic Inference (Transformers): \n- Load Llama using AutoTokenizer/AutoModelForCausalLM or pipeline.\n- Provide examples for single and batch prompts.\n- Key parameters: device_map, dtype, max_new_tokens, temperature, top_p.\n- Show minimal code cells up to output decoding.\n\n(3) vLLM Acceleration: \n- After pip install vllm, present two methods:\n  (a) Python API within the notebook,\n  (b) Running an OpenAI-compatible server (python -m vllm.entrypoints.openai.api_server --model …) and example client calls.\n- Demonstrate KV-cache, paged attention, and concurrent request/batch settings.\n\n(4) Optimization & Troubleshooting: \n- 4/8-bit loading options,\n- Setting max context/tokens,\n- Handling OOM, CUDA, and tokenizer mismatch issues,\n- Include a simple comparison table of inference speed/memory between Transformers and vLLM.",
  "response_list": [
    {
      "contents": "## Hugging Face Model API Calls: Comparative Analysis of Inference API and Inference Endpoint\n\nWhen implementing inference features using Hugging Face models in a web application, there are two main ways to call models via API without managing infrastructure yourself: the **Inference API** and the **Inference Endpoint**. Each method has distinct pros and cons in terms of cost, latency, concurrency, and more, so it’s important to choose the one that best fits your application’s needs.\n\n### 1. Hugging Face Inference API: Ideal for Rapid Prototyping and Occasional Use\n\nThe **Inference API** is a free or pay-as-you-go service that lets you instantly test and use a wide range of models hosted on Hugging Face. You can easily access inference capabilities with just an API call, without any infrastructure setup, making it especially useful for early-stage prototyping or when inference is needed only occasionally.\n\n**Cost:** The Inference API offers a free tier for testing CPU-based inference. However, using GPUs or higher volumes incurs pay-as-you-go charges. In production environments with continuous, large-scale requests, costs can rise significantly.\n\n**Latency:** Since the Inference API operates on shared resources, response times can vary depending on other users’ activity. If you call a model that isn’t used frequently, you may experience additional delays due to ‘cold start’ as the model loads.\n\n**Concurrency:** The free tier may have limitations on concurrent requests and is not suitable for handling high traffic. For stable, production-level concurrency, another solution is recommended.\n\n### 2. Hugging Face Inference Endpoint: Reliable and Scalable for Production\n\nThe **Inference Endpoint** is a service that deploys a specific model on dedicated infrastructure. This ensures predictable performance and reliability, meeting the demanding requirements of production environments.\n\n**Cost:** Inference Endpoints are billed hourly. You can choose the type and size of CPU or GPU instances, which affects the cost. For handling large volumes of inference requests continuously, this can be more cost-effective than the pay-as-you-go Inference API.\n\n**Latency:** With dedicated infrastructure, there are no cold starts, and you get consistent, low latency. This is crucial for services that require real-time, fast responses. According to one study, Inference Endpoints can be more than twice as fast as traditional container-based deployments.\n\n**Concurrency:** Auto-scaling is supported, automatically adjusting the number of instances based on traffic. This allows for stable handling of sudden traffic spikes and maintains high concurrency.\n\n### Summary and Recommendations\n\n| **Category** | **Inference API** | **Inference Endpoint** |\n| --- | --- | --- |\n| **Best Use Cases** | Prototyping, occasional use, education, testing | Production, continuous large-scale requests, low-latency needs |\n| **Cost Structure** | Free tier & pay-as-you-go | Hourly billing (varies by instance type) |\n| **Latency** | Variable, possible cold starts | Low and consistent, no cold starts |\n| **Concurrency** | Limited | High, flexible with auto-scaling |\n| **Infrastructure Management** | Not required | Managed by Hugging Face |\n\n**In conclusion,** if you want to quickly experiment with various models in the early stages of development or don’t need frequent inference, the **Inference API** is a reasonable choice. On the other hand, if you’re building a production web application that requires stable performance and low latency, adopting the **Inference Endpoint**—despite the initial cost—will be a wiser long-term decision. The Inference Endpoint is a powerful tool that lets developers focus on core business logic without the burden of infrastructure management.",
      "need_alignment": {
        "score": "1",
        "feedback": "This isn’t the answer I wanted at all. You’re only giving a high-level service comparison. I asked for a step-by-step, code-first notebook guide to run modern open-source chat LLMs locally (Transformers + vLLM, chat prompts, single/batch, key params, HF token). Instead, you focused on Inference API/Endpoints with no runnable workflow."
      },
      "content_depth": {
        "score": "1",
        "feedback": "The content is too high-level, and there’s nothing in-depth or runnable. No environment/CUDA setup, no installs, no code, no parameters, no troubleshooting—just a product overview. I can’t reproduce anything from this."
      },
      "tone": {
        "score": "2",
        "feedback": "Your tone reads a bit promotional (“Ideal for,” “In conclusion,” “powerful tool,” “wiser long-term decision”). I want it dry and strictly technical—no marketing phrasing, just concise, actionable guidance."
      },
      "explanation_style": {
        "score": "3",
        "feedback": "The structure is readable, but it’s mostly narrative plus a table. I prefer step-by-step bullets and code-first cells (chat template, single/batch, decoding) with minimal prose."
      }
    },
    {
      "contents": "### Research Tutorial Using Hugging Face Open-Source LLMs\n\n_From data loading to inference, prompt engineering, and error handling_\n\nThis tutorial guides you through the entire process of using open-source LLMs from the Hugging Face ecosystem for research purposes. You’ll practice each step, from loading real datasets to generating answers to questions.\n\n#### **1. Project Environment Setup**\n\nFirst, install the necessary libraries for LLM research and complete authentication for API usage.\n\n##### **1.1. Install Required Libraries**\n\nActivate your virtual environment (conda, venv, etc.), then install the following libraries:\n\n```\n# Core LangChain library and Hugging Face integration module\npip install langchain langchain-huggingface\n\n# Libraries for Hugging Face models and datasets\npip install huggingface_hub datasets transformers\n\n# Libraries for GPU computation and optimization\n# Check https://pytorch.org/get-started/locally/ for the correct PyTorch version for your environment.\npip install torch accelerate bitsandbytes\n```\n\n*   `transformers`: The core library for loading Hugging Face models and building pipelines.\n*   `datasets`: Makes it easy to download and process datasets from the Hugging Face Hub.\n*   `accelerate`, `bitsandbytes`: Needed for distributing models across multiple GPUs or for quantization to reduce memory usage.\n\n##### **1.2. Setting Up Hugging Face API Token**\n\nTo access gated models (those requiring agreement before use, e.g., Llama 3) or to increase API call limits, it’s recommended to set up an API token.\n\n1.  **Get a Token:**  \n    Issue a token with the `write` role from [Hugging Face Settings](https://huggingface.co/settings/tokens).\n2.  **Set as Environment Variable:**  \n    Run the code below to register the token as an environment variable.\n    ```\n    import os\n    import getpass\n    # For password-like input in Colab or Jupyter Notebook\n    if \"HUGGINGFACEHUB_API_TOKEN\" not in os.environ:\n        os.environ[\"HUGGINGFACEHUB_API_TOKEN\"] = getpass.getpass(\"Enter your Hugging Face API token:\")\n    ```\n\n#### **2. Loading and Preparing Research Datasets**\n\nUse the Hugging Face `datasets` library to load data for research. Here, we use SQuAD, a standard question answering dataset, as an example.\n\n```\nfrom datasets import load_dataset\n\n# Load the 'train' split of the \"squad\" dataset.\n# streaming=True streams data as needed, saving memory.\ndataset = load_dataset(\"squad\", split=\"train\", streaming=True)\n\n# Get a single sample (row) from the dataset.\nsample = next(iter(dataset))\n\n# Check the structure and content of the sample.\nprint(sample)\n# Example output:\n# {'id': '5733be284776f41900661182', 'title': 'University_of_Notre_Dame',\n#  'context': 'Architecturally, the school has a Catholic character. ...',\n#  'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',\n#  'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}\n\n# Save the context and question for use in inference.\ncontext = sample[\"context\"]\nquestion = sample[\"question\"]\n```\n\n#### **3. Advanced Prompt Engineering**\n\nLLM performance is highly dependent on prompt quality. Learn how to craft sophisticated prompts using various templates provided by `LangChain`.\n\n##### **3.1. Basic: `PromptTemplate`**\n\nThe most basic way to dynamically generate prompts by injecting variables.\n\n```\nfrom langchain.prompts import PromptTemplate\n\n# Define a template that takes context and question as input.\ntemplate_str = \"\"\"\nBased on the following context, answer the question.\n\nContext:\n{context}\n\nQuestion:\n{question}\n\nAnswer:\n\"\"\"\n\nprompt_template = PromptTemplate(\n    template=template_str,\n    input_variables=[\"context\", \"question\"]\n)\n\n# Fill the prompt with the sample extracted from the dataset.\nformatted_prompt = prompt_template.format(context=context, question=question)\nprint(formatted_prompt)\n```\n\n##### **3.2. Conversational: `ChatPromptTemplate` and `MessagesPlaceholder`**\n\nConversational models (Chat Models) take a list of messages with roles (`system`, `human`, `ai`). `MessagesPlaceholder` is useful for dynamically inserting previous conversation history.\n\n```\nfrom langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\nfrom langchain_core.messages import SystemMessage, HumanMessage, AIMessage\n\n# Create a conversational prompt template.\nchat_template = ChatPromptTemplate.from_messages(\n    [\n        # System message: defines the model's role and behavior.\n        SystemMessage(content=\"You are a helpful AI assistant that answers questions based on the given context.\"),\n        # MessagesPlaceholder: previous conversation history (list) is inserted here.\n        MessagesPlaceholder(variable_name=\"chat_history\"),\n        # Represents the user's current question.\n        HumanMessage(content=\"{question}\\n\\nContext:\\n{context}\"),\n    ]\n)\n\n# Example previous conversation history\nchat_history = [\n    HumanMessage(content=\"What is the main characteristic of the school's architecture?\"),\n    AIMessage(content=\"The school's architecture has a Catholic character.\")\n]\n\n# Fill the template with variables to generate the final message list.\nformatted_messages = chat_template.format_messages(\n    context=context,\n    question=question,\n    chat_history=chat_history\n)\n\nprint(formatted_messages)\n```\n\n##### **3.3. Providing Context: `Few-Shot` Prompting**\n\nA technique where you provide a few examples (shots) to help the model learn the desired answer format or style. This is very effective for guiding consistent model outputs.\n\n```\nfrom langchain.prompts import FewShotPromptTemplate, PromptTemplate\n\n# Define the format for each example (shot)\nexample_prompt = PromptTemplate(\n    input_variables=[\"context\", \"question\", \"answer\"],\n    template=\"Context: {context}\\nQuestion: {question}\\nAnswer: {answer}\"\n)\n\n# Example data to provide\nexamples = [\n    {\n        \"context\": \"The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.\",\n        \"question\": \"Where is the Eiffel Tower located?\",\n        \"answer\": \"The Eiffel Tower is located on the Champ de Mars in Paris, France.\"\n    },\n    {\n        \"context\": \"The Amazon rainforest is a moist broadleaf tropical rainforest in the Amazon biome that covers most of the Amazon basin of South America.\",\n        \"question\": \"What type of rainforest is the Amazon?\",\n        \"answer\": \"The Amazon is a moist broadleaf tropical rainforest.\"\n    }\n]\n\n# Create the Few-Shot prompt template\nfew_shot_prompt = FewShotPromptTemplate(\n    examples=examples,                # Provided examples\n    example_prompt=example_prompt,    # Example format\n    prefix=\"Here are some examples of how to answer questions based on a context.\", # Prompt start\n    suffix=\"Now, answer the following question based on the new context.\\n\\nContext: {context}\\nQuestion: {question}\\nAnswer:\", # Prompt end (actual question)\n    input_variables=[\"context\", \"question\"]\n)\n\n# Fill the prompt with actual data.\nformatted_few_shot_prompt = few_shot_prompt.format(context=context, question=question)\nprint(formatted_few_shot_prompt)\n```\n\n#### **4. Running LLM Inference: Method Guide**\n\nNow, use the constructed prompt to run LLM inference. Choose the method that best fits your situation.\n\n##### **4.1. Method 1: Hugging Face Hub Remote Inference (Quick Validation)**\n\n*   **Concept:** Run the model on Hugging Face servers without a separate GPU. Suitable for simple tests or rapid prototyping.\n*   **Pros:** Very easy to set up.\n*   **Cons:** Slow, and large models may cause timeout errors.\n    \n```\nfrom langchain_huggingface import HuggingFaceEndpoint\n\n# Specify the repo_id of the model to use. Here, Mistral-7B Instruct is used as an example.\n# This model works well with Hugging Face's free Inference API.\nrepo_id = \"mistralai/Mistral-7B-Instruct-v0.2\"\n\n# Create a HuggingFaceEndpoint object.\n# max_new_tokens: maximum number of tokens to generate\n# temperature: lower values produce more deterministic answers (0.1 ~ 1.0)\nllm_endpoint = HuggingFaceEndpoint(\n    repo_id=repo_id, max_new_tokens=256, temperature=0.1\n)\n\n# Use LangChain Expression Language (LCEL) to chain the prompt and model.\nchain = prompt_template | llm_endpoint\n\n# Run the chain to get the result.\nresult = chain.invoke({\"context\": context, \"question\": question})\nprint(result)\n```\n\n##### **4.2. Method 2: Local Inference (For Full-Scale Research)**\n\n*   **Concept:** Download the model and run it on your local GPU. Offers speed and control, suitable for serious research.\n*   **GPU Memory Optimization:** Large models may cause VRAM shortage errors. 4-bit quantization with `bitsandbytes` greatly reduces memory usage with minimal quality loss.\n    \n```\nfrom langchain_huggingface import HuggingFacePipeline\nfrom transformers import AutoModelForCausalLM, AutoTokenizer, pipeline\nimport torch\n\n# Example of a small Korean model (Polyglot-Ko 5.8B)\nmodel_id = \"EleutherAI/polyglot-ko-5.8b\"\n\n# Settings for 4-bit quantized loading\n# bnb_4bit_compute_dtype=torch.bfloat16 uses bfloat16 for computation, improving speed and stability.\nmodel_kwargs = {\n    \"load_in_4bit\": True,\n    \"bnb_4bit_quant_type\": \"nf4\",\n    \"bnb_4bit_compute_dtype\": torch.bfloat16\n}\n\n# Create a local model pipeline using HuggingFacePipeline\nllm_pipeline = HuggingFacePipeline.from_model_id(\n    model_id=model_id,\n    task=\"text-generation\",\n    device_map=\"auto\",  # Automatically allocate model to available GPUs\n    pipeline_kwargs={\"max_new_tokens\": 256, \"eos_token_id\": 3, \"pad_token_id\": 3}, # Set eos, pad token id\n    model_kwargs=model_kwargs,\n)\n\n# Chain the prompt and local pipeline\nchain_local = prompt_template | llm_pipeline\n\n# Run inference locally\n# Example Korean data\nko_context = \"대한민국의 수도는 서울이다. 서울은 대한민국의 정치, 경제, 문화의 중심지이다.\"\nko_question = \"대한민국의 수도는 어디인가?\"\nresult_local = chain_local.invoke({\"context\": ko_context, \"question\": ko_question})\n\nprint(result_local)\n```\n\n##### **4.3. Method 3: High-Performance Inference with vLLM (Large-Scale Experiments & Service)**\n\n*   **Concept:** `vLLM` is a library that maximizes LLM serving speed using technologies like PagedAttention.\n*   **Use Case:** When you need to handle many requests simultaneously or require the fastest response times.\n    \n1.  **Run vLLM Server:** Start an OpenAI-compatible API server in the terminal with the command below.\n    ```\n    # Install vLLM: pip install vllm\n    # Start the API server with the Mistral-7B model.\n    python -m vllm.entrypoints.openai.api_server \\\n        --model mistralai/Mistral-7B-Instruct-v0.2 \\\n        --host 0.0.0.0\n    ```\n2.  **Call from Client:** Use `langchain-openai` to send requests to the vLLM server.\n    ```\n    from langchain_openai import ChatOpenAI\n    # Create a ChatOpenAI object to connect to the vLLM server\n    llm_vllm = ChatOpenAI(\n        model=\"mistralai/Mistral-7B-Instruct-v0.2\", # Model ID running on the server\n        openai_api_key=\"EMPTY\", # No API key needed\n        openai_api_base=\"http://localhost:8000/v1\", # vLLM server address\n        max_tokens=256,\n        temperature=0.1\n    )\n    # Chain the prompt and vLLM model\n    chain_vllm = prompt_template | llm_vllm\n    # Run inference\n    result_vllm = chain_vllm.invoke({\"context\": context, \"question\": question})\n    print(result_vllm.content)\n    ```\n\n#### **5. Common Errors and Troubleshooting**\n\n*   **GPU Out-of-Memory (OOM) Error**\n    *   **Cause:** Most common when the model size exceeds GPU VRAM.\n    *   **Solutions:**\n        1.  **Quantization:** Load the model with `load_in_4bit=True` or `load_in_8bit=True`. (See example 4.2 above)\n        2.  **Use a Smaller Model:** Choose a smaller model (e.g., 7B instead of 13B, 3B instead of 7B) suitable for your research.\n        3.  Use `device_map=\"auto\"`: The `accelerate` library can distribute model layers across VRAM and system RAM to alleviate memory shortages.\n*   **Hugging Face Hub Timeout Error**\n    *   **Cause:** Free remote inference servers have limited resources; long model loading or inference can cause disconnections.\n    *   **Solutions:**\n        1.  Try using a smaller model.\n        2.  Switch to local inference methods (4.2, 4.3).\n        3.  Consider using Hugging Face's paid [Inference Endpoints](https://huggingface.co/inference-endpoints).\n*   **Gated Model Access Error (Authorization Error)**\n    *   **Cause:** Some models (e.g., Llama 3, Gemma) require license agreement on Hugging Face before use. Access is denied if you lack a token or `write` permission.\n    *   **Solutions:**\n        1.  Visit the model’s Hugging Face page and agree to the license.\n        2.  Make sure you have issued an API token with `write` permission and set it correctly as an environment variable (see 1.2).\n*   **Dependency and Compatibility Issues**\n    *   **Cause:** Occurs when versions of `transformers`, `torch`, `CUDA`, etc. are incompatible.\n    *   **Solutions:**\n        1.  Always use a Python virtual environment (conda, venv) to isolate libraries per project.\n        2.  Read error messages carefully and adjust specific library versions (e.g., downgrade `protobuf`) as needed.\n        3.  Always install PyTorch according to the official guide for your CUDA version.",
      "need_alignment": {
        "score": "4",
        "feedback": "Okay—mostly aligned; keep it up. I like the dataset-driven flow, HF token handling, local inference, and the vLLM server path. I still need the vLLM Python API usage (not just the OpenAI server), a Transformers-native path without LangChain (AutoTokenizer/AutoModel or pipeline) with a batch example, and key gen params like top_p to exactly match what I asked."
      },
      "content_depth": {
        "score": "3",
        "feedback": "Okay, but not fully there yet. I’m missing: explicit venv/CUDA checks and VRAM guidance/model selection, sentencepiece in the install list, a pure Transformers load (with dtype/device_map and a chat template) plus a batch example, vLLM Python API with concurrency/KV-cache notes, a quick speed/memory comparison, and troubleshooting for tokenizer mismatch/max context/timeouts. Add these and it’s fully reproducible for my workflow."
      },
      "tone": {
        "score": "5",
        "feedback": "Your tone is perfect right now. Please keep it that way."
      },
      "explanation_style": {
        "score": "4",
        "feedback": "Alright. Keep it up. Good structure and code-first flow, but to hit perfect I want clearer step-by-step (numbered) cells for setup/inference, at least one plain Transformers chat example (no LangChain), and a short bullet comparison of Transformers vs vLLM."
      }
    }
  ],
  "query-id": 121
}