{
  "query": "How can I call and run inference on the open-source language model I want from Hugging Face?",
  "gold_information_need": "How do I run inference by calling an open-source language model I want from Hugging Face? - Please write a step-by-step tutorial that I can follow to load an open-source LLM (e.g., Llama) from Hugging Face in a Jupyter Notebook, input prompts directly, and run inference.\n\n(1) Preparation: \n- Set up a virtual environment and check CUDA.\n- Set your Hugging Face token.\n- Criteria for model selection (including recommended VRAM).\n- Installation list: transformers, torch, accelerate, huggingface_hub, sentencepiece, and optionally bitsandbytes.\n\n(2) Basic Inference (Transformers): \n- Load Llama using AutoTokenizer/AutoModelForCausalLM or pipeline.\n- Provide examples for single and batch prompts.\n- Key parameters: device_map, dtype, max_new_tokens, temperature, top_p.\n- Show minimal code cells up to output decoding.\n\n(3) vLLM Acceleration: \n- After pip install vllm, present two methods:\n  (a) Python API within the notebook,\n  (b) Running an OpenAI-compatible server (python -m vllm.entrypoints.openai.api_server --model …) and example client calls.\n- Demonstrate KV-cache, paged attention, and concurrent request/batch settings.\n\n(4) Optimization & Troubleshooting: \n- 4/8-bit loading options,\n- Setting max context/tokens,\n- Handling OOM, CUDA, and tokenizer mismatch issues,\n- Include a simple comparison table of inference speed/memory between Transformers and vLLM.",
  "response_list": [
    {
      "contents": "## Choosing an AI Model on Hugging Face Hub: A Guide for Successful Product Development\n\nYou’re at the starting point of developing new AI features—a journey much like setting sail in search of a treasure island across a vast ocean. The Hugging Face Hub is that ocean, dotted with countless open-source model “islands,” and it can feel overwhelming to decide where to plant your flag. But don’t worry—with the right compass and map, you’ll be able to discover the perfect treasure for your product: the best AI model.\n\n### The Lighthouse of Performance: Using Benchmarks\n\nJust as a lighthouse guides ships through the night, benchmarks like **HELM (Holistic Evaluation of Language Models)** and **MMLU (Massive Multitask Language Understanding)** are crucial indicators for assessing model performance.\n\n*   **HELM** is a dynamic benchmark that comprehensively evaluates language models across various scenarios and metrics, helping you understand a model’s overall capabilities from a broad perspective.\n*   **MMLU** measures a model’s knowledge and reasoning skills through multiple-choice questions spanning 57 subjects, including math, history, and computer science. It’s useful for gauging a model’s depth in specific fields.\n\nThese benchmark scores are an excellent starting point for objectively comparing models, but they don’t tell the whole story—just as a lighthouse reveals the presence of an island but not everything about it.\n\n### The Winds of Speed and the Waves of Hardware: Considering Practical Constraints\n\nBefore setting sail, you need to consider your ship’s speed and size—likewise, you should carefully review a model’s **inference speed** and **hardware requirements**.\n\n*   **Inference speed** determines how quickly and smoothly users can experience your AI features. No matter how smart a model is, if it responds too slowly, users will turn away.\n*   **Hardware requirements** directly impact operational costs. High-end hardware like GPUs offers faster speeds but comes with higher expenses.\n\nOptimization techniques such as **quantization** and **pruning** can help reduce hardware demands while maintaining satisfactory performance—like a small boat catching a favorable wind to move swiftly.\n\n### The License: Your Permit for Commercial Voyages\n\nEven if you find a treasure island, it’s meaningless without the rights to claim it. Similarly, you must carefully check the **license terms** before using an AI model in a commercial product.\n\n*   **Apache 2.0 License** is widely adopted in open-source projects and generally imposes few restrictions on commercial use. You’re not required to disclose modified source code, and it includes explicit patent permissions.\n*   **Llama 2 License** applies to models released by Meta AI. While it allows commercial use, there may be specific restrictions, so review it carefully.\n\nLicenses are essential safeguards that protect your business from legal risks—think of them as your sailing permit. Always check and comply with them.\n\n### Writing Your Own Captain’s Log\n\nChoosing the best AI model on Hugging Face Hub is more than a technical decision—it’s a crucial process that shapes the future of your product and business. Pay attention to performance benchmarks as your lighthouse, consider inference speed and hardware as the winds and waves, and don’t forget the license as your sailing permit. By thoughtfully weighing all these factors and charting your course carefully, your AI features will reach the treasure island of success.",
      "need_alignment": {
        "score": "1",
        "feedback": "You have no idea what I want. What I want is code that lets me take an open-source model, actually try out prompts with it, and then use a library like vllm to run inference—along with a detailed explanation of that code. Instead, this is a conceptual model-selection write-up with metaphors and no runnable steps: no HF token setup, no Transformers (AutoTokenizer/AutoModel or pipeline), no vLLM (Python API or OpenAI server), no chat-style prompting, and no single/batch inference examples."
      },
      "content_depth": {
        "score": "1",
        "feedback": "You're just listing things like a high-level, unengaging blog post. The content needs to be much deeper and more low-level.  \nIt's good that you provided examples of benchmarks actually used in research, but there's nothing concrete about how to use them, what examples I could try, or how I should conduct experiments. There’s no environment setup (venv/conda, CUDA/PyTorch check), no package installs, no key params (device_map, dtype, max_new_tokens, temperature, top_p), and zero runnable code."
      },
      "tone": {
        "score": "1",
        "feedback": "Stop using metaphors and don’t drag out unnecessary concept explanations. Just deliver the explanation in a dry tone. The “sailing/treasure” narrative makes this feel like marketing copy, not technical guidance."
      },
      "explanation_style": {
        "score": "1",
        "feedback": "The explanation style is story-like and metaphor-driven with headings, but it isn’t actionable: no step-by-step flow, no code cells, no chat templates, and no decoding examples. I can’t easily extract steps or copy-paste anything to run, so this doesn’t work as a tutorial."
      }
    },
    {
      "contents": "Running a large language model (LLM) like Llama 3 on your own GPU server is like building your own race car. It’s not just about driving—it’s about fine-tuning every part for peak performance on the track. Model quantization, Flash Attention, and choosing the optimal batch size are like the engine, aerodynamics, and tires of your race car. Let me explain, with a bit of humor and analogy, how each of these elements can work together to get your model across the finish line as fast as possible.\n\n### Model Quantization: Putting Your Model on a Diet\n\nModel quantization is like a diet program for your LLM. It reduces a heavy 16-bit or 32-bit model down to 8-bit or even 4-bit—like swapping out heavy armor for lightweight sportswear. Naturally, a lighter body can run faster!\n\n**Quantization with bitsandbytes**\n\nThe `bitsandbytes` library is an excellent trainer for your model’s “4-bit diet.” Here’s how to hire this trainer:\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nimport torch\n\nmodel_id = \"meta-llama/Meta-Llama-3-8B-Instruct\"\n\n# Load the model in 4-bit mode.\nmodel = AutoModelForCausalLM.from_pretrained(\n model_id,\n torch_dtype=torch.bfloat16,\n device_map=\"auto\",\n quantization_config=dict(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)\n)\ntokenizer = AutoTokenizer.from_pretrained(model_id)\n```\n\nThis magic code shrinks your model, reducing memory usage and speeding up inference. But just like an extreme diet can make you dizzy, heavy quantization can slightly reduce your model’s performance—so keep that in mind.\n\n### Flash Attention: Lightning-Fast Focus\n\nFlash Attention is a technology that boosts your LLM’s focus. It’s like being able to pick out the important information even when many people are talking at once. Traditional attention mechanisms take time to check every piece of information, but Flash Attention zeroes in on what matters, dramatically improving computation speed.\n\n**Applying Flash Attention 2**\n\nWith Hugging Face’s `transformers` library, enabling Flash Attention is simple:\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nimport torch\n\nmodel_id = \"meta-llama/Meta-Llama-3-8B-Instruct\"\n\n# Load the model with Flash Attention 2.\nmodel = AutoModelForCausalLM.from_pretrained(\n model_id,\n torch_dtype=torch.bfloat16,\n device_map=\"auto\",\n attn_implementation=\"flash_attention_2\"\n)\ntokenizer = AutoTokenizer.from_pretrained(model_id)\n```\n\nWith Flash Attention, your model can process longer contexts much faster—like a superhero with special powers!\n\n### Choosing the Optimal Batch Size: Deciding How Many Guests to Invite to the Party\n\nBatch size refers to how many requests you process at once. It’s like deciding how many guests to invite to a party. Too few, and the party feels empty; too many, and your house gets overwhelmed.\n\n**The Tug-of-War Between Throughput and Latency**\n\n* **Small batch size:** Like focusing on one guest at a time. Responses are quick (low latency), but you can’t serve many people at once (low throughput).\n* **Large batch size:** Like serving many guests at once. You can handle more people overall (high throughput), but each guest may wait longer for their food (higher latency).\n\n**Finding the Optimal Balance**\n\nThe best batch size depends on your goals:\n* **For real-time services like chatbots:** Low latency is key, so a small batch size is best.\n* **For offline data processing:** High throughput matters, so a larger batch size is better.\n\nExperiment with different batch sizes to find the “party size” that best fits your GPU server.\n\n### Bringing It All Together: Building the Ultimate LLM Race Car\n\nNow it’s time to assemble all the parts and build the ultimate LLM race car.\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nimport torch\n\nmodel_id = \"meta-llama/Meta-Llama-3-8B-Instruct\"\n\n# Apply both 4-bit quantization and Flash Attention 2.\nmodel = AutoModelForCausalLM.from_pretrained(\n model_id,\n torch_dtype=torch.bfloat16,\n device_map=\"auto\",\n quantization_config=dict(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16),\n attn_implementation=\"flash_attention_2\"\n)\ntokenizer = AutoTokenizer.from_pretrained(model_id)\n\n# Run inference with the optimal batch size.\ninputs = tokenizer(\n    [\"What’s the best way to optimize an LLM?\", \"I want to make Llama 3 faster.\"],\n    return_tensors=\"pt\"\n).to(\"cuda\")\noutputs = model.generate(**inputs, max_new_tokens=100)\nprint(tokenizer.batch_decode(outputs, skip_special_tokens=True))\n```\n\nBy slimming down your model with quantization, boosting focus with Flash Attention, and maximizing throughput with the right batch size, your GPU server becomes a high-performance racing machine. Now, hit the track and enjoy the ride!",
      "need_alignment": {
        "score": "2",
        "feedback": "This isn’t the topic I wanted, and I have no interest in a long quantization/Flash Attention analogy. I asked for a step-by-step, code-first way to actually run open-source chat LLMs from Hugging Face: both Transformers and vLLM (Python API + OpenAI-compatible server), chat-style prompts (system/user/assistant), single and batch examples, HF token handling for gated models, and key gen params. None of that shows up here, so I can’t just run prompts and iterate."
      },
      "content_depth": {
        "score": "2",
        "feedback": "It skips the actual setup and reproducible details I need (venv/conda, CUDA check, install commands, VRAM guidance, HF token). No vLLM at all, no chat template, no single vs batch chat examples, no temperature/top_p, no dataset-driven examples, and little troubleshooting (OOM, tokenizer mismatch, gating). It’s mostly prose with a small Transformers snippet, so I can’t reproduce a research-usable workflow from this."
      },
      "tone": {
        "score": "1",
        "feedback": "There are too many metaphors and unnecessary explanations. Be concise and focus only on the main points. Go into detail only when necessary, such as when explaining how code works or discussing comments."
      },
      "explanation_style": {
        "score": "3",
        "feedback": "The structure is understandable, but this isn’t the step-by-step, code-first flow I asked for. Too much narrative/analogy, not enough ordered cells with inline comments and clear decoding. No chat template is shown. Some bullets/headings help, but the key steps are missing, so it’s harder to follow and apply."
      }
    },
    {
      "contents": "## An Enchanting First Step into the World of AI Language Models: Build Your Own Text Generator with GPT-2\n\nWelcome to the vast universe of Natural Language Processing (NLP) and Hugging Face! Are you ready to experience the magic of turning your ideas into text with just a few lines of code? In this guide, we’ll walk you step by step through the fascinating process of running a powerful text generation model like GPT-2 right on your own computer. Starting with installing Hugging Face’s `transformers` library, we’ll harness the power of pre-trained models and generate original text based on your prompts. We’ll dive into what each code snippet does, so if you’re ready, let’s get started!\n\n### The Journey Begins: Installing the `transformers` Library\n\nOur first mission is to install the `transformers` library, the Swiss Army knife of NLP. This magical toolbox lets you easily download and use state-of-the-art AI models. Simply enter the following command in your terminal or command prompt:\n\n```bash\npip install transformers\n```\n\nWith this single command, you unlock access to countless cutting-edge models. Since `transformers` works with deep learning frameworks like PyTorch or TensorFlow, it’s a good idea to install one if you haven’t already. For example, to install with PyTorch, use:\n\n```bash\npip install transformers[torch]\n```\n\nNow you’re all set. Let’s dive into the world of text generation!\n\n### Build Your Own Text Generator in 3 Steps\n\nNow it’s time to work some magic with Python code. The whole process can be broken down into three main steps.\n\n**Step 1: Summon the Magic Wand, `pipeline`**\n\nThe `pipeline` in the `transformers` library is an amazing tool that simplifies complex processes. It lets you perform tasks like text generation, translation, and summarization with a single line of code. The `pipeline` handles everything from model loading to text preprocessing and postprocessing.\n\n```python\nfrom transformers import pipeline, set_seed\n\n# Create a pipeline for text generation.\n# \"text-generation\" specifies the task we want to perform.\ngenerator = pipeline('text-generation', model='gpt2')\n\n# Set a seed for reproducible results.\nset_seed(42)\n```\n\nHere, `pipeline('text-generation', model='gpt2')` automatically downloads and sets up the pre-trained GPT-2 model for text generation. Now you have a powerful tool called `generator` at your fingertips.\n\n**Step 2: Plant the Seed of Creativity with a Prompt**\n\nNow it’s time to plant the seed of creativity in your `generator`. Provide a phrase or question—the ‘prompt’—to start generating text.\n\n```python\n# Enter a prompt to request text generation.\n# max_length sets the maximum length of the generated text, num_return_sequences sets the number of outputs.\ngenerated_text = generator(\"In a shocking finding, scientist discovered a herd of unicorns living\", max_length=50, num_return_sequences=1)\n```\n\nHere, `max_length` determines the maximum number of tokens (units similar to words) in the output, and `num_return_sequences` sets how many different versions to generate.\n\n**Step 3: Check the Magic Results**\n\nNow it’s time to see what magic your prompt has conjured.\n\n```python\n# Print the generated text.\nprint(generated_text[0]['generated_text'])\n```\n\nYou’ll likely see an intriguing story unfold, such as:\n\n> In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.\n\n### Unveiling the Magic: The Roles of `AutoTokenizer` and `AutoModel`\n\nWhile `pipeline` is convenient, understanding what happens under the hood lets you use it more deeply. The `pipeline` is actually built on two key components: `AutoTokenizer` and `AutoModel`.\n\n*   **`AutoTokenizer`**: Computers can’t understand human language directly. So, text must be converted into numerical ‘tokens’—a process called tokenization. `AutoTokenizer` converts your text into the right sequence of numbers for the chosen model (e.g., ‘gpt2’). Think of it as a translator that turns your words into a language the model understands.\n\n*   **`AutoModel`**: `AutoModel` loads pre-trained models from the Hugging Face Hub. It automatically loads the right architecture for tasks like text generation, classification, or translation. For example, `AutoModelForCausalLM` is used for causal language modeling, like GPT-2, which predicts the next word in a sequence.\n\nHere’s an example of generating text without using `pipeline`, by directly using `AutoTokenizer` and `AutoModelForCausalLM`:\n\n```python\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\n# Load the tokenizer and model for GPT-2.\ntokenizer = AutoTokenizer.from_pretrained('gpt2')\nmodel = AutoModelForCausalLM.from_pretrained('gpt2')\n\n# Tokenize the input prompt.\ninputs = tokenizer(\"The future of AI is\", return_tensors=\"pt\")\n\n# Generate text using the model.\n# You can adjust various generation options in addition to max_length and num_return_sequences.\noutputs = model.generate(**inputs, max_length=50, num_return_sequences=1, no_repeat_ngram_size=2)\n\n# Decode the generated token sequence back into text.\ndecoded_text = tokenizer.decode(outputs[0], skip_special_tokens=True)\n\n# Print the generated text.\nprint(decoded_text)\n```\n\nBy using `AutoTokenizer` and `AutoModel` directly, you can control the text generation process in more detail. For example, you can adjust parameters like `temperature` or `top_k` in `model.generate()` to control the creativity and diversity of the output.\n\nNow you have the basic knowledge and skills to build your own text generator using the Hugging Face `transformers` library. This exciting journey has just begun. Explore different models, tweak generation options, and let your ideas unfold into limitless text!",
      "need_alignment": {
        "score": "3",
        "feedback": "The overall topic seems similar, but what I want is more specific: I want to use the latest open-source LLMs, format prompts in a chat template style, and have code that performs inference based on that, including single and batch examples. Also, you left out information related to vllm (both Python API and OpenAI-compatible server) and HF token handling for gated models. Please respond again, taking these points into consideration."
      },
      "content_depth": {
        "score": "2",
        "feedback": "I’m not a beginner in this field at all, so I’d like it to be tailored to a level where I can actually use it for my research. GPT-2 is too outdated and not practical since it’s not a recent model. It also lacks essential pieces: environment/conda + CUDA checks, VRAM guidance, HF token handling (for gated models), vLLM usage, chat-style prompts (system/user/assistant), single vs batch examples, key generation params (device_map, dtype, max_new_tokens, temperature, top_p), and targeted troubleshooting (OOM, tokenizer mismatch, timeouts)."
      },
      "tone": {
        "score": "1",
        "feedback": "I am not a child. Please refrain from using a tone as if you're explaining things to a kid, and instead function as an assistant who supports my research."
      },
      "explanation_style": {
        "score": "3",
        "feedback": "The explanation style itself seems fine. The ratio of titles to text, the code snippets, and the comments included are all appropriate. However, I wish the comments were a bit more detailed, and I think the overall explanation could use a bit more structure, with more bullet points added. Please shift to a clear step-by-step, code-first flow with inline comments, include explicit chat templates (system/user/assistant) and decoding, and organize sections for Transformers vs vLLM with single vs batch examples."
      }
    },
    {
      "contents": "Dear User,\n\nHere is the standard procedure for running inference using your fine-tuned BERT model uploaded to the Hugging Face Hub. This process includes loading the tokenizer and model correctly, generating predictions for new data, and interpreting the results.\n\n### **Step-by-Step Guide to Loading a Fine-Tuned Model**\n\n**Step 1: Import Required Libraries**\n\nBefore starting inference, import the necessary classes from the `transformers` library.\n\n```python\nfrom transformers import AutoTokenizer, AutoModelForSequenceClassification\nimport torch\n```\n\n**Step 2: Load the Tokenizer and Model**\n\nModels stored on the Hugging Face Hub can be easily loaded using the `from_pretrained()` method. Pass the model repository ID (e.g., `username/model-name`) to this method. `AutoTokenizer` automatically loads the appropriate tokenizer, and `AutoModelForSequenceClassification` loads the fine-tuned model for text classification.\n\n```python\n# Specify the path to your model on the Hub.\nmodel_name = \"your-username/your-fine-tuned-bert-model\"\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModelForSequenceClassification.from_pretrained(model_name)\n```\n\n**Step 3: Prepare New Data for Inference**\n\nPrepare the new text data you want to run inference on. This can be in the form of a list.\n\n```python\nnew_texts = [\"This movie was fantastic!\", \"I did not enjoy this film at all.\"]\n```\n\n**Step 4: Tokenize the Data**\n\nConvert the prepared text data into numerical input that the model can understand. Use the `tokenizer` to tokenize the text and convert it into PyTorch tensors.\n\n```python\ninputs = tokenizer(new_texts, padding=True, truncation=True, return_tensors=\"pt\")\n```\n\n**Step 5: Run Inference**\n\nPass the tokenized inputs to the model to run inference. The model will return outputs containing logits.\n\n```python\nwith torch.no_grad():\n    outputs = model(**inputs)\n\nlogits = outputs.logits\n```\n\n**Step 6: Interpret Logits and Get Predicted Labels**\n\nLogits are raw, unnormalized prediction scores for each class. The class with the highest logit value is the model’s prediction. Use the `torch.argmax()` function to find the index with the highest value.\n\n```python\npredictions = torch.argmax(logits, dim=-1)\n```\n\n**Step 7: Map Predicted Labels**\n\nThe model’s configuration file (`config.json`) contains the mapping between labels and indices. Use this information to convert the predicted indices to actual label names.\n\n```python\n# Get the id2label mapping from the model's config.\nid2label = model.config.id2label\npredicted_labels = [id2label[p.item()] for p in predictions]\n\nfor text, label in zip(new_texts, predicted_labels):\n    print(f\"Text: '{text}' ==> Predicted label: '{label}'\")\n```\n\nBy following this structured approach, you can effectively load and utilize your fine-tuned BERT model to perform classification predictions on new data.",
      "need_alignment": {
        "score": "2",
        "feedback": "We were almost on topic, then suddenly, why bring up BERT? I want approaches like decoder-only models such as Llama, with corresponding prompt input and model inference output structures. It would be great to actively use libraries that speed up inference like vllm. I also need chat-style formatting (system/user/assistant), single and batch examples, key generation params, HF token handling for gated models, and both Transformers and vLLM modes (Python API + OpenAI-compatible server)."
      },
      "content_depth": {
        "score": "3",
        "feedback": "I think the step-by-step is clear, but for my use it’s missing critical pieces: env/CUDA checks, VRAM guidance, HF token setup, LLM loading/inference with Transformers, vLLM (Python API and OpenAI server), single vs batch chat prompts and key gen params, and targeted troubleshooting (OOM, tokenizer mismatch, gating). Also it’s on BERT classification, not the chat LLM workflow I asked for."
      },
      "tone": {
        "score": "4",
        "feedback": "I like your tone—dry and direct—but please drop the “Dear User” salutation; I prefer instructions without greetings."
      },
      "explanation_style": {
        "score": "4",
        "feedback": "An explanatory style is fine, too. The step-by-step with code is easy to follow, but I also want chat templates and output decoding in the examples; a few more inline comments would help."
      }
    },
    {
      "contents": "Here’s an answer on how to perform inference using open-source language models available on Hugging Face.\n\nThere are three main methods:\n\n1.  **Remote Inference via Hugging Face Hub**: The simplest way, using Hugging Face’s servers for inference.\n2.  **Local Inference**: Downloading the model and running inference on your own machine (PC or server).\n3.  **High-Performance Local Inference with vLLM**: An advanced method using specialized libraries to maximize local inference performance.\n\n* * *\n\n### 1. Remote Inference via Hugging Face Hub (using LangChain)\n\nThis method runs the model on Hugging Face’s servers and only returns the results. It’s easy to test without a GPU server, but may be slow or encounter timeout errors.\n\n#### Steps:\n\n1.  **Set up your environment**: Install required libraries and set your Hugging Face API token.\n    *   **Install libraries**:\n        ```\n        pip install langchain huggingface_hub\n        ```\n    *   **Set API token**: \n        Get a token with `READ` permission from [Hugging Face settings](https://huggingface.co/settings/tokens) and set it as an environment variable.\n        ```\n        import os\n        os.environ['HUGGINGFACEHUB_API_TOKEN'] = 'YOUR_TOKEN_HERE'\n        ```\n2.  **Run the code**: Use `langchain.llms.HuggingFaceHub` to call your chosen model.\n    *   **Choose a model**: Check the Repository ID on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) (e.g., `mistralai/Mistral-7B-v0.1`).\n    *   **Inference code**:\n        ```\n        from langchain.prompts import PromptTemplate\n        from langchain.llms import HuggingFaceHub\n        from langchain.chains import LLMChain\n        # 1. Set the Repository ID\n        repo_id = 'mistralai/Mistral-7B-v0.1'\n        # 2. Define the prompt template\n        question = \"Who is Son Heung Min?\"\n        template = \"\"\"Question: {question}\n        Answer: \"\"\"\n        prompt = PromptTemplate(template=template, input_variables=[\"question\"])\n        # 3. Create HuggingFaceHub object\n        # Adjust parameters like temperature, max_length in model_kwargs\n        llm = HuggingFaceHub(\n            repo_id=repo_id, \n            model_kwargs={\"temperature\": 0.2, \"max_length\": 128}\n        )\n        # 4. Run with LLMChain\n        llm_chain = LLMChain(prompt=prompt, llm=llm)\n        print(llm_chain.run(question=question))\n        ```\n\n* * *\n\n### 2. Local Inference (using LangChain)\n\nIf you have a server or PC with a GPU, you can download the model and run inference much faster.\n\n#### Steps:\n\n1.  **Set up your environment**: You’ll need additional libraries like `transformers`.\n    ```\n    pip install langchain huggingface_hub transformers\n    # For GPU support, install torch\n    # pip install torch\n    ```\n2.  **Run the code**: Use `langchain.llms.HuggingFacePipeline` to load the model locally.\n    *   **Choose a model**: Find the model ID on Hugging Face (e.g., `beomi/llama-2-ko-7b`).\n    *   **Inference code**:\n        ```\n        from langchain.prompts import PromptTemplate\n        from langchain.llms import HuggingFacePipeline\n        from langchain.chains import LLMChain\n        # 1. Set the model ID\n        model_id = 'beomi/llama-2-ko-7b' \n        # 2. Create HuggingFacePipeline object\n        llm = HuggingFacePipeline.from_model_id(\n            model_id=model_id,\n            device=0,  # Use 0 or higher for GPU, -1 for CPU\n            task=\"text-generation\",\n            model_kwargs={\"temperature\": 0.1, \"max_length\": 64}\n        )\n        # 3. Define the prompt template (Korean model example)\n        template = \"\"\"질문: {question}\n        답변: \"\"\"\n        prompt = PromptTemplate.from_template(template)\n        # 4. Run with LLMChain\n        llm_chain = LLMChain(prompt=prompt, llm=llm)\n        question = \"대한민국의 수도는 어디야?\"\n        print(llm_chain.run(question=question))\n        ```\n\n* * *\n\n### 3. High-Performance Local Inference with vLLM (Advanced)\n\n`vLLM` is a library that uses optimizations like PagedAttention to maximize LLM inference performance (throughput, latency) on local machines. It’s ideal for large-scale services or when fast responses are needed.\n\n#### Advantages of vLLM:\n\n*   **Excellent performance**: High throughput and low latency.\n*   **Maximized GPU utilization**: Uses optimized CUDA kernels for best GPU performance.\n\n#### How to use:\n\nYou can use `vLLM` directly via its Python API or run an OpenAI-compatible API server.\n\n##### Method A: Direct use with Python API\n\n```\nfrom vllm import LLM\n\n# 1. Load the model with vLLM\nllm = LLM(model=\"facebook/opt-350m\")\n\n# 2. Prepare prompts\nprompts = [\n    \"Hello, my name is\",\n    \"The capital of France is\",\n]\n\n# 3. Run inference\noutputs = llm.generate(prompts)\n\n# Print results\nfor output in outputs:\n    prompt = output.prompt\n    generated_text = output.outputs[0].text\n    print(f\"Prompt: {prompt!r}, Generated text: {generated_text!r}\")\n```\n\n##### Method B: Run as an OpenAI-compatible API server\n\nThis method runs a local API server, allowing you to send requests via the `openai` library or `curl`.\n\n1.  **Start the vLLM server**: In your terminal, run:\n    ```\n    # Start the server on port 8000 with the Qwen/Qwen2.5-1.5B-Instruct model\n    vllm serve Qwen/Qwen2.5-1.5B-Instruct\n    ```\n2.  **Call from a Python client**: Use the `openai` library to send requests.\n    ```\n    import openai\n    # Set the vLLM server address\n    openai.api_base = \"http://localhost:8000/v1\"\n    # API key is not required, set any value\n    openai.api_key = \"EMPTY\" \n    completion = openai.Completion.create(\n        model=\"Qwen/Qwen2.5-1.5B-Instruct\", # Model name used when starting the server\n        prompt=\"The capital of Germany is\",\n        max_tokens=50\n    )\n    print(completion.choices[0].text)\n    ```\n\n* * *\n\n### Summary & Selection Guide\n\n| Method  | Pros  | Cons  | Recommended for |\n| --- | --- | --- | --- |\n| **1. Hugging Face Hub** | - Easy setup<br>- No hardware needed | - Slow<br>- Possible timeouts | Simple testing and quick prototyping |\n| **2. Local Inference (HuggingFacePipeline)** | - Fast<br>- Full control | - Requires high-spec hardware (GPU)<br>- More complex setup | Full-scale development, research, offline use |\n| **3. vLLM** | - **Very high performance** (throughput, speed)<br>- OpenAI-compatible API | - More complex setup<br>- GPU required | High-performance LLM serving, production services |\n\nAdditionally, when sending questions to the model, it’s common to use `PromptTemplate` or `ChatPromptTemplate` to structure your input. This helps ensure consistent output formatting and makes it easy to inject variables dynamically.\n\nWould you like a more detailed explanation or an example for a specific method or model?",
      "need_alignment": {
        "score": "3",
        "feedback": "The topic is generally aligned, but it would be even better if you could include the following points. First, I’d like the overall content to be written in the style of a code tutorial that I can directly use for my research, using Hugging Face Transformers directly (AutoTokenizer/AutoModelForCausalLM or pipeline) rather than relying on LangChain wrappers and the remote Hub path. It’s good to have detailed explanations as you’ve done, but I think the formatting, such as proper chat-style prompting (system/user/assistant) and batch examples in Transformers, should be more detailed. Also, it would be helpful to explain in detail how to handle HF tokens for gated models (e.g., Llama 3), give model/VRAM selection guidance, and load datasets from the datasets library for quick experiments. Additionally, it would be great if you could discuss vLLM-specific settings (KV-cache/paged attention and concurrency), avoid outdated model examples (e.g., OPT-350M), and add practical troubleshooting (OOM, tokenizer mismatch, CUDA issues)."
      },
      "content_depth": {
        "score": "3",
        "feedback": "The overall level is understandable, but for research-usable reproducibility I need more. Please add a precise setup (venv/conda, CUDA check, correct torch install), HF token login, and VRAM guidance. Include the full install list (transformers, torch, accelerate, huggingface_hub, sentencepiece, optionally bitsandbytes) and show direct Transformers usage with key params (device_map, dtype, max_new_tokens, temperature, top_p), single and batch, with chat formatting. For vLLM, cover both the Python API and the OpenAI-compatible server with concurrency/KV-cache/paged-attention settings. Also include optimization/troubleshooting (4/8‑bit loading, max context/tokens, OOM/CUDA errors, tokenizer mismatch, gated model access). As written, it’s missing several critical pieces and requires extra effort to adapt."
      },
      "tone": {
        "score": "5",
        "feedback": "Your tone is perfect right now. Please keep it that way."
      },
      "explanation_style": {
        "score": "4",
        "feedback": "The explanation style is fine as it is. However, I’d like each section to be a bit richer in content, and it would be better if the code snippets included more detailed comments. Please also add explicit chat templates and decoding, and switch the comparison from a table to a short bullet list to keep it scannable."
      }
    },
    {
      "contents": "Step-by-step notebook: Run open-source LLMs (Llama/Mistral/Qwen) from Hugging Face\n\n0) What you will get\n- A clean local environment with CUDA-verified PyTorch.\n- Direct Transformers inference (single + batch) with chat-style prompts.\n- vLLM acceleration two ways: Python API and an OpenAI-compatible server.\n- Practical optimization and troubleshooting for VRAM/OOM, tokens, and tokenizer issues.\n\n1) Preparation\n1.1 Create and activate a virtual environment (choose one)\n- Conda\n  conda create -n hf-llm python=3.10 -y\n  conda activate hf-llm\n- venv (Linux/macOS)\n  python -m venv .venv\n  source .venv/bin/activate\n- venv (Windows PowerShell)\n  python -m venv .venv\n  .\\.venv\\Scripts\\Activate.ps1\n\n1.2 Install PyTorch with the right CUDA build\n- Find the correct command for your system at https://pytorch.org/get-started/locally/\n- Example (CUDA 12.1 wheels):\n  pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio\n\n1.3 Install libraries\n- Core\n  pip install -U transformers accelerate huggingface_hub sentencepiece datasets\n- Optional (quantization and speed)\n  pip install -U bitsandbytes  # 4/8-bit loading (Linux/NVIDIA)\n  pip install -U vllm openai   # accelerated serving + OpenAI client\n  # optional extra speed: FlashAttention 2 (GPU-dependent)\n  # pip install -U flash-attn --no-build-isolation\n\n1.4 Verify CUDA and GPU\n\n```python\nimport torch, subprocess, sys\nprint('torch:', torch.__version__)\nprint('cuda available:', torch.cuda.is_available())\nif torch.cuda.is_available():\n    print('device:', torch.cuda.get_device_name(0))\n    print('capability:', torch.cuda.get_device_capability(0))\n    # Show VRAM\n    try:\n        subprocess.run(['nvidia-smi'], check=False)\n    except FileNotFoundError:\n        print('nvidia-smi not found (OK on some systems)')\nelse:\n    print('Running on CPU; large LLMs will be very slow or fail due to memory.')\n```\n\n1.5 Authenticate to Hugging Face (for gated models like Llama 3)\n\n```python\nimport os, getpass\nfrom huggingface_hub import login\n# One-time: accept the model license on its HF page (e.g., Meta Llama 3).\nif not os.getenv('HUGGINGFACE_HUB_TOKEN'):\n    os.environ['HUGGINGFACE_HUB_TOKEN'] = getpass.getpass('Enter your HF token: ')\nlogin(token=os.environ['HUGGINGFACE_HUB_TOKEN'])\n```\n\n1.6 Model selection and VRAM guidance (rough)\n- Prefer instruct/chat variants: e.g., meta-llama/Meta-Llama-3-8B-Instruct, mistralai/Mistral-7B-Instruct-v0.2, Qwen/Qwen2.5-7B-Instruct.\n- VRAM (bf16/fp16, single GPU; add KV-cache for long outputs):\n  - 7–8B: ~16–20 GB; with 4-bit: ~7–10 GB.\n  - 13B: ~28–34 GB; with 4-bit: ~14–18 GB.\n  - 70B: multi-GPU or server required.\n- If VRAM is tight, start with 4-bit quantization (bitsandbytes) or use vLLM with smaller max context.\n\n2) Basic inference with Transformers (no LangChain)\n2.1 Load a chat LLM (Llama/Mistral/Qwen) with AutoTokenizer/AutoModelForCausalLM\n\n```python\nimport os, torch\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\n# Choose one chat model (ensure license accepted if gated):\n# model_id = 'meta-llama/Meta-Llama-3-8B-Instruct'   # gated\nmodel_id = 'mistralai/Mistral-7B-Instruct-v0.2'       # ungated reference\n\n# Dtype selection (bf16 if supported; else fp16)\nif torch.cuda.is_available():\n    major_cap = torch.cuda.get_device_capability(0)[0]\n    torch_dtype = torch.bfloat16 if major_cap >= 8 else torch.float16\nelse:\n    torch_dtype = torch.float32\n\nhf_token = os.getenv('HUGGINGFACE_HUB_TOKEN')\n\ntokenizer = AutoTokenizer.from_pretrained(\n    model_id, use_fast=False, token=hf_token\n)\nif tokenizer.pad_token_id is None:\n    tokenizer.pad_token_id = tokenizer.eos_token_id\n\n# Load model to GPU(s) automatically; trust_remote_code for some families (e.g., Qwen)\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_id,\n    torch_dtype=torch_dtype,\n    device_map='auto',           # accelerate will shard across GPUs if present\n    trust_remote_code=True,\n)\nmodel.eval()\n```\n\n2.2 Single chat prompt (system/user/assistant)\n\n```python\n# Compose chat messages; apply the model's chat template for correct special tokens\nmessages = [\n    {\"role\": \"system\", \"content\": \"You are a concise, helpful research assistant.\"},\n    {\"role\": \"user\", \"content\": \"In 2 sentences, explain retrieval-augmented generation.\"}\n]\n\nprompt = tokenizer.apply_chat_template(\n    messages,\n    tokenize=False,\n    add_generation_prompt=True,  # adds assistant header according to the template\n)\n\ninputs = tokenizer(prompt, return_tensors='pt').to(model.device)\n\ngen_kwargs = dict(\n    max_new_tokens=256,\n    do_sample=True,\n    temperature=0.7,   # lower -> more deterministic\n    top_p=0.9,         # nucleus sampling\n    # top_k=50,        # optional\n    eos_token_id=tokenizer.eos_token_id,\n    pad_token_id=tokenizer.eos_token_id,\n)\n\nwith torch.no_grad():\n    output_ids = model.generate(**inputs, **gen_kwargs)\n\n# Strip the prompt and decode only newly generated tokens\ngen_text = tokenizer.decode(output_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)\nprint(gen_text)\n```\n\n2.3 Batch prompts (dataset-driven)\n\n```python\nfrom datasets import load_dataset\n\n# Small batch from SQuAD validation\nds = load_dataset('squad', split='validation[:4]')\n\ndef to_messages(example):\n    ctx = example['context']\n    q = example['question']\n    return [\n        {\"role\": \"system\", \"content\": \"Answer using only the provided context.\"},\n        {\"role\": \"user\", \"content\": f\"Context:\\n{ctx}\\n\\nQuestion: {q}\\nAnswer:\"}\n    ]\n\nprompts = [\n    tokenizer.apply_chat_template(to_messages(ex), tokenize=False, add_generation_prompt=True)\n    for ex in ds\n]\n\nbatch_inputs = tokenizer(\n    prompts,\n    padding=True,\n    truncation=True,\n    return_tensors='pt'\n).to(model.device)\n\nwith torch.no_grad():\n    batch_output_ids = model.generate(**batch_inputs, **gen_kwargs)\n\n# Decode only the newly generated parts per item\nprompt_lengths = [len(tokenizer(p).input_ids) for p in prompts]\noutputs = []\nfor i, out_ids in enumerate(batch_output_ids):\n    gen_only = out_ids[prompt_lengths[i]:]\n    outputs.append(tokenizer.decode(gen_only, skip_special_tokens=True))\n\nfor i, text in enumerate(outputs):\n    print(f'--- Example {i} ---')\n    print(text)\n```\n\n2.4 (Optional) Minimal pipeline variant\n\n```python\nfrom transformers import pipeline\n\npipe = pipeline(\n    'text-generation',\n    model=model,\n    tokenizer=tokenizer,\n    device_map='auto',\n    torch_dtype=torch_dtype,\n)\nresult = pipe(prompt, max_new_tokens=128, temperature=0.7, top_p=0.9)[0]['generated_text']\nprint(result[len(prompt):])  # strip the prompt region\n```\n\n3) vLLM acceleration\n3.1 Install\n  pip install -U vllm openai\n\nNote: vLLM reads your Hugging Face token from HF_TOKEN or HUGGINGFACE_HUB_TOKEN.\n\n3.2 vLLM Python API (in-notebook)\n\n```python\nimport os, torch\nfrom vllm import LLM, SamplingParams\nfrom transformers import AutoTokenizer\n\nmodel_id = 'meta-llama/Meta-Llama-3-8B-Instruct'  # or any instruct model\nos.environ.setdefault('HF_TOKEN', os.getenv('HUGGINGFACE_HUB_TOKEN', ''))\n\n# Prepare chat prompts using the model's chat template\ntok = AutoTokenizer.from_pretrained(model_id, use_fast=False, token=os.getenv('HF_TOKEN'))\nmsgs = [\n    [\n        {\"role\": \"system\", \"content\": \"You are a concise, helpful research assistant.\"},\n        {\"role\": \"user\", \"content\": \"Summarize contrastive learning in one paragraph.\"}\n    ],\n    [\n        {\"role\": \"system\", \"content\": \"You are a concise, helpful research assistant.\"},\n        {\"role\": \"user\", \"content\": \"List 3 pitfalls when evaluating LLMs with BLEU.\"}\n    ],\n]\nprompts = [tok.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in msgs]\n\nllm = LLM(\n    model=model_id,\n    dtype='auto',\n    tensor_parallel_size=(torch.cuda.device_count() or 1),\n    gpu_memory_utilization=0.92,\n    max_model_len=8192,\n    enable_prefix_caching=True,  # reuse shared prefixes across requests\n    trust_remote_code=True,\n)\n\nsampling = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)\noutputs = llm.generate(prompts, sampling_params=sampling)\n\nfor i, out in enumerate(outputs):\n    print(f'--- Response {i} ---')\n    print(out.outputs[0].text)\n```\n\nNotes\n- vLLM automatically uses PagedAttention and an efficient KV-cache.\n- Concurrency/throughput knobs (Python API): set max_num_seqs and/or max_num_batched_tokens in LLM(...), e.g., LLM(..., max_num_seqs=64, max_num_batched_tokens=8192). Increase for higher parallelism if VRAM allows.\n\n3.3 vLLM OpenAI-compatible server\n- Start the server (terminal):\n\n  export HF_TOKEN=$HUGGINGFACE_HUB_TOKEN\n  python -m vllm.entrypoints.openai.api_server \\\n      --model meta-llama/Meta-Llama-3-8B-Instruct \\\n      --dtype auto \\\n      --host 0.0.0.0 --port 8000 \\\n      --max-model-len 8192 \\\n      --gpu-memory-utilization 0.92 \\\n      --tensor-parallel-size 1 \\\n      --enable-prefix-caching \\\n      --max-num-seqs 64\n\n- Minimal client (Python, Chat Completions):\n\n```python\nfrom openai import OpenAI\nclient = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')\n\nmessages = [\n    {\"role\": \"system\", \"content\": \"You are a concise, helpful research assistant.\"},\n    {\"role\": \"user\", \"content\": \"Give 3 bulleted tips for prompt evaluation.\"},\n]\nresp = client.chat.completions.create(\n    model='meta-llama/Meta-Llama-3-8B-Instruct',\n    messages=messages,\n    max_tokens=200,\n    temperature=0.7,\n    top_p=0.9,\n)\nprint(resp.choices[0].message.content)\n```\n\n- Concurrent requests (async):\n\n```python\nimport asyncio\nfrom openai import AsyncOpenAI\naclient = AsyncOpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')\n\nasync def ask(q):\n    return await aclient.chat.completions.create(\n        model='meta-llama/Meta-Llama-3-8B-Instruct',\n        messages=[{\"role\": \"system\", \"content\": \"You are concise.\"}, {\"role\": \"user\", \"content\": q}],\n        max_tokens=128,\n        temperature=0.7,\n        top_p=0.9,\n    )\n\nasync def main():\n    qs = [f'Question {i}: one-line summary of LoRA.' for i in range(8)]\n    rs = await asyncio.gather(*[ask(q) for q in qs])\n    for i, r in enumerate(rs):\n        print(i, r.choices[0].message.content)\n\nasyncio.run(main())\n```\n\n4) Optimization and troubleshooting\n4.1 4/8-bit loading (Transformers)\n\n```python\nimport torch\nfrom transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig\n\nmodel_id = 'meta-llama/Meta-Llama-3-8B-Instruct'\nbnb = BitsAndBytesConfig(\n    load_in_4bit=True,\n    bnb_4bit_quant_type='nf4',\n    bnb_4bit_compute_dtype=torch.bfloat16,\n)\n\ntokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)\nif tokenizer.pad_token_id is None:\n    tokenizer.pad_token_id = tokenizer.eos_token_id\n\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_id,\n    quantization_config=bnb,\n    device_map='auto',\n    trust_remote_code=True,\n)\n```\n- If 4-bit is unstable on your GPU, try load_in_8bit=True or set bnb_4bit_compute_dtype=torch.float16.\n\n4.2 Max context and tokens\n- Control output length with max_new_tokens.\n- Control input context via tokenizer.model_max_length or model.config.max_position_embeddings; truncate long prompts.\n- Some models support rope scaling or long-context variants; otherwise keep total tokens (prompt + generated) under max_model_len.\n\n4.3 Common errors\n- CUDA OOM:\n  - Use 4/8-bit quantization.\n  - Reduce max_new_tokens and/or batch size.\n  - Use device_map='auto' (Accelerate) to CPU-offload some layers; optionally set max_memory per device.\n  - Close other GPU processes; restart kernel; torch.cuda.empty_cache().\n- CUDA/driver mismatch:\n  - torch.version.cuda must match your installed CUDA runtime; reinstall Torch with the correct CUDA wheels.\n  - Verify nvidia-smi is visible and driver is up to date.\n- Tokenizer mismatch (garbled outputs or missing chat template):\n  - Ensure tokenizer and model come from the same repo_id.\n  - Use AutoTokenizer(..., use_fast=False) and trust_remote_code=True for families like Qwen.\n  - Prefer tokenizer.apply_chat_template(..., add_generation_prompt=True) over manual string prompts.\n  - Set tokenizer.pad_token_id = tokenizer.eos_token_id if missing.\n- Gated model access errors:\n  - Accept the license on the model page.\n  - Ensure HUGGINGFACE_HUB_TOKEN is set and you ran huggingface_hub.login().\n- Timeouts (serving):\n  - vLLM: increase --max-num-seqs, reduce max_model_len, or use faster dtype; ensure GPU is not oversubscribed.\n  - Client: increase request timeout; keep prompts concise.\n\n4.4 Quick comparison (indicative; varies by GPU/model)\n- Transformers (bf16/fp16, single stream):\n  - Simpler to embed in research code; good single-request latency.\n  - Throughput modest; VRAM ≈ weights + KV-cache (e.g., ~16–20 GB for 7–8B at bf16).\n- Transformers with 4-bit:\n  - Much lower VRAM (e.g., ~7–10 GB for 7–8B); slower decoding; small quality drop.\n- vLLM (bf16/fp16):\n  - High throughput via PagedAttention and shared KV-cache; strong concurrency with dynamic batching.\n  - Similar weight VRAM to Transformers; better KV-cache utilization; best for multi-request/batch workloads.\n\nNotes\n- Swap models freely by changing model_id (Llama 3, Mistral, Qwen). Always pick the -Instruct or -Chat variant for conversations.\n- Optional RAG: use datasets or your corpus to build the context string and place it in the user message; the chat template remains the same.\n\nYou can now load any open-source chat LLM from the Hub, feed chat-formatted prompts, and run fast inference locally with either Transformers or vLLM.",
      "need_alignment": {
        "score": "5",
        "feedback": "Direct, code-first tutorial for modern chat LLMs from Hugging Face (Llama/Mistral/Qwen). Covers Transformers (AutoTokenizer/AutoModel), single and batch chat prompts with apply_chat_template, key gen params, HF token handling for gated models, and both vLLM modes (Python API + OpenAI server) with concurrency/KV-cache notes. Includes a dataset-based batch example and practical troubleshooting. No outdated models or unrelated tasks."
      },
      "content_depth": {
        "score": "5",
        "feedback": "Research-usable and reproducible: virtual env and CUDA checks, proper PyTorch install, full dependency list (transformers/torch/accelerate/huggingface_hub/sentencepiece + optional bitsandbytes/vllm), VRAM guidance/model selection, clean chat-format examples (single + batch), vLLM Python API and server with concurrency/prefix caching, plus targeted optimization/troubleshooting (4/8-bit, max context/tokens, OOM/CUDA/tokenizer/gating/timeouts)."
      },
      "tone": {
        "score": "5",
        "feedback": "Dry, concise, and professional throughout; no fluff, salutations, or marketing language. Clear technical wording aligned with a research workflow."
      },
      "explanation_style": {
        "score": "5",
        "feedback": "Step-by-step numbered flow with short bullets and code-first cells. Includes explicit chat templates and output decoding, and a brief bullet comparison of Transformers vs vLLM. Minimal narrative overhead; easy to scan and execute."
      }
    }
  ],
  "query-id": 121
}