{
  "query": "How can I call and run inference on the open-source language model I want from Hugging Face?",
  "query-id": 121,
  "gold_response": "Step-by-step notebook: Run open-source LLMs (Llama/Mistral/Qwen) from Hugging Face\n\n0) What you will get\n- A clean local environment with CUDA-verified PyTorch.\n- Direct Transformers inference (single + batch) with chat-style prompts.\n- vLLM acceleration two ways: Python API and an OpenAI-compatible server.\n- Practical optimization and troubleshooting for VRAM/OOM, tokens, and tokenizer issues.\n\n1) Preparation\n1.1 Create and activate a virtual environment (choose one)\n- Conda\n  conda create -n hf-llm python=3.10 -y\n  conda activate hf-llm\n- venv (Linux/macOS)\n  python -m venv .venv\n  source .venv/bin/activate\n- venv (Windows PowerShell)\n  python -m venv .venv\n  .\\.venv\\Scripts\\Activate.ps1\n\n1.2 Install PyTorch with the right CUDA build\n- Find the correct command for your system at https://pytorch.org/get-started/locally/\n- Example (CUDA 12.1 wheels):\n  pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio\n\n1.3 Install libraries\n- Core\n  pip install -U transformers accelerate huggingface_hub sentencepiece datasets\n- Optional (quantization and speed)\n  pip install -U bitsandbytes  # 4/8-bit loading (Linux/NVIDIA)\n  pip install -U vllm openai   # accelerated serving + OpenAI client\n  # optional extra speed: FlashAttention 2 (GPU-dependent)\n  # pip install -U flash-attn --no-build-isolation\n\n1.4 Verify CUDA and GPU\n\n```python\nimport torch, subprocess, sys\nprint('torch:', torch.__version__)\nprint('cuda available:', torch.cuda.is_available())\nif torch.cuda.is_available():\n    print('device:', torch.cuda.get_device_name(0))\n    print('capability:', torch.cuda.get_device_capability(0))\n    # Show VRAM\n    try:\n        subprocess.run(['nvidia-smi'], check=False)\n    except FileNotFoundError:\n        print('nvidia-smi not found (OK on some systems)')\nelse:\n    print('Running on CPU; large LLMs will be very slow or fail due to memory.')\n```\n\n1.5 Authenticate to Hugging Face (for gated models like Llama 3)\n\n```python\nimport os, getpass\nfrom huggingface_hub import login\n# One-time: accept the model license on its HF page (e.g., Meta Llama 3).\nif not os.getenv('HUGGINGFACE_HUB_TOKEN'):\n    os.environ['HUGGINGFACE_HUB_TOKEN'] = getpass.getpass('Enter your HF token: ')\nlogin(token=os.environ['HUGGINGFACE_HUB_TOKEN'])\n```\n\n1.6 Model selection and VRAM guidance (rough)\n- Prefer instruct/chat variants: e.g., meta-llama/Meta-Llama-3-8B-Instruct, mistralai/Mistral-7B-Instruct-v0.2, Qwen/Qwen2.5-7B-Instruct.\n- VRAM (bf16/fp16, single GPU; add KV-cache for long outputs):\n  - 7–8B: ~16–20 GB; with 4-bit: ~7–10 GB.\n  - 13B: ~28–34 GB; with 4-bit: ~14–18 GB.\n  - 70B: multi-GPU or server required.\n- If VRAM is tight, start with 4-bit quantization (bitsandbytes) or use vLLM with smaller max context.\n\n2) Basic inference with Transformers (no LangChain)\n2.1 Load a chat LLM (Llama/Mistral/Qwen) with AutoTokenizer/AutoModelForCausalLM\n\n```python\nimport os, torch\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\n# Choose one chat model (ensure license accepted if gated):\n# model_id = 'meta-llama/Meta-Llama-3-8B-Instruct'   # gated\nmodel_id = 'mistralai/Mistral-7B-Instruct-v0.2'       # ungated reference\n\n# Dtype selection (bf16 if supported; else fp16)\nif torch.cuda.is_available():\n    major_cap = torch.cuda.get_device_capability(0)[0]\n    torch_dtype = torch.bfloat16 if major_cap >= 8 else torch.float16\nelse:\n    torch_dtype = torch.float32\n\nhf_token = os.getenv('HUGGINGFACE_HUB_TOKEN')\n\ntokenizer = AutoTokenizer.from_pretrained(\n    model_id, use_fast=False, token=hf_token\n)\nif tokenizer.pad_token_id is None:\n    tokenizer.pad_token_id = tokenizer.eos_token_id\n\n# Load model to GPU(s) automatically; trust_remote_code for some families (e.g., Qwen)\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_id,\n    torch_dtype=torch_dtype,\n    device_map='auto',           # accelerate will shard across GPUs if present\n    trust_remote_code=True,\n)\nmodel.eval()\n```\n\n2.2 Single chat prompt (system/user/assistant)\n\n```python\n# Compose chat messages; apply the model's chat template for correct special tokens\nmessages = [\n    {\"role\": \"system\", \"content\": \"You are a concise, helpful research assistant.\"},\n    {\"role\": \"user\", \"content\": \"In 2 sentences, explain retrieval-augmented generation.\"}\n]\n\nprompt = tokenizer.apply_chat_template(\n    messages,\n    tokenize=False,\n    add_generation_prompt=True,  # adds assistant header according to the template\n)\n\ninputs = tokenizer(prompt, return_tensors='pt').to(model.device)\n\ngen_kwargs = dict(\n    max_new_tokens=256,\n    do_sample=True,\n    temperature=0.7,   # lower -> more deterministic\n    top_p=0.9,         # nucleus sampling\n    # top_k=50,        # optional\n    eos_token_id=tokenizer.eos_token_id,\n    pad_token_id=tokenizer.eos_token_id,\n)\n\nwith torch.no_grad():\n    output_ids = model.generate(**inputs, **gen_kwargs)\n\n# Strip the prompt and decode only newly generated tokens\ngen_text = tokenizer.decode(output_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)\nprint(gen_text)\n```\n\n2.3 Batch prompts (dataset-driven)\n\n```python\nfrom datasets import load_dataset\n\n# Small batch from SQuAD validation\nds = load_dataset('squad', split='validation[:4]')\n\ndef to_messages(example):\n    ctx = example['context']\n    q = example['question']\n    return [\n        {\"role\": \"system\", \"content\": \"Answer using only the provided context.\"},\n        {\"role\": \"user\", \"content\": f\"Context:\\n{ctx}\\n\\nQuestion: {q}\\nAnswer:\"}\n    ]\n\nprompts = [\n    tokenizer.apply_chat_template(to_messages(ex), tokenize=False, add_generation_prompt=True)\n    for ex in ds\n]\n\nbatch_inputs = tokenizer(\n    prompts,\n    padding=True,\n    truncation=True,\n    return_tensors='pt'\n).to(model.device)\n\nwith torch.no_grad():\n    batch_output_ids = model.generate(**batch_inputs, **gen_kwargs)\n\n# Decode only the newly generated parts per item\nprompt_lengths = [len(tokenizer(p).input_ids) for p in prompts]\noutputs = []\nfor i, out_ids in enumerate(batch_output_ids):\n    gen_only = out_ids[prompt_lengths[i]:]\n    outputs.append(tokenizer.decode(gen_only, skip_special_tokens=True))\n\nfor i, text in enumerate(outputs):\n    print(f'--- Example {i} ---')\n    print(text)\n```\n\n2.4 (Optional) Minimal pipeline variant\n\n```python\nfrom transformers import pipeline\n\npipe = pipeline(\n    'text-generation',\n    model=model,\n    tokenizer=tokenizer,\n    device_map='auto',\n    torch_dtype=torch_dtype,\n)\nresult = pipe(prompt, max_new_tokens=128, temperature=0.7, top_p=0.9)[0]['generated_text']\nprint(result[len(prompt):])  # strip the prompt region\n```\n\n3) vLLM acceleration\n3.1 Install\n  pip install -U vllm openai\n\nNote: vLLM reads your Hugging Face token from HF_TOKEN or HUGGINGFACE_HUB_TOKEN.\n\n3.2 vLLM Python API (in-notebook)\n\n```python\nimport os, torch\nfrom vllm import LLM, SamplingParams\nfrom transformers import AutoTokenizer\n\nmodel_id = 'meta-llama/Meta-Llama-3-8B-Instruct'  # or any instruct model\nos.environ.setdefault('HF_TOKEN', os.getenv('HUGGINGFACE_HUB_TOKEN', ''))\n\n# Prepare chat prompts using the model's chat template\ntok = AutoTokenizer.from_pretrained(model_id, use_fast=False, token=os.getenv('HF_TOKEN'))\nmsgs = [\n    [\n        {\"role\": \"system\", \"content\": \"You are a concise, helpful research assistant.\"},\n        {\"role\": \"user\", \"content\": \"Summarize contrastive learning in one paragraph.\"}\n    ],\n    [\n        {\"role\": \"system\", \"content\": \"You are a concise, helpful research assistant.\"},\n        {\"role\": \"user\", \"content\": \"List 3 pitfalls when evaluating LLMs with BLEU.\"}\n    ],\n]\nprompts = [tok.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in msgs]\n\nllm = LLM(\n    model=model_id,\n    dtype='auto',\n    tensor_parallel_size=(torch.cuda.device_count() or 1),\n    gpu_memory_utilization=0.92,\n    max_model_len=8192,\n    enable_prefix_caching=True,  # reuse shared prefixes across requests\n    trust_remote_code=True,\n)\n\nsampling = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)\noutputs = llm.generate(prompts, sampling_params=sampling)\n\nfor i, out in enumerate(outputs):\n    print(f'--- Response {i} ---')\n    print(out.outputs[0].text)\n```\n\nNotes\n- vLLM automatically uses PagedAttention and an efficient KV-cache.\n- Concurrency/throughput knobs (Python API): set max_num_seqs and/or max_num_batched_tokens in LLM(...), e.g., LLM(..., max_num_seqs=64, max_num_batched_tokens=8192). Increase for higher parallelism if VRAM allows.\n\n3.3 vLLM OpenAI-compatible server\n- Start the server (terminal):\n\n  export HF_TOKEN=$HUGGINGFACE_HUB_TOKEN\n  python -m vllm.entrypoints.openai.api_server \\\n      --model meta-llama/Meta-Llama-3-8B-Instruct \\\n      --dtype auto \\\n      --host 0.0.0.0 --port 8000 \\\n      --max-model-len 8192 \\\n      --gpu-memory-utilization 0.92 \\\n      --tensor-parallel-size 1 \\\n      --enable-prefix-caching \\\n      --max-num-seqs 64\n\n- Minimal client (Python, Chat Completions):\n\n```python\nfrom openai import OpenAI\nclient = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')\n\nmessages = [\n    {\"role\": \"system\", \"content\": \"You are a concise, helpful research assistant.\"},\n    {\"role\": \"user\", \"content\": \"Give 3 bulleted tips for prompt evaluation.\"},\n]\nresp = client.chat.completions.create(\n    model='meta-llama/Meta-Llama-3-8B-Instruct',\n    messages=messages,\n    max_tokens=200,\n    temperature=0.7,\n    top_p=0.9,\n)\nprint(resp.choices[0].message.content)\n```\n\n- Concurrent requests (async):\n\n```python\nimport asyncio\nfrom openai import AsyncOpenAI\naclient = AsyncOpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')\n\nasync def ask(q):\n    return await aclient.chat.completions.create(\n        model='meta-llama/Meta-Llama-3-8B-Instruct',\n        messages=[{\"role\": \"system\", \"content\": \"You are concise.\"}, {\"role\": \"user\", \"content\": q}],\n        max_tokens=128,\n        temperature=0.7,\n        top_p=0.9,\n    )\n\nasync def main():\n    qs = [f'Question {i}: one-line summary of LoRA.' for i in range(8)]\n    rs = await asyncio.gather(*[ask(q) for q in qs])\n    for i, r in enumerate(rs):\n        print(i, r.choices[0].message.content)\n\nasyncio.run(main())\n```\n\n4) Optimization and troubleshooting\n4.1 4/8-bit loading (Transformers)\n\n```python\nimport torch\nfrom transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig\n\nmodel_id = 'meta-llama/Meta-Llama-3-8B-Instruct'\nbnb = BitsAndBytesConfig(\n    load_in_4bit=True,\n    bnb_4bit_quant_type='nf4',\n    bnb_4bit_compute_dtype=torch.bfloat16,\n)\n\ntokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)\nif tokenizer.pad_token_id is None:\n    tokenizer.pad_token_id = tokenizer.eos_token_id\n\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_id,\n    quantization_config=bnb,\n    device_map='auto',\n    trust_remote_code=True,\n)\n```\n- If 4-bit is unstable on your GPU, try load_in_8bit=True or set bnb_4bit_compute_dtype=torch.float16.\n\n4.2 Max context and tokens\n- Control output length with max_new_tokens.\n- Control input context via tokenizer.model_max_length or model.config.max_position_embeddings; truncate long prompts.\n- Some models support rope scaling or long-context variants; otherwise keep total tokens (prompt + generated) under max_model_len.\n\n4.3 Common errors\n- CUDA OOM:\n  - Use 4/8-bit quantization.\n  - Reduce max_new_tokens and/or batch size.\n  - Use device_map='auto' (Accelerate) to CPU-offload some layers; optionally set max_memory per device.\n  - Close other GPU processes; restart kernel; torch.cuda.empty_cache().\n- CUDA/driver mismatch:\n  - torch.version.cuda must match your installed CUDA runtime; reinstall Torch with the correct CUDA wheels.\n  - Verify nvidia-smi is visible and driver is up to date.\n- Tokenizer mismatch (garbled outputs or missing chat template):\n  - Ensure tokenizer and model come from the same repo_id.\n  - Use AutoTokenizer(..., use_fast=False) and trust_remote_code=True for families like Qwen.\n  - Prefer tokenizer.apply_chat_template(..., add_generation_prompt=True) over manual string prompts.\n  - Set tokenizer.pad_token_id = tokenizer.eos_token_id if missing.\n- Gated model access errors:\n  - Accept the license on the model page.\n  - Ensure HUGGINGFACE_HUB_TOKEN is set and you ran huggingface_hub.login().\n- Timeouts (serving):\n  - vLLM: increase --max-num-seqs, reduce max_model_len, or use faster dtype; ensure GPU is not oversubscribed.\n  - Client: increase request timeout; keep prompts concise.\n\n4.4 Quick comparison (indicative; varies by GPU/model)\n- Transformers (bf16/fp16, single stream):\n  - Simpler to embed in research code; good single-request latency.\n  - Throughput modest; VRAM ≈ weights + KV-cache (e.g., ~16–20 GB for 7–8B at bf16).\n- Transformers with 4-bit:\n  - Much lower VRAM (e.g., ~7–10 GB for 7–8B); slower decoding; small quality drop.\n- vLLM (bf16/fp16):\n  - High throughput via PagedAttention and shared KV-cache; strong concurrency with dynamic batching.\n  - Similar weight VRAM to Transformers; better KV-cache utilization; best for multi-request/batch workloads.\n\nNotes\n- Swap models freely by changing model_id (Llama 3, Mistral, Qwen). Always pick the -Instruct or -Chat variant for conversations.\n- Optional RAG: use datasets or your corpus to build the context string and place it in the user message; the chat template remains the same.\n\nYou can now load any open-source chat LLM from the Hub, feed chat-formatted prompts, and run fast inference locally with either Transformers or vLLM.",
  "gold_information": [
    "Create and activate an isolated virtual environment before installing dependencies.",
    "Install a GPU‑enabled deep‑learning framework that matches your driver and toolkit.",
    "Verify GPU availability, compute capability, and memory before loading models.",
    "Install text‑generation and model‑hub client libraries to download and run models.",
    "Authenticate with the model hub and accept licenses to access gated models.",
    "Choose an instruct or chat variant of the model for conversational use.",
    "Use half precision or bfloat16 on supported GPUs to reduce memory and improve speed.",
    "Load the tokenizer and causal language model with automatic device mapping to available hardware.",
    "Set the pad token to the end‑of‑sequence token if it is missing.",
    "Format prompts with a chat template using system and user roles.",
    "Configure generation parameters such as max_new_tokens, temperature, and top_p.",
    "Decode only the newly generated tokens to avoid duplicating the prompt.",
    "Batch multiple prompts with padding and truncation for efficient inference.",
    "A high‑level pipeline API can perform text generation in a single call.",
    "Use a high‑throughput inference engine to accelerate serving and batch workloads.",
    "Expose a chat‑completions HTTP API that is compatible with common clients.",
    "Enable prefix caching and dynamic batching to improve throughput.",
    "Adjust sequence and token batch limits to increase parallelism within available memory.",
    "Provide the model‑hub token as an environment variable for authenticated downloads.",
    "Quantize model weights to 4‑bit or 8‑bit to reduce VRAM at some cost to speed and quality.",
    "If 4‑bit quantization is unstable, switch to 8‑bit or change the compute data type.",
    "Control output length with max_new_tokens and keep total tokens under the context limit.",
    "Truncate or summarize long inputs to fit within the model’s context window.",
    "Use long‑context extensions only if they are supported by the model.",
    "On out‑of‑memory errors, enable quantization, reduce batch size, or lower max_new_tokens.",
    "Use CPU or disk offloading and automatic device placement to fit larger models.",
    "Resolve tokenizer issues by pairing the tokenizer and model from the same repository and using the chat template.",
    "For access errors, confirm license acceptance and that the authentication token is configured.",
    "For serving timeouts, reduce context length, tune concurrency settings, or increase client timeouts.",
    "Models with about 7–8B parameters typically need roughly 16–20 GB of VRAM in half precision.",
    "The same models in 4‑bit quantization can fit in roughly 7–10 GB of VRAM.",
    "Models around 13B parameters may need roughly 28–34 GB of VRAM in half precision or 14–18 GB in 4‑bit.",
    "Very large models often require multiple GPUs or a dedicated server.",
    "A dataset loader can drive batch prompts for evaluation or data processing.",
    "Swap models by changing the model identifier in the loading call.",
    "For retrieval‑augmented generation, place retrieved context into the user message.",
    "The standard framework offers simpler integration but lower throughput than an accelerated engine."
  ]
}