{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image generation", "api_name": "stabilityai/stable-diffusion-2-base", "api_call": "StableDiffusionPipeline.from_pretrained('stabilityai/stable-diffusion-2-base', scheduler=EulerDiscreteScheduler.from_pretrained('stabilityai/stable-diffusion-2-base', subfolder=scheduler), torch_dtype=torch.float16)", "performance": {"dataset": "COCO2017 validation set", "accuracy": "Not optimized for FID scores"}, "description": "Stable Diffusion v2-base is a diffusion-based text-to-image generation model trained on a subset of LAION-5B dataset. It can be used to generate and modify images based on text prompts. The model uses a fixed, pretrained text encoder (OpenCLIP-ViT/H) and is intended for research purposes only.", "model_name": "stabilityai/stable-diffusion-2-base"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "glpn-nyu-finetuned-diode-221122-014502", "api_call": "AutoModel.from_pretrained('sayakpaul/glpn-nyu-finetuned-diode-221122-014502')", "performance": {"dataset": "diode-subset", "accuracy": {"Loss": 0.3476, "Mae": 0.2763, "Rmse": 0.4088, "Abs Rel": 0.33080000000000004, "Log Mae": 0.11610000000000001, "Log Rmse": 0.17, "Delta1": 0.5682, "Delta2": 0.8301000000000001, "Delta3": 0.9279000000000001}}, "description": "This model is a fine-tuned version of vinvino02/glpn-nyu on the diode-subset dataset. It achieves depth estimation with various performance metrics.", "model_name": "sayakpaul/glpn-nyu-finetuned-diode-221122-014502"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "functionality": "emotion", "api_name": "bhadresh-savani/distilbert-base-uncased-emotion", "api_call": "pipeline('text-classification', model='bhadresh-savani/distilbert-base-uncased-emotion', return_all_scores=True)", "performance": {"dataset": "Twitter-Sentiment-Analysis", "accuracy": 0.9380000000000001}, "description": "Distilbert is created with knowledge distillation during the pre-training phase which reduces the size of a BERT model by 40%, while retaining 97% of its language understanding. It's smaller, faster than Bert and any other Bert-based model. Distilbert-base-uncased finetuned on the emotion dataset using HuggingFace Trainer.", "model_name": "bhadresh-savani/distilbert-base-uncased-emotion"}
{"domain": "Audio Text-to-Speech", "framework": "speechbrain", "functionality": "Text-to-Speech", "api_name": "padmalcom/tts-tacotron2-german", "api_call": "Tacotron2.from_hparams(source='padmalcom/tts-tacotron2-german')", "performance": {"dataset": "custom german dataset", "accuracy": "Not provided"}, "description": "Text-to-Speech (TTS) with Tacotron2 trained on a custom german dataset with 12 days voice using speechbrain. Trained for 39 epochs (english speechbrain models are trained for 750 epochs) so there is room for improvement and the model is most likely to be updated soon. The hifigan vocoder can fortunately be used language-independently.", "model_name": "padmalcom/tts-tacotron2-german"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Unconditional Image Generation", "api_name": "MFawad/sd-class-butterflies-32", "api_call": "DDPMPipeline.from_pretrained('MFawad/sd-class-butterflies-32')", "performance": {"dataset": "", "accuracy": ""}, "description": "This model is a diffusion model for unconditional image generation of cute \ud83e\udd8b.", "model_name": "MFawad/sd-class-butterflies-32"}
{"domain": "Audio Text-to-Speech", "framework": "Fairseq", "functionality": "Text-to-Speech", "api_name": "facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_es_css10", "api_call": "unit.TTS.from_pretrained('facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_es_css10')", "performance": {"dataset": "covost2", "accuracy": null}, "description": "A text-to-speech model trained on multiple datasets including mtedx, covost2, europarl_st, and voxpopuli. Supports English, Spanish, French, and Italian languages.", "model_name": "facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_es_css10"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "layoutlm-invoices", "api_call": "AutoModelForDocumentQuestionAnswering.from_pretrained('impira/layoutlm-invoices')", "performance": {"dataset": "proprietary dataset of invoices, SQuAD2.0, and DocVQA", "accuracy": "Not provided"}, "description": "A fine-tuned version of the multi-modal LayoutLM model for the task of question answering on invoices and other documents. It has been fine-tuned on a proprietary dataset of invoices as well as both SQuAD2.0 and DocVQA for general comprehension. Unlike other QA models, which can only extract consecutive tokens, this model can predict longer-range, non-consecutive sequences with an additional classifier head.", "model_name": "impira/layoutlm-invoices"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "azwierzc/vilt-b32-finetuned-vqa-pl", "api_call": "pipeline('visual-question-answering', model='azwierzc/vilt-b32-finetuned-vqa-pl')", "performance": {"dataset": "", "accuracy": ""}, "description": "A Visual Question Answering model fine-tuned on the Polish language.", "model_name": "azwierzc/vilt-b32-finetuned-vqa-pl"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "facebook/maskformer-swin-base-ade", "api_call": "MaskFormerForInstanceSegmentation.from_pretrained('facebook/maskformer-swin-base-ade')", "performance": {"dataset": "ADE20k", "accuracy": "Not provided"}, "description": "MaskFormer model trained on ADE20k semantic segmentation (base-sized version, Swin backbone). It was introduced in the paper Per-Pixel Classification is Not All You Need for Semantic Segmentation and first released in this repository. This model addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation.", "model_name": "facebook/maskformer-swin-base-ade"}
{"domain": "Tabular Tabular Classification", "framework": "Hugging Face Transformers", "functionality": "Tabular Classification", "api_name": "datadmg/autotrain-test-news-44534112235", "api_call": "AutoModel.from_pretrained('datadmg/autotrain-test-news-44534112235')", "performance": {"dataset": "datadmg/autotrain-data-test-news", "accuracy": 0.333}, "description": "This model is trained for Multi-class Classification on CO2 Emissions dataset. It uses the Hugging Face Transformers framework and is based on the extra_trees algorithm. The model is trained with AutoTrain and has a tabular classification functionality.", "model_name": "datadmg/autotrain-test-news-44534112235"}
{"domain": "Audio Audio-to-Audio", "framework": "Fairseq", "functionality": "speech-to-speech-translation", "api_name": "facebook/xm_transformer_sm_all-en", "api_call": "pipeline('translation', model='facebook/xm_transformer_sm_all-en')", "performance": {"dataset": "", "accuracy": ""}, "description": "A speech-to-speech translation model that can be loaded on the Inference API on-demand.", "model_name": "facebook/xm_transformer_sm_all-en"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "functionality": "Text Classification", "api_name": "lvwerra/distilbert-imdb", "api_call": "pipeline('sentiment-analysis', model='lvwerra/distilbert-imdb')", "performance": {"dataset": "imdb", "accuracy": 0.928}, "description": "This model is a fine-tuned version of distilbert-base-uncased on the imdb dataset. It is used for sentiment analysis on movie reviews and achieves an accuracy of 0.928 on the evaluation set.", "model_name": "lvwerra/distilbert-imdb"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face", "functionality": "Zero-Shot Image Classification", "api_name": "laion/CLIP-ViT-B-16-laion2B-s34B-b88K", "api_call": "pipeline('image-classification', model='laion/CLIP-ViT-B-16-laion2B-s34B-b88K')", "performance": {"dataset": "ImageNet-1k", "accuracy": "70.2%"}, "description": "A CLIP ViT-B/16 model trained with the LAION-2B English subset of LAION-5B using OpenCLIP. This model is intended for research purposes and can be used for zero-shot image classification, image and text retrieval, and other related tasks.", "model_name": "laion/CLIP-ViT-B-16-laion2B-s34B-b88K"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "Image-to-Image", "api_name": "GreeneryScenery/SheepsControlV3", "api_call": "pipeline('image-to-image', model='GreeneryScenery/SheepsControlV3')", "performance": {"dataset": "GreeneryScenery/SheepsControlV3", "accuracy": "Not provided"}, "description": "GreeneryScenery/SheepsControlV3 is a model for image-to-image tasks. It can be used to generate images based on the input image and optional text guidance. The model has some limitations, such as the conditioning image not affecting the output image much. Improvements can be made by training for more epochs, using better prompts, and preprocessing the data.", "model_name": "GreeneryScenery/SheepsControlV3"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "seungwon12/layoutlmv2-base-uncased_finetuned_docvqa", "api_call": "pipeline('question-answering', model='seungwon12/layoutlmv2-base-uncased_finetuned_docvqa', tokenizer='seungwon12/layoutlmv2-base-uncased_finetuned_docvqa')", "performance": {"dataset": "DocVQA", "accuracy": ""}, "description": "A document question answering model finetuned on the DocVQA dataset using LayoutLMv2-base-uncased.", "model_name": "seungwon12/layoutlmv2-base-uncased_finetuned_docvqa"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "MCG-NJU/videomae-base", "api_call": "VideoMAEForPreTraining.from_pretrained('MCG-NJU/videomae-base')", "performance": {"dataset": "Kinetics-400", "accuracy": "To be provided"}, "description": "VideoMAE is an extension of Masked Autoencoders (MAE) to video. The architecture of the model is very similar to that of a standard Vision Transformer (ViT), with a decoder on top for predicting pixel values for masked patches.", "model_name": "MCG-NJU/videomae-base"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Fill-Mask", "api_name": "distilbert-base-uncased", "api_call": "pipeline('fill-mask', model='distilbert-base-uncased')", "performance": {"dataset": "GLUE", "accuracy": {"MNLI": 82.2, "QQP": 88.5, "QNLI": 89.2, "SST-2": 91.3, "CoLA": 51.3, "STS-B": 85.8, "MRPC": 87.5, "RTE": 59.9}}, "description": "DistilBERT is a transformers model, smaller and faster than BERT, which was pretrained on the same corpus in a self-supervised fashion, using the BERT base model as a teacher. It was pretrained with three objectives: Distillation loss, Masked language modeling (MLM), and Cosine embedding loss. This model is uncased and can be used for masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task.", "model_name": "distilbert-base-uncased"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Image Classification", "api_name": "clip-vit-base-patch32-ko", "api_call": "pipeline('zero-shot-image-classification', model='Bingsu/clip-vit-base-patch32-ko')", "performance": {"dataset": "AIHUB", "accuracy": "Not provided"}, "description": "Korean CLIP model trained by Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation. It is a zero-shot image classification model that can be used to classify images without any training data.", "model_name": "Bingsu/clip-vit-base-patch32-ko"}
{"domain": "Natural Language Processing Token Classification", "framework": "Flair", "functionality": "Part-of-Speech Tagging", "api_name": "flair/upos-english", "api_call": "SequenceTagger.load('flair/upos-english')", "performance": {"dataset": "ontonotes", "accuracy": 98.6}, "description": "This is the standard universal part-of-speech tagging model for English that ships with Flair. It predicts universal POS tags such as ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, and X. The model is based on Flair embeddings and LSTM-CRF.", "model_name": "flair/upos-english"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "Human Pose Estimation", "api_name": "lllyasviel/sd-controlnet-openpose", "api_call": "ControlNetModel.from_pretrained('lllyasviel/sd-controlnet-openpose')", "performance": {"dataset": "200k pose-image, caption pairs", "accuracy": "Not specified"}, "description": "ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on Human Pose Estimation. It can be used in combination with Stable Diffusion.", "model_name": "lllyasviel/sd-controlnet-openpose"}
{"domain": "Audio Text-to-Speech", "framework": "ESPnet", "functionality": "Text-to-Speech", "api_name": "kazusam/kt", "api_call": "./run.sh --skip_data_prep false --skip_train true --download_model mio/amadeus", "performance": {"dataset": "amadeus", "accuracy": "Not provided"}, "description": "An ESPnet2 TTS model trained by mio using amadeus recipe in espnet.", "model_name": "mio/amadeus"}
{"domain": "Natural Language Processing Token Classification", "framework": "Hugging Face Transformers", "functionality": "De-identification", "api_name": "StanfordAIMI/stanford-deidentifier-base", "api_call": "pipeline('ner', model='StanfordAIMI/stanford-deidentifier-base')", "performance": {"dataset": "radreports", "accuracy": {"known_institution_F1": 97.9, "new_institution_F1": 99.6, "i2b2_2006_F1": 99.5, "i2b2_2014_F1": 98.9}}, "description": "Stanford de-identifier was trained on a variety of radiology and biomedical documents with the goal of automatising the de-identification process while reaching satisfactory accuracy for use in production.", "model_name": "StanfordAIMI/stanford-deidentifier-base"}
{"domain": "Reinforcement Learning", "framework": "Stable-Baselines3", "functionality": "LunarLander-v2", "api_name": "araffin/dqn-LunarLander-v2", "api_call": "DQN.load(load_from_hub('araffin/dqn-LunarLander-v2', 'dqn-LunarLander-v2.zip'), **kwargs)", "performance": {"dataset": "LunarLander-v2", "accuracy": "280.22 +/- 13.03"}, "description": "This is a trained model of a DQN agent playing LunarLander-v2 using the stable-baselines3 library.", "model_name": "araffin/dqn-LunarLander-v2"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "functionality": "Multilingual Question Answering", "api_name": "mrm8488/bert-multi-cased-finetuned-xquadv1", "api_call": "pipeline('question-answering', model='mrm8488/bert-multi-cased-finetuned-xquadv1', tokenizer='mrm8488/bert-multi-cased-finetuned-xquadv1')", "performance": {"dataset": "XQuAD", "accuracy": "Not provided"}, "description": "This model is a BERT (base-multilingual-cased) fine-tuned for multilingual Question Answering on 11 different languages using the XQuAD dataset and additional data augmentation techniques.", "model_name": "mrm8488/bert-multi-cased-finetuned-xquadv1"}
{"domain": "Natural Language Processing Token Classification", "framework": "Hugging Face Transformers", "functionality": "Named Entity Recognition", "api_name": "flair/ner-english", "api_call": "SequenceTagger.load('flair/ner-english')", "performance": {"dataset": "conll2003", "accuracy": "93.06"}, "description": "This is the standard 4-class NER model for English that ships with Flair. It predicts 4 tags: PER (person name), LOC (location name), ORG (organization name), and MISC (other name). The model is based on Flair embeddings and LSTM-CRF.", "model_name": "flair/ner-english"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "shibing624/text2vec-base-chinese", "api_call": "SentenceModel('shibing624/text2vec-base-chinese')", "performance": {"dataset": [{"name": "ATEC", "accuracy": "31.93"}, {"name": "BQ", "accuracy": "42.67"}, {"name": "LCQMC", "accuracy": "70.16"}, {"name": "PAWSX", "accuracy": "17.21"}, {"name": "STS-B", "accuracy": "79.30"}]}, "description": "This is a CoSENT(Cosine Sentence) model: shibing624/text2vec-base-chinese. It maps sentences to a 768 dimensional dense vector space and can be used for tasks like sentence embeddings, text matching or semantic search.", "model_name": "shibing624/text2vec-base-chinese"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "Linaqruf/anything-v3.0", "api_call": "Text2ImagePipeline(model='Linaqruf/anything-v3.0')", "performance": {"dataset": "", "accuracy": ""}, "description": "A text-to-image model that generates images from text descriptions.", "model_name": "Linaqruf/anything-v3.0"}
{"domain": "Reinforcement Learning", "framework": "Stable-Baselines3", "functionality": "MountainCar-v0", "api_name": "sb3/dqn-MountainCar-v0", "api_call": "load_from_hub(repo_id='sb3/dqn-MountainCar-v0',filename='{MODEL FILENAME}.zip',)", "performance": {"dataset": "MountainCar-v0", "accuracy": "-103.40 +/- 7.49"}, "description": "This is a trained model of a DQN agent playing MountainCar-v0 using the stable-baselines3 library and the RL Zoo. The RL Zoo is a training framework for Stable Baselines3 reinforcement learning agents, with hyperparameter optimization and pre-trained agents included.", "model_name": "sb3/dqn-MountainCar-v0"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Conversational", "api_name": "af1tang/personaGPT", "api_call": "AutoModelForCausalLM.from_pretrained('af1tang/personaGPT')", "performance": {"dataset": "Persona-Chat", "accuracy": "Not provided"}, "description": "PersonaGPT is an open-domain conversational agent designed to do 2 tasks: decoding personalized responses based on input personality facts (the persona profile of the bot) and incorporating turn-level goals into its responses through action codes (e.g., talk about work, ask about favorite music). It builds on the DialoGPT-medium pretrained model based on the GPT-2 architecture.", "model_name": "af1tang/personaGPT"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "lmazzon70/videomae-large-finetuned-kinetics-finetuned-rwf2000-epochs8-batch8-kl-torch2", "api_call": "AutoModelForVideoClassification.from_pretrained('lmazzon70/videomae-large-finetuned-kinetics-finetuned-rwf2000-epochs8-batch8-kl-torch2')", "performance": {"dataset": "unknown", "accuracy": 0.7212000000000001}, "description": "This model is a fine-tuned version of MCG-NJU/videomae-large-finetuned-kinetics on an unknown dataset.", "model_name": "lmazzon70/videomae-large-finetuned-kinetics-finetuned-rwf2000-epochs8-batch8-kl-torch2"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/tapex-base", "api_call": "BartForConditionalGeneration.from_pretrained('microsoft/tapex-base')", "performance": {"dataset": "arxiv:2107.07653", "accuracy": "Not provided"}, "description": "TAPEX (Table Pre-training via Execution) is a conceptually simple and empirically powerful pre-training approach to empower existing models with table reasoning skills. TAPEX realizes table pre-training by learning a neural SQL executor over a synthetic corpus, which is obtained by automatically synthesizing executable SQL queries.", "model_name": "microsoft/tapex-base"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face", "functionality": "Zero-Shot Image Classification", "api_name": "laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup", "api_call": "pipeline('image-classification', model='laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup')", "performance": {"dataset": "ImageNet-1k", "accuracy": "79.1-79.4"}, "description": "A series of CLIP ConvNeXt-XXLarge models trained on LAION-2B (English), a subset of LAION-5B, using OpenCLIP. These models achieve between 79.1 and 79.4 top-1 zero-shot accuracy on ImageNet-1k.", "model_name": "laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Automatic Speech Recognition and Speech Translation", "api_name": "openai/whisper-large-v2", "api_call": "WhisperForConditionalGeneration.from_pretrained('openai/whisper-large-v2')", "performance": {"dataset": "LibriSpeech test-clean", "accuracy": 3.000358308}, "description": "Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalize to many datasets and domains without the need for fine-tuning.", "model_name": "openai/whisper-large-v2"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Image Segmentation", "api_name": "keremberke/yolov8s-building-segmentation", "api_call": "YOLO('keremberke/yolov8s-building-segmentation')", "performance": {"dataset": "satellite-building-segmentation", "accuracy": {"mAP@0.5(box)": 0.661, "mAP@0.5(mask)": 0.651}}, "description": "A YOLOv8 model for building segmentation in satellite images. Trained on the satellite-building-segmentation dataset, it can detect and segment buildings with high accuracy.", "model_name": "keremberke/yolov8s-building-segmentation"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "google/vit-base-patch16-224", "api_call": "ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')", "performance": {"dataset": "imagenet-1k", "accuracy": "Not provided"}, "description": "Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al.", "model_name": "google/vit-base-patch16-224"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Image Segmentation", "api_name": "facebook/detr-resnet-50-panoptic", "api_call": "DetrForSegmentation.from_pretrained('facebook/detr-resnet-50-panoptic')", "performance": {"dataset": "COCO 2017 validation", "accuracy": {"box_AP": 38.8, "segmentation_AP": 31.1, "PQ": 43.4}}, "description": "DEtection TRansformer (DETR) model trained end-to-end on COCO 2017 panoptic (118k annotated images). It was introduced in the paper End-to-End Object Detection with Transformers by Carion et al. and first released in this repository.", "model_name": "facebook/detr-resnet-50-panoptic"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Masked Language Modeling, Next Sentence Prediction", "api_name": "bert-base-uncased", "api_call": "pipeline('fill-mask', model='bert-base-uncased')", "performance": {"dataset": "GLUE", "accuracy": 79.6}, "description": "BERT base model (uncased) is a transformer model pretrained on a large corpus of English data using a masked language modeling (MLM) objective. It can be used for masked language modeling, next sentence prediction, and fine-tuning on downstream tasks such as sequence classification, token classification, or question answering.", "model_name": "bert-base-uncased"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "LayoutLMX_pt_question_answer_ocrazure_correct_V18_08_04_2023", "api_call": "AutoModelForDocumentQuestionAnswering.from_pretrained('L-oenai/LayoutLMX_pt_question_answer_ocrazure_correct_V18_08_04_2023')", "performance": {"dataset": "", "accuracy": ""}, "description": "A LayoutLM model for document question answering.", "model_name": "L-oenai/LayoutLMX_pt_question_answer_ocrazure_correct_V18_08_04_2023"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "hf-tiny-model-private/tiny-random-GLPNForDepthEstimation", "api_call": "AutoModel.from_pretrained('hf-tiny-model-private/tiny-random-GLPNForDepthEstimation')", "performance": {"dataset": "", "accuracy": ""}, "description": "A tiny random GLPN model for depth estimation using the Hugging Face Transformers library.", "model_name": "hf-tiny-model-private/tiny-random-GLPNForDepthEstimation"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Text-to-Text Transfer Transformer", "api_name": "google/mt5-base", "api_call": "MT5ForConditionalGeneration.from_pretrained('google/mt5-base')", "performance": {"dataset": "mc4", "accuracy": "Not provided"}, "description": "mT5 is a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. It leverages a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of multilingual NLP tasks.", "model_name": "google/mt5-base"}
{"domain": "Computer Vision Object Detection", "framework": "Transformers", "functionality": "Object Detection", "api_name": "fcakyon/yolov5s-v7.0", "api_call": "yolov5.load('fcakyon/yolov5s-v7.0')", "performance": {"dataset": "detection-datasets/coco", "accuracy": null}, "description": "Yolov5s-v7.0 is an object detection model trained on the COCO dataset. It can detect objects in images and return their bounding boxes, scores, and categories.", "model_name": "fcakyon/yolov5s-v7.0"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Text2Text Generation", "api_name": "DialogLED-base-16384", "api_call": "LEDForConditionalGeneration.from_pretrained('MingZhong/DialogLED-base-16384')", "performance": {"dataset": "arxiv", "accuracy": "2109.02492"}, "description": "DialogLED is a pre-trained model for long dialogue understanding and summarization. It builds on the Longformer-Encoder-Decoder (LED) architecture and uses window-based denoising as the pre-training task on a large amount of long dialogue data for further training. Here is a base version of DialogLED, the input length is limited to 16,384 in the pre-training phase.", "model_name": "MingZhong/DialogLED-base-16384"}
{"domain": "Audio Text-to-Speech", "framework": "Fairseq", "functionality": "Text-to-Speech", "api_name": "unit_hifigan_HK_layer12.km2500_frame_TAT-TTS", "api_call": "load_model_ensemble_and_task_from_hf_hub('facebook/unit_hifigan_HK_layer12.km2500_frame_TAT-TT')", "performance": {"dataset": "TAT-TTS", "accuracy": "Not provided"}, "description": "Hokkien unit HiFiGAN based vocoder from fairseq. Trained with TAT-TTS data with 4 speakers in Taiwanese Hokkien accent.", "model_name": "facebook/unit_hifigan_HK_layer12.km2500_frame_TAT-TT"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Image Classification", "api_name": "openai/clip-vit-base-patch32", "api_call": "CLIPModel.from_pretrained('openai/clip-vit-base-patch32')", "performance": {"dataset": ["Food101", "CIFAR10", "CIFAR100", "Birdsnap", "SUN397", "Stanford Cars", "FGVC Aircraft", "VOC2007", "DTD", "Oxford-IIIT Pet dataset", "Caltech101", "Flowers102", "MNIST", "SVHN", "IIIT5K", "Hateful Memes", "SST-2", "UCF101", "Kinetics700", "Country211", "CLEVR Counting", "KITTI Distance", "STL-10", "RareAct", "Flickr30", "MSCOCO", "ImageNet", "ImageNet-A", "ImageNet-R", "ImageNet Sketch", "ObjectNet (ImageNet Overlap)", "Youtube-BB", "ImageNet-Vid"], "accuracy": "varies"}, "description": "The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner.", "model_name": "openai/clip-vit-base-patch32"}
{"domain": "Natural Language Processing Conversational", "framework": "Hugging Face Transformers", "functionality": "Text Generation", "api_name": "mywateriswet/ShuanBot", "api_call": "pipeline('conversational', model='mywateriswet/ShuanBot')", "performance": {"dataset": "N/A", "accuracy": "N/A"}, "description": "ShuanBot is a conversational chatbot model based on the GPT-2 architecture. It can be used for generating human-like responses in a chat context.", "model_name": "mywateriswet/ShuanBot"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Depth Estimation", "api_name": "glpn-kitti", "api_call": "GLPNForDepthEstimation.from_pretrained('vinvino02/glpn-kitti')", "performance": {"dataset": "KITTI", "accuracy": "Not provided"}, "description": "Global-Local Path Networks (GLPN) model trained on KITTI for monocular depth estimation. It was introduced in the paper Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth by Kim et al. and first released in this repository.", "model_name": "vinvino02/glpn-kitti"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "Abstractive Text Summarization", "api_name": "plguillou/t5-base-fr-sum-cnndm", "api_call": "T5ForConditionalGeneration.from_pretrained('plguillou/t5-base-fr-sum-cnndm')", "performance": {"dataset": "cnn_dailymail", "ROUGE-1": 44.5252, "ROUGE-2": 22.652, "ROUGE-L": 29.8866}, "description": "This model is a T5 Transformers model (JDBN/t5-base-fr-qg-fquad) that was fine-tuned in French for abstractive text summarization.", "model_name": "plguillou/t5-base-fr-sum-cnndm"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "tiny-random-DPTForDepthEstimation", "api_call": "DPTForDepthEstimation.from_pretrained('hf-tiny-model-private/tiny-random-DPTForDepthEstimation')", "performance": {"dataset": "", "accuracy": ""}, "description": "A tiny random DPT model for depth estimation using Hugging Face Transformers library.", "model_name": "hf-tiny-model-private/tiny-random-DPTForDepthEstimation"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "MCG-NJU/videomae-large-finetuned-kinetics", "api_call": "VideoMAEForVideoClassification.from_pretrained('MCG-NJU/videomae-large-finetuned-kinetics')", "performance": {"dataset": "Kinetics-400", "accuracy": {"top-1": 84.7, "top-5": 96.5}}, "description": "VideoMAE model pre-trained for 1600 epochs in a self-supervised way and fine-tuned in a supervised way on Kinetics-400. It was introduced in the paper VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training by Tong et al. and first released in this repository.", "model_name": "MCG-NJU/videomae-large-finetuned-kinetics"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Translation, Summarization, Question Answering, Sentiment Analysis", "api_name": "t5-3b", "api_call": "T5ForConditionalGeneration.from_pretrained('t5-3b')", "performance": {"dataset": "c4", "accuracy": "See research paper, Table 14"}, "description": "T5-3B is a Text-To-Text Transfer Transformer (T5) model with 3 billion parameters. It is designed for various NLP tasks such as translation, summarization, question answering, and sentiment analysis. The model is pre-trained on the Colossal Clean Crawled Corpus (C4) and fine-tuned on multiple supervised and unsupervised tasks.", "model_name": "t5-3b"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Unconditional Image Generation", "api_name": "google/ddpm-church-256", "api_call": "DDPMPipeline.from_pretrained('google/ddpm-church-256')", "performance": {"dataset": "CIFAR10", "accuracy": {"Inception_score": 9.46, "FID_score": 3.17}}, "description": "Denoising Diffusion Probabilistic Models (DDPM) for high-quality image synthesis. Trained on the unconditional CIFAR10 dataset and 256x256 LSUN. Supports different noise schedulers like scheduling_ddpm, scheduling_ddim, and scheduling_pndm for inference.", "model_name": "google/ddpm-church-256"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli", "api_call": "Wav2Vec2ForCTC.from_pretrained('jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli')", "performance": {"dataset": "librispeech validation set", "accuracy": "4.45%"}, "description": "This checkpoint is a wav2vec2-large model that is useful for generating transcriptions with punctuation. It is intended for use in building transcriptions for TTS models, where punctuation is very important for prosody. This model was created by fine-tuning the facebook/wav2vec2-large-robust-ft-libri-960h checkpoint on the libritts and voxpopuli datasets with a new vocabulary that includes punctuation.", "model_name": "jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Unconditional Image Generation", "api_name": "ddpm-cifar10-32", "api_call": "DDPMPipeline.from_pretrained('google/ddpm-cifar10-32')", "performance": {"dataset": "CIFAR10", "accuracy": {"Inception score": 9.46, "FID score": 3.17}}, "description": "Denoising Diffusion Probabilistic Models (DDPM) for high quality image synthesis. Trained on the unconditional CIFAR10 dataset. Supports various discrete noise schedulers such as scheduling_ddpm, scheduling_ddim, and scheduling_pndm.", "model_name": "google/ddpm-cifar10-32"}
{"domain": "Natural Language Processing Text Generation", "framework": "Transformers", "functionality": "Text Generation", "api_name": "gpt2-large", "api_call": "pipeline('text-generation', model='gpt2-large')", "performance": {"dataset": {"LAMBADA": {"PPL": 10.87}, "CBT-CN": {"ACC": 93.45}, "CBT-NE": {"ACC": 88.0}, "WikiText2": {"PPL": 19.93}, "PTB": {"PPL": 40.31}, "enwiki8": {"BPB": 0.97}, "text8": {"BPC": 1.02}, "WikiText103": {"PPL": 22.05}, "1BW": {"PPL": 44.575}}}, "description": "GPT-2 Large is the 774M parameter version of GPT-2, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a causal language modeling (CLM) objective.", "model_name": "gpt2-large"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "Natural Language Inference", "api_name": "cross-encoder/nli-MiniLM2-L6-H768", "api_call": "AutoModelForSequenceClassification.from_pretrained('cross-encoder/nli-MiniLM2-L6-H768')", "performance": {"dataset": "SNLI and MultiNLI", "accuracy": "See SBERT.net - Pretrained Cross-Encoder for evaluation results"}, "description": "This model was trained using SentenceTransformers Cross-Encoder class on the SNLI and MultiNLI datasets. For a given sentence pair, it will output three scores corresponding to the labels: contradiction, entailment, neutral.", "model_name": "cross-encoder/nli-MiniLM2-L6-H768"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Transformers", "functionality": "Zero-Shot Classification", "api_name": "cross-encoder/nli-roberta-base", "api_call": "CrossEncoder('cross-encoder/nli-roberta-base')", "performance": {"dataset": ["SNLI", "MultiNLI"], "accuracy": "See SBERT.net - Pretrained Cross-Encoder"}, "description": "Cross-Encoder for Natural Language Inference trained on the SNLI and MultiNLI datasets. Outputs three scores corresponding to the labels: contradiction, entailment, neutral.", "model_name": "cross-encoder/nli-roberta-base"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Transcription", "api_name": "openai/whisper-tiny.en", "api_call": "WhisperForConditionalGeneration.from_pretrained('openai/whisper-tiny.en')", "performance": {"dataset": "LibriSpeech (clean)", "accuracy": 8.437}, "description": "Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning.", "model_name": "openai/whisper-tiny.en"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/tapex-base-finetuned-wtq", "api_call": "BartForConditionalGeneration.from_pretrained('microsoft/tapex-base-finetuned-wtq')", "performance": {"dataset": "wikitablequestions", "accuracy": "Not provided"}, "description": "TAPEX (Table Pre-training via Execution) is a conceptually simple and empirically powerful pre-training approach to empower existing models with table reasoning skills. TAPEX realizes table pre-training by learning a neural SQL executor over a synthetic corpus, which is obtained by automatically synthesizing executable SQL queries.", "model_name": "microsoft/tapex-base-finetuned-wtq"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Action Recognition", "api_name": "videomae-base-finetuned-ucf101", "api_call": "VideoMAEForVideoClassification.from_pretrained('nateraw/videomae-base-finetuned-ucf101')", "performance": {"dataset": "UCF101", "accuracy": 0.758209765}, "description": "VideoMAE Base model fine tuned on UCF101 for Video Action Recognition", "model_name": "nateraw/videomae-base-finetuned-ucf101"}
{"domain": "Tabular Tabular Regression", "framework": "Scikit-learn", "functionality": "baseline-trainer", "api_name": "merve/tips9y0jvt5q-tip-regression", "api_call": "pipeline('tabular-regression', model='merve/tips9y0jvt5q-tip-regression')", "performance": {"dataset": "tips9y0jvt5q", "accuracy": {"r2": 0.41524000000000005, "neg_mean_squared_error": -1.098792}}, "description": "Baseline Model trained on tips9y0jvt5q to apply regression on tip. The model uses Ridge(alpha=10) and is trained with dabl library as a baseline. For better results, use AutoTrain.", "model_name": "merve/tips9y0jvt5q-tip-regression"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "Text Summarization", "api_name": "facebook/bart-large-cnn", "api_call": "pipeline('summarization', model='facebook/bart-large-cnn')", "performance": {"dataset": "cnn_dailymail", "accuracy": {"ROUGE-1": 42.949, "ROUGE-2": 20.815, "ROUGE-L": 30.619, "ROUGE-LSUM": 40.038}}, "description": "BART (large-sized model), fine-tuned on CNN Daily Mail. BART is a transformer encoder-encoder (seq2seq) model with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder. BART is pre-trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. BART is particularly effective when fine-tuned for text generation (e.g. summarization, translation) but also works well for comprehension tasks (e.g. text classification, question answering). This particular checkpoint has been fine-tuned on CNN Daily Mail, a large collection of text-summary pairs.", "model_name": "facebook/bart-large-cnn"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Transformers", "api_name": "sentence-transformers/multi-qa-MiniLM-L6-cos-v1", "api_call": "SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1')", "performance": {"dataset": [{"name": "WikiAnswers", "accuracy": "77,427,422"}, {"name": "PAQ", "accuracy": "64,371,441"}, {"name": "Stack Exchange", "accuracy": "25,316,456"}, {"name": "MS MARCO", "accuracy": "17,579,773"}, {"name": "GOOAQ", "accuracy": "3,012,496"}, {"name": "Amazon-QA", "accuracy": "2,448,839"}, {"name": "Yahoo Answers", "accuracy": "1,198,260"}, {"name": "SearchQA", "accuracy": "582,261"}, {"name": "ELI5", "accuracy": "325,475"}, {"name": "Quora", "accuracy": "103,663"}, {"name": "Natural Questions (NQ)", "accuracy": "100,231"}, {"name": "SQuAD2.0", "accuracy": "87,599"}, {"name": "TriviaQA", "accuracy": "73,346"}]}, "description": "This is a sentence-transformers model that maps sentences & paragraphs to a 384-dimensional dense vector space and was designed for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.", "model_name": "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"}
{"domain": "Audio Voice Activity Detection", "framework": "Hugging Face Transformers", "functionality": "Voice Activity Detection", "api_name": "julien-c/voice-activity-detection", "api_call": "Inference('julien-c/voice-activity-detection', device='cuda')", "performance": {"dataset": "dihard", "accuracy": "Not provided"}, "description": "Example pyannote-audio Voice Activity Detection model using PyanNet. Imported from https://github.com/pyannote/pyannote-audio-hub and trained by @hbredin.", "model_name": "julien-c/voice-activity-detection"}
{"domain": "Natural Language Processing Text Classification", "framework": "Transformers", "functionality": "Sentiment Analysis", "api_name": "cardiffnlp/twitter-roberta-base-sentiment-latest", "api_call": "pipeline(sentiment-analysis, model=AutoModel.from_pretrained('cardiffnlp/twitter-roberta-base-sentiment-latest'), tokenizer=AutoTokenizer.from_pretrained('cardiffnlp/twitter-roberta-base-sentiment-latest'))", "performance": {"dataset": "tweet_eval", "accuracy": "Not provided"}, "description": "This is a RoBERTa-base model trained on ~124M tweets from January 2018 to December 2021, and finetuned for sentiment analysis with the TweetEval benchmark. The model is suitable for English.", "model_name": "cardiffnlp/twitter-roberta-base-sentiment-latest"}
{"domain": "Natural Language Processing Token Classification", "framework": "Hugging Face Transformers", "functionality": "Entity Extraction", "api_name": "904029577", "api_call": "AutoModelForTokenClassification.from_pretrained('ismail-lucifer011/autotrain-name_all-904029577', use_auth_token=True)", "performance": {"dataset": "ismail-lucifer011/autotrain-data-name_all", "accuracy": 0.9989316041}, "description": "This model is trained using AutoTrain for entity extraction. It is based on the DistilBert architecture and has a CO2 Emissions of 0.8375653425894861 grams.", "model_name": "ismail-lucifer011/autotrain-name_all-904029577"}
{"domain": "Natural Language Processing Text Classification", "framework": "Transformers", "functionality": "Text Classification", "api_name": "distilbert-base-uncased-finetuned-sst-2-english", "api_call": "DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')", "performance": {"dataset": "glue", "accuracy": 0.911}, "description": "This model is a fine-tune checkpoint of DistilBERT-base-uncased, fine-tuned on SST-2. It reaches an accuracy of 91.3 on the dev set (for comparison, Bert bert-base-uncased version reaches an accuracy of 92.7). This model can be used for topic classification.", "model_name": "distilbert-base-uncased-finetuned-sst-2-english"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Image Classification", "api_name": "flax-community/clip-rsicd-v2", "api_call": "CLIPModel.from_pretrained('flax-community/clip-rsicd-v2')", "performance": {"dataset": {"RSICD": {"original CLIP": {"k=1": 0.5720000000000001, "k=3": 0.745, "k=5": 0.837, "k=10": 0.9390000000000001}, "clip-rsicd-v2 (this model)": {"k=1": 0.883, "k=3": 0.968, "k=5": 0.982, "k=10": 0.998}}}}, "description": "This model is a fine-tuned CLIP by OpenAI. It is designed with an aim to improve zero-shot image classification, text-to-image and image-to-image retrieval specifically on remote sensing images.", "model_name": "flax-community/clip-rsicd-v2"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "blip2-flan-t5-xxl", "api_call": "Blip2ForConditionalGeneration.from_pretrained('Salesforce/blip2-flan-t5-xxl')", "performance": {"dataset": "LAION", "accuracy": "Not provided"}, "description": "BLIP-2 model, leveraging Flan T5-xxl (a large language model). It was introduced in the paper BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models by Li et al. and first released in this repository. The model is used for tasks like image captioning, visual question answering (VQA), and chat-like conversations by feeding the image and the previous conversation as prompt to the model.", "model_name": "Salesforce/blip2-flan-t5-xxl"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Transformers", "functionality": "Table Question Answering", "api_name": "google/tapas-base-finetuned-sqa", "api_call": "TapasTokenizer.from_pretrained('google/tapas-base-finetuned-sqa')", "performance": {"dataset": "msr_sqa", "accuracy": 0.6874}, "description": "TAPAS base model fine-tuned on Sequential Question Answering (SQA). It is a BERT-like transformers model pretrained on a large corpus of English data from Wikipedia and fine-tuned on SQA. It can be used for answering questions related to a table in a conversational set-up.", "model_name": "google/tapas-base-finetuned-sqa"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "Image Segmentation", "api_name": "lllyasviel/sd-controlnet-seg", "api_call": "ControlNetModel.from_pretrained('lllyasviel/sd-controlnet-seg')", "performance": {"dataset": "ADE20K", "accuracy": "Trained on 164K segmentation-image, caption pairs"}, "description": "ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on Image Segmentation. It can be used in combination with Stable Diffusion.", "model_name": "lllyasviel/sd-controlnet-seg"}
{"domain": "Natural Language Processing Conversational", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "tinkoff-ai/ruDialoGPT-medium", "api_call": "AutoModelWithLMHead.from_pretrained('tinkoff-ai/ruDialoGPT-medium')", "performance": {"dataset": "Private Validation Set", "sensibleness": 0.78, "specificity": 0.6900000000000001, "SSA": 0.735}, "description": "This generation model is based on sberbank-ai/rugpt3medium_based_on_gpt2. It's trained on large corpus of dialog data and can be used for buildning generative conversational agents. The model was trained with context size 3.", "model_name": "tinkoff-ai/ruDialoGPT-medium"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Image Captioning", "api_name": "nlpconnect/vit-gpt2-image-captioning", "api_call": "VisionEncoderDecoderModel.from_pretrained('nlpconnect/vit-gpt2-image-captioning')", "performance": {"dataset": "Not provided", "accuracy": "Not provided"}, "description": "An image captioning model that uses transformers to generate captions for input images. The model is based on the Illustrated Image Captioning using transformers approach.", "model_name": "nlpconnect/vit-gpt2-image-captioning"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Masked Language Modeling", "api_name": "roberta-large", "api_call": "pipeline('fill-mask', model='roberta-large')", "performance": {"dataset": "GLUE", "accuracy": {"MNLI": 90.2, "QQP": 92.2, "QNLI": 94.7, "SST-2": 96.4, "CoLA": 68.0, "STS-B": 96.4, "MRPC": 90.9, "RTE": 86.6}}, "description": "RoBERTa is a transformers model pretrained on a large corpus of English data in a self-supervised fashion using the Masked language modeling (MLM) objective. It can be fine-tuned on a downstream task, such as sequence classification, token classification, or question answering.", "model_name": "roberta-large"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Classification", "api_name": "valhalla/distilbart-mnli-12-1", "api_call": "pipeline('zero-shot-classification', model='valhalla/distilbart-mnli-12-1')", "performance": {"dataset": "MNLI", "matched_accuracy": 87.08, "mismatched_accuracy": 87.5}, "description": "distilbart-mnli is the distilled version of bart-large-mnli created using the No Teacher Distillation technique proposed for BART summarisation by Huggingface. It is designed for zero-shot classification tasks.", "model_name": "valhalla/distilbart-mnli-12-1"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Document Question Answering", "api_name": "tiennvcs/layoutlmv2-large-uncased-finetuned-vi-infovqa", "api_call": "pipeline('question-answering', model='tiennvcs/layoutlmv2-large-uncased-finetuned-vi-infovqa')", "performance": {"dataset": "unknown", "accuracy": {"Loss": 8.5806}}, "description": "This model is a fine-tuned version of microsoft/layoutlmv2-large-uncased on an unknown dataset.", "model_name": "tiennvcs/layoutlmv2-large-uncased-finetuned-vi-infovqa"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "Image-to-Image", "api_name": "lllyasviel/control_v11p_sd15_canny", "api_call": "ControlNetModel.from_pretrained('lllyasviel/control_v11p_sd15_canny')", "performance": {"dataset": "N/A", "accuracy": "N/A"}, "description": "Controlnet v1.1 is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on Canny edges. It can be used in combination with Stable Diffusion, such as runwayml/stable-diffusion-v1-5.", "model_name": "lllyasviel/control_v11p_sd15_canny"}
{"domain": "Multimodal Feature Extraction", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "GanjinZero/UMLSBert_ENG", "api_call": "AutoModel.from_pretrained('GanjinZero/UMLSBert_ENG')", "performance": {"dataset": "", "accuracy": ""}, "description": "CODER: Knowledge infused cross-lingual medical term embedding for term normalization. English Version. Old name. This model is not UMLSBert! Github Link: https://github.com/GanjinZero/CODER", "model_name": "GanjinZero/UMLSBert_ENG"}
{"domain": "Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "mazkooleg/0-9up-hubert-base-ls960-ft", "api_call": "pipeline('audio-classification', model='mazkooleg/0-9up-hubert-base-ls960-ft')", "performance": {"dataset": "mazkooleg/0-9up_google_speech_commands_augmented_raw", "accuracy": 0.9973000000000001}, "description": "This model is a fine-tuned version of facebook/hubert-base-ls960 on the None dataset. It achieves an accuracy of 0.9973 on the evaluation set.", "model_name": "mazkooleg/0-9up-hubert-base-ls960-ft"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Image generation and modification based on text prompts", "api_name": "stabilityai/stable-diffusion-x4-upscaler", "api_call": "StableDiffusionUpscalePipeline.from_pretrained('stabilityai/stable-diffusion-x4-upscaler', torch_dtype=torch.float16)", "performance": {"dataset": "COCO2017 validation set", "accuracy": "Not optimized for FID scores"}, "description": "Stable Diffusion x4 upscaler is a latent diffusion model trained on a 10M subset of LAION containing images >2048x2048. It can be used to generate and modify images based on text prompts. The model receives a noise_level as an input parameter, which can be used to add noise to the low-resolution input according to a predefined diffusion schedule. The model is trained with English captions and might not work well with other languages.", "model_name": "stabilityai/stable-diffusion-x4-upscaler"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face Transformers", "functionality": "Visual Question Answering", "api_name": "blip-vqa-base", "api_call": "BlipForQuestionAnswering.from_pretrained('Salesforce/blip-vqa-base')", "performance": {"dataset": "VQA", "accuracy": "+1.6% in VQA score"}, "description": "BLIP is a Vision-Language Pre-training (VLP) framework that transfers flexibly to both vision-language understanding and generation tasks. It effectively utilizes noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. This model is trained on visual question answering with a base architecture (using ViT base backbone).", "model_name": "Salesforce/blip-vqa-base"}
{"domain": "Natural Language Processing Token Classification", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "lanwuwei/BERTOverflow_stackoverflow_github", "api_call": "AutoModelForTokenClassification.from_pretrained('lanwuwei/BERTOverflow_stackoverflow_github')", "performance": {"dataset": "StackOverflow's 10 year archive", "accuracy": "Not provided"}, "description": "BERT-base model pre-trained on 152 million sentences from the StackOverflow's 10 year archive. It can be used for code and named entity recognition in StackOverflow.", "model_name": "lanwuwei/BERTOverflow_stackoverflow_github"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/tapex-large-sql-execution", "api_call": "BartForConditionalGeneration.from_pretrained('microsoft/tapex-large-sql-execution')", "performance": {"dataset": "synthetic corpus", "accuracy": "not specified"}, "description": "TAPEX (Table Pre-training via Execution) is a conceptually simple and empirically powerful pre-training approach to empower existing models with table reasoning skills. TAPEX realizes table pre-training by learning a neural SQL executor over a synthetic corpus, which is obtained by automatically synthesizing executable SQL queries. TAPEX is based on the BART architecture, the transformer encoder-encoder (seq2seq) model with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder.", "model_name": "microsoft/tapex-large-sql-execution"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "Image-to-Image", "api_name": "lllyasviel/sd-controlnet-canny", "api_call": "ControlNetModel.from_pretrained('lllyasviel/sd-controlnet-canny')", "performance": {"dataset": "3M edge-image, caption pairs", "accuracy": "600 GPU-hours with Nvidia A100 80G"}, "description": "ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on Canny edges. It can be used in combination with Stable Diffusion.", "model_name": "lllyasviel/sd-controlnet-canny"}
{"domain": "Audio Voice Activity Detection", "framework": "Hugging Face", "functionality": "Voice Activity Detection", "api_name": "Eklavya/ZFF_VAD", "api_call": "pipeline('voice-activity-detection', model='Eklavya/ZFF_VAD')", "performance": {"dataset": "N/A", "accuracy": "N/A"}, "description": "A Voice Activity Detection model by Eklavya, using the Hugging Face framework.", "model_name": "Eklavya/ZFF_VAD"}
{"domain": "Audio Voice Activity Detection", "framework": "Hugging Face", "functionality": "Voice Activity Detection", "api_name": "FSMN-VAD", "api_call": "pipeline('voice-activity-detection', model='funasr/FSMN-VAD')", "performance": {"dataset": "", "accuracy": ""}, "description": "FSMN-VAD model for Voice Activity Detection using Hugging Face Transformers library.", "model_name": "funasr/FSMN-VAD"}
{"domain": "Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Language Identification", "api_name": "sanchit-gandhi/whisper-medium-fleurs-lang-id", "api_call": "AutoModelForSpeechClassification.from_pretrained('sanchit-gandhi/whisper-medium-fleurs-lang-id')", "performance": {"dataset": "google/xtreme_s", "accuracy": 0.8805000000000001}, "description": "This model is a fine-tuned version of openai/whisper-medium on the FLEURS subset of the google/xtreme_s dataset. It is used for language identification in audio classification tasks.", "model_name": "sanchit-gandhi/whisper-medium-fleurs-lang-id"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "Summarization", "api_name": "google/pegasus-xsum", "api_call": "pipeline('summarization', model='google/pegasus-xsum')", "performance": {"dataset": [{"name": "xsum", "accuracy": {"ROUGE-1": 46.862, "ROUGE-2": 24.453, "ROUGE-L": 39.055, "ROUGE-LSUM": 39.099}}, {"name": "cnn_dailymail", "accuracy": {"ROUGE-1": 22.206, "ROUGE-2": 7.67, "ROUGE-L": 15.405, "ROUGE-LSUM": 19.218}}, {"name": "samsum", "accuracy": {"ROUGE-1": 21.81, "ROUGE-2": 4.253, "ROUGE-L": 17.447, "ROUGE-LSUM": 18.891}}]}, "description": "PEGASUS is a pre-trained model for abstractive summarization, developed by Google. It is based on the Transformer architecture and trained on both C4 and HugeNews datasets. The model is designed to extract gap sentences and generate summaries by stochastically sampling important sentences.", "model_name": "google/pegasus-xsum"}
{"domain": "Natural Language Processing Token Classification", "framework": "Transformers", "functionality": "Named Entity Recognition", "api_name": "Davlan/bert-base-multilingual-cased-ner-hrl", "api_call": "AutoModelForTokenClassification.from_pretrained('Davlan/bert-base-multilingual-cased-ner-hrl')", "performance": {"dataset": {"Arabic": "ANERcorp", "German": "conll 2003", "English": "conll 2003", "Spanish": "conll 2002", "French": "Europeana Newspapers", "Italian": "Italian I-CAB", "Latvian": "Latvian NER", "Dutch": "conll 2002", "Portuguese": "Paramopama + Second Harem", "Chinese": "MSRA"}, "accuracy": "Not provided"}, "description": "bert-base-multilingual-cased-ner-hrl is a Named Entity Recognition model for 10 high resourced languages (Arabic, German, English, Spanish, French, Italian, Latvian, Dutch, Portuguese and Chinese) based on a fine-tuned mBERT base model. It has been trained to recognize three types of entities: location (LOC), organizations (ORG), and person (PER).", "model_name": "Davlan/bert-base-multilingual-cased-ner-hrl"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "madhurjindal/autonlp-Gibberish-Detector-492513457", "api_call": "AutoModelForSequenceClassification.from_pretrained('madhurjindal/autonlp-Gibberish-Detector-492513457')", "performance": {"dataset": "madhurjindal/autonlp-data-Gibberish-Detector", "accuracy": 0.9735624587}, "description": "A multi-class text classification model for detecting gibberish text. Trained using AutoNLP and DistilBERT.", "model_name": "madhurjindal/autonlp-Gibberish-Detector-492513457"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "Text Summarization", "api_name": "Randeng-Pegasus-238M-Summary-Chinese", "api_call": "PegasusForConditionalGeneration.from_pretrained('IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese')", "performance": {"dataset": "LCSTS", "accuracy": {"rouge-1": 43.46, "rouge-2": 29.59, "rouge-L": 39.76}}, "description": "Randeng-Pegasus-238M-Summary-Chinese is a Chinese text summarization model based on Pegasus. It is fine-tuned on 7 Chinese text summarization datasets including education, new2016zh, nlpcc, shence, sohu, thucnews, and weibo. The model can be used to generate summaries for Chinese text inputs.", "model_name": "IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Monocular Depth Estimation", "api_name": "Intel/dpt-large", "api_call": "DPTForDepthEstimation.from_pretrained('Intel/dpt-large')", "performance": {"dataset": "MIX 6", "accuracy": "10.82"}, "description": "Dense Prediction Transformer (DPT) model trained on 1.4 million images for monocular depth estimation. Introduced in the paper Vision Transformers for Dense Prediction by Ranftl et al. (2021). DPT uses the Vision Transformer (ViT) as backbone and adds a neck + head on top for monocular depth estimation.", "model_name": "Intel/dpt-large"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "naver-clova-ix/donut-base", "api_call": "AutoModel.from_pretrained('naver-clova-ix/donut-base')", "performance": {"dataset": "arxiv:2111.15664", "accuracy": "Not provided"}, "description": "Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder.", "model_name": "naver-clova-ix/donut-base"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Table Extraction", "api_name": "keremberke/yolov8s-table-extraction", "api_call": "YOLO('keremberke/yolov8s-table-extraction')", "performance": {"dataset": "table-extraction", "accuracy": 0.984}, "description": "A YOLOv8 model for table extraction in documents, capable of detecting bordered and borderless tables. Trained on the table-extraction dataset, the model achieves a mAP@0.5 of 0.984 on the validation set.", "model_name": "keremberke/yolov8s-table-extraction"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "microsoft/resnet-18", "api_call": "ResNetForImageClassification.from_pretrained('microsoft/resnet-18')", "performance": {"dataset": "imagenet-1k"}, "description": "ResNet model trained on imagenet-1k. It was introduced in the paper Deep Residual Learning for Image Recognition and first released in this repository. ResNet introduced residual connections, they allow to train networks with an unseen number of layers (up to 1000). ResNet won the 2015 ILSVRC & COCO competition, one important milestone in deep computer vision.", "model_name": "microsoft/resnet-18"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Text Generation", "api_name": "facebook/opt-350m", "api_call": "pipeline('text-generation', model='facebook/opt-350m')", "performance": {"dataset": "BookCorpus, CC-Stories, The Pile, Pushshift.io Reddit, CCNewsV2", "accuracy": "Roughly matches GPT-3 performance"}, "description": "OPT (Open Pre-trained Transformer) is a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, developed by Meta AI. It is designed to enable reproducible and responsible research at scale and bring more voices to the table in studying the impact of large language models. The pretrained-only model can be used for prompting for evaluation of downstream tasks as well as text generation. It can also be fine-tuned on a downstream task using the CLM example.", "model_name": "facebook/opt-350m"}
{"domain": "Reinforcement Learning", "framework": "Stable-Baselines3", "functionality": "Acrobot-v1", "api_name": "sb3/dqn-Acrobot-v1", "api_call": "load_from_hub(repo_id='sb3/dqn-Acrobot-v1',filename='{MODEL FILENAME}.zip',)", "performance": {"dataset": "Acrobot-v1", "accuracy": "-72.10 +/- 6.44"}, "description": "This is a trained model of a DQN agent playing Acrobot-v1 using the stable-baselines3 library and the RL Zoo. The RL Zoo is a training framework for Stable Baselines3 reinforcement learning agents, with hyperparameter optimization and pre-trained agents included.", "model_name": "sb3/dqn-Acrobot-v1"}
{"domain": "Multimodal Text-to-Video", "framework": "Hugging Face", "functionality": "Text-to-Video", "api_name": "chavinlo/TempoFunk", "api_call": "pipeline('text-to-video', model='chavinlo/TempoFunk')", "performance": {"dataset": "", "accuracy": ""}, "description": "A Text-to-Video model using Hugging Face Transformers library. Model is capable of generating video content based on the input text.", "model_name": "chavinlo/TempoFunk"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Depth Estimation", "api_name": "glpn-nyu-finetuned-diode-221116-054332", "api_call": "AutoModel.from_pretrained('sayakpaul/glpn-nyu-finetuned-diode-221116-054332')", "performance": {"dataset": "diode-subset", "accuracy": {"Loss": 0.6028, "Rmse": "nan"}}, "description": "This model is a fine-tuned version of vinvino02/glpn-nyu on the diode-subset dataset.", "model_name": "sayakpaul/glpn-nyu-finetuned-diode-221116-054332"}
{"domain": "Natural Language Processing Conversational", "framework": "Hugging Face Transformers", "functionality": "Text Generation", "api_name": "ruDialoGpt3-medium-finetuned-telegram", "api_call": "AutoModelForCausalLM.from_pretrained('ruDialoGpt3-medium-finetuned-telegram')", "performance": {"dataset": "Russian forums and Telegram chat", "accuracy": "Not available"}, "description": "DialoGPT trained on Russian language and fine tuned on my telegram chat. This model was created by sberbank-ai and trained on Russian forums. It has been fine-tuned on a 30mb json file of exported telegram chat data.", "model_name": "ruDialoGpt3-medium-finetuned-telegram"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "microsoft/swin-tiny-patch4-window7-224", "api_call": "SwinForImageClassification.from_pretrained('microsoft/swin-tiny-patch4-window7-224')", "performance": {"dataset": "imagenet-1k", "accuracy": "Not specified"}, "description": "Swin Transformer model trained on ImageNet-1k at resolution 224x224. It was introduced in the paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Liu et al. and first released in this repository. The Swin Transformer is a type of Vision Transformer. It builds hierarchical feature maps by merging image patches (shown in gray) in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window (shown in red). It can thus serve as a general-purpose backbone for both image classification and dense recognition tasks.", "model_name": "microsoft/swin-tiny-patch4-window7-224"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "functionality": "financial-sentiment-analysis", "api_name": "ProsusAI/finbert", "api_call": "AutoModelForSequenceClassification.from_pretrained('ProsusAI/finbert')", "performance": {"dataset": "Financial PhraseBank", "accuracy": "Not provided"}, "description": "FinBERT is a pre-trained NLP model to analyze sentiment of financial text. It is built by further training the BERT language model in the finance domain, using a large financial corpus and thereby fine-tuning it for financial sentiment classification. Financial PhraseBank by Malo et al. (2014) is used for fine-tuning.", "model_name": "ProsusAI/finbert"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Object Detection", "api_name": "deformable-detr", "api_call": "DeformableDetrForObjectDetection.from_pretrained('SenseTime/deformable-detr')", "performance": {"dataset": "COCO 2017", "accuracy": "Not provided"}, "description": "Deformable DETR model with ResNet-50 backbone trained end-to-end on COCO 2017 object detection (118k annotated images). It was introduced in the paper Deformable DETR: Deformable Transformers for End-to-End Object Detection by Zhu et al. and first released in this repository.", "model_name": "SenseTime/deformable-detr"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "ControlNet", "api_name": "lllyasviel/control_v11p_sd15_lineart", "api_call": "ControlNetModel.from_pretrained('lllyasviel/control_v11p_sd15_lineart')", "performance": {"dataset": "ControlNet-1-1-preview", "accuracy": "Not provided"}, "description": "ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on lineart images.", "model_name": "lllyasviel/control_v11p_sd15_lineart"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Conversational", "api_name": "ingen51/DialoGPT-medium-GPT4", "api_call": "pipeline('conversational', model='ingen51/DialoGPT-medium-GPT4')", "performance": {"dataset": "N/A", "accuracy": "N/A"}, "description": "A GPT-4 model for generating conversational responses in a dialogue setting.", "model_name": "ingen51/DialoGPT-medium-GPT4"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "swin-tiny-patch4-window7-224-bottom_cleaned_data", "api_call": "AutoModelForImageClassification.from_pretrained('microsoft/swin-tiny-patch4-window7-224-bottom_cleaned_data')", "performance": {"dataset": "imagefolder", "accuracy": 0.9726}, "description": "This model is a fine-tuned version of microsoft/swin-tiny-patch4-window7-224 on the imagefolder dataset.", "model_name": "microsoft/swin-tiny-patch4-window7-224-bottom_cleaned_data"}
{"domain": "Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "mazkooleg/0-9up-data2vec-audio-base-960h-ft", "api_call": "pipeline('audio-classification', model='mazkooleg/0-9up-data2vec-audio-base-960h-ft')", "performance": {"dataset": "None", "accuracy": 0.9967}, "description": "This model is a fine-tuned version of facebook/data2vec-audio-base-960h on the None dataset.", "model_name": "mazkooleg/0-9up-data2vec-audio-base-960h-ft"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "vision-encoder-decoder", "api_name": "naver-clova-ix/donut-base-finetuned-docvqa", "api_call": "pipeline('document-question-answering', model='donut-base-finetuned-docvqa')", "performance": {"dataset": "DocVQA", "accuracy": "Not provided"}, "description": "Donut model fine-tuned on DocVQA. It was introduced in the paper OCR-free Document Understanding Transformer by Geewok et al. and first released in this repository. Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder.", "model_name": "donut-base-finetuned-docvqa"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Transformers", "api_name": "sentence-transformers/LaBSE", "api_call": "SentenceTransformer('sentence-transformers/LaBSE')", "performance": {"dataset": "Sentence Embeddings Benchmark", "accuracy": "https://seb.sbert.net"}, "description": "This is a port of the LaBSE model to PyTorch. It can be used to map 109 languages to a shared vector space.", "model_name": "sentence-transformers/LaBSE"}
{"domain": "Natural Language Processing Summarization", "framework": "Transformers", "functionality": "Code Documentation Generation", "api_name": "code_trans_t5_base_code_documentation_generation_python", "api_call": "AutoModelWithLMHead.from_pretrained('SEBIS/code_trans_t5_base_code_documentation_generation_python')", "performance": {"dataset": "CodeSearchNet Corpus python dataset", "accuracy": "20.26 BLEU score"}, "description": "This CodeTrans model is based on the t5-base model and is trained on tokenized python code functions. It can be used to generate descriptions for python functions or be fine-tuned on other python code tasks. The model works best with tokenized python functions but can also be used on unparsed and untokenized python code.", "model_name": "SEBIS/code_trans_t5_base_code_documentation_generation_python"}
{"domain": "Audio Text-to-Speech", "framework": "ESPnet", "functionality": "Text-to-Speech", "api_name": "lakahaga/novel_reading_tts", "api_call": "AutoModelForTTS.from_pretrained('lakahaga/novel_reading_tts')", "performance": {"dataset": "novelspeech", "accuracy": null}, "description": "This model was trained by lakahaga using novelspeech recipe in espnet. It is designed for Korean text-to-speech tasks.", "model_name": "lakahaga/novel_reading_tts"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Hugging Face Transformers", "functionality": "Table Question Answering", "api_name": "lysandre/tapas-temporary-repo", "api_call": "TapasForQuestionAnswering.from_pretrained('lysandre/tapas-temporary-repo')", "performance": {"dataset": "SQA", "accuracy": "Not provided"}, "description": "TAPAS base model fine-tuned on Sequential Question Answering (SQA). This model is pretrained on a large corpus of English data from Wikipedia in a self-supervised fashion and can be used for answering questions related to a table in a conversational set-up.", "model_name": "lysandre/tapas-temporary-repo"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Transformers", "functionality": "Table Question Answering", "api_name": "google/tapas-small-finetuned-wikisql-supervised", "api_call": "TapasForQuestionAnswering.from_pretrained('google/tapas-small-finetuned-wikisql-supervised')", "performance": {"dataset": "wikisql", "accuracy": "Not specified"}, "description": "TAPAS is a BERT-like transformers model pretrained on a large corpus of English data from Wikipedia in a self-supervised fashion. This model is fine-tuned on WikiSQL and can be used for answering questions related to a table.", "model_name": "google/tapas-small-finetuned-wikisql-supervised"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Depth Estimation", "api_name": "glpn-nyu-finetuned-diode-221116-062619", "api_call": "AutoModel.from_pretrained('sayakpaul/glpn-nyu-finetuned-diode-221116-062619')", "performance": {"dataset": "diode-subset", "accuracy": {"Loss": 0.548, "Rmse": "nan"}}, "description": "This model is a fine-tuned version of vinvino02/glpn-nyu on the diode-subset dataset.", "model_name": "sayakpaul/glpn-nyu-finetuned-diode-221116-062619"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Embeddings", "api_name": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2", "api_call": "SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')", "performance": {"dataset": "https://seb.sbert.net", "accuracy": "Automated evaluation"}, "description": "This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "functionality": "Fill-Mask", "api_name": "microsoft/deberta-base", "api_call": "DebertaModel.from_pretrained('microsoft/deberta-base')", "performance": {"dataset": {"SQuAD 1.1": "93.1/87.2", "SQuAD 2.0": "86.2/83.1", "MNLI-m": "88.8"}}, "description": "DeBERTa improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder. It outperforms BERT and RoBERTa on majority of NLU tasks with 80GB training data.", "model_name": "microsoft/deberta-base"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "git-large-textcaps", "api_call": "AutoModelForCausalLM.from_pretrained('microsoft/git-large-textcaps')", "performance": {"dataset": "TextCaps", "accuracy": "Refer to the paper"}, "description": "GIT (short for GenerativeImage2Text) model, large-sized version, fine-tuned on TextCaps. It was introduced in the paper GIT: A Generative Image-to-text Transformer for Vision and Language by Wang et al. and first released in this repository. The model is trained using 'teacher forcing' on a lot of (image, text) pairs. The goal for the model is simply to predict the next text token, giving the image tokens and previous text tokens. This allows the model to be used for tasks like image and video captioning, visual question answering (VQA) on images and videos, and even image classification (by simply conditioning the model on the image and asking it to generate a class for it in text).", "model_name": "microsoft/git-large-textcaps"}
{"domain": "Audio Text-to-Speech", "framework": "Fairseq", "functionality": "Text-to-Speech", "api_name": "fastspeech2-en-ljspeech", "api_call": "'TTSHubInterface.get_prediction('facebook/fastspeech2-en-ljspeech')'", "performance": {"dataset": "LJSpeech", "accuracy": "N/A"}, "description": "FastSpeech 2 text-to-speech model from fairseq S^2. English single-speaker female voice trained on LJSpeech.", "model_name": "facebook/fastspeech2-en-ljspeech"}
{"domain": "Natural Language Processing Text Classification", "framework": "Transformers", "functionality": "Sentiment Analysis", "api_name": "finiteautomata/beto-sentiment-analysis", "api_call": "pipeline('sentiment-analysis', model='finiteautomata/beto-sentiment-analysis')", "performance": {"dataset": "TASS 2020 corpus", "accuracy": ""}, "description": "Model trained with TASS 2020 corpus (around ~5k tweets) of several dialects of Spanish. Base model is BETO, a BERT model trained in Spanish. Uses POS, NEG, NEU labels.", "model_name": "finiteautomata/beto-sentiment-analysis"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Audio Classification", "api_name": "ast-finetuned-audioset-10-10-0.4593", "api_call": "pipeline('audio-classification', model='MIT/ast-finetuned-audioset-10-10-0.4593')", "performance": {"dataset": "AudioSet", "accuracy": ""}, "description": "Audio Spectrogram Transformer (AST) model fine-tuned on AudioSet. It was introduced in the paper AST: Audio Spectrogram Transformer by Gong et al. and first released in this repository. The Audio Spectrogram Transformer is equivalent to ViT, but applied on audio. Audio is first turned into an image (as a spectrogram), after which a Vision Transformer is applied. The model gets state-of-the-art results on several audio classification benchmarks.", "model_name": "MIT/ast-finetuned-audioset-10-10-0.4593"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Grammar Synthesis", "api_name": "pszemraj/flan-t5-large-grammar-synthesis", "api_call": "pipeline('text2text-generation', 'pszemraj/flan-t5-large-grammar-synthesis')", "performance": {"dataset": "jfleg", "accuracy": "Not provided"}, "description": "A fine-tuned version of google/flan-t5-large for grammar correction on an expanded version of the JFLEG dataset.", "model_name": "pszemraj/flan-t5-large-grammar-synthesis"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "prithivida/parrot_adequacy_model", "api_call": "pipeline('text-classification', model='prithivida/parrot_adequacy_model')", "performance": {"dataset": "", "accuracy": ""}, "description": "Parrot is a paraphrase-based utterance augmentation framework purpose-built to accelerate training NLU models. This model is an ancillary model for Parrot paraphraser.", "model_name": "prithivida/parrot_adequacy_model"}
{"domain": "Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Language Identification", "api_name": "lang-id-voxlingua107-ecapa", "api_call": "EncoderClassifier.from_hparams(source='speechbrain/lang-id-voxlingua107-ecapa', savedir='/tmp')", "performance": {"dataset": "VoxLingua107 development dataset", "accuracy": "93.3%"}, "description": "This is a spoken language recognition model trained on the VoxLingua107 dataset using SpeechBrain. The model uses the ECAPA-TDNN architecture that has previously been used for speaker recognition. It covers 107 different languages.", "model_name": "speechbrain/lang-id-voxlingua107-ecapa"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Classification", "api_name": "Recognai/bert-base-spanish-wwm-cased-xnli", "api_call": "AutoModelForSequenceClassification.from_pretrained('Recognai/bert-base-spanish-wwm-cased-xnli')", "performance": {"dataset": "XNLI-es", "accuracy": "79.9%"}, "description": "This model is a fine-tuned version of the spanish BERT model with the Spanish portion of the XNLI dataset. You can have a look at the training script for details of the training.", "model_name": "Recognai/bert-base-spanish-wwm-cased-xnli"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "Text-to-Text Generation", "api_name": "philschmid/bart-large-cnn-samsum", "api_call": "pipeline('summarization', model='philschmid/bart-large-cnn-samsum')", "performance": {"dataset": "samsum", "accuracy": {"eval_rouge1": 42.621, "eval_rouge2": 21.9825, "eval_rougeL": 33.034, "eval_rougeLsum": 39.6783, "test_rouge1": 41.3174, "test_rouge2": 20.8716, "test_rougeL": 32.1337, "test_rougeLsum": 38.4149}}, "description": "philschmid/bart-large-cnn-samsum is a BART-based model trained for text summarization on the SAMSum dataset. It can be used to generate abstractive summaries of conversations.", "model_name": "philschmid/bart-large-cnn-samsum"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Object Detection", "api_name": "keremberke/yolov8m-hard-hat-detection", "api_call": "YOLO('keremberke/yolov8m-hard-hat-detection')", "performance": {"dataset": "hard-hat-detection", "accuracy": 0.811}, "description": "A YOLOv8 model for detecting hard hats in images. The model can distinguish between 'Hardhat' and 'NO-Hardhat' classes. It can be used to ensure safety compliance in construction sites or other industrial environments where hard hats are required.", "model_name": "keremberke/yolov8m-hard-hat-detection"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "facebook/convnext-tiny-224", "api_call": "ConvNextForImageClassification.from_pretrained('facebook/convnext-tiny-224')", "performance": {"dataset": "imagenet-1k", "accuracy": "Not specified"}, "description": "ConvNeXT is a pure convolutional model (ConvNet), inspired by the design of Vision Transformers, that claims to outperform them. It is trained on ImageNet-1k at resolution 224x224 and can be used for image classification.", "model_name": "facebook/convnext-tiny-224"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "git-large-textvqa", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('microsoft/git-large-textvqa')", "performance": {"dataset": "TextVQA", "accuracy": "See table 11 in the paper for more details."}, "description": "GIT (short for GenerativeImage2Text) model, large-sized version, fine-tuned on TextVQA. It was introduced in the paper GIT: A Generative Image-to-text Transformer for Vision and Language by Wang et al. and first released in this repository. The model is trained using 'teacher forcing' on a lot of (image, text) pairs. The goal for the model is simply to predict the next text token, giving the image tokens and previous text tokens. This allows the model to be used for tasks like: image and video captioning, visual question answering (VQA) on images and videos, and even image classification (by simply conditioning the model on the image and asking it to generate a class for it in text).", "model_name": "microsoft/git-large-textvqa"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "MCG-NJU/videomae-base-finetuned-ssv2", "api_call": "VideoMAEForVideoClassification.from_pretrained('MCG-NJU/videomae-base-finetuned-ssv2')", "performance": {"dataset": "Something-Something-v2", "accuracy": {"top-1": 70.6, "top-5": 92.6}}, "description": "VideoMAE model pre-trained for 2400 epochs in a self-supervised way and fine-tuned in a supervised way on Something-Something-v2. It was introduced in the paper VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training by Tong et al. and first released in this repository.", "model_name": "MCG-NJU/videomae-base-finetuned-ssv2"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "opus-mt-en-ROMANCE", "api_call": "pipeline('translation_en_to_ROMANCE', model='Helsinki-NLP/opus-mt-en-ROMANCE')", "performance": {"dataset": "opus", "accuracy": {"BLEU": 50.1, "chr-F": 0.6930000000000001}}, "description": "A translation model trained on the OPUS dataset that supports translation between English and various Romance languages. It uses a transformer architecture and requires a sentence initial language token in the form of >>id<< (id = valid target language ID).", "model_name": "Helsinki-NLP/opus-mt-en-ROMANCE"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "Diffusion-based text-to-image generation", "api_name": "lllyasviel/control_v11p_sd15_seg", "api_call": "ControlNetModel.from_pretrained('lllyasviel/control_v11p_sd15_seg')", "performance": {"dataset": "COCO", "accuracy": "Not specified"}, "description": "ControlNet v1.1 is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on seg images. It can be used in combination with Stable Diffusion, such as runwayml/stable-diffusion-v1-5.", "model_name": "lllyasviel/control_v11p_sd15_seg"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "michellejieli/emotion_text_classifier", "api_call": "pipeline('sentiment-analysis', model='michellejieli/emotion_text_classifier')", "performance": {"dataset": ["Crowdflower (2016)", "Emotion Dataset, Elvis et al. (2018)", "GoEmotions, Demszky et al. (2020)", "ISEAR, Vikash (2018)", "MELD, Poria et al. (2019)", "SemEval-2018, EI-reg, Mohammad et al. (2018)", "Emotion Lines (Friends)"], "accuracy": "Not provided"}, "description": "DistilRoBERTa-base is a transformer model that performs sentiment analysis. I fine-tuned the model on transcripts from the Friends show with the goal of classifying emotions from text data, specifically dialogue from Netflix shows or movies. The model predicts 6 Ekman emotions and a neutral class. These emotions include anger, disgust, fear, joy, neutrality, sadness, and surprise.", "model_name": "michellejieli/emotion_text_classifier"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Depth Estimation", "api_name": "glpn-nyu-finetuned-diode", "api_call": "pipeline('depth-estimation', model='sayakpaul/glpn-nyu-finetuned-diode')", "performance": {"dataset": "diode-subset", "accuracy": {"Loss": 0.4359, "Rmse": 0.42760000000000004}}, "description": "This model is a fine-tuned version of vinvino02/glpn-nyu on the diode-subset dataset.", "model_name": "sayakpaul/glpn-nyu-finetuned-diode"}
{"domain": "Multimodal Graph Machine Learning", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "graphormer-base-pcqm4mv1", "api_call": "AutoModel.from_pretrained('graphormer-base-pcqm4mv1')", "performance": {"dataset": "PCQM4M-LSC", "accuracy": "1st place on the KDD CUP 2021 (quantum prediction track)"}, "description": "The Graphormer is a graph Transformer model, pretrained on PCQM4M-LSC, and which got 1st place on the KDD CUP 2021 (quantum prediction track). Developed by Microsoft, this model should be used for graph classification tasks or graph representation tasks; the most likely associated task is molecule modeling. It can either be used as such, or finetuned on downstream tasks.", "model_name": "graphormer-base-pcqm4mv1"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "Summarization", "api_name": "Einmalumdiewelt/T5-Base_GNAD", "api_call": "pipeline('summarization', model='Einmalumdiewelt/T5-Base_GNAD')", "performance": {"dataset": "unknown", "accuracy": {"Loss": 2.1025, "Rouge1": 27.5357, "Rouge2": 8.5623, "Rougel": 19.1508, "Rougelsum": 23.9029, "Gen Len": 52.7253}}, "description": "This model is a fine-tuned version of Einmalumdiewelt/T5-Base_GNAD on an unknown dataset. It is intended for German text summarization.", "model_name": "Einmalumdiewelt/T5-Base_GNAD"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "lmazzon70/videomae-base-finetuned-kinetics-finetuned-rwf2000-epochs8-batch8-kb", "api_call": "AutoModelForVideoClassification.from_pretrained('lmazzon70/videomae-base-finetuned-kinetics-finetuned-rwf2000-epochs8-batch8-kb')", "performance": {"dataset": "unknown", "accuracy": 0.7298}, "description": "This model is a fine-tuned version of MCG-NJU/videomae-base-finetuned-kinetics on an unknown dataset. It achieves the following results on the evaluation set: Loss: 0.5482, Accuracy: 0.7298.", "model_name": "lmazzon70/videomae-base-finetuned-kinetics-finetuned-rwf2000-epochs8-batch8-kb"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face", "functionality": "Translation", "api_name": "Helsinki-NLP/opus-mt-en-fr", "api_call": "translate('input_text', model='Helsinki-NLP/opus-mt-en-fr')", "performance": {"dataset": "opus", "accuracy": {"BLEU": {"newsdiscussdev2015-enfr.en.fr": 33.8, "newsdiscusstest2015-enfr.en.fr": 40.0, "newssyscomb2009.en.fr": 29.8, "news-test2008.en.fr": 27.5, "newstest2009.en.fr": 29.4, "newstest2010.en.fr": 32.7, "newstest2011.en.fr": 34.3, "newstest2012.en.fr": 31.8, "newstest2013.en.fr": 33.2, "Tatoeba.en.fr": 50.5}}}, "description": "Helsinki-NLP/opus-mt-en-fr is a translation model that translates English text to French using the Hugging Face Transformers library. It is based on the OPUS dataset and uses a transformer-align architecture with normalization and SentencePiece pre-processing.", "model_name": "Helsinki-NLP/opus-mt-en-fr"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Speech Recognition", "api_name": "jonatasgrosman/wav2vec2-large-xlsr-53-portuguese", "api_call": "SpeechRecognitionModel('jonatasgrosman/wav2vec2-large-xlsr-53-portuguese')", "performance": {"dataset": "mozilla-foundation/common_voice_6_0", "accuracy": {"Test WER": 11.31, "Test CER": 3.74, "Test WER (+LM)": 9.01, "Test CER (+LM)": 3.21}}, "description": "Fine-tuned facebook/wav2vec2-large-xlsr-53 on Portuguese using the train and validation splits of Common Voice 6.1. When using this model, make sure that your speech input is sampled at 16kHz.", "model_name": "jonatasgrosman/wav2vec2-large-xlsr-53-portuguese"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Image Segmentation", "api_name": "keremberke/yolov8s-pcb-defect-segmentation", "api_call": "YOLO('keremberke/yolov8s-pcb-defect-segmentation')", "performance": {"dataset": "pcb-defect-segmentation", "accuracy": {"mAP@0.5(box)": 0.515, "mAP@0.5(mask)": 0.491}}, "description": "YOLOv8s model for PCB defect segmentation. The model is trained to detect and segment PCB defects such as Dry_joint, Incorrect_installation, PCB_damage, and Short_circuit.", "model_name": "keremberke/yolov8s-pcb-defect-segmentation"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Detect Bordered and Borderless tables in documents", "api_name": "TahaDouaji/detr-doc-table-detection", "api_call": "DetrForObjectDetection.from_pretrained('TahaDouaji/detr-doc-table-detection')", "performance": {"dataset": "ICDAR2019 Table Dataset", "accuracy": "Not provided"}, "description": "detr-doc-table-detection is a model trained to detect both Bordered and Borderless tables in documents, based on facebook/detr-resnet-50.", "model_name": "TahaDouaji/detr-doc-table-detection"}
{"domain": "Natural Language Processing Conversational", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "facebook/blenderbot-90M", "api_call": "AutoModelForCausalLM.from_pretrained('facebook/blenderbot-90M')", "performance": {"dataset": "blended_skill_talk", "accuracy": "Not provided"}, "description": "BlenderBot-90M is a conversational AI model developed by Facebook AI. It is trained on the Blended Skill Talk dataset and aims to provide engaging and human-like responses in a multi-turn dialogue setting. The model is deprecated, and it is recommended to use the identical model https://huggingface.co/facebook/blenderbot_small-90M instead.", "model_name": "facebook/blenderbot-90M"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Program Synthesis", "api_name": "Salesforce/codegen-2B-multi", "api_call": "AutoModelForCausalLM.from_pretrained('Salesforce/codegen-2B-multi')", "performance": {"dataset": "HumanEval, MTPB"}, "description": "CodeGen is a family of autoregressive language models for program synthesis. The models are originally released in this repository, under 3 pre-training data variants (NL, Multi, Mono) and 4 model size variants (350M, 2B, 6B, 16B). The checkpoint included in this repository is denoted as CodeGen-Multi 2B, where Multi means the model is initialized with CodeGen-NL 2B and further pre-trained on a dataset of multiple programming languages, and 2B refers to the number of trainable parameters.", "model_name": "Salesforce/codegen-2B-multi"}
{"domain": "Reinforcement Learning", "framework": "Unity ML-Agents Library", "functionality": "Train and play SoccerTwos", "api_name": "poca-SoccerTwosv2", "api_call": "mlagents-load-from-hf --repo-id='Raiden-1001/poca-SoccerTwosv2' --local-dir='./downloads'", "performance": {"dataset": "SoccerTwos", "accuracy": "Not provided"}, "description": "A trained model of a poca agent playing SoccerTwos using the Unity ML-Agents Library.", "model_name": "Raiden-1001/poca-SoccerTwosv2"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "LayoutLMX_pt_question_answer_ocrazure_correct_V15_30_03_2023", "api_call": "AutoModelForDocumentQuestionAnswering.from_pretrained('L-oenai/LayoutLMX_pt_question_answer_ocrazure_correct_V15_30_03_2023')", "performance": {"dataset": {}, "accuracy": {}}, "description": "A document question answering model based on LayoutLMv2, which can be used to extract answers from images with text and layout information.", "model_name": "L-oenai/LayoutLMX_pt_question_answer_ocrazure_correct_V15_30_03_2023"}
{"domain": "Audio Text-to-Speech", "framework": "ESPnet", "functionality": "Text-to-Speech", "api_name": "kan-bayashi_ljspeech_joint_finetune_conformer_fastspeech2_hifigan", "api_call": "Text2Speech.from_pretrained('espnet/kan-bayashi_ljspeech_joint_finetune_conformer_fastspeech2_hifigan')", "performance": {"dataset": "LJSpeech", "accuracy": ""}, "description": "A pretrained Text-to-Speech model based on the ESPnet framework, fine-tuned on the LJSpeech dataset. This model is capable of converting text input into synthesized speech.", "model_name": "espnet/kan-bayashi_ljspeech_joint_finetune_conformer_fastspeech2_hifigan"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Denoising Diffusion Probabilistic Models (DDPM)", "api_name": "google/ddpm-ema-cat-256", "api_call": "DDPMPipeline.from_pretrained('google/ddpm-ema-cat-256')", "performance": {"dataset": "CIFAR10", "accuracy": {"Inception_score": 9.46, "FID_score": 3.17}}, "description": "Denoising Diffusion Probabilistic Models (DDPM) is a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. It can generate high-quality images, and supports different noise schedulers such as scheduling_ddpm, scheduling_ddim, and scheduling_pndm. On the unconditional CIFAR10 dataset, it achieves an Inception score of 9.46 and a state-of-the-art FID score of 3.17.", "model_name": "google/ddpm-ema-cat-256"}
{"domain": "Audio Automatic Speech Recognition", "framework": "PyTorch Transformers", "functionality": "Automatic Speech Recognition", "api_name": "data2vec-audio-base-960h", "api_call": "Data2VecForCTC.from_pretrained('facebook/data2vec-audio-base-960h')", "performance": {"dataset": "librispeech_asr", "accuracy": {"clean": 2.77, "other": 7.08}}, "description": "Facebook's Data2Vec-Audio-Base-960h model is an Automatic Speech Recognition model pretrained and fine-tuned on 960 hours of Librispeech on 16kHz sampled speech audio. It can be used for transcribing audio files and achieves competitive performance on major benchmarks of speech recognition. The model is based on the Data2Vec framework which uses the same learning method for either speech, NLP, or computer vision.", "model_name": "facebook/data2vec-audio-base-960h"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Object Detection", "api_name": "keremberke/yolov8n-csgo-player-detection", "api_call": "YOLO('keremberke/yolov8n-csgo-player-detection')", "performance": {"dataset": "csgo-object-detection", "accuracy": 0.844}, "description": "A YOLOv8 model for detecting Counter-Strike: Global Offensive (CS:GO) players with supported labels: ['ct', 'cthead', 't', 'thead'].", "model_name": "keremberke/yolov8n-csgo-player-detection"}
{"domain": "Natural Language Processing Token Classification", "framework": "Transformers", "functionality": "Named Entity Recognition", "api_name": "dslim/bert-base-NER", "api_call": "AutoModelForTokenClassification.from_pretrained('dslim/bert-base-NER')", "performance": {"dataset": "conll2003", "accuracy": {"f1": 91.3, "precision": 90.7, "recall": 91.9}}, "description": "bert-base-NER is a fine-tuned BERT model that is ready to use for Named Entity Recognition and achieves state-of-the-art performance for the NER task. It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC). Specifically, this model is a bert-base-cased model that was fine-tuned on the English version of the standard CoNLL-2003 Named Entity Recognition dataset.", "model_name": "dslim/bert-base-NER"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Image Segmentation", "api_name": "keremberke/yolov8n-pcb-defect-segmentation", "api_call": "YOLO('keremberke/yolov8n-pcb-defect-segmentation')", "performance": {"dataset": "pcb-defect-segmentation", "accuracy": {"mAP@0.5(box)": 0.512, "mAP@0.5(mask)": 0.517}}, "description": "A YOLOv8 model for detecting and segmenting PCB defects such as Dry_joint, Incorrect_installation, PCB_damage, and Short_circuit.", "model_name": "keremberke/yolov8n-pcb-defect-segmentation"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Unconditional Image Generation", "api_name": "google/ncsnpp-celebahq-256", "api_call": "DiffusionPipeline.from_pretrained('google/ncsnpp-celebahq-256')", "performance": {"dataset": "CIFAR-10", "accuracy": {"Inception_score": 9.89, "FID": 2.2, "likelihood": 2.99}}, "description": "Score-Based Generative Modeling through Stochastic Differential Equations (SDE) for unconditional image generation. This model achieves record-breaking performance on CIFAR-10 and demonstrates high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model.", "model_name": "google/ncsnpp-celebahq-256"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Classification", "api_name": "typeform/mobilebert-uncased-mnli", "api_call": "AutoModelForSequenceClassification.from_pretrained('typeform/mobilebert-uncased-mnli')", "performance": {"dataset": "multi_nli", "accuracy": "More information needed"}, "description": "This model is the Multi-Genre Natural Language Inference (MNLI) fine-turned version of the uncased MobileBERT model. It can be used for the task of zero-shot classification.", "model_name": "typeform/mobilebert-uncased-mnli"}
{"domain": "Natural Language Processing Text Classification", "framework": "Transformers", "functionality": "Emotion Classification", "api_name": "j-hartmann/emotion-english-distilroberta-base", "api_call": "pipeline('text-classification', model='j-hartmann/emotion-english-distilroberta-base', return_all_scores=True)", "performance": {"dataset": "Balanced subset from 6 diverse datasets", "accuracy": "66%"}, "description": "This model classifies emotions in English text data. It predicts Ekman's 6 basic emotions, plus a neutral class: anger, disgust, fear, joy, neutral, sadness, and surprise. The model is a fine-tuned checkpoint of DistilRoBERTa-base.", "model_name": "j-hartmann/emotion-english-distilroberta-base"}
{"domain": "Tabular Tabular Classification", "framework": "Hugging Face", "functionality": "Binary Classification", "api_name": "harithapliyal/autotrain-tatanic-survival-51030121311", "api_call": "AutoModel.from_pretrained('harithapliyal/autotrain-tatanic-survival-51030121311')", "performance": {"dataset": "harithapliyal/autotrain-data-tatanic-survival", "accuracy": 0.872}, "description": "A tabular classification model trained on the Titanic survival dataset using Hugging Face AutoTrain. The model predicts whether a passenger survived or not based on features such as age, gender, and passenger class.", "model_name": "harithapliyal/autotrain-tatanic-survival-51030121311"}
{"domain": "Audio Audio Classification", "framework": "SpeechBrain", "functionality": "Emotion Recognition", "api_name": "speechbrain/emotion-recognition-wav2vec2-IEMOCAP", "api_call": "foreign_class(source='speechbrain/emotion-recognition-wav2vec2-IEMOCAP', pymodule_file='custom_interface.py', classname='CustomEncoderWav2vec2Classifier')", "performance": {"dataset": "IEMOCAP", "accuracy": "78.7%"}, "description": "This repository provides all the necessary tools to perform emotion recognition with a fine-tuned wav2vec2 (base) model using SpeechBrain. It is trained on IEMOCAP training data.", "model_name": "speechbrain/emotion-recognition-wav2vec2-IEMOCAP"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "functionality": "Sentiment Analysis", "api_name": "finiteautomata/bertweet-base-sentiment-analysis", "api_call": "pipeline('text-classification', model='finiteautomata/bertweet-base-sentiment-analysis')", "performance": {"dataset": "SemEval 2017", "accuracy": null}, "description": "Model trained with SemEval 2017 corpus (around ~40k tweets). Base model is BERTweet, a RoBERTa model trained on English tweets. Uses POS, NEG, NEU labels.", "model_name": "finiteautomata/bertweet-base-sentiment-analysis"}
{"domain": "Natural Language Processing Text Generation", "framework": "Transformers", "functionality": "Text Generation", "api_name": "EleutherAI/gpt-j-6B", "api_call": "AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-j-6B')", "performance": {"dataset": "the_pile", "accuracy": {"LAMBADA_PPL": 3.99, "LAMBADA_Acc": "69.7%", "Winogrande": "65.3%", "Hellaswag": "66.1%", "PIQA": "76.5%"}}, "description": "GPT-J 6B is a transformer model trained using Ben Wang's Mesh Transformer JAX. It consists of 28 layers with a model dimension of 4096, and a feedforward dimension of 16384. The model dimension is split into 16 heads, each with a dimension of 256. Rotary Position Embedding (RoPE) is applied to 64 dimensions of each head. The model is trained with a tokenization vocabulary of 50257, using the same set of BPEs as GPT-2/GPT-3. GPT-J 6B was trained on the Pile, a large-scale curated dataset created by EleutherAI.", "model_name": "EleutherAI/gpt-j-6B"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Embeddings", "api_name": "sentence-transformers/paraphrase-MiniLM-L3-v2", "api_call": "SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L3-v2')", "performance": {"dataset": "snli, multi_nli, ms_marco", "accuracy": "Not provided"}, "description": "This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/paraphrase-MiniLM-L3-v2"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Image Segmentation", "api_name": "keremberke/yolov8n-pothole-segmentation", "api_call": "YOLO('keremberke/yolov8n-pothole-segmentation')", "performance": {"dataset": "pothole-segmentation", "accuracy": {"mAP@0.5(box)": 0.995, "mAP@0.5(mask)": 0.995}}, "description": "A YOLOv8 model for pothole segmentation in images. The model is trained on the pothole-segmentation dataset and achieves high accuracy in detecting potholes.", "model_name": "keremberke/yolov8n-pothole-segmentation"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Depth Estimation", "api_name": "glpn-nyu-finetuned-diode-230103-091356", "api_call": "AutoModel.from_pretrained('sayakpaul/glpn-nyu-finetuned-diode-230103-091356')", "performance": {"dataset": "diode-subset", "accuracy": {"Loss": 0.436, "Mae": 0.42510000000000003, "Rmse": 0.6169, "Abs Rel": 0.45, "Log Mae": 0.1721, "Log Rmse": 0.22690000000000002, "Delta1": 0.38280000000000003, "Delta2": 0.6326, "Delta3": 0.8051}}, "description": "This model is a fine-tuned version of vinvino02/glpn-nyu on the diode-subset dataset. It is used for depth estimation in computer vision tasks.", "model_name": "sayakpaul/glpn-nyu-finetuned-diode-230103-091356"}
{"domain": "Reinforcement Learning", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "edbeeching/decision-transformer-gym-walker2d-expert", "api_call": "AutoModel.from_pretrained('edbeeching/decision-transformer-gym-walker2d-expert')", "performance": {"dataset": "Gym Walker2d environment", "accuracy": "Not provided"}, "description": "Decision Transformer model trained on expert trajectories sampled from the Gym Walker2d environment.", "model_name": "edbeeching/decision-transformer-gym-walker2d-expert"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "MCG-NJU/videomae-base-short", "api_call": "VideoMAEForPreTraining.from_pretrained('MCG-NJU/videomae-base-short')", "performance": {"dataset": "Kinetics-400", "accuracy": "Not provided"}, "description": "VideoMAE is an extension of Masked Autoencoders (MAE) to video. The architecture of the model is very similar to that of a standard Vision Transformer (ViT), with a decoder on top for predicting pixel values for masked patches. Videos are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds fixed sinus/cosinus position embeddings before feeding the sequence to the layers of the Transformer encoder. By pre-training the model, it learns an inner representation of videos that can then be used to extract features useful for downstream tasks.", "model_name": "MCG-NJU/videomae-base-short"}
{"domain": "Tabular Tabular Classification", "framework": "Scikit-learn", "functionality": "Wine Quality classification", "api_name": "julien-c/wine-quality", "api_call": "joblib.load(cached_download(hf_hub_url('julien-c/wine-quality', 'winequality-red.csv')))", "performance": {"dataset": "julien-c/wine-quality", "accuracy": 0.6616635397}, "description": "A Simple Example of Scikit-learn Pipeline for Wine Quality classification. Inspired by https://towardsdatascience.com/a-simple-example-of-pipeline-in-machine-learning-with-scikit-learn-e726ffbb6976 by Saptashwa Bhattacharyya.", "model_name": "julien-c/wine-quality"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "functionality": "Sentiment Analysis", "api_name": "bert-base-multilingual-uncased-sentiment", "api_call": "pipeline('sentiment-analysis', model='nlptown/bert-base-multilingual-uncased-sentiment')", "performance": {"dataset": [{"language": "English", "accuracy": {"exact": "67%", "off-by-1": "95%"}}, {"language": "Dutch", "accuracy": {"exact": "57%", "off-by-1": "93%"}}, {"language": "German", "accuracy": {"exact": "61%", "off-by-1": "94%"}}, {"language": "French", "accuracy": {"exact": "59%", "off-by-1": "94%"}}, {"language": "Italian", "accuracy": {"exact": "59%", "off-by-1": "95%"}}, {"language": "Spanish", "accuracy": {"exact": "58%", "off-by-1": "95%"}}]}, "description": "This a bert-base-multilingual-uncased model finetuned for sentiment analysis on product reviews in six languages: English, Dutch, German, French, Spanish and Italian. It predicts the sentiment of the review as a number of stars (between 1 and 5).", "model_name": "nlptown/bert-base-multilingual-uncased-sentiment"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "facebook/convnext-large-224", "api_call": "ConvNextForImageClassification.from_pretrained('facebook/convnext-large-224')", "performance": {"dataset": "imagenet-1k", "accuracy": "Not specified"}, "description": "ConvNeXT is a pure convolutional model (ConvNet), inspired by the design of Vision Transformers, that claims to outperform them. The authors started from a ResNet and 'modernized' its design by taking the Swin Transformer as inspiration.", "model_name": "facebook/convnext-large-224"}
{"domain": "Audio Voice Activity Detection", "framework": "Hugging Face Transformers", "functionality": "Voice Activity Detection", "api_name": "popcornell/pyannote-segmentation-chime6-mixer6", "api_call": "Model.from_pretrained('popcornell/pyannote-segmentation-chime6-mixer6')", "performance": {"dataset": "ami", "accuracy": "N/A"}, "description": "Pyannote Segmentation model fine-tuned on data from CHiME-7 DASR Challenge. Used to perform diarization in the CHiME-7 DASR diarization baseline.", "model_name": "popcornell/pyannote-segmentation-chime6-mixer6"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/table-transformer-structure-recognition", "api_call": "pipeline('object-detection', model='microsoft/table-transformer-structure-recognition')", "performance": {"dataset": "PubTables1M", "accuracy": ""}, "description": "Table Transformer (DETR) model trained on PubTables1M for detecting the structure (like rows, columns) in tables.", "model_name": "microsoft/table-transformer-structure-recognition"}
{"domain": "Natural Language Processing Token Classification", "framework": "Hugging Face Transformers", "functionality": "Part-of-Speech Tagging", "api_name": "flair/pos-english", "api_call": "SequenceTagger.load('flair/pos-english')", "performance": {"dataset": "Ontonotes", "accuracy": "98.19"}, "description": "This is the standard part-of-speech tagging model for English that ships with Flair. It predicts fine-grained POS tags based on Flair embeddings and LSTM-CRF.", "model_name": "flair/pos-english"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "bigscience/test-bloomd-6b3", "api_call": "pipeline('text-generation', model='bigscience/test-bloomd-6b3')", "performance": {"dataset": "Not specified", "accuracy": "Not specified"}, "description": "A text generation model from Hugging Face, using the bigscience/test-bloomd-6b3 architecture. It can be used for generating text based on a given input.", "model_name": "bigscience/test-bloomd-6b3"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Image Classification", "api_name": "chinese-clip-vit-large-patch14", "api_call": "ChineseCLIPModel.from_pretrained('OFA-Sys/chinese-clip-vit-large-patch14')", "performance": {"dataset": "MUGE Text-to-Image Retrieval, Flickr30K-CN Retrieval, COCO-CN Retrieval, CIFAR10, CIFAR100, DTD, EuroSAT, FER, FGV, KITTI, MNIST, PASCAL VOC", "accuracy": "Varies depending on the dataset"}, "description": "Chinese-CLIP-ViT-Large-Patch14 is a large version of the Chinese CLIP model, with ViT-L/14 as the image encoder and RoBERTa-wwm-base as the text encoder. Chinese CLIP is a simple implementation of CLIP on a large-scale dataset of around 200 million Chinese image-text pairs. It is designed for zero-shot image classification tasks.", "model_name": "OFA-Sys/chinese-clip-vit-large-patch14"}
{"domain": "Natural Language Processing Text Classification", "framework": "Transformers", "functionality": "Detect GPT-2 generated text", "api_name": "roberta-base-openai-detector", "api_call": "pipeline('text-classification', model='roberta-base-openai-detector')", "performance": {"dataset": "WebText", "accuracy": "95%"}, "description": "RoBERTa base OpenAI Detector is the GPT-2 output detector model, obtained by fine-tuning a RoBERTa base model with the outputs of the 1.5B-parameter GPT-2 model. The model can be used to predict if text was generated by a GPT-2 model.", "model_name": "roberta-base-openai-detector"}
{"domain": "Natural Language Processing Text Generation", "framework": "Transformers", "functionality": "Text Generation", "api_name": "facebook/opt-6.7b", "api_call": "AutoModelForCausalLM.from_pretrained('facebook/opt-6.7b', torch_dtype=torch.float16)", "performance": {"dataset": {"BookCorpus": "unknown", "CC-Stories": "unknown", "The Pile": "unknown", "Pushshift.io Reddit": "unknown", "CCNewsV2": "unknown"}, "accuracy": "unknown"}, "description": "OPT (Open Pre-trained Transformer Language Models) is a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters. It was trained on a large corpus of text, predominantly in English, using a causal language modeling (CLM) objective. The model can be used for prompting for evaluation of downstream tasks, text generation, and fine-tuning on a downstream task using the CLM example.", "model_name": "facebook/opt-6.7b"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "functionality": "Masked Language Modeling", "api_name": "roberta-base", "api_call": "pipeline('fill-mask', model='roberta-base')", "performance": {"dataset": [{"name": "MNLI", "accuracy": 87.6}, {"name": "QQP", "accuracy": 91.9}, {"name": "QNLI", "accuracy": 92.8}, {"name": "SST-2", "accuracy": 94.8}, {"name": "CoLA", "accuracy": 63.6}, {"name": "STS-B", "accuracy": 91.2}, {"name": "MRPC", "accuracy": 90.2}, {"name": "RTE", "accuracy": 78.7}]}, "description": "RoBERTa is a transformers model pretrained on a large corpus of English data in a self-supervised fashion using the Masked language modeling (MLM) objective. This model is case-sensitive and can be fine-tuned on a downstream task.", "model_name": "roberta-base"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "sentence-transformers/distilbert-base-nli-mean-tokens", "api_call": "SentenceTransformer('sentence-transformers/distilbert-base-nli-mean-tokens')", "performance": {"dataset": "https://seb.sbert.net", "accuracy": "Not provided"}, "description": "This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/distilbert-base-nli-mean-tokens"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "git-large-coco", "api_call": "GenerativeImage2TextModel.from_pretrained('microsoft/git-large-coco')", "performance": {"dataset": "COCO", "accuracy": "See table 11 in the paper for more details."}, "description": "GIT (short for GenerativeImage2Text) model, large-sized version, fine-tuned on COCO. It was introduced in the paper GIT: A Generative Image-to-text Transformer for Vision and Language by Wang et al. and first released in this repository.", "model_name": "microsoft/git-large-coco"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Diffusers", "api_name": "myunus1/diffmodels_galaxies_scratchbook", "api_call": "DDPMPipeline.from_pretrained('myunus1/diffmodels_galaxies_scratchbook')", "performance": {"dataset": "Not provided", "accuracy": "Not provided"}, "description": "This model is a diffusion model for unconditional image generation of cute \ud83e\udd8b.", "model_name": "myunus1/diffmodels_galaxies_scratchbook"}
{"domain": "Natural Language Processing Text Generation", "framework": "Transformers", "functionality": "Text Generation", "api_name": "distilgpt2", "api_call": "pipeline('text-generation', model='distilgpt2')", "performance": {"dataset": "WikiText-103", "accuracy": "21.100"}, "description": "DistilGPT2 is an English-language model pre-trained with the supervision of the 124 million parameter version of GPT-2. With 82 million parameters, it was developed using knowledge distillation and designed to be a faster, lighter version of GPT-2. It can be used for text generation, writing assistance, creative writing, entertainment, and more.", "model_name": "distilgpt2"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Fill-Mask", "api_name": "microsoft/deberta-v2-xlarge", "api_call": "DebertaModel.from_pretrained('microsoft/deberta-v2-xlarge')", "performance": {"dataset": [{"name": "SQuAD 1.1", "accuracy": "95.8/90.8"}, {"name": "SQuAD 2.0", "accuracy": "91.4/88.9"}, {"name": "MNLI-m/mm", "accuracy": "91.7/91.6"}, {"name": "SST-2", "accuracy": "97.5"}, {"name": "QNLI", "accuracy": "95.8"}, {"name": "CoLA", "accuracy": "71.1"}, {"name": "RTE", "accuracy": "93.9"}, {"name": "MRPC", "accuracy": "92.0/94.2"}, {"name": "QQP", "accuracy": "92.3/89.8"}, {"name": "STS-B", "accuracy": "92.9/92.9"}]}, "description": "DeBERTa improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder. It outperforms BERT and RoBERTa on majority of NLU tasks with 80GB training data. This is the DeBERTa V2 xlarge model with 24 layers, 1536 hidden size. The total parameters are 900M and it is trained with 160GB raw data.", "model_name": "microsoft/deberta-v2-xlarge"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "google/vit-base-patch16-384", "api_call": "ViTForImageClassification.from_pretrained('google/vit-base-patch16-384')", "performance": {"dataset": "ImageNet", "accuracy": "Refer to tables 2 and 5 of the original paper"}, "description": "Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 384x384. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.", "model_name": "google/vit-base-patch16-384"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "hf-tiny-model-private/tiny-random-ViltForQuestionAnswering", "api_call": "ViltForQuestionAnswering.from_pretrained('hf-tiny-model-private/tiny-random-ViltForQuestionAnswering')", "performance": {"dataset": "", "accuracy": ""}, "description": "A tiny random model for Visual Question Answering using the VILT framework.", "model_name": "hf-tiny-model-private/tiny-random-ViltForQuestionAnswering"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Document Question Answering", "api_name": "layoutlmv2-base-uncased-finetuned-infovqa", "api_call": "AutoModelForDocumentQuestionAnswering.from_pretrained('tiennvcs/layoutlmv2-base-uncased-finetuned-infovqa')", "performance": {"dataset": "unknown", "accuracy": {"Loss": 2.087}}, "description": "This model is a fine-tuned version of microsoft/layoutlmv2-base-uncased on an unknown dataset.", "model_name": "tiennvcs/layoutlmv2-base-uncased-finetuned-infovqa"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "google/pix2struct-base", "api_call": "Pix2StructForConditionalGeneration.from_pretrained('google/pix2struct-base')", "performance": {"dataset": [{"name": "Documents", "accuracy": "N/A"}, {"name": "Illustrations", "accuracy": "N/A"}, {"name": "User Interfaces", "accuracy": "N/A"}, {"name": "Natural Images", "accuracy": "N/A"}]}, "description": "Pix2Struct is an image encoder - text decoder model that is trained on image-text pairs for various tasks, including image captioning and visual question answering. The model is pretrained by learning to parse masked screenshots of web pages into simplified HTML. It can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images.", "model_name": "google/pix2struct-base"}
{"domain": "Natural Language Processing Token Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "kredor/punctuate-all", "api_call": "pipeline('token-classification', model='kredor/punctuate-all')", "performance": {"dataset": "multilingual", "accuracy": 0.98}, "description": "A finetuned xlm-roberta-base model for punctuation prediction on twelve languages: English, German, French, Spanish, Bulgarian, Italian, Polish, Dutch, Czech, Portugese, Slovak, Slovenian.", "model_name": "kredor/punctuate-all"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Transformers", "api_name": "sentence-transformers/multi-qa-mpnet-base-dot-v1", "api_call": "SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1')", "performance": {"dataset": [{"name": "WikiAnswers", "accuracy": 77427422}, {"name": "PAQ", "accuracy": 64371441}, {"name": "Stack Exchange", "accuracy": 25316456}]}, "description": "This is a sentence-transformers model that maps sentences & paragraphs to a 768 dimensional dense vector space and was designed for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.", "model_name": "sentence-transformers/multi-qa-mpnet-base-dot-v1"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "facebook/dino-vitb16", "api_call": "ViTModel.from_pretrained('facebook/dino-vitb16')", "performance": {"dataset": "imagenet-1k", "accuracy": "Not provided"}, "description": "Vision Transformer (ViT) model trained using the DINO method. The model is pretrained on a large collection of images in a self-supervised fashion, namely ImageNet-1k, at a resolution of 224x224 pixels. Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. Note that this model does not include any fine-tuned heads.", "model_name": "facebook/dino-vitb16"}
{"domain": "Natural Language Processing Question Answering", "framework": "Transformers", "functionality": "Feature Extraction", "api_name": "facebook/dpr-question_encoder-single-nq-base", "api_call": "DPRQuestionEncoder.from_pretrained('facebook/dpr-question_encoder-single-nq-base')", "performance": {"dataset": [{"name": "NQ", "accuracy": {"top_20": 78.4, "top_100": 85.4}}, {"name": "TriviaQA", "accuracy": {"top_20": 79.4, "top_100": 85.0}}, {"name": "WQ", "accuracy": {"top_20": 73.2, "top_100": 81.4}}, {"name": "TREC", "accuracy": {"top_20": 79.8, "top_100": 89.1}}, {"name": "SQuAD", "accuracy": {"top_20": 63.2, "top_100": 77.2}}]}, "description": "Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. dpr-question_encoder-single-nq-base is the question encoder trained using the Natural Questions (NQ) dataset (Lee et al., 2019; Kwiatkowski et al., 2019).", "model_name": "facebook/dpr-question_encoder-single-nq-base"}
{"domain": "Natural Language Processing Text Classification", "framework": "Transformers", "functionality": "Language Detection", "api_name": "papluca/xlm-roberta-base-language-detection", "api_call": "pipeline('text-classification', model='papluca/xlm-roberta-base-language-detection')", "performance": {"dataset": "Language Identification", "accuracy": 0.996}, "description": "This model is a fine-tuned version of xlm-roberta-base on the Language Identification dataset. It is an XLM-RoBERTa transformer model with a classification head on top, and can be used as a language detector for sequence classification tasks. It supports 20 languages including Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Italian, Japanese, Dutch, Polish, Portuguese, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, and Chinese.", "model_name": "papluca/xlm-roberta-base-language-detection"}
{"domain": "Reinforcement Learning Robotics", "framework": "Hugging Face", "functionality": "Inference API", "api_name": "Antheia/Hanna", "api_call": "pipeline('robotics', model='Antheia/Hanna')", "performance": {"dataset": "openai/webgpt_comparisons", "accuracy": ""}, "description": "Antheia/Hanna is a reinforcement learning model for robotics tasks, trained on the openai/webgpt_comparisons dataset.", "model_name": "Antheia/Hanna"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "princeton-nlp/unsup-simcse-roberta-base", "api_call": "AutoModel.from_pretrained('princeton-nlp/unsup-simcse-roberta-base')", "performance": {"dataset": null, "accuracy": null}, "description": "An unsupervised sentence embedding model trained using the SimCSE approach with a Roberta base architecture.", "model_name": "princeton-nlp/unsup-simcse-roberta-base"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Embeddings", "api_name": "sentence-transformers/paraphrase-mpnet-base-v2", "api_call": "SentenceTransformer('sentence-transformers/paraphrase-mpnet-base-v2')", "performance": {"dataset": "https://seb.sbert.net", "accuracy": "Automated evaluation"}, "description": "This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space. It can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/paraphrase-mpnet-base-v2"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Image generation and modification based on text prompts", "api_name": "stabilityai/stable-diffusion-2-inpainting", "api_call": "StableDiffusionInpaintPipeline.from_pretrained('stabilityai/stable-diffusion-2-inpainting', torch_dtype=torch.float16)", "performance": {"dataset": "COCO2017 validation set", "accuracy": "Not optimized for FID scores"}, "description": "A Latent Diffusion Model that uses a fixed, pretrained text encoder (OpenCLIP-ViT/H) to generate and modify images based on text prompts.", "model_name": "stabilityai/stable-diffusion-2-inpainting"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "opus-mt-sv-en", "api_call": "AutoModel.from_pretrained('Helsinki-NLP/opus-mt-sv-en').", "performance": {"dataset": "Tatoeba.sv.en", "accuracy": "BLEU: 64.5, chr-F: 0.763"}, "description": "A Swedish to English translation model trained on the OPUS dataset using the transformer-align architecture. The model is pre-processed with normalization and SentencePiece.", "model_name": "Helsinki-NLP/opus-mt-sv-en"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Unconditional Image Generation", "api_name": "ocariz/butterfly_200", "api_call": "DDPMPipeline.from_pretrained('ocariz/butterfly_200')", "performance": {"dataset": "", "accuracy": ""}, "description": "This model is a diffusion model for unconditional image generation of cute butterflies trained for 200 epochs.", "model_name": "ocariz/butterfly_200"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Transformers", "api_name": "sentence-transformers/multi-qa-mpnet-base-cos-v1", "api_call": "SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-cos-v1')", "performance": {"dataset": "215M (question, answer) pairs from diverse sources", "accuracy": "Not provided"}, "description": "This is a sentence-transformers model that maps sentences and paragraphs to a 768 dimensional dense vector space and was designed for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.", "model_name": "sentence-transformers/multi-qa-mpnet-base-cos-v1"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Denoising Diffusion Probabilistic Models (DDPM)", "api_name": "google/ddpm-ema-bedroom-256", "api_call": "DDPMPipeline.from_pretrained('google/ddpm-ema-bedroom-256')", "performance": {"dataset": "CIFAR10", "accuracy": {"Inception_score": 9.46, "FID_score": 3.17}}, "description": "Denoising Diffusion Probabilistic Models (DDPM) is a class of latent variable models inspired by nonequilibrium thermodynamics, capable of producing high-quality image synthesis results. The model can use discrete noise schedulers such as scheduling_ddpm, scheduling_ddim, and scheduling_pndm for inference. It obtains an Inception score of 9.46 and a state-of-the-art FID score of 3.17 on the unconditional CIFAR10 dataset.", "model_name": "google/ddpm-ema-bedroom-256"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "glpn-nyu-finetuned-diode-221121-063504", "api_call": "AutoModelForImageClassification.from_pretrained('sayakpaul/glpn-nyu-finetuned-diode-221121-063504')", "performance": {"dataset": "diode-subset", "accuracy": {"Loss": 0.3533, "Mae": 0.26680000000000004, "Rmse": 0.37160000000000004, "Abs Rel": 0.3427, "Log Mae": 0.11670000000000001, "Log Rmse": 0.1703, "Delta1": 0.5522, "Delta2": 0.8362, "Delta3": 0.9382}}, "description": "This model is a fine-tuned version of vinvino02/glpn-nyu on the diode-subset dataset for depth estimation.", "model_name": "sayakpaul/glpn-nyu-finetuned-diode-221121-063504"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "dandelin/vilt-b32-finetuned-vqa", "api_call": "ViltForQuestionAnswering.from_pretrained('dandelin/vilt-b32-finetuned-vqa')", "performance": {"dataset": "VQAv2", "accuracy": "to do"}, "description": "Vision-and-Language Transformer (ViLT) model fine-tuned on VQAv2. It was introduced in the paper ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision by Kim et al. and first released in this repository.", "model_name": "dandelin/vilt-b32-finetuned-vqa"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Blood Cell Detection", "api_name": "keremberke/yolov8n-blood-cell-detection", "api_call": "YOLO('keremberke/yolov8n-blood-cell-detection')", "performance": {"dataset": "blood-cell-object-detection", "accuracy": 0.893}, "description": "This model detects blood cells in images, specifically Platelets, RBC, and WBC. It is based on the YOLOv8 architecture and trained on the blood-cell-object-detection dataset.", "model_name": "keremberke/yolov8n-blood-cell-detection"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Text Generation", "api_name": "EleutherAI/gpt-neo-2.7B", "api_call": "pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B')", "performance": {"dataset": "the_pile", "accuracy": {"Lambada_Acc": "62.22%", "Winogrande": "56.50%", "Hellaswag": "42.73%"}}, "description": "GPT-Neo 2.7B is a transformer model designed using EleutherAI's replication of the GPT-3 architecture. It was trained on the Pile, a large scale curated dataset created by EleutherAI for the purpose of training this model. This model is best suited for generating texts from a prompt and can be used directly with a pipeline for text generation.", "model_name": "EleutherAI/gpt-neo-2.7B"}
{"domain": "Audio Audio-to-Audio", "framework": "Hugging Face Transformers", "functionality": "Speech Enhancement", "api_name": "speechbrain/sepformer-whamr-enhancement", "api_call": "separator.from_hparams(source='speechbrain/sepformer-whamr-enhancement', savedir='pretrained_models/sepformer-whamr-enhancement')", "performance": {"dataset": "WHAMR!", "accuracy": "10.59 dB SI-SNR"}, "description": "This repository provides all the necessary tools to perform speech enhancement (denoising + dereverberation) with a SepFormer model, implemented with SpeechBrain, and pretrained on WHAMR! dataset with 8k sampling frequency, which is basically a version of WSJ0-Mix dataset with environmental noise and reverberation in 8k.", "model_name": "speechbrain/sepformer-whamr-enhancement"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "moussaKam/barthez-orangesum-abstract", "api_call": "BarthezModel.from_pretrained('moussaKam/barthez-orangesum-abstract')", "performance": {"dataset": "orangeSum", "accuracy": ""}, "description": "Barthez model finetuned on orangeSum for abstract generation in French language", "model_name": "moussaKam/barthez-orangesum-abstract"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "glpn-nyu-finetuned-diode-221122-030603", "api_call": "pipeline('depth-estimation', model='sayakpaul/glpn-nyu-finetuned-diode-221122-030603')", "performance": {"dataset": "diode-subset", "accuracy": {"Loss": 0.3597, "Mae": 0.3054, "Rmse": 0.4481, "Abs Rel": 0.3462, "Log Mae": 0.12560000000000002, "Log Rmse": 0.17980000000000002, "Delta1": 0.5278, "Delta2": 0.8055, "Delta3": 0.9191}}, "description": "This model is a fine-tuned version of vinvino02/glpn-nyu on the diode-subset dataset.", "model_name": "sayakpaul/glpn-nyu-finetuned-diode-221122-030603"}
{"domain": "Audio Audio-to-Audio", "framework": "Fairseq", "functionality": "speech-to-speech-translation", "api_name": "facebook/textless_sm_en_fr", "api_call": "load_model_ensemble_and_task_from_hf_hub('facebook/textless_sm_en_fr')", "performance": {"dataset": "", "accuracy": ""}, "description": "This model is a speech-to-speech translation model trained by Facebook. It is designed for translating English speech to French speech.", "model_name": "facebook/textless_sm_en_fr"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Text2Text Generation", "api_name": "google/byt5-small", "api_call": "T5ForConditionalGeneration.from_pretrained('google/byt5-small')", "performance": {"dataset": "mc4", "accuracy": "Not provided"}, "description": "ByT5 is a tokenizer-free version of Google's T5 and generally follows the architecture of MT5. ByT5 was only pre-trained on mC4 excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is usable on a downstream task. ByT5 works especially well on noisy text data, e.g., google/byt5-small significantly outperforms mt5-small on TweetQA.", "model_name": "google/byt5-small"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "AICVTG_What_if_a_machine_could_create_captions_automatically", "api_call": "VisionEncoderDecoderModel.from_pretrained('facebook/mmt-en-de')", "performance": {"dataset": "Not specified", "accuracy": "Not specified"}, "description": "This is an image captioning model training by Zayn", "model_name": "facebook/mmt-en-de"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Image Segmentation", "api_name": "keremberke/yolov8m-building-segmentation", "api_call": "YOLO('keremberke/yolov8m-building-segmentation')", "performance": {"dataset": "satellite-building-segmentation", "accuracy": {"mAP@0.5(box)": 0.623, "mAP@0.5(mask)": 0.613}}, "description": "A YOLOv8 model for building segmentation in satellite images. It can detect and segment buildings in the input images.", "model_name": "keremberke/yolov8m-building-segmentation"}
{"domain": "Reinforcement Learning", "framework": "Stable-Baselines3", "functionality": "CartPole-v1", "api_name": "dqn-CartPole-v1", "api_call": "load_from_hub(repo_id='sb3/dqn-CartPole-v1',filename='{MODEL FILENAME}.zip',)", "performance": {"dataset": "CartPole-v1", "accuracy": "500.00 +/- 0.00"}, "description": "This is a trained model of a DQN agent playing CartPole-v1 using the stable-baselines3 library and the RL Zoo. The RL Zoo is a training framework for Stable Baselines3 reinforcement learning agents, with hyperparameter optimization and pre-trained agents included.", "model_name": "sb3/dqn-CartPole-v1"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Unconditional Image Generation", "api_name": "johnowhitaker/sd-class-wikiart-from-bedrooms", "api_call": "DDPMPipeline.from_pretrained('johnowhitaker/sd-class-wikiart-from-bedrooms')", "performance": {"dataset": "https://huggingface.co/datasets/huggan/wikiart", "accuracy": "Not provided"}, "description": "This model is a diffusion model initialized from https://huggingface.co/google/ddpm-bedroom-256 and trained for 5000 steps on https://huggingface.co/datasets/huggan/wikiart.", "model_name": "johnowhitaker/sd-class-wikiart-from-bedrooms"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "facebook/maskformer-swin-tiny-coco", "api_call": "MaskFormerForInstanceSegmentation.from_pretrained('facebook/maskformer-swin-tiny-coco')", "performance": {"dataset": "COCO panoptic segmentation", "accuracy": "Not provided"}, "description": "MaskFormer model trained on COCO panoptic segmentation (tiny-sized version, Swin backbone). It was introduced in the paper Per-Pixel Classification is Not All You Need for Semantic Segmentation and first released in this repository.", "model_name": "facebook/maskformer-swin-tiny-coco"}
{"domain": "Tabular Tabular Regression", "framework": "Scikit-learn", "functionality": "skops", "api_name": "rajistics/MAPIE-TS-Electricity", "api_call": "RandomForestRegressor(max_depth=10, n_estimators=50, random_state=59)", "performance": {"dataset": "", "accuracy": ""}, "description": "A RandomForestRegressor model for electricity consumption prediction.", "model_name": "RandomForestRegressor(max_depth=10, n_estimators=50, random_state=59)"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "prompthero/openjourney", "api_call": "StableDiffusionPipeline.from_pretrained('prompthero/openjourney', torch_dtype=torch.float16)", "performance": {"dataset": "Midjourney images", "accuracy": "Not specified"}, "description": "Openjourney is an open source Stable Diffusion fine-tuned model on Midjourney images, by PromptHero. It can be used for generating AI art based on text prompts.", "model_name": "prompthero/openjourney"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "facebook/bart-large", "api_call": "BartModel.from_pretrained('facebook/bart-large')", "performance": {"dataset": "arxiv", "accuracy": "Not provided"}, "description": "BART is a transformer encoder-decoder (seq2seq) model with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder. BART is pre-trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. BART is particularly effective when fine-tuned for text generation (e.g. summarization, translation) but also works well for comprehension tasks (e.g. text classification, question answering).", "model_name": "facebook/bart-large"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "convnextv2_huge.fcmae_ft_in1k", "api_call": "timm.create_model('convnextv2_huge.fcmae_ft_in1k', pretrained=True)", "performance": {"dataset": "imagenet-1k", "accuracy": 86.256}, "description": "A ConvNeXt-V2 image classification model. Pretrained with a fully convolutional masked autoencoder framework (FCMAE) and fine-tuned on ImageNet-1k.", "model_name": "timm/convnextv2_huge.fcmae_ft_in1k"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Transformers", "functionality": "Summarization", "api_name": "pszemraj/long-t5-tglobal-base-16384-book-summary", "api_call": "T5ForConditionalGeneration.from_pretrained('pszemraj/long-t5-tglobal-base-16384-book-summary')", "performance": {"dataset": "kmfoda/booksum", "accuracy": {"ROUGE-1": 36.408, "ROUGE-2": 6.065, "ROUGE-L": 16.721, "ROUGE-LSUM": 33.34}}, "description": "A fine-tuned version of google/long-t5-tglobal-base on the kmfoda/booksum dataset, which can be used to summarize long text and generate SparkNotes-esque summaries of arbitrary topics. The model generalizes reasonably well to academic and narrative text.", "model_name": "pszemraj/long-t5-tglobal-base-16384-book-summary"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "dreamlike-art/dreamlike-diffusion-1.0", "api_call": "StableDiffusionPipeline.from_pretrained('dreamlike-art/dreamlike-diffusion-1.0', torch_dtype=torch.float16)", "performance": {"dataset": "high quality art", "accuracy": "not provided"}, "description": "Dreamlike Diffusion 1.0 is SD 1.5 fine tuned on high quality art, made by dreamlike.art.", "model_name": "dreamlike-art/dreamlike-diffusion-1.0"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face Transformers", "functionality": "Image Super-Resolution", "api_name": "caidas/swin2SR-classical-sr-x2-64", "api_call": "Swin2SRForImageSuperResolution.from_pretrained('caidas/swin2sr-classical-sr-x2-64')", "performance": {"dataset": "arxiv: 2209.11345", "accuracy": "Not provided"}, "description": "Swin2SR model that upscales images x2. It was introduced in the paper Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration by Conde et al. and first released in this repository.", "model_name": "caidas/swin2sr-classical-sr-x2-64"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "videomae-small-finetuned-kinetics", "api_call": "VideoMAEForVideoClassification.from_pretrained('MCG-NJU/videomae-small-finetuned-kinetics')", "performance": {"dataset": "Kinetics-400", "accuracy": {"top-1": 79.0, "top-5": 93.8}}, "description": "VideoMAE model pre-trained for 1600 epochs in a self-supervised way and fine-tuned in a supervised way on Kinetics-400. It was introduced in the paper VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training by Tong et al. and first released in this repository.", "model_name": "MCG-NJU/videomae-small-finetuned-kinetics"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "lmazzon70/videomae-base-finetuned-kinetics-finetuned-rwf2000mp4-epochs8-batch8-kb", "api_call": "AutoModelForVideoClassification.from_pretrained('lmazzon70/videomae-base-finetuned-kinetics-finetuned-rwf2000mp4-epochs8-batch8-kb')", "performance": {"dataset": "unknown", "accuracy": 0.7453000000000001}, "description": "This model is a fine-tuned version of MCG-NJU/videomae-base-finetuned-kinetics on an unknown dataset.", "model_name": "lmazzon70/videomae-base-finetuned-kinetics-finetuned-rwf2000mp4-epochs8-batch8-kb"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Transformers", "api_name": "sentence-transformers/paraphrase-distilroberta-base-v2", "api_call": "SentenceTransformer('sentence-transformers/paraphrase-distilroberta-base-v2')", "performance": {"dataset": "https://seb.sbert.net", "accuracy": "Automated evaluation"}, "description": "This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/paraphrase-distilroberta-base-v2"}
{"domain": "Natural Language Processing Token Classification", "framework": "Hugging Face Transformers", "functionality": "Named Entity Recognition", "api_name": "flair/ner-english-ontonotes-fast", "api_call": "SequenceTagger.load('flair/ner-english-ontonotes-fast')", "performance": {"dataset": "Ontonotes", "accuracy": "F1-Score: 89.3"}, "description": "This is the fast version of the 18-class NER model for English that ships with Flair. It predicts 18 tags such as cardinal value, date value, event name, building name, geo-political entity, language name, law name, location name, money name, affiliation, ordinal value, organization name, percent value, person name, product name, quantity value, time value, and name of work of art. The model is based on Flair embeddings and LSTM-CRF.", "model_name": "flair/ner-english-ontonotes-fast"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "google/mobilenet_v2_1.0_224", "api_call": "AutoModelForImageClassification.from_pretrained('google/mobilenet_v2_1.0_224')", "performance": {"dataset": "imagenet-1k", "accuracy": "Not specified"}, "description": "MobileNet V2 model pre-trained on ImageNet-1k at resolution 224x224. It was introduced in MobileNetV2: Inverted Residuals and Linear Bottlenecks by Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen. MobileNets are small, low-latency, low-power models parameterized to meet the resource constraints of a variety of use cases. They can be built upon for classification, detection, embeddings and segmentation similar to how other popular large scale models, such as Inception, are used. MobileNets can be run efficiently on mobile devices.", "model_name": "google/mobilenet_v2_1.0_224"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Masked Language Modeling Prediction", "api_name": "CodeBERTa-small-v1", "api_call": "pipeline('fill-mask', model='huggingface/CodeBERTa-small-v1')", "performance": {"dataset": "code_search_net", "accuracy": null}, "description": "CodeBERTa is a RoBERTa-like model trained on the CodeSearchNet dataset from GitHub. It supports languages like Go, Java, JavaScript, PHP, Python, and Ruby. The tokenizer is a Byte-level BPE tokenizer trained on the corpus using Hugging Face tokenizers. The small model is a 6-layer, 84M parameters, RoBERTa-like Transformer model.", "model_name": "huggingface/CodeBERTa-small-v1"}
{"domain": "Reinforcement Learning", "framework": "Stable-Baselines3", "functionality": "deep-reinforcement-learning", "api_name": "td3-Ant-v3", "api_call": "load_from_hub(repo_id='sb3/td3-Ant-v3,filename='{MODEL FILENAME}.zip',)", "performance": {"dataset": "Ant-v3", "accuracy": "5822.96 +/- 93.33"}, "description": "This is a trained model of a TD3 agent playing Ant-v3 using the stable-baselines3 library and the RL Zoo. The RL Zoo is a training framework for Stable Baselines3 reinforcement learning agents, with hyperparameter optimization and pre-trained agents included.", "model_name": "sb3/td3-Ant-v3"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "sshleifer/tiny-marian-en-de", "api_call": "pipeline('translation_en_to_de', model='sshleifer/tiny-marian-en-de')", "performance": {"dataset": "", "accuracy": ""}, "description": "A tiny English to German translation model using the Marian framework in Hugging Face Transformers.", "model_name": "sshleifer/tiny-marian-en-de"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "Helsinki-NLP/opus-mt-zh-en", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('Helsinki-NLP/opus-mt-zh-en')", "performance": {"dataset": "opus", "accuracy": {"BLEU": 36.1, "chr-F": 0.548}}, "description": "A Chinese to English translation model developed by the Language Technology Research Group at the University of Helsinki. It is based on the Marian NMT framework and trained on the OPUS dataset.", "model_name": "Helsinki-NLP/opus-mt-zh-en"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "tiny-random-VideoMAEForVideoClassification", "api_call": "VideoClassificationPipeline(model='hf-tiny-model-private/tiny-random-VideoMAEForVideoClassification')", "performance": {"dataset": "", "accuracy": ""}, "description": "A tiny random VideoMAE model for video classification.", "model_name": "hf-tiny-model-private/tiny-random-VideoMAEForVideoClassification"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "MCG-NJU/videomae-base-short-finetuned-kinetics", "api_call": "VideoMAEForVideoClassification.from_pretrained('MCG-NJU/videomae-base-short-finetuned-kinetics')", "performance": {"dataset": "Kinetics-400", "accuracy": {"top-1": 79.4, "top-5": 94.1}}, "description": "VideoMAE model pre-trained for 800 epochs in a self-supervised way and fine-tuned in a supervised way on Kinetics-400. It was introduced in the paper VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training by Tong et al. and first released in this repository.", "model_name": "MCG-NJU/videomae-base-short-finetuned-kinetics"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Diffusers", "api_name": "google/ddpm-ema-church-256", "api_call": "DDPMPipeline.from_pretrained('google/ddpm-ema-church-256')", "performance": {"dataset": "CIFAR10", "accuracy": {"Inception score": 9.46, "FID score": 3.17}}, "description": "Denoising Diffusion Probabilistic Models (DDPM) is a class of latent variable models inspired by nonequilibrium thermodynamics. It is used for high-quality image synthesis. DDPM models can use discrete noise schedulers such as scheduling_ddpm, scheduling_ddim, and scheduling_pndm for inference. The model can be used with different pipelines for faster inference and better trade-off between quality and speed.", "model_name": "google/ddpm-ema-church-256"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/tapex-large", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('microsoft/tapex-large')", "performance": {"dataset": "", "accuracy": ""}, "description": "TAPEX (Table Pre-training via Execution) is a conceptually simple and empirically powerful pre-training approach to empower existing models with table reasoning skills. TAPEX realizes table pre-training by learning a neural SQL executor over a synthetic corpus, which is obtained by automatically synthesizing executable SQL queries. TAPEX is based on the BART architecture, the transformer encoder-encoder (seq2seq) model with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder.", "model_name": "microsoft/tapex-large"}
{"domain": "Natural Language Processing Token Classification", "framework": "Hugging Face Transformers", "functionality": "Named Entity Recognition", "api_name": "flair/ner-english-ontonotes", "api_call": "SequenceTagger.load('flair/ner-english-ontonotes')", "performance": {"dataset": "Ontonotes", "accuracy": "89.27"}, "description": "This is the 18-class NER model for English that ships with Flair. It predicts 18 tags such as cardinal value, date value, event name, building name, geo-political entity, language name, law name, location name, money name, affiliation, ordinal value, organization name, percent value, person name, product name, quantity value, time value, and name of work of art. Based on Flair embeddings and LSTM-CRF.", "model_name": "flair/ner-english-ontonotes"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Image Classification", "api_name": "OFA-Sys/chinese-clip-vit-large-patch14-336px", "api_call": "ChineseCLIPModel.from_pretrained('OFA-Sys/chinese-clip-vit-large-patch14-336px')", "performance": {"dataset": {"CIFAR10": 96.0, "CIFAR100": 79.75, "DTD": 51.2, "EuroSAT": 52.0, "FER": 55.1, "FGVC": 26.2, "KITTI": 49.9, "MNIST": 79.4, "PC": 63.5, "VOC": 84.9}, "accuracy": "various"}, "description": "Chinese CLIP is a simple implementation of CLIP on a large-scale dataset of around 200 million Chinese image-text pairs. It uses ViT-L/14@336px as the image encoder and RoBERTa-wwm-base as the text encoder.", "model_name": "OFA-Sys/chinese-clip-vit-large-patch14-336px"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "glpn-nyu-finetuned-diode-221122-044810", "api_call": "pipeline('depth-estimation', model='sayakpaul/glpn-nyu-finetuned-diode-221122-044810')", "performance": {"dataset": "diode-subset", "accuracy": {"Loss": 0.369, "Mae": 0.2909, "Rmse": 0.4208, "Abs Rel": 0.3635, "Log Mae": 0.12240000000000001, "Log Rmse": 0.17930000000000001, "Delta1": 0.5323, "Delta2": 0.8179000000000001, "Delta3": 0.9258000000000001}}, "description": "This model is a fine-tuned version of vinvino02/glpn-nyu on the diode-subset dataset.", "model_name": "sayakpaul/glpn-nyu-finetuned-diode-221122-044810"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Grammar Correction", "api_name": "vennify/t5-base-grammar-correction", "api_call": "HappyTextToText('T5', 'vennify/t5-base-grammar-correction')", "performance": {"dataset": "jfleg", "accuracy": "Not provided"}, "description": "This model generates a revised version of inputted text with the goal of containing fewer grammatical errors. It was trained with Happy Transformer using a dataset called JFLEG.", "model_name": "vennify/t5-base-grammar-correction"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face Transformers", "functionality": "Visual Question Answering", "api_name": "temp_vilt_vqa", "api_call": "pipeline('visual-question-answering', model='Bingsu/temp_vilt_vqa', tokenizer='Bingsu/temp_vilt_vqa')", "performance": {"dataset": "", "accuracy": ""}, "description": "A visual question answering model for answering questions related to images using the Hugging Face Transformers library.", "model_name": "Bingsu/temp_vilt_vqa"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Transformers", "functionality": "Table Question Answering", "api_name": "google/tapas-large-finetuned-wtq", "api_call": "pipeline('table-question-answering', model='google/tapas-large-finetuned-wtq')", "performance": {"dataset": "wikitablequestions", "accuracy": 0.5097}, "description": "TAPAS large model fine-tuned on WikiTable Questions (WTQ). This model was pre-trained on MLM and an additional step which the authors call intermediate pre-training, and then fine-tuned in a chain on SQA, WikiSQL and finally WTQ. It uses relative position embeddings (i.e. resetting the position index at every cell of the table).", "model_name": "google/tapas-large-finetuned-wtq"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Image Segmentation", "api_name": "mattmdjaga/segformer_b2_clothes", "api_call": "SegformerForSemanticSegmentation.from_pretrained('mattmdjaga/segformer_b2_clothes')", "performance": {"dataset": "mattmdjaga/human_parsing_dataset", "accuracy": "Not provided"}, "description": "SegFormer model fine-tuned on ATR dataset for clothes segmentation.", "model_name": "mattmdjaga/segformer_b2_clothes"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "Kirili4ik/mbart_ruDialogSum", "api_call": "MBartForConditionalGeneration.from_pretrained('Kirili4ik/mbart_ruDialogSum')", "performance": {"dataset": [{"name": "SAMSum Corpus (translated to Russian)", "accuracy": {"Validation ROGUE-1": 34.5, "Validation ROGUE-L": 33, "Test ROGUE-1": 31, "Test ROGUE-L": 28}}]}, "description": "MBart for Russian summarization fine-tuned for dialogues summarization. This model was firstly fine-tuned by Ilya Gusev on Gazeta dataset. We have fine tuned that model on SamSum dataset translated to Russian using GoogleTranslateAPI. Moreover! We have implemented a ! telegram bot @summarization_bot ! with the inference of this model. Add it to the chat and get summaries instead of dozens spam messages!", "model_name": "Kirili4ik/mbart_ruDialogSum"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "facebook/mask2former-swin-base-coco-panoptic", "api_call": "Mask2FormerForUniversalSegmentation.from_pretrained('facebook/mask2former-swin-base-coco-panoptic')", "performance": {"dataset": "COCO panoptic segmentation", "accuracy": null}, "description": "Mask2Former model trained on COCO panoptic segmentation (base-sized version, Swin backbone). It was introduced in the paper Masked-attention Mask Transformer for Universal Image Segmentation and first released in this repository. Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA, MaskFormer both in terms of performance an efficiency.", "model_name": "facebook/mask2former-swin-base-coco-panoptic"}
{"domain": "Multimodal Text-to-Video", "framework": "Hugging Face", "functionality": "Text-to-Video", "api_name": "duncan93/video", "api_call": "BaseModel.from_pretrained('duncan93/video')", "performance": {"dataset": "OpenAssistant/oasst1", "accuracy": ""}, "description": "A text-to-video model trained on OpenAssistant/oasst1 dataset.", "model_name": "duncan93/video"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Text Generation", "api_name": "xlnet-base-cased", "api_call": "XLNetModel.from_pretrained('xlnet-base-cased')", "performance": {"dataset": "bookcorpus, wikipedia", "accuracy": "state-of-the-art (SOTA) results on various downstream language tasks"}, "description": "XLNet model pre-trained on English language. It was introduced in the paper XLNet: Generalized Autoregressive Pretraining for Language Understanding by Yang et al. and first released in this repository. XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Additionally, XLNet employs Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context.", "model_name": "xlnet-base-cased"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image Diffusion Models", "api_name": "lllyasviel/control_v11p_sd15_mlsd", "api_call": "ControlNetModel.from_pretrained('lllyasviel/control_v11p_sd15_mlsd')", "performance": {"dataset": "MLSD", "accuracy": "Not provided"}, "description": "Controlnet v1.1 is a neural network structure to control diffusion models by adding extra conditions. It can be used in combination with Stable Diffusion, such as runwayml/stable-diffusion-v1-5. This checkpoint corresponds to the ControlNet conditioned on MLSD images.", "model_name": "lllyasviel/control_v11p_sd15_mlsd"}
{"domain": "Natural Language Processing Text Classification", "framework": "Transformers", "functionality": "Sentiment Analysis", "api_name": "michellejieli/NSFW_text_classifier", "api_call": "pipeline('sentiment-analysis', model='michellejieli/NSFW_text_classification')", "performance": {"dataset": "Reddit posts", "accuracy": "Not specified"}, "description": "DistilBERT is a transformer model that performs sentiment analysis. I fine-tuned the model on Reddit posts with the purpose of classifying not safe for work (NSFW) content, specifically text that is considered inappropriate and unprofessional. The model predicts 2 classes, which are NSFW or safe for work (SFW). The model is a fine-tuned version of DistilBERT. It was fine-tuned on 14317 Reddit posts pulled from the Reddit API.", "model_name": "michellejieli/NSFW_text_classification"}
{"domain": "Natural Language Processing Token Classification", "framework": "Transformers", "functionality": "Named Entity Recognition", "api_name": "dslim/bert-base-NER-uncased", "api_call": "pipeline('ner', model='dslim/bert-base-NER-uncased')", "performance": {"dataset": "", "accuracy": ""}, "description": "A pretrained BERT model for Named Entity Recognition (NER) on uncased text. It can be used to extract entities such as person names, locations, and organizations from text.", "model_name": "dslim/bert-base-NER-uncased"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "functionality": "Question Answering", "api_name": "bigwiz83/sapbert-from-pubmedbert-squad2", "api_call": "pipeline('question-answering', model='bigwiz83/sapbert-from-pubmedbert-squad2')", "performance": {"dataset": "squad_v2", "accuracy": "1.2582"}, "description": "This model is a fine-tuned version of cambridgeltl/SapBERT-from-PubMedBERT-fulltext on the squad_v2 dataset.", "model_name": "bigwiz83/sapbert-from-pubmedbert-squad2"}
{"domain": "Reinforcement Learning", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "edbeeching/decision-transformer-gym-halfcheetah-expert", "api_call": "AutoModel.from_pretrained('edbeeching/decision-transformer-gym-halfcheetah-expert')", "performance": {"dataset": "Gym HalfCheetah environment", "accuracy": "Not specified"}, "description": "Decision Transformer model trained on expert trajectories sampled from the Gym HalfCheetah environment", "model_name": "edbeeching/decision-transformer-gym-halfcheetah-expert"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/git-base-textvqa", "api_call": "AutoModel.from_pretrained('microsoft/git-base-textvqa')", "performance": {"dataset": "TextVQA", "accuracy": "Refer to the paper"}, "description": "GIT (GenerativeImage2Text), base-sized, fine-tuned on TextVQA. It is a Transformer decoder conditioned on both CLIP image tokens and text tokens. The model is trained using 'teacher forcing' on a lot of (image, text) pairs. The goal for the model is to predict the next text token, giving the image tokens and previous text tokens. It can be used for tasks like image and video captioning, visual question answering (VQA) on images and videos, and even image classification.", "model_name": "microsoft/git-base-textvqa"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "Normal Map Estimation", "api_name": "lllyasviel/sd-controlnet-normal", "api_call": "ControlNetModel.from_pretrained('lllyasviel/sd-controlnet-normal')", "performance": {"dataset": "DIODE", "accuracy": "Not provided"}, "description": "ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on Normal Map Estimation. It can be used in combination with Stable Diffusion.", "model_name": "lllyasviel/sd-controlnet-normal"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Diffusers", "api_name": "google/ddpm-celebahq-256", "api_call": "DDPMPipeline.from_pretrained('ddpm-celebahq-256')", "performance": {"dataset": "CIFAR10", "accuracy": {"Inception_score": 9.46, "FID_score": 3.17}}, "description": "Denoising Diffusion Probabilistic Models (DDPM) for high quality image synthesis. Trained on the unconditional CIFAR10 dataset and 256x256 LSUN, obtaining state-of-the-art FID score of 3.17 and Inception score of 9.46.", "model_name": "ddpm-celebahq-256"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "lysandre/tiny-tapas-random-sqa", "api_call": "TapasForCovid.from_pretrained('lysandre/tiny-tapas-random-sqa')", "performance": {"dataset": null, "accuracy": null}, "description": "A tiny TAPAS model for table question answering tasks.", "model_name": "lysandre/tiny-tapas-random-sqa"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Image Segmentation", "api_name": "nvidia/segformer-b5-finetuned-ade-640-640", "api_call": "SegformerForSemanticSegmentation.from_pretrained('nvidia/segformer-b5-finetuned-ade-640-640')", "performance": {"dataset": "ADE20K", "accuracy": "Not provided"}, "description": "SegFormer model fine-tuned on ADE20k at resolution 640x640. It was introduced in the paper SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers by Xie et al. and first released in this repository.", "model_name": "nvidia/segformer-b5-finetuned-ade-640-640"}
{"domain": "Natural Language Processing Question Answering", "framework": "Transformers", "functionality": "Question Answering", "api_name": "distilbert-base-uncased-distilled-squad", "api_call": "pipeline('question-answering', model='distilbert-base-uncased-distilled-squad')", "performance": {"dataset": "SQuAD v1.1", "accuracy": "86.9 F1 score"}, "description": "DistilBERT base uncased distilled SQuAD is a fine-tuned version of DistilBERT-base-uncased for the task of question answering. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark.", "model_name": "distilbert-base-uncased-distilled-squad"}
{"domain": "Tabular Tabular Regression", "framework": "Keras", "functionality": "anomaly-detection", "api_name": "keras-io/timeseries-anomaly-detection", "api_call": "TFAutoModelForSequenceClassification.from_pretrained('keras-io/timeseries-anomaly-detection')", "performance": {"dataset": "Numenta Anomaly Benchmark(NAB)", "accuracy": {"Train Loss": 0.006, "Validation Loss": 0.008}}, "description": "This script demonstrates how you can use a reconstruction convolutional autoencoder model to detect anomalies in timeseries data. We will use the Numenta Anomaly Benchmark(NAB) dataset. It provides artifical timeseries data containing labeled anomalous periods of behavior. Data are ordered, timestamped, single-valued metrics.", "model_name": "keras-io/timeseries-anomaly-detection"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "functionality": "Fill-Mask", "api_name": "cl-tohoku/bert-base-japanese-whole-word-masking", "api_call": "AutoModelForMaskedLM.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')", "performance": {"dataset": "Japanese Wikipedia", "accuracy": "Not provided"}, "description": "This is a BERT model pretrained on texts in the Japanese language. This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by the WordPiece subword tokenization. Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective.", "model_name": "cl-tohoku/bert-base-japanese-whole-word-masking"}
{"domain": "Multimodal Feature Extraction", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "DeepPavlov/rubert-base-cased", "api_call": "AutoModel.from_pretrained('DeepPavlov/rubert-base-cased')", "performance": {"dataset": "Russian part of Wikipedia and news data", "accuracy": ""}, "description": "RuBERT (Russian, cased, 12\u2011layer, 768\u2011hidden, 12\u2011heads, 180M parameters) was trained on the Russian part of Wikipedia and news data. We used this training data to build a vocabulary of Russian subtokens and took a multilingual version of BERT\u2011base as an initialization for RuBERT[1].", "model_name": "DeepPavlov/rubert-base-cased"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Many-to-Many multilingual translation", "api_name": "facebook/m2m100_1.2B", "api_call": "M2M100ForConditionalGeneration.from_pretrained('facebook/m2m100_1.2B')", "performance": {"dataset": "M2M100", "accuracy": "Not specified"}, "description": "M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation. It can directly translate between the 9,900 directions of 100 languages. To translate into a target language, the target language id is forced as the first generated token.", "model_name": "facebook/m2m100_1.2B"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "superb/hubert-base-superb-ks", "api_call": "pipeline('audio-classification', model='superb/hubert-base-superb-ks')", "performance": {"dataset": "Speech Commands dataset v1.0", "accuracy": 0.9672000000000001}, "description": "This is a ported version of S3PRL's Hubert for the SUPERB Keyword Spotting task. The base model is hubert-base-ls960, which is pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. For more information refer to SUPERB: Speech processing Universal PERformance Benchmark.", "model_name": "superb/hubert-base-superb-ks"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "andite/anything-v4.0", "api_call": "StableDiffusionPipeline.from_pretrained('andite/anything-v4.0', torch_dtype=torch.float16)", "performance": {"dataset": "Not specified", "accuracy": "Not specified"}, "description": "Anything V4 is a latent diffusion model for generating high-quality, highly detailed anime-style images with just a few prompts. It supports danbooru tags to generate images and can be used just like any other Stable Diffusion model.", "model_name": "andite/anything-v4.0"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "hakurei/waifu-diffusion", "api_call": "StableDiffusionPipeline.from_pretrained('hakurei/waifu-diffusion', torch_dtype=torch.float32)", "performance": {"dataset": "high-quality anime images", "accuracy": "not available"}, "description": "waifu-diffusion is a latent text-to-image diffusion model that has been conditioned on high-quality anime images through fine-tuning.", "model_name": "hakurei/waifu-diffusion"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "functionality": "Fill-Mask", "api_name": "camembert-base", "api_call": "pipeline('fill-mask', model='camembert-base', tokenizer='camembert-base')", "performance": {"dataset": "oscar", "accuracy": "N/A"}, "description": "CamemBERT is a state-of-the-art language model for French based on the RoBERTa model. It is available on Hugging Face in 6 different versions with varying number of parameters, amount of pretraining data, and pretraining data source domains. It can be used for Fill-Mask tasks.", "model_name": "camembert-base"}
{"domain": "Audio Audio-to-Audio", "framework": "Hugging Face Transformers", "functionality": "Speech Enhancement", "api_name": "sepformer-wham-enhancement", "api_call": "separator.from_hparams(source='speechbrain/sepformer-wham-enhancement', savedir='pretrained_models/sepformer-wham-enhancement')", "performance": {"dataset": "WHAM!", "accuracy": "14.35 dB SI-SNR"}, "description": "This repository provides all the necessary tools to perform speech enhancement (denoising) with a SepFormer model, implemented with SpeechBrain, and pretrained on WHAM! dataset with 8k sampling frequency, which is basically a version of WSJ0-Mix dataset with environmental noise and reverberation in 8k.", "model_name": "speechbrain/sepformer-wham-enhancement"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "functionality": "Sentiment Inferencing for stock-related comments", "api_name": "zhayunduo/roberta-base-stocktwits-finetuned", "api_call": "RobertaForSequenceClassification.from_pretrained('zhayunduo/roberta-base-stocktwits-finetuned')", "performance": {"dataset": "stocktwits", "accuracy": 0.9343}, "description": "This model is fine-tuned with roberta-base model on 3,200,000 comments from stocktwits, with the user-labeled tags 'Bullish' or 'Bearish'.", "model_name": "zhayunduo/roberta-base-stocktwits-finetuned"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "LayoutLMX_pt_question_answer_ocrazure_correct_V16_07_04_2023", "api_call": "AutoModelForDocumentQuestionAnswering.from_pretrained('L-oenai/LayoutLMX_pt_question_answer_ocrazure_correct_V15_30_03_2023')", "performance": {"dataset": "", "accuracy": ""}, "description": "A LayoutLMv2 model for document question answering.", "model_name": "L-oenai/LayoutLMX_pt_question_answer_ocrazure_correct_V15_30_03_2023"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "superb/hubert-large-superb-sid", "api_call": "pipeline('audio-classification', model='superb/hubert-large-superb-sid')", "performance": {"dataset": "VoxCeleb1", "accuracy": 0.9035000000000001}, "description": "Hubert-Large for Speaker Identification. This model is pretrained on 16kHz sampled speech audio and should be used with speech input also sampled at 16Khz. It is used for the SUPERB Speaker Identification task and can classify each utterance for its speaker identity as a multi-class classification.", "model_name": "superb/hubert-large-superb-sid"}
{"domain": "Natural Language Processing Conversational", "framework": "Hugging Face Transformers", "functionality": "Text Generation", "api_name": "microsoft/DialoGPT-medium", "api_call": "AutoModelForCausalLM.from_pretrained('microsoft/DialoGPT-medium')", "performance": {"dataset": "Reddit", "accuracy": "Comparable to human response quality under a single-turn conversation Turing test"}, "description": "DialoGPT is a SOTA large-scale pretrained dialogue response generation model for multiturn conversations. The model is trained on 147M multi-turn dialogue from Reddit discussion thread.", "model_name": "microsoft/DialoGPT-medium"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "facebook/maskformer-swin-large-ade", "api_call": "MaskFormerForInstanceSegmentation.from_pretrained('facebook/maskformer-swin-large-ade')", "performance": {"dataset": "ADE20k", "accuracy": "Not provided"}, "description": "MaskFormer model trained on ADE20k semantic segmentation (large-sized version, Swin backbone). It was introduced in the paper Per-Pixel Classification is Not All You Need for Semantic Segmentation and first released in this repository. This model addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation.", "model_name": "facebook/maskformer-swin-large-ade"}
{"domain": "Tabular Tabular Classification", "framework": "Scikit-learn", "functionality": "Binary Classification", "api_name": "danupurnomo/dummy-titanic", "api_call": "load_model(cached_download(hf_hub_url('danupurnomo/dummy-titanic', 'titanic_model.h5')))", "performance": {"dataset": "Titanic", "accuracy": "Not provided"}, "description": "This model is a binary classifier for predicting whether a passenger on the Titanic survived or not, based on features such as passenger class, age, sex, fare, and more.", "model_name": "danupurnomo/dummy-titanic"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Image Classification", "api_name": "openai/clip-vit-base-patch16", "api_call": "CLIPModel.from_pretrained('openai/clip-vit-base-patch16')", "performance": {"dataset": ["Food101", "CIFAR10", "CIFAR100", "Birdsnap", "SUN397", "Stanford Cars", "FGVC Aircraft", "VOC2007", "DTD", "Oxford-IIIT Pet dataset", "Caltech101", "Flowers102", "MNIST", "SVHN", "IIIT5K", "Hateful Memes", "SST-2", "UCF101", "Kinetics700", "Country211", "CLEVR Counting", "KITTI Distance", "STL-10", "RareAct", "Flickr30", "MSCOCO", "ImageNet", "ImageNet-A", "ImageNet-R", "ImageNet Sketch", "ObjectNet (ImageNet Overlap)", "Youtube-BB", "ImageNet-Vid"], "accuracy": "varies depending on the dataset"}, "description": "The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner.", "model_name": "openai/clip-vit-base-patch16"}
{"domain": "Natural Language Processing Text Generation", "framework": "Transformers", "functionality": "Text Generation", "api_name": "facebook/opt-1.3b", "api_call": "pipeline('text-generation', model='facebook/opt-1.3b')", "performance": {"dataset": "BookCorpus, CC-Stories, The Pile, Pushshift.io Reddit, CCNewsV2", "accuracy": "Not provided"}, "description": "OPT (Open Pre-trained Transformers) is a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, trained to roughly match the performance and sizes of the GPT-3 class of models. It can be used for prompting for evaluation of downstream tasks as well as text generation.", "model_name": "facebook/opt-1.3b"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "mazkooleg/0-9up-wavlm-base-plus-ft", "api_call": "pipeline('audio-classification', model='mazkooleg/0-9up-wavlm-base-plus-ft')", "performance": {"dataset": "mazkooleg/0-9up_google_speech_commands_augmented_raw", "accuracy": 0.9973000000000001}, "description": "This model is a fine-tuned version of microsoft/wavlm-base-plus on the None dataset. It achieves the following results on the evaluation set: Loss: 0.0093, Accuracy: 0.9973.", "model_name": "mazkooleg/0-9up-wavlm-base-plus-ft"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Speech Recognition", "api_name": "jonatasgrosman/wav2vec2-large-xlsr-53-dutch", "api_call": "SpeechRecognitionModel('jonatasgrosman/wav2vec2-large-xlsr-53-dutch')", "performance": {"dataset": "Common Voice nl", "accuracy": {"Test WER": 15.72, "Test CER": 5.35, "Test WER (+LM)": 12.84, "Test CER (+LM)": 4.64}}, "description": "Fine-tuned XLSR-53 large model for speech recognition in Dutch. Fine-tuned on Dutch using the train and validation splits of Common Voice 6.1 and CSS10.", "model_name": "jonatasgrosman/wav2vec2-large-xlsr-53-dutch"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Transformers", "functionality": "Table Question Answering", "api_name": "google/tapas-medium-finetuned-wtq", "api_call": "pipeline('table-question-answering', model='google/tapas-medium-finetuned-wtq')", "performance": {"dataset": "wikitablequestions", "accuracy": 0.4324}, "description": "TAPAS medium model fine-tuned on WikiTable Questions (WTQ). This model is pretrained on a large corpus of English data from Wikipedia and is used for answering questions related to a table.", "model_name": "google/tapas-medium-finetuned-wtq"}
{"domain": "Natural Language Processing Conversational", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "hyunwoongko/blenderbot-9B", "api_call": "pipeline('conversational', model='hyunwoongko/blenderbot-9B')", "performance": {"dataset": "blended_skill_talk", "accuracy": "Not provided"}, "description": "Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that scaling neural models in the number of parameters and the size of the data they are trained on gives improved results, we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to their partners, both asking and answering questions, and displaying knowledge, empathy and personality appropriately, depending on the situation. We show that large scale models can learn these skills when given appropriate training data and choice of generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter neural models, and make our models and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models.", "model_name": "hyunwoongko/blenderbot-9B"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Unconditional Image Generation", "api_name": "CompVis/ldm-celebahq-256", "api_call": "DiffusionPipeline.from_pretrained('CompVis/ldm-celebahq-256')", "performance": {"dataset": "CelebA-HQ", "accuracy": "N/A"}, "description": "Latent Diffusion Models (LDMs) achieve state-of-the-art synthesis results on image data and beyond by decomposing the image formation process into a sequential application of denoising autoencoders. LDMs enable high-resolution synthesis, semantic scene synthesis, super-resolution, and image inpainting while significantly reducing computational requirements compared to pixel-based DMs.", "model_name": "CompVis/ldm-celebahq-256"}
{"domain": "Audio Audio-to-Audio", "framework": "Fairseq", "functionality": "speech-to-speech-translation", "api_name": "xm_transformer_s2ut_en-hk", "api_call": "load_model_ensemble_and_task_from_hf_hub('facebook/xm_transformer_s2ut_en-hk')", "performance": {"dataset": "MuST-C", "accuracy": "Not specified"}, "description": "Speech-to-speech translation model with single-pass decoder (S2UT) from fairseq: English-Hokkien. Trained with supervised data in TED domain, and weakly supervised data in TED and Audiobook domain.", "model_name": "facebook/xm_transformer_s2ut_en-hk"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image Generation", "api_name": "stabilityai/stable-diffusion-2-1-base", "api_call": "StableDiffusionPipeline.from_pretrained('stabilityai/stable-diffusion-2-1-base', scheduler=EulerDiscreteScheduler.from_pretrained(stabilityai/stable-diffusion-2-1-base, subfolder=scheduler), torch_dtype=torch.float16)", "performance": {"dataset": "COCO2017 validation set", "accuracy": "Not optimized for FID scores"}, "description": "Stable Diffusion v2-1-base is a diffusion-based text-to-image generation model that can generate and modify images based on text prompts. It is a Latent Diffusion Model that uses a fixed, pretrained text encoder (OpenCLIP-ViT/H). It is intended for research purposes only and can be used in areas such as safe deployment of models, understanding limitations and biases of generative models, generation of artworks, and research on generative models.", "model_name": "stabilityai/stable-diffusion-2-1-base"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "Narsil/deberta-large-mnli-zero-cls", "api_call": "DebertaModel.from_pretrained('Narsil/deberta-large-mnli-zero-cls')", "performance": {"dataset": {"SQuAD 1.1": {"F1": 95.5, "EM": 90.1}, "SQuAD 2.0": {"F1": 90.7, "EM": 88.0}, "MNLI-m/mm": {"Accuracy": 91.3}, "SST-2": {"Accuracy": 96.5}, "QNLI": {"Accuracy": 95.3}, "CoLA": {"MCC": 69.5}, "RTE": {"Accuracy": 91.0}, "MRPC": {"Accuracy": 92.6}, "QQP": {}, "STS-B": {"P/S": 92.8}}}, "description": "DeBERTa improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder. It outperforms BERT and RoBERTa on the majority of NLU tasks with 80GB training data. This is the DeBERTa large model fine-tuned with MNLI task.", "model_name": "Narsil/deberta-large-mnli-zero-cls"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "PyTorch Transformers", "functionality": "Feature Extraction", "api_name": "kobart-base-v2", "api_call": "BartModel.from_pretrained('gogamza/kobart-base-v2')", "performance": {"dataset": "NSMC", "accuracy": 0.901}, "description": "KoBART is a Korean encoder-decoder language model trained on over 40GB of Korean text using the BART architecture. It can be used for feature extraction and has been trained on a variety of data sources, including Korean Wiki, news, books, and more.", "model_name": "gogamza/kobart-base-v2"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "facebook/timesformer-hr-finetuned-k600", "api_call": "TimesformerForVideoClassification.from_pretrained('facebook/timesformer-hr-finetuned-k600')", "performance": {"dataset": "Kinetics-600", "accuracy": "Not provided"}, "description": "TimeSformer model pre-trained on Kinetics-600. It was introduced in the paper TimeSformer: Is Space-Time Attention All You Need for Video Understanding? by Tong et al. and first released in this repository. The model can be used for video classification into one of the 600 possible Kinetics-600 labels.", "model_name": "facebook/timesformer-hr-finetuned-k600"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face", "functionality": "Zero-Shot Image Classification", "api_name": "laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft", "api_call": "pipeline('image-classification', model='laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft')", "performance": {"dataset": "ImageNet-1k", "accuracy": "75.9-76.9%"}, "description": "A series of CLIP ConvNeXt-Large models trained on the LAION-2B (english) subset of LAION-5B using OpenCLIP. The models achieve between 75.9 and 76.9 top-1 zero-shot accuracy on ImageNet-1k.", "model_name": "laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Text Generation", "api_name": "facebook/opt-13b", "api_call": "AutoModelForCausalLM.from_pretrained('facebook/opt-13b')", "performance": {"dataset": "GPT-3", "accuracy": "roughly match the performance and sizes of the GPT-3 class of models"}, "description": "OPT (Open Pre-trained Transformers) is a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters. The models are trained to match the performance and sizes of the GPT-3 class of models. The primary goal is to enable reproducible and responsible research at scale and to bring more voices to the table in studying the impact of large language models. OPT-13B is a 13-billion-parameter model trained predominantly with English text, but a small amount of non-English data is still present within the training corpus via CommonCrawl. The model was pretrained using a causal language modeling (CLM) objective.", "model_name": "facebook/opt-13b"}
{"domain": "Audio Text-to-Speech", "framework": "Fairseq", "functionality": "Text-to-Speech", "api_name": "facebook/tts_transformer-zh-cv7_css10", "api_call": "load_model_ensemble_and_task_from_hf_hub('facebook/tts_transformer-zh-cv7_css10', arg_overrides={'vocoder': 'hifigan', 'fp16': False})", "performance": {"dataset": "common_voice", "accuracy": "Not provided"}, "description": "Transformer text-to-speech model from fairseq S^2. Simplified Chinese, Single-speaker female voice, Pre-trained on Common Voice v7, fine-tuned on CSS10.", "model_name": "facebook/tts_transformer-zh-cv7_css10"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Summarization", "api_name": "it5-base-news-summarization", "api_call": "pipeline('summarization', model='it5/it5-base-news-summarization')", "performance": {"dataset": "NewsSum-IT", "accuracy": {"Rouge1": 0.339, "Rouge2": 0.16, "RougeL": 0.263}}, "description": "IT5 Base model fine-tuned on news summarization on the Fanpage and Il Post corpora for Italian Language Understanding and Generation.", "model_name": "it5/it5-base-news-summarization"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Object Detection", "api_name": "keremberke/yolov8s-csgo-player-detection", "api_call": "YOLO('keremberke/yolov8s-csgo-player-detection')", "performance": {"dataset": "csgo-object-detection", "accuracy": 0.886}, "description": "A YOLOv8 model for detecting Counter-Strike: Global Offensive (CS:GO) players. Supports the labels ['ct', 'cthead', 't', 'thead'].", "model_name": "keremberke/yolov8s-csgo-player-detection"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Inference", "api_name": "google/ncsnpp-ffhq-1024", "api_call": "DiffusionPipeline.from_pretrained('google/ncsnpp-ffhq-1024')", "performance": {"dataset": "CIFAR-10", "accuracy": {"Inception_score": 9.89, "FID": 2.2, "likelihood": 2.99}}, "description": "Score-Based Generative Modeling through Stochastic Differential Equations (SDE) for unconditional image generation. Achieves record-breaking performance on CIFAR-10 and demonstrates high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model.", "model_name": "google/ncsnpp-ffhq-1024"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "mgp-str", "api_call": "MgpstrForSceneTextRecognition.from_pretrained('alibaba-damo/mgp-str-base')", "performance": {"dataset": "MJSynth and SynthText", "accuracy": null}, "description": "MGP-STR is a pure vision Scene Text Recognition (STR) model, consisting of ViT and specially designed A^3 modules. It is trained on MJSynth and SynthText datasets and can be used for optical character recognition (OCR) on text images.", "model_name": "alibaba-damo/mgp-str-base"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "mazkooleg/0-9up-ast-ft", "api_call": "pipeline('audio-classification', model= 'MIT/ast-finetuned-speech-commands-v2')", "performance": {"dataset": "mazkooleg/0-9up_google_speech_commands_augmented_raw", "accuracy": 0.9979}, "description": "This model is a fine-tuned version of MIT/ast-finetuned-speech-commands-v2 on the None dataset. It achieves the following results on the evaluation set: Loss: 0.0210, Accuracy: 0.9979", "model_name": "MIT/ast-finetuned-speech-commands-v2"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Diffusers", "api_name": "WiNE-iNEFF/Minecraft-Skin-Diffusion-V2", "api_call": "DDPMPipeline.from_pretrained('WiNE-iNEFF/Minecraft-Skin-Diffusion-V2')", "performance": {"dataset": null, "accuracy": null}, "description": "An unconditional image generation model for generating Minecraft skin images using the diffusion model.", "model_name": "WiNE-iNEFF/Minecraft-Skin-Diffusion-V2"}
{"domain": "Tabular Tabular Classification", "framework": "Scikit-learn", "functionality": "Joblib", "api_name": "julien-c/skops-digits", "api_call": "load('path_to_folder/sklearn_model.joblib')", "performance": {"dataset": null, "accuracy": null}, "description": "A tabular classification model using the Scikit-learn framework and Joblib functionality. The model is trained with various hyperparameters and can be used for classification tasks.", "model_name": "path_to_folder/sklearn_model.joblib"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Object Detection", "api_name": "keremberke/yolov8m-plane-detection", "api_call": "YOLO('keremberke/yolov8m-plane-detection')", "performance": {"dataset": "plane-detection", "accuracy": "0.995"}, "description": "A YOLOv8 model for plane detection trained on the keremberke/plane-detection dataset. The model is capable of detecting planes in images with high accuracy.", "model_name": "keremberke/yolov8m-plane-detection"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Text Generation", "api_name": "openai-gpt", "api_call": "pipeline('text-generation', model='openai-gpt')", "performance": {"dataset": [{"name": "SNLI", "accuracy": 89.9}, {"name": "MNLI Matched", "accuracy": 82.1}, {"name": "MNLI Mismatched", "accuracy": 81.4}, {"name": "SciTail", "accuracy": 88.3}, {"name": "QNLI", "accuracy": 88.1}, {"name": "RTE", "accuracy": 56.0}, {"name": "STS-B", "accuracy": 82.0}, {"name": "QQP", "accuracy": 70.3}, {"name": "MPRC", "accuracy": 82.3}, {"name": "RACE", "accuracy": 59.0}, {"name": "ROCStories", "accuracy": 86.5}, {"name": "COPA", "accuracy": 78.6}, {"name": "SST-2", "accuracy": 91.3}, {"name": "CoLA", "accuracy": 45.4}, {"name": "GLUE", "accuracy": 72.8}]}, "description": "openai-gpt is a transformer-based language model created and released by OpenAI. The model is a causal (unidirectional) transformer pre-trained using language modeling on a large corpus with long-range dependencies.", "model_name": "openai-gpt"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Image Classification", "api_name": "openai/clip-vit-large-patch14", "api_call": "CLIPModel.from_pretrained('openai/clip-vit-large-patch14')", "performance": {"dataset": ["Food101", "CIFAR10", "CIFAR100", "Birdsnap", "SUN397", "Stanford Cars", "FGVC Aircraft", "VOC2007", "DTD", "Oxford-IIIT Pet dataset", "Caltech101", "Flowers102", "MNIST", "SVHN", "IIIT5K", "Hateful Memes", "SST-2", "UCF101", "Kinetics700", "Country211", "CLEVR Counting", "KITTI Distance", "STL-10", "RareAct", "Flickr30", "MSCOCO", "ImageNet", "ImageNet-A", "ImageNet-R", "ImageNet Sketch", "ObjectNet (ImageNet Overlap)", "Youtube-BB", "ImageNet-Vid"], "accuracy": "varies depending on the dataset"}, "description": "The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner.", "model_name": "openai/clip-vit-large-patch14"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Age Classification", "api_name": "nateraw/vit-age-classifier", "api_call": "ViTForImageClassification.from_pretrained('nateraw/vit-age-classifier')", "performance": {"dataset": "fairface", "accuracy": null}, "description": "A vision transformer finetuned to classify the age of a given person's face.", "model_name": "nateraw/vit-age-classifier"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Automatic Speech Recognition and Speech Translation", "api_name": "openai/whisper-large", "api_call": "WhisperForConditionalGeneration.from_pretrained('openai/whisper-large')", "performance": {"dataset": [{"name": "LibriSpeech (clean)", "accuracy": 3.0}, {"name": "LibriSpeech (other)", "accuracy": 5.4}, {"name": "Common Voice 11.0", "accuracy": 54.8}]}, "description": "Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning.", "model_name": "openai/whisper-large"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face", "functionality": "Visual Question Answering", "api_name": "JosephusCheung/GuanacoVQA", "api_call": "pipeline('visual-question-answering', model='GuanacoVQA').", "performance": {"dataset": "JosephusCheung/GuanacoVQADataset", "accuracy": "N/A"}, "description": "A multilingual Visual Question Answering model supporting English, Chinese, Japanese, and German languages. It requires the combined use of the Guanaco 7B LLM model and is based on the implementation of MiniGPT-4.", "model_name": "GuanacoVQA"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "Image Variations", "api_name": "lambdalabs/sd-image-variations-diffusers", "api_call": "StableDiffusionImageVariationPipeline.from_pretrained('lambdalabs/sd-image-variations-diffusers', revision='v2.0')", "performance": {"dataset": "ChristophSchuhmann/improved_aesthetics_6plus", "accuracy": "N/A"}, "description": "This version of Stable Diffusion has been fine tuned from CompVis/stable-diffusion-v1-4-original to accept CLIP image embedding rather than text embeddings. This allows the creation of image variations similar to DALLE-2 using Stable Diffusion.", "model_name": "lambdalabs/sd-image-variations-diffusers"}
{"domain": "Natural Language Processing Token Classification", "framework": "Transformers", "functionality": "Token Classification", "api_name": "vblagoje/bert-english-uncased-finetuned-pos", "api_call": "AutoModelForTokenClassification.from_pretrained('vblagoje/bert-english-uncased-finetuned-pos')", "performance": {"dataset": "N/A", "accuracy": "N/A"}, "description": "A BERT model fine-tuned for Part-of-Speech (POS) tagging in English text.", "model_name": "vblagoje/bert-english-uncased-finetuned-pos"}
{"domain": "Multimodal Feature Extraction", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "facebook/dragon-plus-context-encoder", "api_call": "AutoModel.from_pretrained('facebook/dragon-plus-context-encoder')", "performance": {"dataset": "MS MARCO", "accuracy": 39.0}, "description": "DRAGON+ is a BERT-base sized dense retriever initialized from RetroMAE and further trained on the data augmented from MS MARCO corpus, following the approach described in How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval. The associated GitHub repository is available here https://github.com/facebookresearch/dpr-scale/tree/main/dragon. We use asymmetric dual encoder, with two distinctly parameterized encoders.", "model_name": "facebook/dragon-plus-context-encoder"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "functionality": "Paraphrase-based utterance augmentation", "api_name": "prithivida/parrot_fluency_model", "api_call": "pipeline('text-classification', model='prithivida/parrot_fluency_model')", "performance": {"dataset": "N/A", "accuracy": "N/A"}, "description": "Parrot is a paraphrase-based utterance augmentation framework purpose-built to accelerate training NLU models. A paraphrase framework is more than just a paraphrasing model.", "model_name": "prithivida/parrot_fluency_model"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "glpn-nyu-finetuned-diode-221121-113853", "api_call": "pipeline('depth-estimation', model='sayakpaul/glpn-nyu-finetuned-diode-221121-113853')", "performance": {"dataset": "diode-subset", "accuracy": {"Loss": 0.33840000000000003, "Mae": 0.27390000000000003, "Rmse": 0.39590000000000003, "Abs Rel": 0.323, "Log Mae": 0.1148, "Log Rmse": 0.1651, "Delta1": 0.5576, "Delta2": 0.8345, "Delta3": 0.9398000000000001}}, "description": "This model is a fine-tuned version of vinvino02/glpn-nyu on the diode-subset dataset.", "model_name": "sayakpaul/glpn-nyu-finetuned-diode-221121-113853"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Conversational", "api_name": "DialoGPT-medium-PALPATINE2", "api_call": "pipeline('text-generation', model='Filosofas/DialoGPT-medium-PALPATINE2')", "performance": {"dataset": "N/A", "accuracy": "N/A"}, "description": "A DialoGPT model trained for generating human-like conversational responses.", "model_name": "Filosofas/DialoGPT-medium-PALPATINE2"}
{"domain": "Multimodal Text-to-Video", "framework": "Hugging Face", "functionality": "Text-to-Video", "api_name": "ImRma/Brucelee", "api_call": "pipeline('text-to-video', model='ImRma/Brucelee')", "performance": {"dataset": "", "accuracy": ""}, "description": "A Hugging Face model for converting Persian and English text into video.", "model_name": "ImRma/Brucelee"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/xclip-base-patch32", "api_call": "XClipModel.from_pretrained('microsoft/xclip-base-patch32')", "performance": {"dataset": "Kinetics 400", "accuracy": {"top-1": 80.4, "top-5": 95.0}}, "description": "X-CLIP is a minimal extension of CLIP for general video-language understanding. The model is trained in a contrastive way on (video, text) pairs. This allows the model to be used for tasks like zero-shot, few-shot or fully supervised video classification and video-text retrieval.", "model_name": "microsoft/xclip-base-patch32"}
{"domain": "Reinforcement Learning Robotics", "framework": "Hugging Face", "functionality": "6D grasping", "api_name": "camusean/grasp_diffusion", "api_call": "AutoModel.from_pretrained('camusean/grasp_diffusion')", "performance": {"dataset": "N/A", "accuracy": "N/A"}, "description": "Trained Models for Grasp SE(3) DiffusionFields. Check SE(3)-DiffusionFields: Learning smooth cost functions for joint grasp and motion optimization through diffusion for additional details.", "model_name": "camusean/grasp_diffusion"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "gsdf/Counterfeit-V2.5", "api_call": "pipeline('text-to-image', model='gsdf/Counterfeit-V2.5')", "performance": {"dataset": "EasyNegative", "accuracy": "Not provided"}, "description": "Counterfeit-V2.5 is a text-to-image model that generates anime-style images based on text prompts. It has been updated for ease of use and can be used with negative prompts to create high-quality images.", "model_name": "gsdf/Counterfeit-V2.5"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "dreamlike-art/dreamlike-anime-1.0", "api_call": "StableDiffusionPipeline.from_pretrained('dreamlike-art/dreamlike-anime-1.0', torch_dtype=torch.float16)(prompt, negative_prompt=negative_prompt)", "performance": {"dataset": "N/A", "accuracy": "N/A"}, "description": "Dreamlike Anime 1.0 is a high quality anime model, made by dreamlike.art. It can be used to generate anime-style images based on text prompts. The model is trained on 768x768px images and works best with prompts that include 'photo anime, masterpiece, high quality, absurdres'. It can be used with the Stable Diffusion Pipeline from the diffusers library.", "model_name": "dreamlike-art/dreamlike-anime-1.0"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "sayakpaul/videomae-base-finetuned-kinetics-finetuned-ucf101-subset", "api_call": "AutoModelForVideoClassification.from_pretrained('sayakpaul/videomae-base-finetuned-kinetics-finetuned-ucf101-subset')", "performance": {"dataset": "unknown", "accuracy": 1.0}, "description": "This model is a fine-tuned version of MCG-NJU/videomae-base-finetuned-kinetics on an unknown dataset.", "model_name": "sayakpaul/videomae-base-finetuned-kinetics-finetuned-ucf101-subset"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Transformers", "functionality": "Conversational", "api_name": "facebook/blenderbot-400M-distill", "api_call": "BlenderbotForConditionalGeneration.from_pretrained('facebook/blenderbot-400M-distill')", "performance": {"dataset": "blended_skill_talk", "accuracy": "Not specified"}, "description": "BlenderBot-400M-distill is a distilled version of the BlenderBot model, trained on the Blended Skill Talk dataset. It is designed for open-domain chatbot tasks and can generate text-to-text responses in a conversational manner. The model is based on the Transformers library and can be used with PyTorch, TensorFlow, and JAX.", "model_name": "facebook/blenderbot-400M-distill"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Image-to-Text", "api_name": "ydshieh/vit-gpt2-coco-en", "api_call": "VisionEncoderDecoderModel.from_pretrained('ydshieh/vit-gpt2-coco-en')", "performance": {"dataset": "COCO", "accuracy": "Not specified"}, "description": "A proof-of-concept model for the Hugging Face FlaxVisionEncoderDecoder Framework that produces reasonable image captioning results.", "model_name": "ydshieh/vit-gpt2-coco-en"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "dreamlike-art/dreamlike-photoreal-2.0", "api_call": "StableDiffusionPipeline.from_pretrained('dreamlike-art/dreamlike-photoreal-2.0', torch_dtype=torch.float16)(prompt).images[0]", "performance": {"dataset": "Stable Diffusion 1.5", "accuracy": "Not specified"}, "description": "Dreamlike Photoreal 2.0 is a photorealistic model based on Stable Diffusion 1.5, made by dreamlike.art. It can be used to generate photorealistic images from text prompts.", "model_name": "dreamlike-art/dreamlike-photoreal-2.0"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "Lykon/DreamShaper", "api_call": "pipeline('text-to-image', model=Lykon/DreamShaper)", "performance": {"dataset": "", "accuracy": ""}, "description": "Dream Shaper is a text-to-image model that generates artistic images based on the given input text. Read more about this model here: https://civitai.com/models/4384/dreamshaper", "model_name": "Lykon/DreamShaper"}
{"domain": "Natural Language Processing Question Answering", "framework": "Transformers", "functionality": "Question Answering", "api_name": "bert-large-uncased-whole-word-masking-squad2", "api_call": "pipeline('question-answering', model=AutoModel.from_pretrained('deepset/bert-large-uncased-whole-word-masking-squad2'), tokenizer=AutoTokenizer.from_pretrained('deepset/bert-large-uncased-whole-word-masking-squad2'))", "performance": {"dataset": "squad_v2", "accuracy": {"Exact Match": 80.885, "F1": 83.876}}, "description": "This is a bert-large model, fine-tuned using the SQuAD2.0 dataset for the task of question answering. It is designed for extractive question answering and supports English language.", "model_name": "deepset/bert-large-uncased-whole-word-masking-squad2"}
{"domain": "Audio Audio-to-Audio", "framework": "Hugging Face Transformers", "functionality": "Asteroid", "api_name": "DCUNet_Libri1Mix_enhsingle_16k", "api_call": "BaseModel.from_pretrained('JorisCos/DCUNet_Libri1Mix_enhsingle_16k')", "performance": {"dataset": "Libri1Mix", "accuracy": {"si_sdr": 13.1540353916, "si_sdr_imp": 9.7042540858, "sdr": 13.5680588731, "sdr_imp": 10.0653960739, "sar": 13.5680588731, "sar_imp": 10.0653960739, "stoi": 0.9199373340000001, "stoi_imp": 0.12401751050000001}}, "description": "This model was trained by Joris Cosentino using the librimix recipe in Asteroid. It was trained on the enh_single task of the Libri1Mix dataset.", "model_name": "JorisCos/DCUNet_Libri1Mix_enhsingle_16k"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Summarization", "api_name": "lidiya/bart-large-xsum-samsum", "api_call": "pipeline('summarization', model='lidiya/bart-large-xsum-samsum')", "performance": {"dataset": "SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization", "accuracy": {"rouge1": 53.306, "rouge2": 28.355, "rougeL": 44.095}}, "description": "This model was obtained by fine-tuning facebook/bart-large-xsum on Samsum dataset.", "model_name": "lidiya/bart-large-xsum-samsum"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "videomae-base-short-ssv2", "api_call": "VideoMAEForPreTraining.from_pretrained('MCG-NJU/videomae-base-short-ssv2')", "performance": {"dataset": "Something-Something-v2", "accuracy": "N/A"}, "description": "VideoMAE is an extension of Masked Autoencoders (MAE) to video. The architecture of the model is very similar to that of a standard Vision Transformer (ViT), with a decoder on top for predicting pixel values for masked patches. Videos are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds fixed sinus/cosinus position embeddings before feeding the sequence to the layers of the Transformer encoder. By pre-training the model, it learns an inner representation of videos that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled videos for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire video.", "model_name": "MCG-NJU/videomae-base-short-ssv2"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Speech to Text", "api_name": "facebook/s2t-medium-librispeech-asr", "api_call": "Speech2TextForConditionalGeneration.from_pretrained('facebook/s2t-medium-librispeech-asr')", "performance": {"dataset": "LibriSpeech", "accuracy": {"clean": 3.5, "other": 7.8}}, "description": "s2t-medium-librispeech-asr is a Speech to Text Transformer (S2T) model trained for automatic speech recognition (ASR). The S2T model was proposed in this paper and released in this repository.", "model_name": "facebook/s2t-medium-librispeech-asr"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "kykim/bertshared-kor-base", "api_call": "EncoderDecoderModel.from_pretrained('kykim/bertshared-kor-base')", "performance": {"dataset": "70GB Korean text dataset", "accuracy": "42000 lower-cased subwords"}, "description": "Bert base model for Korean, trained on a 70GB Korean text dataset and 42000 lower-cased subwords. Can be used for Text2Text Generation tasks.", "model_name": "kykim/bertshared-kor-base"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "functionality": "Question Answering", "api_name": "deepset/deberta-v3-large-squad2", "api_call": "AutoModelForQuestionAnswering.from_pretrained('deepset/deberta-v3-large-squad2')", "performance": {"dataset": "squad_v2", "accuracy": {"exact": 87.6105449339, "f1": 90.7530700887}}, "description": "This is the deberta-v3-large model, fine-tuned using the SQuAD2.0 dataset. It's been trained on question-answer pairs, including unanswerable questions, for the task of Question Answering.", "model_name": "deepset/deberta-v3-large-squad2"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Document Question Answering", "api_name": "layoutlmv2-base-uncased_finetuned_docvqa", "api_call": "pipeline('question-answering', model='Sayantan1993/layoutlmv2-base-uncased_finetuned_docvqa')", "performance": {"dataset": "", "accuracy": ""}, "description": "A model for document question answering, fine-tuned on the DocVQA dataset using LayoutLMv2-base-uncased.", "model_name": "Sayantan1993/layoutlmv2-base-uncased_finetuned_docvqa"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Unconditional Image Generation", "api_name": "google/ncsnpp-church-256", "api_call": "DiffusionPipeline.from_pretrained('google/ncsnpp-church-256')", "performance": {"dataset": "CIFAR-10", "accuracy": {"Inception_score": 9.89, "FID": 2.2, "likelihood": 2.99}}, "description": "Score-Based Generative Modeling through Stochastic Differential Equations (SDE) for unconditional image generation. This model achieves record-breaking performance on CIFAR-10 and can generate high fidelity images of size 1024 x 1024.", "model_name": "google/ncsnpp-church-256"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Text Generation", "api_name": "TehVenom/PPO_Pygway-V8p4_Dev-6b", "api_call": "pipeline('text-generation', model='TehVenom/PPO_Pygway-V8p4_Dev-6b')", "performance": {"dataset": "", "accuracy": ""}, "description": "TODO card. Mix of (GPT-J-6B-Janeway + PPO_HH_GPT-J) + Pygmalion-6b-DEV (V8 / Part 4). At a ratio of GPT-J-6B-Janeway - 20%, PPO_HH_GPT-J - 20%, Pygmalion-6b DEV (V8 / Part 4) - 60%.", "model_name": "TehVenom/PPO_Pygway-V8p4_Dev-6b"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Transformers", "functionality": "Transcription", "api_name": "facebook/wav2vec2-base-960h", "api_call": "Wav2Vec2ForCTC.from_pretrained('facebook/wav2vec2-base-960h')", "performance": {"dataset": "LibriSpeech", "accuracy": {"clean": 3.4, "other": 8.6}}, "description": "Facebook's Wav2Vec2 base model pretrained and fine-tuned on 960 hours of Librispeech on 16kHz sampled speech audio. It is designed for automatic speech recognition and can transcribe audio files.", "model_name": "facebook/wav2vec2-base-960h"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Classification", "api_name": "typeform/squeezebert-mnli", "api_call": "AutoModel.from_pretrained('typeform/squeezebert-mnli')", "performance": {"dataset": "mulit_nli", "accuracy": "not provided"}, "description": "SqueezeBERT is a transformer model designed for efficient inference on edge devices. This specific model, typeform/squeezebert-mnli, is fine-tuned on the MultiNLI dataset for zero-shot classification tasks.", "model_name": "typeform/squeezebert-mnli"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "functionality": "Text Classification", "api_name": "shahrukhx01/question-vs-statement-classifier", "api_call": "AutoModelForSequenceClassification.from_pretrained('shahrukhx01/question-vs-statement-classifier')", "performance": {"dataset": "Haystack", "accuracy": "Not provided"}, "description": "Trained to add the feature for classifying queries between Question Query vs Statement Query using classification in Haystack", "model_name": "shahrukhx01/question-vs-statement-classifier"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Text Generation", "api_name": "bigscience/bloomz-560m", "api_call": "AutoModelForCausalLM.from_pretrained('bigscience/bloomz-560m')", "performance": {"dataset": "bigscience/xP3", "accuracy": {"Winogrande XL (xl) validation set": 52.41, "XWinograd (en) test set": 51.01, "XWinograd (fr) test set": 51.81, "XWinograd (jp) test set": 52.03, "XWinograd (pt) test set": 53.99, "XWinograd (ru) test set": 53.97, "XWinograd (zh) test set": 54.76, "ANLI (r1) validation set": 33.4, "ANLI (r2) validation set": 33.4, "ANLI (r3) validation set": 33.5}}, "description": "BLOOMZ & mT0 are a family of models capable of following human instructions in dozens of languages zero-shot. Finetuned on the crosslingual task mixture (xP3), these models can generalize to unseen tasks & languages. Useful for tasks expressed in natural language, such as translation, summarization, and question answering.", "model_name": "bigscience/bloomz-560m"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "swin2SR-classical-sr-x4-64", "api_call": "pipeline('image-super-resolution', model='caidas/swin2SR-classical-sr-x4-64')", "performance": {"dataset": "", "accuracy": ""}, "description": "Swin2SR model that upscales images x4. It was introduced in the paper Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration by Conde et al. and first released in this repository. This model is intended for image super resolution.", "model_name": "caidas/swin2SR-classical-sr-x4-64"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Unconditional Image Generation", "api_name": "ocariz/universe_1400", "api_call": "DDPMPipeline.from_pretrained('ocariz/universe_1400')", "performance": {"dataset": "", "accuracy": ""}, "description": "This model is a diffusion model for unconditional image generation of the universe trained for 1400 epochs.", "model_name": "ocariz/universe_1400"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "DataIntelligenceTeam/eurocorpV4", "api_call": "AutoModelForTokenClassification.from_pretrained('DataIntelligenceTeam/eurocorpV4')", "performance": {"dataset": "sroie", "accuracy": 0.982}, "description": "This model is a fine-tuned version of microsoft/layoutlmv3-large on the sroie dataset. It achieves the following results on the evaluation set: Loss: 0.1239, Precision: 0.9548, Recall: 0.9602, F1: 0.9575, Accuracy: 0.9819", "model_name": "DataIntelligenceTeam/eurocorpV4"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "dsba-lab/koreapas-finetuned-korwikitq", "api_call": "pipeline('table-question-answering', model='dsba-lab/koreapas-finetuned-korwikitq')", "performance": {"dataset": "korwikitq", "accuracy": null}, "description": "A Korean Table Question Answering model finetuned on the korwikitq dataset.", "model_name": "dsba-lab/koreapas-finetuned-korwikitq"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Audio Classification", "api_name": "distil-ast-audioset", "api_call": "AutoModelForSequenceClassification.from_pretrained('bookbot/distil-ast-audioset')", "performance": {"dataset": "AudioSet", "accuracy": 0.0714}, "description": "Distil Audio Spectrogram Transformer AudioSet is an audio classification model based on the Audio Spectrogram Transformer architecture. This model is a distilled version of MIT/ast-finetuned-audioset-10-10-0.4593 on the AudioSet dataset.", "model_name": "bookbot/distil-ast-audioset"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Text Generation", "api_name": "decapoda-research/llama-13b-hf", "api_call": "pipeline('text-generation', model='decapoda-research/llama-13b-hf')", "performance": {"dataset": [{"name": "BoolQ", "accuracy": "85.3"}, {"name": "PIQA", "accuracy": "82.8"}, {"name": "SIQA", "accuracy": "52.3"}, {"name": "HellaSwag", "accuracy": "84.2"}, {"name": "WinoGrande", "accuracy": "77"}, {"name": "ARC-e", "accuracy": "81.5"}, {"name": "ARC-c", "accuracy": "56"}, {"name": "OBQACOPA", "accuracy": "60.2"}]}, "description": "LLaMA-13B is an auto-regressive language model based on the transformer architecture developed by the FAIR team of Meta AI. It is designed for research purposes, such as question answering, natural language understanding, and reading comprehension. The model has been trained on a variety of sources, including web data, GitHub, Wikipedia, and books in 20 languages. It has been evaluated on several benchmarks, including BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC, and OpenBookQA.", "model_name": "decapoda-research/llama-13b-hf"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Semantic Segmentation", "api_name": "nvidia/segformer-b5-finetuned-cityscapes-1024-1024", "api_call": "SegformerForSemanticSegmentation.from_pretrained('nvidia/segformer-b5-finetuned-cityscapes-1024-1024')", "performance": {"dataset": "CityScapes", "accuracy": "Not provided"}, "description": "SegFormer model fine-tuned on CityScapes at resolution 1024x1024. It was introduced in the paper SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers by Xie et al. and first released in this repository.", "model_name": "nvidia/segformer-b5-finetuned-cityscapes-1024-1024"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Denoising Diffusion Probabilistic Models (DDPM)", "api_name": "google/ddpm-ema-celebahq-256", "api_call": "DDPMPipeline.from_pretrained('google/ddpm-ema-celebahq-256')", "performance": {"dataset": {"CIFAR10": {"Inception_score": 9.46, "FID_score": 3.17}, "LSUN": {"sample_quality": "similar to ProgressiveGAN"}}}, "description": "High quality image synthesis using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics.", "model_name": "google/ddpm-ema-celebahq-256"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Image Segmentation", "api_name": "openmmlab/upernet-convnext-small", "api_call": "UperNetModel.from_pretrained('openmmlab/upernet-convnext-small')", "performance": {"dataset": "N/A", "accuracy": "N/A"}, "description": "UperNet framework for semantic segmentation, leveraging a ConvNeXt backbone. UperNet was introduced in the paper Unified Perceptual Parsing for Scene Understanding by Xiao et al. Combining UperNet with a ConvNeXt backbone was introduced in the paper A ConvNet for the 2020s.", "model_name": "openmmlab/upernet-convnext-small"}
{"domain": "Natural Language Processing Token Classification", "framework": "Hugging Face Transformers", "functionality": "Entity Extraction", "api_name": "903929564", "api_call": "AutoModelForTokenClassification.from_pretrained('ismail-lucifer011/autotrain-job_all-903929564', use_auth_token=True)", "performance": {"dataset": "ismail-lucifer011/autotrain-data-job_all", "accuracy": 0.9989412010000001}, "description": "A Token Classification model trained using AutoTrain for Entity Extraction. The model is based on distilbert and achieves high accuracy, precision, recall, and F1 score.", "model_name": "ismail-lucifer011/autotrain-job_all-903929564"}
{"domain": "Natural Language Processing Conversational", "framework": "Hugging Face Transformers", "functionality": "Text Generation", "api_name": "Zixtrauce/BDBot4Epoch", "api_call": "pipeline('text-generation', model='Zixtrauce/BDBot4Epoch')", "performance": {"dataset": "unknown", "accuracy": "unknown"}, "description": "BrandonBot4Epochs is a conversational model trained on the GPT-2 architecture for text generation. It can be used to generate responses in a chatbot-like interface.", "model_name": "Zixtrauce/BDBot4Epoch"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "glpn-nyu-finetuned-diode-221215-095508", "api_call": "AutoModel.from_pretrained('sayakpaul/glpn-nyu-finetuned-diode-221215-095508')", "performance": {"dataset": "DIODE", "accuracy": null}, "description": "A depth estimation model fine-tuned on the DIODE dataset using the GLPN model architecture.", "model_name": "sayakpaul/glpn-nyu-finetuned-diode-221215-095508"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "tiny-random-CLIPSegModel", "api_call": "pipeline('zero-shot-image-classification', model='hf-tiny-model-private/tiny-random-CLIPSegModel')", "performance": {"dataset": "", "accuracy": ""}, "description": "A tiny random CLIPSegModel for zero-shot image classification.", "model_name": "hf-tiny-model-private/tiny-random-CLIPSegModel"}
{"domain": "Reinforcement Learning", "framework": "ML-Agents", "functionality": "SoccerTwos", "api_name": "0xid/poca-SoccerTwos", "api_call": "mlagents-load-from-hf --repo-id='0xid/poca-SoccerTwos' --local-dir='./downloads'", "performance": {"dataset": "SoccerTwos", "accuracy": "N/A"}, "description": "A trained model of a poca agent playing SoccerTwos using the Unity ML-Agents Library.", "model_name": "0xid/poca-SoccerTwos"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Sentiment Classification", "api_name": "hackathon-pln-es/wav2vec2-base-finetuned-sentiment-classification-MESD", "api_call": "Wav2Vec2ForSequenceClassification.from_pretrained('hackathon-pln-es/wav2vec2-base-finetuned-sentiment-classification-MESD')", "performance": {"dataset": "MESD", "accuracy": 0.9308000000000001}, "description": "This model is a fine-tuned version of facebook/wav2vec2-base on the MESD dataset. It is trained to classify underlying sentiment of Spanish audio/speech.", "model_name": "hackathon-pln-es/wav2vec2-base-finetuned-sentiment-classification-MESD"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Conversational", "api_name": "facebook/blenderbot-3B", "api_call": "BlenderbotForConditionalGeneration.from_pretrained('facebook/blenderbot-3B')", "performance": {"dataset": "blended_skill_talk", "accuracy": "Not provided"}, "description": "BlenderBot-3B is a large-scale neural model designed for open-domain chatbot applications. It is trained on the blended_skill_talk dataset and can engage in multi-turn conversations, providing engaging talking points, asking and answering questions, and displaying knowledge, empathy, and personality. The model is available through the Hugging Face Transformers library.", "model_name": "facebook/blenderbot-3B"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Image Captioning", "api_name": "blip-image-captioning-base", "api_call": "BlipForConditionalGeneration.from_pretrained('Salesforce/blip-image-captioning-base')", "performance": {"dataset": "COCO", "accuracy": {"CIDEr": "+2.8%"}}, "description": "BLIP (Bootstrapping Language-Image Pre-training) is a new vision-language pre-training (VLP) framework that transfers flexibly to both vision-language understanding and generation tasks. It effectively utilizes noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. This model is pre-trained on the COCO dataset with a base architecture (ViT base backbone).", "model_name": "Salesforce/blip-image-captioning-base"}
{"domain": "Natural Language Processing Token Classification", "framework": "Hugging Face Transformers", "functionality": "Named Entity Recognition", "api_name": "flair/ner-english-large", "api_call": "SequenceTagger.load('flair/ner-english-large')", "performance": {"dataset": "conll2003", "accuracy": "94.36"}, "description": "This is the large 4-class NER model for English that ships with Flair. It predicts 4 tags: PER (person name), LOC (location name), ORG (organization name), and MISC (other name). The model is based on document-level XLM-R embeddings and FLERT.", "model_name": "flair/ner-english-large"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "License Plate Detection", "api_name": "keremberke/yolov5m-license-plate", "api_call": "yolov5.load('keremberke/yolov5m-license-plate')", "performance": {"dataset": "keremberke/license-plate-object-detection", "accuracy": 0.988}, "description": "A YOLOv5 model for license plate detection trained on a custom dataset. The model can detect license plates in images with high accuracy.", "model_name": "keremberke/yolov5m-license-plate"}
{"domain": "Audio Voice Activity Detection", "framework": "Hugging Face Transformers", "functionality": "Voice Activity Detection, Overlapped Speech Detection, Resegmentation", "api_name": "pyannote/segmentation", "api_call": "VoiceActivityDetection(segmentation='pyannote/segmentation')", "performance": {"dataset": {"ami": {"accuracy": {"onset": 0.684, "offset": 0.577, "min_duration_on": 0.181, "min_duration_off": 0.037}}, "dihard": {"accuracy": {"onset": 0.767, "offset": 0.377, "min_duration_on": 0.136, "min_duration_off": 0.067}}, "voxconverse": {"accuracy": {"onset": 0.767, "offset": 0.713, "min_duration_on": 0.182, "min_duration_off": 0.501}}}}, "description": "Model from End-to-end speaker segmentation for overlap-aware resegmentation, by Herv\u00e9 Bredin and Antoine Laurent. It provides voice activity detection, overlapped speech detection, and resegmentation functionalities.", "model_name": "pyannote/segmentation"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Document Question Answering", "api_name": "tiennvcs/layoutlmv2-large-uncased-finetuned-infovqa", "api_call": "AutoModelForDocumentQuestionAnswering.from_pretrained('tiennvcs/layoutlmv2-large-uncased-finetuned-infovqa')", "performance": {"dataset": "unknown", "accuracy": {"Loss": 2.2207}}, "description": "This model is a fine-tuned version of microsoft/layoutlmv2-large-uncased on an unknown dataset.", "model_name": "tiennvcs/layoutlmv2-large-uncased-finetuned-infovqa"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "text2text-generation", "api_name": "google/pegasus-newsroom", "api_call": "pipeline('summarization', model='google/pegasus-newsroom')", "performance": {"dataset": "newsroom", "accuracy": "45.98/34.20/42.18"}, "description": "PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization by Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. The model is trained on both C4 and HugeNews datasets and is designed for summarization tasks.", "model_name": "google/pegasus-newsroom"}
{"domain": "Multimodal Graph Machine Learning", "framework": "Hugging Face Transformers", "functionality": "Graph Classification", "api_name": "graphormer-base-pcqm4mv2", "api_call": "AutoModel.from_pretrained('clefourrier/graphormer-base-pcqm4mv2')", "performance": {"dataset": "PCQM4M-LSCv2", "accuracy": "Not provided"}, "description": "The Graphormer is a graph Transformer model, pretrained on PCQM4M-LSCv2. Developed by Microsoft, it is designed for graph classification tasks or graph representation tasks, such as molecule modeling.", "model_name": "clefourrier/graphormer-base-pcqm4mv2"}
{"domain": "Tabular Tabular Regression", "framework": "Scikit-learn", "functionality": "baseline-trainer", "api_name": "merve/tips5wx_sbh5-tip-regression", "api_call": "joblib.load(hf_hub_download('merve/tips5wx_sbh5-tip-regression', 'sklearn_model.joblib'))", "performance": {"dataset": "tips5wx_sbh5", "r2": 0.38936299999999996, "neg_mean_squared_error": -1.092356}, "description": "Baseline Model trained on tips5wx_sbh5 to apply regression on tip", "model_name": "merve/tips5wx_sbh5-tip-regression"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "PyTorch Transformers", "functionality": "Table Question Answering", "api_name": "microsoft/tapex-large-finetuned-wikisql", "api_call": "BartForConditionalGeneration.from_pretrained('microsoft/tapex-large-finetuned-wikisql')", "performance": {"dataset": "wikisql", "accuracy": "N/A"}, "description": "TAPEX (Table Pre-training via Execution) is a conceptually simple and empirically powerful pre-training approach to empower existing models with table reasoning skills. TAPEX realizes table pre-training by learning a neural SQL executor over a synthetic corpus, which is obtained by automatically synthesizing executable SQL queries. TAPEX is based on the BART architecture, the transformer encoder-encoder (seq2seq) model with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder. This model is the tapex-base model fine-tuned on the WikiSQL dataset.", "model_name": "microsoft/tapex-large-finetuned-wikisql"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Depth Estimation", "api_name": "glpn-nyu-finetuned-diode-221116-104421", "api_call": "AutoModel.from_pretrained('sayakpaul/glpn-nyu-finetuned-diode-221116-104421')", "performance": {"dataset": "diode-subset", "accuracy": {"Loss": 0.37360000000000004, "Mae": 0.3079, "Rmse": 0.43210000000000004, "Abs Rel": 0.36660000000000004, "Log Mae": 0.1288, "Log Rmse": 0.1794, "Delta1": 0.4929, "Delta2": 0.7934, "Delta3": 0.9234}}, "description": "This model is a fine-tuned version of vinvino02/glpn-nyu on the diode-subset dataset.", "model_name": "sayakpaul/glpn-nyu-finetuned-diode-221116-104421"}
{"domain": "Reinforcement Learning", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "edbeeching/decision-transformer-gym-hopper-expert", "api_call": "AutoModel.from_pretrained('edbeeching/decision-transformer-gym-hopper-expert')", "performance": {"dataset": "Gym Hopper environment", "accuracy": "Not provided"}, "description": "Decision Transformer model trained on expert trajectories sampled from the Gym Hopper environment", "model_name": "edbeeching/decision-transformer-gym-hopper-expert"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Automatic Speech Recognition", "api_name": "cpierse/wav2vec2-large-xlsr-53-esperanto", "api_call": "Wav2Vec2ForCTC.from_pretrained('cpierse/wav2vec2-large-xlsr-53-esperanto')", "performance": {"dataset": "common_voice", "accuracy": "12.31%"}, "description": "Fine-tuned facebook/wav2vec2-large-xlsr-53 on esperanto using the Common Voice dataset. When using this model, make sure that your speech input is sampled at 16kHz.", "model_name": "cpierse/wav2vec2-large-xlsr-53-esperanto"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "martinezomg/vit-base-patch16-224-diabetic-retinopathy", "api_call": "pipeline('image-classification', 'martinezomg/vit-base-patch16-224-diabetic-retinopathy')", "performance": {"dataset": "None", "accuracy": 0.7744000000000001}, "description": "This model is a fine-tuned version of google/vit-base-patch16-224 on the None dataset. It is designed for image classification tasks, specifically for diabetic retinopathy detection.", "model_name": "martinezomg/vit-base-patch16-224-diabetic-retinopathy"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "PyTorch Transformers", "functionality": "Table-based QA", "api_name": "neulab/omnitab-large-1024shot", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('neulab/omnitab-large-1024shot')", "performance": {"dataset": "wikitablequestions", "accuracy": "Not provided"}, "description": "OmniTab is a table-based QA model proposed in OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering. neulab/omnitab-large-1024shot (based on BART architecture) is initialized with microsoft/tapex-large and continuously pretrained on natural and synthetic data (SQL2NL model trained in the 1024-shot setting).", "model_name": "neulab/omnitab-large-1024shot"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Embeddings", "api_name": "sentence-transformers/paraphrase-multilingual-mpnet-base-v2", "api_call": "SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')", "performance": {"dataset": "https://seb.sbert.net", "accuracy": "Automated evaluation"}, "description": "This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Transformers", "functionality": "Cross-Encoder for Natural Language Inference", "api_name": "cross-encoder/nli-deberta-v3-base", "api_call": "CrossEncoder('cross-encoder/nli-deberta-v3-base')", "performance": {"dataset": {"SNLI-test": "92.38", "MNLI mismatched set": "90.04"}}, "description": "This model is based on microsoft/deberta-v3-base and was trained on the SNLI and MultiNLI datasets. For a given sentence pair, it will output three scores corresponding to the labels: contradiction, entailment, neutral.", "model_name": "cross-encoder/nli-deberta-v3-base"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face", "functionality": "Zero-Shot Image Classification", "api_name": "laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-rewind", "api_call": "CLIPModel.from_pretrained('laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-rewind')", "performance": {"dataset": "ImageNet-1k", "accuracy": "79.1 - 79.4"}, "description": "A series of CLIP ConvNeXt-XXLarge models trained on LAION-2B (English), a subset of LAION-5B, using OpenCLIP. These models achieve between 79.1 and 79.4 top-1 zero-shot accuracy on ImageNet-1k. The models can be used for zero-shot image classification, image and text retrieval, and other related tasks.", "model_name": "laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-rewind"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "valhalla/t5-base-e2e-qg", "api_call": "pipeline('e2e-qg', model='valhalla/t5-base-e2e-qg')", "performance": {"dataset": "squad", "accuracy": "N/A"}, "description": "This is a T5-base model trained for end-to-end question generation task. Simply input the text and the model will generate multiple questions. You can play with the model using the inference API, just put the text and see the results!", "model_name": "valhalla/t5-base-e2e-qg"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "opus-mt-fr-es", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('Helsinki-NLP/opus-mt-fr-es')", "performance": {"dataset": "opus", "accuracy": {"BLEU": {"newssyscomb2009.fr.es": 34.3, "news-test2008.fr.es": 32.5, "newstest2009.fr.es": 31.6, "newstest2010.fr.es": 36.5, "newstest2011.fr.es": 38.3, "newstest2012.fr.es": 38.1, "newstest2013.fr.es": 34.0, "Tatoeba.fr.es": 53.2}, "chr-F": {"newssyscomb2009.fr.es": 0.601, "news-test2008.fr.es": 0.583, "newstest2009.fr.es": 0.586, "newstest2010.fr.es": 0.616, "newstest2011.fr.es": 0.622, "newstest2012.fr.es": 0.619, "newstest2013.fr.es": 0.587, "Tatoeba.fr.es": 0.709}}}, "description": "A French to Spanish translation model trained on the OPUS dataset using the Hugging Face Transformers library. The model is based on the transformer-align architecture and uses normalization and SentencePiece for pre-processing.", "model_name": "Helsinki-NLP/opus-mt-fr-es"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "Hugging Face Transformers", "functionality": "Document-level embeddings of research papers", "api_name": "malteos/scincl", "api_call": "AutoModel.from_pretrained('malteos/scincl')", "performance": {"dataset": "SciDocs", "accuracy": {"mag-f1": 81.2, "mesh-f1": 89.0, "co-view-map": 85.3, "co-view-ndcg": 92.2, "co-read-map": 87.7, "co-read-ndcg": 94.0, "cite-map": 93.6, "cite-ndcg": 97.4, "cocite-map": 91.7, "cocite-ndcg": 96.5, "recomm-ndcg": 54.3, "recomm-P@1": 19.6}}, "description": "SciNCL is a pre-trained BERT language model to generate document-level embeddings of research papers. It uses the citation graph neighborhood to generate samples for contrastive learning. Prior to the contrastive training, the model is initialized with weights from scibert-scivocab-uncased. The underlying citation embeddings are trained on the S2ORC citation graph.", "model_name": "malteos/scincl"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Diffusers", "api_name": "pravsels/ddpm-ffhq-vintage-finetuned-vintage-3epochs", "api_call": "DDPMPipeline.from_pretrained('pravsels/ddpm-ffhq-vintage-finetuned-vintage-3epochs')", "performance": {"dataset": "", "accuracy": ""}, "description": "Example Fine-Tuned Model for Unit 2 of the Diffusion Models Class", "model_name": "pravsels/ddpm-ffhq-vintage-finetuned-vintage-3epochs"}
{"domain": "Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "Rajaram1996/Hubert_emotion", "api_call": "HubertForSpeechClassification.from_pretrained('Rajaram1996/Hubert_emotion')", "performance": {"dataset": "unknown", "accuracy": "unknown"}, "description": "A pretrained model for predicting emotion in local audio files using Hubert.", "model_name": "Rajaram1996/Hubert_emotion"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Transformers", "functionality": "Table Question Answering", "api_name": "google/tapas-small-finetuned-sqa", "api_call": "pipeline('table-question-answering', model='google/tapas-small-finetuned-sqa')", "performance": {"dataset": "msr_sqa", "accuracy": 0.6155}, "description": "TAPAS small model fine-tuned on Sequential Question Answering (SQA). It uses relative position embeddings (i.e. resetting the position index at every cell of the table).", "model_name": "google/tapas-small-finetuned-sqa"}
{"domain": "Audio Automatic Speech Recognition", "framework": "PyTorch Transformers", "functionality": "Automatic Speech Recognition", "api_name": "ravirajoshi/wav2vec2-large-xls-r-300m-marathi", "api_call": "Wav2Vec2ForCTC.from_pretrained('ravirajoshi/wav2vec2-large-xls-r-300m-marathi')", "performance": {"dataset": "None", "accuracy": {"Loss": 0.5656, "Wer": 0.2156}}, "description": "This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on the None dataset. It is designed for Automatic Speech Recognition in Marathi language.", "model_name": "ravirajoshi/wav2vec2-large-xls-r-300m-marathi"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "functionality": "Information Retrieval", "api_name": "cross-encoder/ms-marco-TinyBERT-L-2-v2", "api_call": "AutoModelForSequenceClassification.from_pretrained('cross-encoder/ms-marco-TinyBERT-L-2-v2')", "performance": {"dataset": "TREC Deep Learning 2019", "accuracy": "69.84 (NDCG@10)"}, "description": "This model was trained on the MS Marco Passage Ranking task. It can be used for Information Retrieval: Given a query, encode the query with all possible passages (e.g. retrieved with ElasticSearch). Then sort the passages in a decreasing order. The training code is available here: SBERT.net Training MS Marco.", "model_name": "cross-encoder/ms-marco-TinyBERT-L-2-v2"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification, Feature Map Extraction, Image Embeddings", "api_name": "convnext_base.fb_in1k", "api_call": "timm.create_model('convnext_base.fb_in1k', pretrained=True)", "performance": {"dataset": "imagenet-1k", "accuracy": "83.82%"}, "description": "A ConvNeXt image classification model pretrained on ImageNet-1k by paper authors. It can be used for image classification, feature map extraction, and image embeddings.", "model_name": "timm/convnext_base.fb_in1k"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image Generation", "api_name": "stabilityai/stable-diffusion-2-1", "api_call": "StableDiffusionPipeline.from_pretrained('stabilityai/stable-diffusion-2-1', torch_dtype=torch.float16)", "performance": {"dataset": "COCO2017", "accuracy": "Not optimized for FID scores"}, "description": "Stable Diffusion v2-1 is a diffusion-based text-to-image generation model developed by Robin Rombach and Patrick Esser. It is capable of generating and modifying images based on text prompts in English. The model is trained on a subset of the LAION-5B dataset and is primarily intended for research purposes.", "model_name": "stabilityai/stable-diffusion-2-1"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Transformers", "functionality": "Table Question Answering", "api_name": "google/tapas-mini-finetuned-wtq", "api_call": "AutoModelForTableQuestionAnswering.from_pretrained('google/tapas-mini-finetuned-wtq')", "performance": {"dataset": "wikitablequestions", "accuracy": 0.2854}, "description": "TAPAS mini model fine-tuned on WikiTable Questions (WTQ). It is pretrained on a large corpus of English data from Wikipedia and can be used for answering questions related to a table.", "model_name": "google/tapas-mini-finetuned-wtq"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Image Classification", "api_name": "laion/CLIP-ViT-L-14-laion2B-s32B-b82K", "api_call": "CLIPModel.from_pretrained('laion/CLIP-ViT-L-14-laion2B-s32B-b82K')", "performance": {"dataset": "ImageNet-1k", "accuracy": 75.3}, "description": "A CLIP ViT L/14 model trained with the LAION-2B English subset of LAION-5B using OpenCLIP. Intended for research purposes and exploring zero-shot, arbitrary image classification. Can be used for interdisciplinary studies of the potential impact of such model.", "model_name": "laion/CLIP-ViT-L-14-laion2B-s32B-b82K"}
{"domain": "Audio Text-to-Speech", "framework": "ESPnet", "functionality": "Text-to-Speech", "api_name": "mio/amadeus", "api_call": "./run.sh --skip_data_prep false --skip_train true --download_model mio/amadeus", "performance": {"dataset": "amadeus", "accuracy": "Not provided"}, "description": "This model was trained by mio using amadeus recipe in espnet.", "model_name": "mio/amadeus"}
{"domain": "Natural Language Processing Text Generation", "framework": "Transformers", "functionality": "Text Generation", "api_name": "bigscience/bloom-560m", "api_call": "pipeline('text-generation', model='bigscience/bloom-560m')", "performance": {"dataset": "Validation", "accuracy": {"Training Loss": 2.0, "Validation Loss": 2.2, "Perplexity": 8.9}}, "description": "BLOOM LM is a large open-science, open-access multilingual language model developed by BigScience. It is a transformer-based language model trained on 45 natural languages and 12 programming languages. The model has 559,214,592 parameters, 24 layers, and 16 attention heads.", "model_name": "bigscience/bloom-560m"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Fill-Mask", "api_name": "microsoft/deberta-v2-xxlarge", "api_call": "DebertaV2ForMaskedLM.from_pretrained('microsoft/deberta-v2-xxlarge')", "performance": {"dataset": [{"name": "SQuAD 1.1", "accuracy": "F1/EM: 96.1/91.4"}, {"name": "SQuAD 2.0", "accuracy": "F1/EM: 92.2/89.7"}, {"name": "MNLI-m/mm", "accuracy": "Acc: 91.7/91.9"}, {"name": "SST-2", "accuracy": "Acc: 97.2"}, {"name": "QNLI", "accuracy": "Acc: 96.0"}, {"name": "CoLA", "accuracy": "MCC: 72.0"}, {"name": "RTE", "accuracy": "Acc: 93.5"}, {"name": "MRPC", "accuracy": "Acc/F1: 93.1/94.9"}, {"name": "QQP", "accuracy": "Acc/F1: 92.7/90.3"}, {"name": "STS-B", "accuracy": "P/S: 93.2/93.1"}]}, "description": "DeBERTa improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder. It outperforms BERT and RoBERTa on majority of NLU tasks with 80GB training data. This is the DeBERTa V2 xxlarge model with 48 layers, 1536 hidden size. The total parameters are 1.5B and it is trained with 160GB raw data.", "model_name": "microsoft/deberta-v2-xxlarge"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image Generation", "api_name": "stabilityai/stable-diffusion-2", "api_call": "StableDiffusionPipeline.from_pretrained('stabilityai/stable-diffusion-2', scheduler=EulerDiscreteScheduler.from_pretrained('stabilityai/stable-diffusion-2', subfolder=scheduler), torch_dtype=torch.float16)", "performance": {"dataset": "COCO2017 validation set", "accuracy": "Not optimized for FID scores"}, "description": "Stable Diffusion v2 is a diffusion-based text-to-image generation model that can generate and modify images based on text prompts. It uses a fixed, pretrained text encoder (OpenCLIP-ViT/H) and is primarily intended for research purposes, such as safe deployment of models with potential to generate harmful content, understanding limitations and biases of generative models, and generation of artworks for design and artistic processes.", "model_name": "stabilityai/stable-diffusion-2"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Transformers", "api_name": "nikcheerla/nooks-amd-detection-realtime", "api_call": "SentenceTransformer('nikcheerla/nooks-amd-detection-realtime')", "performance": {"dataset": "https://seb.sbert.net", "accuracy": "Automated evaluation"}, "description": "This is a sentence-transformers model that maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "nikcheerla/nooks-amd-detection-realtime"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "donut-base-finetuned-cord-v2", "api_call": "AutoModel.from_pretrained('naver-clova-ix/donut-base-finetuned-cord-v2')", "performance": {"dataset": "CORD", "accuracy": "Not provided"}, "description": "Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder. This model is fine-tuned on CORD, a document parsing dataset.", "model_name": "naver-clova-ix/donut-base-finetuned-cord-v2"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/trocr-small-printed", "api_call": "VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-small-printed')", "performance": {"dataset": "SROIE", "accuracy": "Not specified"}, "description": "TrOCR model fine-tuned on the SROIE dataset. It was introduced in the paper TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Li et al. and first released in this repository. The TrOCR model is an encoder-decoder model, consisting of an image Transformer as encoder, and a text Transformer as decoder. The image encoder was initialized from the weights of DeiT, while the text decoder was initialized from the weights of UniLM.", "model_name": "microsoft/trocr-small-printed"}
{"domain": "Natural Language Processing Token Classification", "framework": "Transformers", "functionality": "Named Entity Recognition", "api_name": "d4data/biomedical-ner-all", "api_call": "AutoModelForTokenClassification.from_pretrained('d4data/biomedical-ner-all')", "performance": {"dataset": "Maccrobat", "accuracy": "Not provided"}, "description": "An English Named Entity Recognition model, trained on Maccrobat to recognize the bio-medical entities (107 entities) from a given text corpus (case reports etc.). This model was built on top of distilbert-base-uncased.", "model_name": "d4data/biomedical-ner-all"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "shi-labs/oneformer_ade20k_swin_large", "api_call": "OneFormerForUniversalSegmentation.from_pretrained('shi-labs/oneformer_ade20k_swin_large')", "performance": {"dataset": "scene_parse_150", "accuracy": null}, "description": "OneFormer model trained on the ADE20k dataset (large-sized version, Swin backbone). It was introduced in the paper OneFormer: One Transformer to Rule Universal Image Segmentation by Jain et al. and first released in this repository. OneFormer is the first multi-task universal image segmentation framework. It needs to be trained only once with a single universal architecture, a single model, and on a single dataset, to outperform existing specialized models across semantic, instance, and panoptic segmentation tasks. OneFormer uses a task token to condition the model on the task in focus, making the architecture task-guided for training, and task-dynamic for inference, all with a single model.", "model_name": "shi-labs/oneformer_ade20k_swin_large"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "Helsinki-NLP/opus-mt-en-zh", "api_call": "pipeline('translation_en_to_zh', model='Helsinki-NLP/opus-mt-en-zh')", "performance": {"dataset": "Tatoeba-test.eng.zho", "accuracy": {"BLEU": 31.4, "chr-F": 0.268}}, "description": "A translation model for English to Chinese using the Hugging Face Transformers library. It is based on the Marian NMT model and trained on the OPUS dataset. The model requires a sentence initial language token in the form of '>>id<<' (id = valid target language ID).", "model_name": "Helsinki-NLP/opus-mt-en-zh"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "neulab/omnitab-large-finetuned-wtq", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('neulab/omnitab-large-finetuned-wtq')", "performance": {"dataset": "wikitablequestions", "accuracy": null}, "description": "OmniTab is a table-based QA model proposed in OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering. The original Github repository is https://github.com/jzbjyb/OmniTab.", "model_name": "neulab/omnitab-large-finetuned-wtq"}
{"domain": "Natural Language Processing Summarization", "framework": "Transformers", "functionality": "text2text-generation", "api_name": "sshleifer/distilbart-cnn-12-6", "api_call": "BartForConditionalGeneration.from_pretrained('sshleifer/distilbart-cnn-12-6')", "performance": {"dataset": [{"name": "cnn_dailymail", "accuracy": {"Rouge 2": "22.12", "Rouge-L": "36.99"}}]}, "description": "DistilBART is a distilled version of BART, a model for text summarization. This specific checkpoint, 'sshleifer/distilbart-cnn-12-6', is trained on the cnn_dailymail dataset and provides a fast and effective way to generate summaries of text. The model can be loaded using the Hugging Face Transformers library.", "model_name": "sshleifer/distilbart-cnn-12-6"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "layoutlm-vqa", "api_call": "pipeline('question-answering', model='pardeepSF/layoutlm-vqa')", "performance": {"dataset": "", "accuracy": ""}, "description": "A model for document question answering using the LayoutLM architecture.", "model_name": "pardeepSF/layoutlm-vqa"}
{"domain": "Audio Audio-to-Audio", "framework": "SpeechBrain", "functionality": "Audio Source Separation", "api_name": "sepformer-wsj02mix", "api_call": "separator.from_hparams(source='speechbrain/sepformer-wsj02mix')", "performance": {"dataset": "WSJ0-2Mix", "accuracy": "22.4 dB"}, "description": "This repository provides all the necessary tools to perform audio source separation with a SepFormer model, implemented with SpeechBrain, and pretrained on WSJ0-2Mix dataset.", "model_name": "speechbrain/sepformer-wsj02mix"}
{"domain": "Audio Automatic Speech Recognition", "framework": "pyannote.audio", "functionality": "Speaker Diarization", "api_name": "pyannote/speaker-diarization", "api_call": "Pipeline.from_pretrained('pyannote/speaker-diarization@2.1', use_auth_token='ACCESS_TOKEN_GOES_HERE')", "performance": {"dataset": "ami", "accuracy": {"DER%": "18.91", "FA%": "4.48", "Miss%": "9.51", "Conf%": "4.91"}}, "description": "This API provides an automatic speaker diarization pipeline using the pyannote.audio framework. It can process audio files and output speaker diarization results in RTTM format. The pipeline can also handle cases where the number of speakers is known in advance or when providing lower and/or upper bounds on the number of speakers.", "model_name": "pyannote/speaker-diarization"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "opus-mt-en-ru", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('Helsinki-NLP/opus-mt-en-ru')", "performance": {"dataset": "newstest2019-enru", "accuracy": "27.1"}, "description": "Helsinki-NLP/opus-mt-en-ru is a translation model trained on the OPUS dataset, which translates English text to Russian. It is based on the Marian NMT framework and can be used with Hugging Face Transformers.", "model_name": "Helsinki-NLP/opus-mt-en-ru"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Object Detection", "api_name": "keremberke/yolov8m-forklift-detection", "api_call": "YOLO('keremberke/yolov8m-forklift-detection')", "performance": {"dataset": "forklift-object-detection", "accuracy": 0.846}, "description": "A YOLOv8 model for detecting forklifts and persons in images.", "model_name": "keremberke/yolov8m-forklift-detection"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/trocr-base-handwritten", "api_call": "VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-base-handwritten')", "performance": {"dataset": "IAM", "accuracy": "Not specified"}, "description": "TrOCR model fine-tuned on the IAM dataset. It was introduced in the paper TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Li et al. and first released in this repository. The TrOCR model is an encoder-decoder model, consisting of an image Transformer as encoder, and a text Transformer as decoder. The image encoder was initialized from the weights of BEiT, while the text decoder was initialized from the weights of RoBERTa.", "model_name": "microsoft/trocr-base-handwritten"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Transformers", "functionality": "Text2Text Generation", "api_name": "t5_sentence_paraphraser", "api_call": "T5ForConditionalGeneration.from_pretrained('ramsrigouthamg/t5_sentence_paraphraser')", "performance": {"dataset": "", "accuracy": ""}, "description": "A T5 model for paraphrasing sentences", "model_name": "ramsrigouthamg/t5_sentence_paraphraser"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "clipseg-rd64-refined", "api_call": "pipeline('image-segmentation', model='CIDAS/clipseg-rd64-refined')", "performance": {"dataset": "", "accuracy": ""}, "description": "CLIPSeg model with reduce dimension 64, refined (using a more complex convolution). It was introduced in the paper Image Segmentation Using Text and Image Prompts by L\u00fcddecke et al. and first released in this repository. This model is intended for zero-shot and one-shot image segmentation.", "model_name": "CIDAS/clipseg-rd64-refined"}
{"domain": "Computer Vision Image-to-Image", "framework": "Diffusers", "functionality": "Text-to-Image", "api_name": "lllyasviel/control_v11p_sd15_scribble", "api_call": "ControlNetModel.from_pretrained('lllyasviel/control_v11p_sd15_scribble')", "performance": {"dataset": "Stable Diffusion v1-5", "accuracy": "Not specified"}, "description": "Controlnet v1.1 is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on Scribble images. It can be used in combination with Stable Diffusion, such as runwayml/stable-diffusion-v1-5.", "model_name": "lllyasviel/control_v11p_sd15_scribble"}
{"domain": "Audio Voice Activity Detection", "framework": "Hugging Face Transformers", "functionality": "Speaker segmentation, Voice activity detection, Overlapped speech detection, Resegmentation, Raw scores", "api_name": "pyannote/segmentation", "api_call": "Model.from_pretrained('pyannote/segmentation', use_auth_token='ACCESS_TOKEN_GOES_HERE')", "performance": {"dataset": {"AMI Mix-Headset": {"voice_activity_detection_accuracy": {"onset": 0.684, "offset": 0.577, "min_duration_on": 0.181, "min_duration_off": 0.037}, "overlapped_speech_detection_accuracy": {"onset": 0.448, "offset": 0.362, "min_duration_on": 0.116, "min_duration_off": 0.187}, "resegmentation_accuracy": {"onset": 0.542, "offset": 0.527, "min_duration_on": 0.044, "min_duration_off": 0.705}}, "DIHARD3": {"voice_activity_detection_accuracy": {"onset": 0.767, "offset": 0.377, "min_duration_on": 0.136, "min_duration_off": 0.067}, "overlapped_speech_detection_accuracy": {"onset": 0.43, "offset": 0.32, "min_duration_on": 0.091, "min_duration_off": 0.14400000000000002}, "resegmentation_accuracy": {"onset": 0.592, "offset": 0.489, "min_duration_on": 0.163, "min_duration_off": 0.182}}, "VoxConverse": {"voice_activity_detection_accuracy": {"onset": 0.767, "offset": 0.713, "min_duration_on": 0.182, "min_duration_off": 0.501}, "overlapped_speech_detection_accuracy": {"onset": 0.587, "offset": 0.426, "min_duration_on": 0.337, "min_duration_off": 0.112}, "resegmentation_accuracy": {"onset": 0.537, "offset": 0.724, "min_duration_on": 0.41000000000000003, "min_duration_off": 0.5630000000000001}}}}, "description": "A pre-trained model for speaker segmentation, voice activity detection, overlapped speech detection, and resegmentation using the pyannote.audio framework.", "model_name": "pyannote/segmentation"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "blip2-opt-2.7b", "api_call": "Blip2ForConditionalGeneration.from_pretrained('Salesforce/blip2-opt-2.7b')", "performance": {"dataset": "LAION", "accuracy": "Not specified"}, "description": "BLIP-2 model, leveraging OPT-2.7b (a large language model with 2.7 billion parameters). It was introduced in the paper BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models by Li et al. and first released in this repository. The goal for the model is to predict the next text token, given the query embeddings and the previous text. This allows the model to be used for tasks like image captioning, visual question answering (VQA), and chat-like conversations by feeding the image and the previous conversation as prompt to the model.", "model_name": "Salesforce/blip2-opt-2.7b"}
{"domain": "Audio Text-to-Speech", "framework": "Fairseq", "functionality": "Text-to-Speech", "api_name": "facebook/tts_transformer-fr-cv7_css10", "api_call": "load_model_ensemble_and_task_from_hf_hub('facebook/tts_transformer-fr-cv7_css10')", "performance": {"dataset": "common_voice", "accuracy": "N/A"}, "description": "Transformer text-to-speech model from fairseq S^2. French, single-speaker male voice. Pre-trained on Common Voice v7, fine-tuned on CSS10.", "model_name": "facebook/tts_transformer-fr-cv7_css10"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "opus-mt-ca-es", "api_call": "MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-ca-es') , MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-ca-es')", "performance": {"dataset": "Tatoeba.ca.es", "accuracy": {"BLEU": 74.9, "chr-F": 0.863}}, "description": "A Hugging Face model for translation between Catalan (ca) and Spanish (es) languages, based on the OPUS dataset and using the transformer-align architecture. The model has been pre-processed with normalization and SentencePiece.", "model_name": "Helsinki-NLP/opus-mt-ca-es"}
{"domain": "Audio Audio-to-Audio", "framework": "Hugging Face Transformers", "functionality": "Asteroid", "api_name": "JorisCos/ConvTasNet_Libri2Mix_sepnoisy_16k", "api_call": "BaseModel.from_pretrained('JorisCos/ConvTasNet_Libri2Mix_sepnoisy_16k')", "performance": {"dataset": "Libri2Mix", "accuracy": {"si_sdr": 10.6171309498, "si_sdr_imp": 12.551811413, "sdr": 11.2318674645, "sdr_imp": 13.0597650097, "sir": 24.461138353, "sir_imp": 24.3718564523, "sar": 11.5649982725, "sar_imp": 4.6625257058, "stoi": 0.8701085139, "stoi_imp": 0.224541802}}, "description": "This model was trained by Joris Cosentino using the librimix recipe in Asteroid. It was trained on the sep_noisy task of the Libri2Mix dataset.", "model_name": "JorisCos/ConvTasNet_Libri2Mix_sepnoisy_16k"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "functionality": "Question Answering", "api_name": "csarron/bert-base-uncased-squad-v1", "api_call": "pipeline('question-answering', model='csarron/bert-base-uncased-squad-v1', tokenizer='csarron/bert-base-uncased-squad-v1')", "performance": {"dataset": "SQuAD1.1", "accuracy": {"EM": 80.9, "F1": 88.2}}, "description": "BERT-base uncased model fine-tuned on SQuAD v1. This model is case-insensitive and does not make a difference between english and English.", "model_name": "csarron/bert-base-uncased-squad-v1"}
{"domain": "Audio Text-to-Speech", "framework": "Fairseq", "functionality": "Text-to-Speech", "api_name": "facebook/tts_transformer-es-css10", "api_call": "load_model_ensemble_and_task_from_hf_hub('facebook/tts_transformer-es-css10')", "performance": {"dataset": "CSS10", "accuracy": null}, "description": "Transformer text-to-speech model from fairseq S^2. Spanish single-speaker male voice trained on CSS10.", "model_name": "facebook/tts_transformer-es-css10"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "padmalcom/wav2vec2-large-emotion-detection-german", "api_call": "pipeline('audio-classification', model='padmalcom/wav2vec2-large-emotion-detection-german')", "performance": {"dataset": "emo-DB", "accuracy": "Not provided"}, "description": "This wav2vec2 based emotion detection model is trained on the emo-DB dataset. It can classify emotions in German audio files into seven classes: anger, boredom, disgust, fear, happiness, sadness, and neutral.", "model_name": "padmalcom/wav2vec2-large-emotion-detection-german"}
{"domain": "Audio Text-to-Speech", "framework": "ESPnet", "functionality": "Text-to-Speech", "api_name": "kan-bayashi_ljspeech_vits", "api_call": "pipeline('text-to-speech', model='espnet/kan-bayashi_ljspeech_vits')", "performance": {"dataset": "ljspeech", "accuracy": "Not mentioned"}, "description": "A Text-to-Speech model trained on the ljspeech dataset using the ESPnet toolkit. This model can be used to convert text input into synthesized speech.", "model_name": "espnet/kan-bayashi_ljspeech_vits"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "CZ_DVQA_layoutxlm-base", "api_call": "LayoutXLMForQuestionAnswering.from_pretrained('fimu-docproc-research/CZ_DVQA_layoutxlm-base')", "performance": {"dataset": "", "accuracy": ""}, "description": "A Document Question Answering model based on LayoutXLM.", "model_name": "fimu-docproc-research/CZ_DVQA_layoutxlm-base"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Transformers", "api_name": "flax-sentence-embeddings/all_datasets_v4_MiniLM-L6", "api_call": "SentenceTransformer('flax-sentence-embeddings/all_datasets_v4_MiniLM-L6')", "performance": {"dataset": "1,097,953,922", "accuracy": "N/A"}, "description": "The model is trained on very large sentence level datasets using a self-supervised contrastive learning objective. It is fine-tuned on a 1B sentence pairs dataset, and it aims to capture the semantic information of input sentences. The sentence vector can be used for information retrieval, clustering, or sentence similarity tasks.", "model_name": "flax-sentence-embeddings/all_datasets_v4_MiniLM-L6"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Transformers", "functionality": "Translation, Summarization, Question Answering, Text Classification", "api_name": "t5-base", "api_call": "T5Model.from_pretrained('t5-base')", "performance": {"dataset": "c4", "accuracy": "See research paper, Table 14"}, "description": "T5-Base is a Text-To-Text Transfer Transformer (T5) model with 220 million parameters. It is designed to perform various NLP tasks, including machine translation, document summarization, question answering, and text classification. The model is pre-trained on the Colossal Clean Crawled Corpus (C4) and can be used with the Transformers library.", "model_name": "t5-base"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Embeddings", "api_name": "sentence-transformers/paraphrase-MiniLM-L6-v2", "api_call": "SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L6-v2')", "performance": {"dataset": "https://seb.sbert.net", "accuracy": "Not provided"}, "description": "This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/paraphrase-MiniLM-L6-v2"}
{"domain": "Audio Audio-to-Audio", "framework": "Hugging Face Transformers", "functionality": "Asteroid", "api_name": "JorisCos/DPTNet_Libri1Mix_enhsingle_16k", "api_call": "pipeline('audio-to-audio', model='JorisCos/DPTNet_Libri1Mix_enhsingle_16k')", "performance": {"dataset": "Libri1Mix", "si_sdr": 14.8296700373, "si_sdr_imp": 11.3798887315, "sdr": 15.3957126447, "sdr_imp": 11.8930498455, "sir": "Infinity", "sir_imp": "NaN", "sar": 15.3957126447, "sar_imp": 11.8930498455, "stoi": 0.9301948391, "stoi_imp": 0.1342750156}, "description": "This model was trained by Joris Cosentino using the librimix recipe in Asteroid. It was trained on the enh_single task of the Libri1Mix dataset.", "model_name": "JorisCos/DPTNet_Libri1Mix_enhsingle_16k"}
{"domain": "Natural Language Processing Question Answering", "framework": "Transformers", "functionality": "Question Answering", "api_name": "ahotrod/electra_large_discriminator_squad2_512", "api_call": "AutoModelForQuestionAnswering.from_pretrained('ahotrod/electra_large_discriminator_squad2_512')", "performance": {"dataset": "SQuAD2.0", "accuracy": {"exact": 87.0967741935, "f1": 89.9834383272}}, "description": "ELECTRA_large_discriminator language model fine-tuned on SQuAD2.0 for question answering tasks.", "model_name": "ahotrod/electra_large_discriminator_squad2_512"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face", "functionality": "Text2Text Generation", "api_name": "facebook/mbart-large-50-many-to-many-mmt", "api_call": "MBartForConditionalGeneration.from_pretrained('facebook/mbart-large-50-many-to-many-mmt')", "performance": {"dataset": "Multilingual Translation", "accuracy": "Not specified"}, "description": "mBART-50 many-to-many multilingual machine translation model can translate directly between any pair of 50 languages. It was introduced in the Multilingual Translation with Extensible Multilingual Pretraining and Finetuning paper.", "model_name": "facebook/mbart-large-50-many-to-many-mmt"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "microsoft/wavlm-large", "api_call": "Wav2Vec2Model.from_pretrained('microsoft/wavlm-large')", "performance": {"dataset": "SUPERB benchmark", "accuracy": "state-of-the-art performance"}, "description": "WavLM-Large is a large model pretrained on 16kHz sampled speech audio. It is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation. WavLM is pretrained on 60,000 hours of Libri-Light, 10,000 hours of GigaSpeech, and 24,000 hours of VoxPopuli. It achieves state-of-the-art performance on the SUPERB benchmark and brings significant improvements for various speech processing tasks on their representative benchmarks.", "model_name": "microsoft/wavlm-large"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "glpn-nyu-finetuned-diode-221221-102136", "api_call": "pipeline('depth-estimation', model='sayakpaul/glpn-nyu-finetuned-diode-221221-102136')", "performance": {"dataset": "diode-subset", "accuracy": {"Loss": 0.4222, "Mae": 0.41100000000000003, "Rmse": 0.6292, "Abs Rel": 0.3778, "Log Mae": 0.1636, "Log Rmse": 0.224, "Delta1": 0.432, "Delta2": 0.6806, "Delta3": 0.8068000000000001}}, "description": "This model is a fine-tuned version of vinvino02/glpn-nyu on the diode-subset dataset.", "model_name": "sayakpaul/glpn-nyu-finetuned-diode-221221-102136"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "functionality": "Question Answering", "api_name": "deepset/roberta-base-squad2-distilled", "api_call": "AutoModel.from_pretrained('deepset/roberta-base-squad2-distilled')", "performance": {"dataset": "squad_v2", "exact": 79.8366040596, "f1": 83.9164070799}, "description": "This model is a distilled version of deepset/roberta-large-squad2, trained on SQuAD 2.0 dataset for question answering tasks. It is based on the Roberta architecture and has been fine-tuned using Haystack's distillation feature.", "model_name": "deepset/roberta-base-squad2-distilled"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "dmis-lab/biobert-base-cased-v1.2", "api_call": "pipeline('fill-mask', model='dmis-lab/biobert-base-cased-v1.2')", "performance": {"dataset": "N/A", "accuracy": "N/A"}, "description": "BioBERT is a pre-trained biomedical language representation model for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, and question answering.", "model_name": "dmis-lab/biobert-base-cased-v1.2"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "Text2Text Generation", "api_name": "t5-efficient-large-nl36_fine_tune_sum_V2", "api_call": "pipeline('summarization', model='Samuel-Fipps/t5-efficient-large-nl36_fine_tune_sum_V2')", "performance": {"dataset": [{"name": "samsum", "accuracy": {"ROUGE-1": 54.933, "ROUGE-2": 31.797, "ROUGE-L": 47.006, "ROUGE-LSUM": 51.203, "loss": 1.131, "gen_len": 23.799}}, {"name": "cnn_dailymail", "accuracy": {"ROUGE-1": 34.406, "ROUGE-2": 14.127, "ROUGE-L": 24.335, "ROUGE-LSUM": 31.658, "loss": 2.446, "gen_len": 45.928}}]}, "description": "A T5-based summarization model trained on the Samsum dataset. This model can be used for text-to-text generation tasks such as summarization without adding 'summarize' to the start of the input string. It has been fine-tuned for 10K steps with a batch size of 10.", "model_name": "Samuel-Fipps/t5-efficient-large-nl36_fine_tune_sum_V2"}
{"domain": "Natural Language Processing Question Answering", "framework": "Transformers", "functionality": "Question Answering", "api_name": "deepset/roberta-base-squad2-covid", "api_call": "pipeline('question-answering', model=RobertaForQuestionAnswering.from_pretrained('deepset/roberta-base-squad2-covid'), tokenizer=RobertaTokenizer.from_pretrained('deepset/roberta-base-squad2-covid'))", "performance": {"dataset": "squad_v2", "accuracy": {"XVAL_EM": 0.1789099526, "XVAL_f1": 0.49925444210000003, "XVAL_top_3_recall": 0.8021327014}}, "description": "This model is a Roberta-based model fine-tuned on SQuAD-style CORD-19 annotations for the task of extractive question answering in the context of COVID-19. It can be used with the Hugging Face Transformers library for question answering tasks.", "model_name": "deepset/roberta-base-squad2-covid"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Transformers", "api_name": "sentence-transformers/distiluse-base-multilingual-cased-v2", "api_call": "SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v2')", "performance": {"dataset": "https://seb.sbert.net", "accuracy": "Not provided"}, "description": "This is a sentence-transformers model: It maps sentences & paragraphs to a 512 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/distiluse-base-multilingual-cased-v2"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Spoken Language Identification", "api_name": "TalTechNLP/voxlingua107-epaca-tdnn", "api_call": "EncoderClassifier.from_hparams(source='TalTechNLP/voxlingua107-epaca-tdnn')", "performance": {"dataset": "VoxLingua107", "accuracy": "93%"}, "description": "This is a spoken language recognition model trained on the VoxLingua107 dataset using SpeechBrain. The model uses the ECAPA-TDNN architecture that has previously been used for speaker recognition. The model can classify a speech utterance according to the language spoken. It covers 107 different languages.", "model_name": "TalTechNLP/voxlingua107-epaca-tdnn"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Fill-Mask", "api_name": "cl-tohoku/bert-base-japanese-char", "api_call": "AutoModelForMaskedLM.from_pretrained('cl-tohoku/bert-base-japanese-char')", "performance": {"dataset": "wikipedia", "accuracy": "N/A"}, "description": "This is a BERT model pretrained on texts in the Japanese language. This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by character-level tokenization.", "model_name": "cl-tohoku/bert-base-japanese-char"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "Image-to-Image", "api_name": "lllyasviel/sd-controlnet-hed", "api_call": "ControlNetModel.from_pretrained('lllyasviel/sd-controlnet-hed')", "performance": {"dataset": "3M edge-image, caption pairs", "accuracy": "Not provided"}, "description": "ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on HED Boundary. It can be used in combination with Stable Diffusion.", "model_name": "lllyasviel/sd-controlnet-hed"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Text Generation", "api_name": "bigcode/santacoder", "api_call": "AutoModelForCausalLM.from_pretrained('bigcode/santacoder', trust_remote_code=True)", "performance": {"dataset": "bigcode/the-stack", "accuracy": {"pass@1 on MultiPL HumanEval (Python)": 0.18, "pass@10 on MultiPL HumanEval (Python)": 0.29, "pass@100 on MultiPL HumanEval (Python)": 0.49, "pass@1 on MultiPL MBPP (Python)": 0.35000000000000003, "pass@10 on MultiPL MBPP (Python)": 0.58, "pass@100 on MultiPL MBPP (Python)": 0.77, "pass@1 on MultiPL HumanEval (JavaScript)": 0.16, "pass@10 on MultiPL HumanEval (JavaScript)": 0.27, "pass@100 on MultiPL HumanEval (JavaScript)": 0.47000000000000003, "pass@1 on MultiPL MBPP (Javascript)": 0.28, "pass@10 on MultiPL MBPP (Javascript)": 0.51, "pass@100 on MultiPL MBPP (Javascript)": 0.7000000000000001, "pass@1 on MultiPL HumanEval (Java)": 0.15, "pass@10 on MultiPL HumanEval (Java)": 0.26, "pass@100 on MultiPL HumanEval (Java)": 0.41000000000000003, "pass@1 on MultiPL MBPP (Java)": 0.28, "pass@10 on MultiPL MBPP (Java)": 0.44, "pass@100 on MultiPL MBPP (Java)": 0.59, "single_line on HumanEval FIM (Python)": 0.44, "single_line on MultiPL HumanEval FIM (Java)": 0.62, "single_line on MultiPL HumanEval FIM (JavaScript)": 0.6000000000000001, "BLEU on CodeXGLUE code-to-text (Python)": 18.13}}, "description": "The SantaCoder models are a series of 1.1B parameter models trained on the Python, Java, and JavaScript subset of The Stack (v1.1) (which excluded opt-out requests). The main model uses Multi Query Attention, was trained using near-deduplication and comment-to-code ratio as filtering criteria and using the Fill-in-the-Middle objective. In addition there are several models that were trained on datasets with different filter parameters and with architecture and objective variations.", "model_name": "bigcode/santacoder"}
{"domain": "Multimodal Text-to-Video", "framework": "Hugging Face", "functionality": "Text-to-Video Generation", "api_name": "mo-di-bear-guitar", "api_call": "TuneAVideoPipeline.from_pretrained('nitrosocke/mo-di-diffusion', unet=UNet3DConditionModel.from_pretrained('Tune-A-Video-library/mo-di-bear-guitar', subfolder='unet', torch_dtype=torch.float16), torch_dtype=torch.float16)", "performance": {"dataset": "Not mentioned", "accuracy": "Not mentioned"}, "description": "Tune-A-Video is a text-to-video generation model based on the Hugging Face framework. The model generates videos based on textual prompts in a modern Disney style.", "model_name": "nitrosocke/mo-di-diffusion"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Object Detection", "api_name": "keremberke/yolov8m-blood-cell-detection", "api_call": "YOLO('keremberke/yolov8m-blood-cell-detection')", "performance": {"dataset": "blood-cell-object-detection", "accuracy": 0.927}, "description": "A YOLOv8 model for blood cell detection, including Platelets, RBC, and WBC. Trained on the blood-cell-object-detection dataset.", "model_name": "keremberke/yolov8m-blood-cell-detection"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Unconditional Image Generation", "api_name": "ntrant7/sd-class-butterflies-32", "api_call": "DDPMPipeline.from_pretrained('ntrant7/sd-class-butterflies-32')", "performance": {"dataset": "Not specified", "accuracy": "Not specified"}, "description": "This model is a diffusion model for unconditional image generation of cute butterflies.", "model_name": "ntrant7/sd-class-butterflies-32"}
{"domain": "Natural Language Processing Token Classification", "framework": "Transformers", "functionality": "Named Entity Recognition", "api_name": "Dizex/InstaFoodRoBERTa-NER", "api_call": "AutoModelForTokenClassification.from_pretrained('Dizex/InstaFoodRoBERTa-NER')", "performance": {"dataset": "Dizex/InstaFoodSet", "accuracy": {"f1": 0.91, "precision": 0.89, "recall": 0.93}}, "description": "InstaFoodRoBERTa-NER is a fine-tuned BERT model that is ready to use for Named Entity Recognition of Food entities on informal text (social media like). It has been trained to recognize a single entity: food (FOOD). Specifically, this model is a roberta-base model that was fine-tuned on a dataset consisting of 400 English Instagram posts related to food.", "model_name": "Dizex/InstaFoodRoBERTa-NER"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "License Plate Detection", "api_name": "keremberke/yolov5s-license-plate", "api_call": "yolov5.load('keremberke/yolov5s-license-plate')", "performance": {"dataset": "keremberke/license-plate-object-detection", "accuracy": 0.985}, "description": "A YOLOv5 based license plate detection model trained on a custom dataset.", "model_name": "keremberke/yolov5s-license-plate"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "facebook/timesformer-base-finetuned-ssv2", "api_call": "TimesformerForVideoClassification.from_pretrained('facebook/timesformer-base-finetuned-ssv2')", "performance": {"dataset": "Something Something v2", "accuracy": "Not provided"}, "description": "TimeSformer model pre-trained on Something Something v2. It was introduced in the paper TimeSformer: Is Space-Time Attention All You Need for Video Understanding? by Tong et al. and first released in this repository.", "model_name": "facebook/timesformer-base-finetuned-ssv2"}
{"domain": "Audio Automatic Speech Recognition", "framework": "CTranslate2", "functionality": "Automatic Speech Recognition", "api_name": "guillaumekln/faster-whisper-large-v2", "api_call": "WhisperModel('large-v2')", "performance": {"dataset": "99 languages", "accuracy": "Not provided"}, "description": "Whisper large-v2 model for CTranslate2. This model can be used in CTranslate2 or projets based on CTranslate2 such as faster-whisper.", "model_name": "WhisperModel('large-v2')"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Masked Language Modeling", "api_name": "distilbert-base-multilingual-cased", "api_call": "pipeline('fill-mask', model='distilbert-base-multilingual-cased')", "performance": {"dataset": [{"name": "XNLI", "accuracy": {"English": 78.2, "Spanish": 69.1, "Chinese": 64.0, "German": 66.3, "Arabic": 59.1, "Urdu": 54.7}}]}, "description": "This model is a distilled version of the BERT base multilingual model. It is trained on the concatenation of Wikipedia in 104 different languages. The model has 6 layers, 768 dimension and 12 heads, totalizing 134M parameters. On average, this model, referred to as DistilmBERT, is twice as fast as mBERT-base.", "model_name": "distilbert-base-multilingual-cased"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/git-base-coco", "api_call": "pipeline('text-generation', model='microsoft/git-base-coco')", "performance": {"dataset": "COCO", "accuracy": "Refer to the paper for evaluation results."}, "description": "GIT (short for GenerativeImage2Text) model, base-sized version, fine-tuned on COCO. It was introduced in the paper GIT: A Generative Image-to-text Transformer for Vision and Language by Wang et al. and first released in this repository. The model is a Transformer decoder conditioned on both CLIP image tokens and text tokens. It can be used for tasks like image and video captioning, visual question answering (VQA) on images and videos, and even image classification (by simply conditioning the model on the image and asking it to generate a class for it in text).", "model_name": "microsoft/git-base-coco"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "julien-c/hotdog-not-hotdog", "api_call": "pipeline('image-classification', model='julien-c/hotdog-not-hotdog')", "performance": {"dataset": "", "accuracy": 0.8250000000000001}, "description": "A model that classifies images as hotdog or not hotdog.", "model_name": "julien-c/hotdog-not-hotdog"}
{"domain": "Natural Language Processing Question Answering", "framework": "Transformers", "functionality": "Question Answering", "api_name": "deepset/xlm-roberta-large-squad2", "api_call": "AutoModelForQuestionAnswering.from_pretrained('deepset/xlm-roberta-large-squad2')", "performance": {"squad_v2": {"exact_match": 81.828, "f1": 84.889}}, "description": "Multilingual XLM-RoBERTa large model for extractive question answering on various languages. Trained on SQuAD 2.0 dataset and evaluated on SQuAD dev set, German MLQA, and German XQuAD.", "model_name": "deepset/xlm-roberta-large-squad2"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Transcription and Translation", "api_name": "openai/whisper-small", "api_call": "WhisperForConditionalGeneration.from_pretrained('openai/whisper-small')", "performance": {"dataset": "LibriSpeech (clean) test set", "accuracy": "3.432 WER"}, "description": "Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalize to many datasets and domains without the need for fine-tuning. It is a Transformer-based encoder-decoder model and supports transcription and translation in various languages.", "model_name": "openai/whisper-small"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Document Question Answering", "api_name": "impira/layoutlm-document-qa", "api_call": "pipeline('question-answering', model=LayoutLMForQuestionAnswering.from_pretrained('impira/layoutlm-document-qa', return_dict=True))", "performance": {"dataset": ["SQuAD2.0", "DocVQA"], "accuracy": "Not provided"}, "description": "A fine-tuned version of the multi-modal LayoutLM model for the task of question answering on documents.", "model_name": "impira/layoutlm-document-qa"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "Depth Estimation", "api_name": "lllyasviel/sd-controlnet-depth", "api_call": "ControlNetModel.from_pretrained('lllyasviel/sd-controlnet-depth')", "performance": {"dataset": "3M depth-image, caption pairs", "accuracy": "500 GPU-hours with Nvidia A100 80G using Stable Diffusion 1.5 as a base model"}, "description": "ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on Depth estimation. It can be used in combination with Stable Diffusion.", "model_name": "lllyasviel/sd-controlnet-depth"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Object Detection", "api_name": "facebook/detr-resnet-101-dc5", "api_call": "DetrForObjectDetection.from_pretrained('facebook/detr-resnet-101-dc5')", "performance": {"dataset": "COCO 2017 validation", "accuracy": "AP 44.9"}, "description": "DETR (End-to-End Object Detection) model with ResNet-101 backbone (dilated C5 stage). The model is trained on COCO 2017 object detection dataset and achieves an average precision (AP) of 44.9 on the COCO 2017 validation set.", "model_name": "facebook/detr-resnet-101-dc5"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "saltacc/anime-ai-detect", "api_call": "pipeline('image-classification', model='saltacc/anime-ai-detect')", "performance": {"dataset": "aibooru and imageboard sites", "accuracy": "96%"}, "description": "A BEiT classifier to see if anime art was made by an AI or a human.", "model_name": "saltacc/anime-ai-detect"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "functionality": "Masked Language Modeling", "api_name": "xlm-roberta-base", "api_call": "pipeline('fill-mask', model='xlm-roberta-base')", "performance": {"dataset": "CommonCrawl", "accuracy": "N/A"}, "description": "XLM-RoBERTa is a multilingual version of RoBERTa pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It can be used for masked language modeling and is intended to be fine-tuned on a downstream task.", "model_name": "xlm-roberta-base"}
{"domain": "Audio Audio-to-Audio", "framework": "Fairseq", "functionality": "Speech-to-speech translation", "api_name": "xm_transformer_unity_hk-en", "api_call": "load_model_ensemble_and_task_from_hf_hub('facebook/xm_transformer_unity_hk-en')", "performance": {"dataset": ["TED", "drama", "TAT"], "accuracy": "Not specified"}, "description": "A speech-to-speech translation model with two-pass decoder (UnitY) trained on Hokkien-English data from TED, drama, and TAT domains. It uses Facebook's Unit HiFiGAN for speech synthesis.", "model_name": "facebook/xm_transformer_unity_hk-en"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Table Extraction", "api_name": "keremberke/yolov8m-table-extraction", "api_call": "YOLO('keremberke/yolov8m-table-extraction')", "performance": {"dataset": "table-extraction", "accuracy": 0.9520000000000001}, "description": "A YOLOv8 model for table extraction in images, capable of detecting both bordered and borderless tables. Trained using the keremberke/table-extraction dataset.", "model_name": "keremberke/yolov8m-table-extraction"}
{"domain": "Natural Language Processing Token Classification", "framework": "Flair", "functionality": "Named Entity Recognition", "api_name": "flair/ner-english-ontonotes-large", "api_call": "SequenceTagger.load('flair/ner-english-ontonotes-large')", "performance": {"dataset": "Ontonotes", "accuracy": 90.93}, "description": "English NER in Flair (Ontonotes large model). This is the large 18-class NER model for English that ships with Flair. It predicts 18 tags such as cardinal value, date value, event name, building name, geo-political entity, language name, law name, location name, money name, affiliation, ordinal value, organization name, percent value, person name, product name, quantity value, time value, and name of work of art. The model is based on document-level XLM-R embeddings and FLERT.", "model_name": "flair/ner-english-ontonotes-large"}
{"domain": "Audio Voice Activity Detection", "framework": "pyannote.audio", "functionality": "Automatic Speech Recognition", "api_name": "pyannote/voice-activity-detection", "api_call": "Pipeline.from_pretrained('pyannote/voice-activity-detection')", "performance": {"dataset": "ami", "accuracy": "Not specified"}, "description": "A pretrained voice activity detection pipeline that detects active speech in audio files.", "model_name": "pyannote/voice-activity-detection"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face Transformers", "functionality": "Visual Question Answering", "api_name": "Salesforce/blip-vqa-capfilt-large", "api_call": "BlipForQuestionAnswering.from_pretrained('Salesforce/blip-vqa-capfilt-large')", "performance": {"dataset": "VQA", "accuracy": "+1.6% in VQA score"}, "description": "BLIP is a new Vision-Language Pre-training (VLP) framework that transfers flexibly to both vision-language understanding and generation tasks. It effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. The model achieves state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval, image captioning, and VQA.", "model_name": "Salesforce/blip-vqa-capfilt-large"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face", "functionality": "Zero-Shot Image Classification", "api_name": "laion/CLIP-ViT-g-14-laion2B-s34B-b88K", "api_call": "pipeline('zero-shot-image-classification', model='laion/CLIP-ViT-g-14-laion2B-s34B-b88K')", "performance": {"dataset": null, "accuracy": null}, "description": "A zero-shot image classification model based on OpenCLIP, which can classify images into various categories without requiring any training data for those categories.", "model_name": "laion/CLIP-ViT-g-14-laion2B-s34B-b88K"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Object Detection", "api_name": "keremberke/yolov8m-csgo-player-detection", "api_call": "YOLO('keremberke/yolov8m-csgo-player-detection')", "performance": {"dataset": "csgo-object-detection", "accuracy": 0.892}, "description": "An object detection model trained to detect Counter-Strike: Global Offensive (CS:GO) players. The model is based on the YOLOv8 architecture and can identify 'ct', 'cthead', 't', and 'thead' labels.", "model_name": "keremberke/yolov8m-csgo-player-detection"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "microsoft/beit-base-patch16-224", "api_call": "BeitForImageClassification.from_pretrained('microsoft/beit-base-patch16-224')", "performance": {"dataset": "ImageNet", "accuracy": "Refer to tables 1 and 2 of the original paper"}, "description": "BEiT model pre-trained in a self-supervised fashion on ImageNet-21k (14 million images, 21,841 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224.", "model_name": "microsoft/beit-base-patch16-224"}
{"domain": "Multimodal Feature Extraction", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "cambridgeltl/SapBERT-from-PubMedBERT-fulltext", "api_call": "AutoModel.from_pretrained('cambridgeltl/SapBERT-from-PubMedBERT-fulltext')", "performance": {"dataset": "UMLS", "accuracy": "N/A"}, "description": "SapBERT is a pretraining scheme that self-aligns the representation space of biomedical entities. It is trained with UMLS 2020AA (English only) and uses microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext as the base model. The input should be a string of biomedical entity names, and the [CLS] embedding of the last layer is regarded as the output.", "model_name": "cambridgeltl/SapBERT-from-PubMedBERT-fulltext"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "Helsinki-NLP/opus-mt-en-es", "api_call": "pipeline('translation_en_to_es', model='Helsinki-NLP/opus-mt-en-es')", "performance": {"dataset": "Tatoeba-test.eng.spa", "accuracy": 54.9}, "description": "This model is a translation model from English to Spanish using the Hugging Face Transformers library. It is based on the Marian framework and trained on the OPUS dataset. The model achieves a BLEU score of 54.9 on the Tatoeba test set.", "model_name": "Helsinki-NLP/opus-mt-en-es"}
{"domain": "Natural Language Processing Token Classification", "framework": "Hugging Face Transformers", "functionality": "Named Entity Recognition", "api_name": "flair/ner-english-fast", "api_call": "SequenceTagger.load('flair/ner-english-fast')", "performance": {"dataset": "conll2003", "accuracy": "F1-Score: 92.92"}, "description": "This is the fast 4-class NER model for English that ships with Flair. It predicts 4 tags: PER (person name), LOC (location name), ORG (organization name), and MISC (other name). The model is based on Flair embeddings and LSTM-CRF.", "model_name": "flair/ner-english-fast"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Embeddings", "api_name": "sentence-transformers/nli-mpnet-base-v2", "api_call": "SentenceTransformer('sentence-transformers/nli-mpnet-base-v2')", "performance": {"dataset": "https://seb.sbert.net", "accuracy": "Automated evaluation"}, "description": "This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/nli-mpnet-base-v2"}
{"domain": "Natural Language Processing Question Answering", "framework": "PyTorch Transformers", "functionality": "Question Answering", "api_name": "deepset/bert-medium-squad2-distilled", "api_call": "AutoModel.from_pretrained('deepset/bert-medium-squad2-distilled')", "performance": {"dataset": "squad_v2", "exact": 68.6431398972, "f1": 72.7637083791}, "description": "This model is a distilled version of deepset/bert-large-uncased-whole-word-masking-squad2, trained on the SQuAD 2.0 dataset for question answering tasks. It is based on the BERT-medium architecture and uses the Hugging Face Transformers library.", "model_name": "deepset/bert-medium-squad2-distilled"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Depth Estimation", "api_name": "Intel/dpt-hybrid-midas", "api_call": "DPTForDepthEstimation.from_pretrained('Intel/dpt-hybrid-midas', low_cpu_mem_usage=True)", "performance": {"dataset": "MIX 6", "accuracy": "11.06"}, "description": "Dense Prediction Transformer (DPT) model trained on 1.4 million images for monocular depth estimation. Introduced in the paper Vision Transformers for Dense Prediction by Ranftl et al. (2021) and first released in this repository. DPT uses the Vision Transformer (ViT) as backbone and adds a neck + head on top for monocular depth estimation. This repository hosts the hybrid version of the model as stated in the paper. DPT-Hybrid diverges from DPT by using ViT-hybrid as a backbone and taking some activations from the backbone.", "model_name": "Intel/dpt-hybrid-midas"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Denoising Diffusion Probabilistic Models (DDPM)", "api_name": "google/ddpm-bedroom-256", "api_call": "DDPMPipeline.from_pretrained('google/ddpm-bedroom-256')", "performance": {"dataset": "CIFAR10", "accuracy": {"Inception score": 9.46, "FID score": 3.17}}, "description": "We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN.", "model_name": "google/ddpm-bedroom-256"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Transformers", "api_name": "sentence-transformers/all-mpnet-base-v2", "api_call": "SentenceTransformer('sentence-transformers/all-mpnet-base-v2')", "performance": {"dataset": [{"name": "MS Marco", "accuracy": "Not provided"}]}, "description": "This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/all-mpnet-base-v2"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face", "functionality": "Visual Question Answering", "api_name": "JosephusCheung/GuanacoVQAOnConsumerHardware", "api_call": "pipeline('visual-question-answering', model='JosephusCheung/GuanacoVQAOnConsumerHardware')", "performance": {"dataset": "JosephusCheung/GuanacoVQADataset", "accuracy": "unknown"}, "description": "A Visual Question Answering model trained on the GuanacoVQADataset, designed to work on consumer hardware like Colab Free T4 GPU. The model can be used to answer questions about images.", "model_name": "JosephusCheung/GuanacoVQAOnConsumerHardware"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "functionality": "Question Answering", "api_name": "Rakib/roberta-base-on-cuad", "api_call": "AutoModelForQuestionAnswering.from_pretrained('Rakib/roberta-base-on-cuad')", "performance": {"dataset": "cuad", "accuracy": "46.6%"}, "description": "This model is trained for the task of Question Answering on Legal Documents using the CUAD dataset. It is based on the RoBERTa architecture and can be used to extract answers from legal contracts and documents.", "model_name": "Rakib/roberta-base-on-cuad"}
{"domain": "Natural Language Processing Question Answering", "framework": "Transformers", "functionality": "Question Answering", "api_name": "optimum/roberta-base-squad2", "api_call": "AutoModelForQuestionAnswering.from_pretrained('deepset/roberta-base-squad2')", "performance": {"dataset": "squad_v2", "accuracy": {"exact": 79.8702939442, "f1": 82.9125116958}}, "description": "This is an ONNX conversion of the deepset/roberta-base-squad2 model for extractive question answering. It is trained on the SQuAD 2.0 dataset and is compatible with the Transformers library.", "model_name": "deepset/roberta-base-squad2"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Image Synthesis", "api_name": "google/ddpm-cifar10-32", "api_call": "DDPMPipeline.from_pretrained('google/ddpm-cifar10-32').", "performance": {"dataset": "CIFAR10", "accuracy": {"Inception_score": 9.46, "FID_score": 3.17}}, "description": "Denoising Diffusion Probabilistic Models (DDPM) is a class of latent variable models inspired by nonequilibrium thermodynamics. It is used for high-quality image synthesis. The model supports different noise schedulers such as scheduling_ddpm, scheduling_ddim, and scheduling_pndm.", "model_name": "google/ddpm-cifar10-32"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "Realistic_Vision_V1.4", "api_call": "pipeline('text-to-image', model=SG161222/Realistic_Vision_V1.4)", "performance": {"dataset": "N/A", "accuracy": "N/A"}, "description": "Realistic_Vision_V1.4 is a text-to-image model that generates high-quality and detailed images based on textual prompts. It can be used for various applications such as generating realistic portraits, landscapes, and other types of images.", "model_name": "SG161222/Realistic_Vision_V1.4"}
{"domain": "Tabular Tabular Regression", "framework": "Scikit-learn", "functionality": "GradientBoostingRegressor", "api_name": "Fish-Weight", "api_call": "load('path_to_folder/example.pkl')", "performance": {"dataset": "Fish dataset", "accuracy": "Not provided"}, "description": "This is a GradientBoostingRegressor on a fish dataset. This model is intended for educational purposes.", "model_name": "path_to_folder/example.pkl"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Text Generation", "api_name": "gpt2", "api_call": "pipeline('text-generation', model='gpt2')", "performance": {"dataset": {"LAMBADA": {"accuracy": "35.13"}, "CBT-CN": {"accuracy": "45.99"}, "CBT-NE": {"accuracy": "87.65"}, "WikiText2": {"accuracy": "83.4"}, "PTB": {"accuracy": "29.41"}, "enwiki8": {"accuracy": "65.85"}, "text8": {"accuracy": "1.16"}, "WikiText103": {"accuracy": "1.17"}, "1BW": {"accuracy": "37.50"}}}, "description": "GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.", "model_name": "gpt2"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "functionality": "Masked Language Modeling", "api_name": "bert-base-chinese", "api_call": "AutoModelForMaskedLM.from_pretrained('bert-base-chinese')", "performance": {"dataset": "[More Information Needed]", "accuracy": "[More Information Needed]"}, "description": "This model has been pre-trained for Chinese, training and random input masking has been applied independently to word pieces (as in the original BERT paper). It can be used for masked language modeling.", "model_name": "bert-base-chinese"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "functionality": "Fill-Mask", "api_name": "microsoft/deberta-v3-base", "api_call": "DebertaModel.from_pretrained('microsoft/deberta-v3-base')", "performance": {"dataset": {"SQuAD 2.0": {"F1": 88.4, "EM": 85.4}, "MNLI-m/mm": {"ACC": "90.6/90.7"}}}, "description": "DeBERTa V3 improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder. It further improves the efficiency of DeBERTa using ELECTRA-Style pre-training with Gradient Disentangled Embedding Sharing. The DeBERTa V3 base model comes with 12 layers and a hidden size of 768. It has only 86M backbone parameters with a vocabulary containing 128K tokens which introduces 98M parameters in the Embedding layer. This model was trained using the 160GB data as DeBERTa V2.", "model_name": "microsoft/deberta-v3-base"}
{"domain": "Tabular Tabular Classification", "framework": "Joblib", "functionality": "Transformers", "api_name": "abhishek/autotrain-adult-census-xgboost", "api_call": "AutoModel.from_pretrained('abhishek/autotrain-adult-census-xgboost')", "performance": {"dataset": "scikit-learn/adult-census-income", "accuracy": 0.8750191924}, "description": "A binary classification model trained on the Adult Census Income dataset using the XGBoost algorithm. The model predicts whether an individual's income is above or below $50,000 per year.", "model_name": "abhishek/autotrain-adult-census-xgboost"}
{"domain": "Multimodal Document Question Answer", "framework": "Transformers", "functionality": "Document Question Answering", "api_name": "tiny-random-LayoutLMForQuestionAnswering", "api_call": "AutoModelForQuestionAnswering.from_pretrained('hf-tiny-model-private/tiny-random-LayoutLMForQuestionAnswering')", "performance": {"dataset": "", "accuracy": ""}, "description": "A tiny random LayoutLM model for question answering. This model is not pretrained and serves as an example for the LayoutLM architecture.", "model_name": "hf-tiny-model-private/tiny-random-LayoutLMForQuestionAnswering"}
{"domain": "Natural Language Processing Conversational", "framework": "PyTorch Transformers", "functionality": "text-generation", "api_name": "satvikag/chatbot", "api_call": "AutoModelWithLMHead.from_pretrained('output-small')", "performance": {"dataset": "Kaggle game script dataset", "accuracy": "Not provided"}, "description": "DialoGPT Trained on the Speech of a Game Character, Joshua from The World Ends With You.", "model_name": "output-small"}
{"domain": "Natural Language Processing Token Classification", "framework": "Hugging Face Transformers", "functionality": "Named Entity Recognition", "api_name": "Jean-Baptiste/roberta-large-ner-english", "api_call": "AutoModelForTokenClassification.from_pretrained('Jean-Baptiste/roberta-large-ner-english')", "performance": {"dataset": "conll2003", "accuracy": {"PER": {"precision": 0.9914000000000001, "recall": 0.9927, "f1": 0.992}, "ORG": {"precision": 0.9627, "recall": 0.9661000000000001, "f1": 0.9644}, "LOC": {"precision": 0.9795, "recall": 0.9862000000000001, "f1": 0.9828}, "MISC": {"precision": 0.9292, "recall": 0.9262, "f1": 0.9277000000000001}, "Overall": {"precision": 0.974, "recall": 0.9766, "f1": 0.9753000000000001}}}, "description": "roberta-large-ner-english is an english NER model that was fine-tuned from roberta-large on conll2003 dataset. Model was validated on emails/chat data and outperformed other models on this type of data specifically. In particular, the model seems to work better on entities that don't start with an upper case.", "model_name": "Jean-Baptiste/roberta-large-ner-english"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Embeddings", "api_name": "sentence-transformers/paraphrase-albert-small-v2", "api_call": "SentenceTransformer('sentence-transformers/paraphrase-albert-small-v2')", "performance": {"dataset": ["snli", "multi_nli", "ms_marco"], "accuracy": "https://seb.sbert.net"}, "description": "This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/paraphrase-albert-small-v2"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "sayakpaul/videomae-base-finetuned-ucf101-subset", "api_call": "AutoModelForVideoClassification.from_pretrained('sayakpaul/videomae-base-finetuned-ucf101-subset')", "performance": {"dataset": "unknown", "accuracy": 0.8645}, "description": "This model is a fine-tuned version of MCG-NJU/videomae-base on an unknown dataset. It achieves the following results on the evaluation set: Loss: 0.3992, Accuracy: 0.8645.", "model_name": "sayakpaul/videomae-base-finetuned-ucf101-subset"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "NLI-based Zero Shot Text Classification", "api_name": "facebook/bart-large-mnli", "api_call": "AutoModelForSequenceClassification.from_pretrained('facebook/bart-large-mnli')", "performance": {"dataset": "multi_nli", "accuracy": "Not specified"}, "description": "This is the checkpoint for bart-large after being trained on the MultiNLI (MNLI) dataset. The model can be used for zero-shot text classification by posing the sequence to be classified as the NLI premise and constructing a hypothesis from each candidate label. The probabilities for entailment and contradiction are then converted to label probabilities.", "model_name": "facebook/bart-large-mnli"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "setu4993/LaBSE", "api_call": "BertModel.from_pretrained('setu4993/LaBSE')", "performance": {"dataset": "CommonCrawl and Wikipedia", "accuracy": "Not Specified"}, "description": "Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages. The pre-training process combines masked language modeling with translation language modeling. The model is useful for getting multilingual sentence embeddings and for bi-text retrieval.", "model_name": "setu4993/LaBSE"}
{"domain": "Natural Language Processing Question Answering", "framework": "Transformers", "functionality": "Question Answering", "api_name": "deepset/minilm-uncased-squad2", "api_call": "AutoModelForQuestionAnswering.from_pretrained('deepset/minilm-uncased-squad2')", "performance": {"dataset": "squad_v2", "accuracy": {"exact": 76.1307167523, "f1": 79.4978650022}}, "description": "MiniLM-L12-H384-uncased is a language model fine-tuned for extractive question answering on the SQuAD 2.0 dataset. It is based on the microsoft/MiniLM-L12-H384-uncased model and can be used with the Hugging Face Transformers library.", "model_name": "deepset/minilm-uncased-squad2"}
{"domain": "Multimodal Image-to-Text", "framework": "Transformers", "functionality": "Image Captioning", "api_name": "blip-image-captioning-large", "api_call": "BlipForConditionalGeneration.from_pretrained(Salesforce/blip-image-captioning-large)", "performance": {"dataset": "COCO", "accuracy": {"image-text retrieval": "+2.7% recall@1", "image captioning": "+2.8% CIDEr", "VQA": "+1.6% VQA score"}}, "description": "BLIP is a Vision-Language Pre-training (VLP) framework that achieves state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval, image captioning, and VQA. It effectively utilizes noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones.", "model_name": "Salesforce/blip-image-captioning-large"}
{"domain": "Reinforcement Learning", "framework": "Unity ML-Agents", "functionality": "Train and play SoccerTwos", "api_name": "Raiden-1001/poca-Soccerv7.1", "api_call": "mlagents-load-from-hf --repo-id='Raiden-1001/poca-Soccerv7.1' --local-dir='./downloads'", "performance": {"dataset": "SoccerTwos", "accuracy": "Not provided"}, "description": "A trained model of a poca agent playing SoccerTwos using the Unity ML-Agents Library.", "model_name": "Raiden-1001/poca-Soccerv7.1"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "wav2vec2-random-tiny-classifier", "api_call": "pipeline('audio-classification', model=Wav2Vec2ForCTC.from_pretrained('anton-l/wav2vec2-random-tiny-classifier'))", "performance": {"dataset": "", "accuracy": ""}, "description": "An audio classification model based on wav2vec2.", "model_name": "anton-l/wav2vec2-random-tiny-classifier"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "vision-encoder-decoder", "api_name": "jinhybr/OCR-DocVQA-Donut", "api_call": "pipeline('document-question-answering', model='jinhybr/OCR-DocVQA-Donut')", "performance": {"dataset": "DocVQA", "accuracy": "Not provided"}, "description": "Donut model fine-tuned on DocVQA. It consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings, after which the decoder autoregressively generates text, conditioned on the encoding of the encoder.", "model_name": "jinhybr/OCR-DocVQA-Donut"}
{"domain": "Audio Voice Activity Detection", "framework": "Hugging Face Transformers", "functionality": "Voice Activity Detection", "api_name": "anilbs/segmentation", "api_call": "VoiceActivityDetection(segmentation='anilbs/segmentation')", "performance": {"dataset": [{"name": "AMI Mix-Headset", "accuracy": {"onset": 0.684, "offset": 0.577, "min_duration_on": 0.181, "min_duration_off": 0.037}}, {"name": "DIHARD3", "accuracy": {"onset": 0.767, "offset": 0.377, "min_duration_on": 0.136, "min_duration_off": 0.067}}, {"name": "VoxConverse", "accuracy": {"onset": 0.767, "offset": 0.713, "min_duration_on": 0.182, "min_duration_off": 0.501}}]}, "description": "Model from End-to-end speaker segmentation for overlap-aware resegmentation, by Herv\u00e9 Bredin and Antoine Laurent. Online demo is available as a Hugging Face Space.", "model_name": "anilbs/segmentation"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Text2Text Generation", "api_name": "google/flan-t5-xxl", "api_call": "T5ForConditionalGeneration.from_pretrained('google/flan-t5-xxl')", "performance": {"dataset": [{"name": "MMLU", "accuracy": "75.2%"}]}, "description": "FLAN-T5 XXL is a fine-tuned version of the T5 language model, achieving state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. It has been fine-tuned on more than 1000 additional tasks covering multiple languages, including English, German, and French. It can be used for research on zero-shot NLP tasks and in-context few-shot learning NLP tasks, such as reasoning and question answering.", "model_name": "google/flan-t5-xxl"}
{"domain": "Natural Language Processing Text Generation", "framework": "PyTorch Transformers", "functionality": "Text Generation", "api_name": "decapoda-research/llama-7b-hf", "api_call": "AutoModel.from_pretrained('decapoda-research/llama-7b-hf')", "performance": {"dataset": [{"name": "BoolQ", "accuracy": 76.5}, {"name": "PIQA", "accuracy": 79.8}, {"name": "SIQA", "accuracy": 48.9}, {"name": "HellaSwag", "accuracy": 76.1}, {"name": "WinoGrande", "accuracy": 70.1}, {"name": "ARC-e", "accuracy": 76.7}, {"name": "ARC-c", "accuracy": 47.6}, {"name": "OBQAC", "accuracy": 57.2}, {"name": "COPA", "accuracy": 93}]}, "description": "LLaMA-7B is an auto-regressive language model based on the transformer architecture. It is designed for research on large language models, including question answering, natural language understanding, and reading comprehension. The model is trained on various sources, including CCNet, C4, GitHub, Wikipedia, Books, ArXiv, and Stack Exchange, with the majority of the dataset being in English.", "model_name": "decapoda-research/llama-7b-hf"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "facebook/bart-base", "api_call": "BartModel.from_pretrained('facebook/bart-base')", "performance": {"dataset": "arxiv", "accuracy": "Not provided"}, "description": "BART is a transformer encoder-decoder (seq2seq) model with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder. BART is pre-trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. BART is particularly effective when fine-tuned for text generation (e.g. summarization, translation) but also works well for comprehension tasks (e.g. text classification, question answering).", "model_name": "facebook/bart-base"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face", "functionality": "Zero-Shot Image Classification", "api_name": "laion/CLIP-convnext_base_w_320-laion_aesthetic-s13B-b82K-augreg", "api_call": "pipeline('image-classification', model='laion/CLIP-convnext_base_w_320-laion_aesthetic-s13B-b82K-augreg')", "performance": {"dataset": "ImageNet-1k", "accuracy": "70.8-71.7%"}, "description": "A series of CLIP ConvNeXt-Base (w/ wide embed dim) models trained on subsets LAION-5B using OpenCLIP. The models utilize the timm ConvNeXt-Base model (convnext_base) as the image tower, and the same text tower as the RN50x4 (depth 12, embed dim 640) model from OpenAI CLIP.", "model_name": "laion/CLIP-convnext_base_w_320-laion_aesthetic-s13B-b82K-augreg"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "zero-shot-object-detection", "api_name": "google/owlvit-large-patch14", "api_call": "OwlViTForObjectDetection.from_pretrained('google/owlvit-large-patch14')", "performance": {"dataset": "COCO", "accuracy": "Not specified"}, "description": "OWL-ViT is a zero-shot text-conditioned object detection model that can be used to query an image with one or multiple text queries. It uses CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. OWL-ViT is trained on publicly available image-caption data and fine-tuned on publicly available object detection datasets such as COCO and OpenImages.", "model_name": "google/owlvit-large-patch14"}
{"domain": "Multimodal Text-to-Video", "framework": "Hugging Face", "functionality": "Text-to-Video Synthesis", "api_name": "damo-vilab/text-to-video-ms-1.7b-legacy", "api_call": "DiffusionPipeline.from_pretrained('damo-vilab/text-to-video-ms-1.7b-legacy', torch_dtype=torch.float16)", "performance": {"dataset": ["LAION5B", "ImageNet", "Webvid"], "accuracy": "Not provided"}, "description": "This model is based on a multi-stage text-to-video generation diffusion model, which inputs a description text and returns a video that matches the text description. Only English input is supported.", "model_name": "damo-vilab/text-to-video-ms-1.7b-legacy"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "google/pix2struct-chartqa-base", "api_call": "Pix2StructForConditionalGeneration.from_pretrained('google/pix2struct-chartqa-base')", "performance": {"dataset": "ChartQA", "accuracy": "Not provided"}, "description": "Pix2Struct is an image encoder - text decoder model that is trained on image-text pairs for various tasks, including image captionning and visual question answering. The model is pretrained by learning to parse masked screenshots of web pages into simplified HTML. It can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images.", "model_name": "google/pix2struct-chartqa-base"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "layoutlmv3-base-mpdocvqa", "api_call": "LayoutLMv3ForQuestionAnswering.from_pretrained('rubentito/layoutlmv3-base-mpdocvqa')", "performance": {"dataset": "rubentito/mp-docvqa", "accuracy": {"ANLS": 0.45380000000000004, "APPA": 51.9426}}, "description": "This is pretrained LayoutLMv3 from Microsoft hub and fine-tuned on Multipage DocVQA (MP-DocVQA) dataset. This model was used as a baseline in Hierarchical multimodal transformers for Multi-Page DocVQA.", "model_name": "rubentito/layoutlmv3-base-mpdocvqa"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face", "functionality": "Conversational", "api_name": "facebook/blenderbot_small-90M", "api_call": "BlenderbotForConditionalGeneration.from_pretrained('facebook/blenderbot_small-90M')", "performance": {"dataset": "blended_skill_talk", "accuracy": "Not provided"}, "description": "Blenderbot is a chatbot model that provides engaging talking points and listens to their partners, both asking and answering questions, and displaying knowledge, empathy, and personality appropriately, depending on the situation.", "model_name": "facebook/blenderbot_small-90M"}
{"domain": "Natural Language Processing Summarization", "framework": "Transformers", "functionality": "text2text-generation", "api_name": "financial-summarization-pegasus", "api_call": "PegasusForConditionalGeneration.from_pretrained('human-centered-summarization/financial-summarization-pegasus')", "performance": {"dataset": "xsum", "accuracy": {"ROUGE-1": 35.206, "ROUGE-2": 16.569, "ROUGE-L": 30.128, "ROUGE-LSUM": 30.171}}, "description": "This model was fine-tuned on a novel financial news dataset, which consists of 2K articles from Bloomberg, on topics such as stock, markets, currencies, rate and cryptocurrencies. It is based on the PEGASUS model and in particular PEGASUS fine-tuned on the Extreme Summarization (XSum) dataset: google/pegasus-xsum model. PEGASUS was originally proposed by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu in PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization.", "model_name": "human-centered-summarization/financial-summarization-pegasus"}
{"domain": "Multimodal Zero-Shot Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Geolocalization", "api_name": "geolocal/StreetCLIP", "api_call": "CLIPModel.from_pretrained('geolocal/StreetCLIP')", "performance": {"dataset": [{"name": "IM2GPS", "accuracy": {"25km": 28.3, "200km": 45.1, "750km": 74.7, "2500km": 88.2}}, {"name": "IM2GPS3K", "accuracy": {"25km": 22.4, "200km": 37.4, "750km": 61.3, "2500km": 80.4}}]}, "description": "StreetCLIP is a robust foundation model for open-domain image geolocalization and other geographic and climate-related tasks. Trained on an original dataset of 1.1 million street-level urban and rural geo-tagged images, it achieves state-of-the-art performance on multiple open-domain image geolocalization benchmarks in zero-shot, outperforming supervised models trained on millions of images.", "model_name": "geolocal/StreetCLIP"}
{"domain": "Natural Language Processing Conversational", "framework": "Hugging Face Transformers", "functionality": "Text Generation", "api_name": "pygmalion-6b", "api_call": "AutoModelForCausalLM.from_pretrained('pygmalion-6b')", "performance": {"dataset": "56MB of dialogue data", "accuracy": "Not specified"}, "description": "Pygmalion 6B is a proof-of-concept dialogue model based on EleutherAI's GPT-J-6B. The fine-tuning dataset consisted of 56MB of dialogue data gathered from multiple sources, which includes both real and partially machine-generated conversations. The model was initialized from the uft-6b ConvoGPT model and fine-tuned on ~48.5 million tokens for ~5k steps on 4 NVIDIA A40s using DeepSpeed.", "model_name": "pygmalion-6b"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Speech Recognition", "api_name": "jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn", "api_call": "Wav2Vec2Model.from_pretrained('jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn')", "performance": {"dataset": "Common Voice zh-CN", "accuracy": {"WER": 82.37, "CER": 19.03}}, "description": "Fine-tuned XLSR-53 large model for speech recognition in Chinese. Fine-tuned facebook/wav2vec2-large-xlsr-53 on Chinese using the train and validation splits of Common Voice 6.1, CSS10 and ST-CMDS.", "model_name": "jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Transformers", "functionality": "Table Question Answering", "api_name": "google/tapas-small-finetuned-wtq", "api_call": "TapasForQuestionAnswering.from_pretrained('google/tapas-small-finetuned-wtq'), TapasTokenizer.from_pretrained('google/tapas-small-finetuned-wtq')", "performance": {"dataset": "wikitablequestions", "accuracy": 0.37620000000000003}, "description": "TAPAS small model fine-tuned on WikiTable Questions (WTQ). This model was pre-trained on MLM and an additional step which the authors call intermediate pre-training, and then fine-tuned in a chain on SQA, WikiSQL and finally WTQ. It uses relative position embeddings (i.e. resetting the position index at every cell of the table).", "model_name": "google/tapas-small-finetuned-wtq"}
{"domain": "Audio Audio-to-Audio", "framework": "Fairseq", "functionality": "audio", "api_name": "textless_sm_cs_en", "api_call": "Wav2Vec2Model.from_pretrained(cached_download('https://huggingface.co/facebook/textless_sm_cs_en/resolve/main/model.pt'))", "performance": {"dataset": "", "accuracy": ""}, "description": "A speech-to-speech translation model for converting between languages without using text as an intermediate representation. This model is designed for the task of audio-to-audio translation.", "model_name": "huggingface.co/facebook"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Transformers", "api_name": "sentence-transformers/gtr-t5-base", "api_call": "SentenceTransformer('sentence-transformers/gtr-t5-base')", "performance": {"dataset": "https://seb.sbert.net", "accuracy": "N/A"}, "description": "This is a sentence-transformers model that maps sentences & paragraphs to a 768 dimensional dense vector space. The model was specifically trained for the task of semantic search.", "model_name": "sentence-transformers/gtr-t5-base"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Document Question Answering", "api_name": "dperales/layoutlmv2-base-uncased_finetuned_docvqa", "api_call": "LayoutLMv2ForQuestionAnswering.from_pretrained('dperales/layoutlmv2-base-uncased_finetuned_docvqa')", "performance": {"dataset": "", "accuracy": ""}, "description": "A model for Document Question Answering based on the LayoutLMv2 architecture, fine-tuned on the DocVQA dataset.", "model_name": "dperales/layoutlmv2-base-uncased_finetuned_docvqa"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/trocr-small-stage1", "api_call": "VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-small-stage1')", "performance": {"dataset": "IAM", "accuracy": "Not provided"}, "description": "TrOCR pre-trained only model. It was introduced in the paper TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Li et al. and first released in this repository. The TrOCR model is an encoder-decoder model, consisting of an image Transformer as encoder, and a text Transformer as decoder. The image encoder was initialized from the weights of DeiT, while the text decoder was initialized from the weights of UniLM. Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. Next, the Transformer text decoder autoregressively generates tokens.", "model_name": "microsoft/trocr-small-stage1"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "microsoft/xclip-base-patch16-zero-shot", "api_call": "XClipModel.from_pretrained('microsoft/xclip-base-patch16-zero-shot')", "performance": {"dataset": [{"name": "HMDB-51", "accuracy": 44.6}, {"name": "UCF-101", "accuracy": 72.0}, {"name": "Kinetics-600", "accuracy": 65.2}]}, "description": "X-CLIP is a minimal extension of CLIP for general video-language understanding. The model is trained in a contrastive way on (video, text) pairs. This allows the model to be used for tasks like zero-shot, few-shot or fully supervised video classification and video-text retrieval.", "model_name": "microsoft/xclip-base-patch16-zero-shot"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Document Question Answering", "api_name": "tiennvcs/layoutlmv2-base-uncased-finetuned-vi-infovqa", "api_call": "pipeline('question-answering', model='tiennvcs/layoutlmv2-base-uncased-finetuned-vi-infovqa')", "performance": {"dataset": "unknown", "accuracy": {"Loss": 4.3332}}, "description": "This model is a fine-tuned version of microsoft/layoutlmv2-base-uncased on an unknown dataset.", "model_name": "tiennvcs/layoutlmv2-base-uncased-finetuned-vi-infovqa"}
{"domain": "Audio Audio-to-Audio", "framework": "Hugging Face Transformers", "functionality": "Asteroid", "api_name": "ConvTasNet_Libri2Mix_sepclean_8k", "api_call": "hf_hub_download(repo_id='JorisCos/ConvTasNet_Libri2Mix_sepclean_8k')", "performance": {"dataset": "Libri2Mix", "accuracy": {"si_sdr": 14.7645436345, "si_sdr_imp": 14.7640293756, "sdr": 15.2933797075, "sdr_imp": 15.1141466051, "sir": 24.0929046611, "sir_imp": 23.9136696831, "sar": 16.0605590692, "sar_imp": -51.9807844413, "stoi": 0.9311142441, "stoi_imp": 0.2181737614}}, "description": "This model was trained by Joris Cosentino using the librimix recipe in Asteroid. It was trained on the sep_clean task of the Libri2Mix dataset.", "model_name": "JorisCos/ConvTasNet_Libri2Mix_sepclean_8k"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Image Segmentation", "api_name": "keremberke/yolov8m-pcb-defect-segmentation", "api_call": "YOLO('keremberke/yolov8m-pcb-defect-segmentation')", "performance": {"dataset": "pcb-defect-segmentation", "accuracy": {"mAP@0.5(box)": 0.5680000000000001, "mAP@0.5(mask)": 0.557}}, "description": "A YOLOv8 model for PCB defect segmentation trained on the pcb-defect-segmentation dataset. The model can detect and segment defects in PCB images, such as Dry_joint, Incorrect_installation, PCB_damage, and Short_circuit.", "model_name": "keremberke/yolov8m-pcb-defect-segmentation"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Table Extraction", "api_name": "keremberke/yolov8n-table-extraction", "api_call": "YOLO('keremberke/yolov8n-table-extraction')", "performance": {"dataset": "table-extraction", "accuracy": 0.967}, "description": "An object detection model for extracting tables from documents. Supports two label types: 'bordered' and 'borderless'.", "model_name": "keremberke/yolov8n-table-extraction"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "glpn-nyu-finetuned-diode-221215-112116", "api_call": "AutoModel.from_pretrained('sayakpaul/glpn-nyu-finetuned-diode-221215-112116')", "performance": {"dataset": "DIODE", "accuracy": ""}, "description": "A depth estimation model fine-tuned on the DIODE dataset.", "model_name": "sayakpaul/glpn-nyu-finetuned-diode-221215-112116"}
{"domain": "Natural Language Processing Token Classification", "framework": "Hugging Face Transformers", "functionality": "Named Entity Recognition", "api_name": "flair/ner-german", "api_call": "SequenceTagger.load('flair/ner-german')", "performance": {"dataset": "conll2003", "accuracy": "87.94"}, "description": "This is the standard 4-class NER model for German that ships with Flair. It predicts 4 tags: PER (person name), LOC (location name), ORG (organization name), and MISC (other name). The model is based on Flair embeddings and LSTM-CRF.", "model_name": "flair/ner-german"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "Text Classification", "api_name": "svalabs/gbert-large-zeroshot-nli", "api_call": "pipeline('zero-shot-classification', model='svalabs/gbert-large-zeroshot-nli')", "performance": {"dataset": "XNLI TEST-Set", "accuracy": "85.6%"}, "description": "A German zeroshot classification model based on the German BERT large model from deepset.ai and finetuned for natural language inference using machine-translated nli sentence pairs from mnli, anli, and snli datasets.", "model_name": "svalabs/gbert-large-zeroshot-nli"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Classification", "api_name": "BaptisteDoyen/camembert-base-xnli", "api_call": "pipeline('zero-shot-classification', model='BaptisteDoyen/camembert-base-xnli')", "performance": {"dataset": "xnli", "accuracy": {"validation": 81.4, "test": 81.7}}, "description": "Camembert-base model fine-tuned on french part of XNLI dataset. One of the few Zero-Shot classification models working on French.", "model_name": "BaptisteDoyen/camembert-base-xnli"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Generate and modify images based on text prompts", "api_name": "stabilityai/stable-diffusion-2-depth", "api_call": "StableDiffusionDepth2ImgPipeline.from_pretrained('stabilityai/stable-diffusion-2-depth', torch_dtype=torch.float16)", "performance": {"dataset": "COCO2017 validation set", "accuracy": "Not optimized for FID scores"}, "description": "Stable Diffusion v2 is a latent diffusion model that generates and modifies images based on text prompts. It uses a fixed, pretrained text encoder (OpenCLIP-ViT/H) and is developed by Robin Rombach and Patrick Esser. The model works with English language prompts and is intended for research purposes only.", "model_name": "stabilityai/stable-diffusion-2-depth"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Question Generation", "api_name": "mrm8488/t5-base-finetuned-question-generation-ap", "api_call": "AutoModelWithLMHead.from_pretrained('mrm8488/t5-base-finetuned-question-generation-ap')", "performance": {"dataset": "SQuAD", "accuracy": "Not provided"}, "description": "Google's T5 model fine-tuned on SQuAD v1.1 for Question Generation by prepending the answer to the context.", "model_name": "mrm8488/t5-base-finetuned-question-generation-ap"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "Hugging Face Transformers", "functionality": "Contextual Representation", "api_name": "indobenchmark/indobert-base-p1", "api_call": "AutoModel.from_pretrained('indobenchmark/indobert-base-p1')", "performance": {"dataset": "Indo4B", "accuracy": "23.43 GB of text"}, "description": "IndoBERT is a state-of-the-art language model for Indonesian based on the BERT model. The pretrained model is trained using a masked language modeling (MLM) objective and next sentence prediction (NSP) objective.", "model_name": "indobenchmark/indobert-base-p1"}
{"domain": "Multimodal Text-to-Video", "framework": "Hugging Face", "functionality": "Text-to-video-synthesis", "api_name": "damo-vilab/text-to-video-ms-1.7b", "api_call": "DiffusionPipeline.from_pretrained('damo-vilab/text-to-video-ms-1.7b', torch_dtype=torch.float16, variant=fp16)", "performance": {"dataset": "Webvid", "accuracy": "Not specified"}, "description": "A multi-stage text-to-video generation diffusion model that inputs a description text and returns a video that matches the text description. The model consists of three sub-networks: text feature extraction model, text feature-to-video latent space diffusion model, and video latent space to video visual space model. It supports English input only and has a wide range of applications.", "model_name": "damo-vilab/text-to-video-ms-1.7b"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Unconditional Image Generation", "api_name": "google/ncsnpp-ffhq-256", "api_call": "DiffusionPipeline.from_pretrained('google/ncsnpp-ffhq-256')", "performance": {"dataset": "CIFAR-10", "accuracy": {"Inception score": 9.89, "FID": 2.2, "Likelihood": 2.99}}, "description": "Score-Based Generative Modeling through Stochastic Differential Equations (SDE) for unconditional image generation. Achieves record-breaking performance on CIFAR-10 and demonstrates high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model.", "model_name": "google/ncsnpp-ffhq-256"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "text2text-generation", "api_name": "google/bigbird-pegasus-large-bigpatent", "api_call": "BigBirdPegasusForConditionalGeneration.from_pretrained('google/bigbird-pegasus-large-bigpatent')", "performance": {"dataset": "big_patent", "accuracy": "Not provided"}, "description": "BigBird, a sparse-attention based transformer, extends Transformer-based models like BERT to much longer sequences. It can handle sequences up to a length of 4096 at a much lower compute cost compared to BERT. BigBird has achieved state-of-the-art results on various tasks involving very long sequences such as long documents summarization and question-answering with long contexts.", "model_name": "google/bigbird-pegasus-large-bigpatent"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Classification", "api_name": "MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli", "api_call": "pipeline('zero-shot-classification', model='MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli')", "performance": {"dataset": [{"name": "mnli_test_m", "accuracy": 0.912}, {"name": "mnli_test_mm", "accuracy": 0.908}, {"name": "anli_test", "accuracy": 0.7020000000000001}, {"name": "anli_test_r3", "accuracy": 0.64}, {"name": "ling_test", "accuracy": 0.87}, {"name": "wanli_test", "accuracy": 0.77}]}, "description": "This model was fine-tuned on the MultiNLI, Fever-NLI, Adversarial-NLI (ANLI), LingNLI and WANLI datasets, which comprise 885 242 NLI hypothesis-premise pairs. This model is the best performing NLI model on the Hugging Face Hub as of 06.06.22 and can be used for zero-shot classification. It significantly outperforms all other large models on the ANLI benchmark.", "model_name": "MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli"}
{"domain": "Natural Language Processing Conversational", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "ToddGoldfarb/Cadet-Tiny", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('ToddGoldfarb/Cadet-Tiny', low_cpu_mem_usage=True)", "performance": {"dataset": "allenai/soda", "accuracy": ""}, "description": "Cadet-Tiny is a very small conversational model trained off of the SODA dataset. Cadet-Tiny is intended for inference at the edge (on something as small as a 2GB RAM Raspberry Pi). Cadet-Tiny is trained off of the t5-small pretrained model from Google, and is, as a result, is about 2% of the size of the Cosmo-3B model.", "model_name": "ToddGoldfarb/Cadet-Tiny"}
{"domain": "Audio Text-to-Speech", "framework": "SpeechBrain", "functionality": "Text-to-Speech", "api_name": "tts-hifigan-ljspeech", "api_call": "HIFIGAN.from_hparams(source='speechbrain/tts-hifigan-ljspeech', savedir=tmpdir)", "performance": {"dataset": "LJSpeech", "accuracy": "Not specified"}, "description": "This repository provides all the necessary tools for using a HiFIGAN vocoder trained with LJSpeech. The pre-trained model takes in input a spectrogram and produces a waveform in output. Typically, a vocoder is used after a TTS model that converts an input text into a spectrogram. The sampling frequency is 22050 Hz.", "model_name": "speechbrain/tts-hifigan-ljspeech"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "functionality": "Question Answering", "api_name": "bert-large-uncased-whole-word-masking-finetuned-squad", "api_call": "AutoModel.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')", "performance": {"dataset": "SQuAD", "accuracy": {"f1": 93.15, "exact_match": 86.91}}, "description": "BERT large model (uncased) whole word masking finetuned on SQuAD. The model was pretrained on BookCorpus and English Wikipedia. It was trained with two objectives: Masked language modeling (MLM) and Next sentence prediction (NSP). This model should be used as a question-answering model.", "model_name": "bert-large-uncased-whole-word-masking-finetuned-squad"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Transformers", "functionality": "Table Question Answering", "api_name": "google/tapas-large-finetuned-sqa", "api_call": "TapasForQuestionAnswering.from_pretrained('google/tapas-large-finetuned-sqa')", "performance": {"dataset": "msr_sqa", "accuracy": 0.7289}, "description": "TAPAS large model fine-tuned on Sequential Question Answering (SQA). This model was pre-trained on MLM and an additional step which the authors call intermediate pre-training, and then fine-tuned on SQA. It uses relative position embeddings (i.e. resetting the position index at every cell of the table).", "model_name": "google/tapas-large-finetuned-sqa"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Unconditional Image Generation", "api_name": "sd-class-butterflies-32", "api_call": "DDPMPipeline.from_pretrained('clp/sd-class-butterflies-32')", "performance": {"dataset": null, "accuracy": null}, "description": "This model is a diffusion model for unconditional image generation of cute butterflies.", "model_name": "clp/sd-class-butterflies-32"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Transformers", "functionality": "Cross-Encoder for Natural Language Inference", "api_name": "cross-encoder/nli-distilroberta-base", "api_call": "CrossEncoder('cross-encoder/nli-distilroberta-base')", "performance": {"dataset": "SNLI and MultiNLI", "accuracy": "See SBERT.net - Pretrained Cross-Encoder for evaluation results"}, "description": "This model was trained using SentenceTransformers Cross-Encoder class on the SNLI and MultiNLI datasets. For a given sentence pair, it will output three scores corresponding to the labels: contradiction, entailment, neutral.", "model_name": "cross-encoder/nli-distilroberta-base"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/trocr-base-printed", "api_call": "VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-base-printed')", "performance": {"dataset": "SROIE", "accuracy": "Not provided"}, "description": "TrOCR model fine-tuned on the SROIE dataset. It was introduced in the paper TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Li et al. and first released in this repository. The TrOCR model is an encoder-decoder model, consisting of an image Transformer as encoder, and a text Transformer as decoder. The image encoder was initialized from the weights of BEiT, while the text decoder was initialized from the weights of RoBERTa.", "model_name": "microsoft/trocr-base-printed"}
{"domain": "Audio Audio-to-Audio", "framework": "Hugging Face Transformers", "functionality": "speech-enhancement", "api_name": "speechbrain/metricgan-plus-voicebank", "api_call": "SpectralMaskEnhancement.from_hparams(source='speechbrain/metricgan-plus-voicebank', savedir='pretrained_models/metricgan-plus-voicebank')", "performance": {"dataset": "Voicebank", "accuracy": {"Test PESQ": "3.15", "Test STOI": "93.0"}}, "description": "MetricGAN-trained model for Enhancement", "model_name": "speechbrain/metricgan-plus-voicebank"}
{"domain": "Natural Language Processing Token Classification", "framework": "Hugging Face Transformers", "functionality": "Token Classification", "api_name": "dbmdz/bert-large-cased-finetuned-conll03-english", "api_call": "AutoModelForTokenClassification.from_pretrained('dbmdz/bert-large-cased-finetuned-conll03-english')", "performance": {"dataset": "CoNLL-03", "accuracy": "Not provided"}, "description": "This is a BERT-large-cased model fine-tuned on the CoNLL-03 dataset for token classification tasks.", "model_name": "dbmdz/bert-large-cased-finetuned-conll03-english"}
{"domain": "Audio Audio Classification", "framework": "PyTorch Transformers", "functionality": "Emotion Recognition", "api_name": "superb/wav2vec2-base-superb-er", "api_call": "pipeline('audio-classification', model='superb/wav2vec2-base-superb-er')", "performance": {"dataset": "IEMOCAP", "accuracy": 0.6258}, "description": "This is a ported version of S3PRL's Wav2Vec2 for the SUPERB Emotion Recognition task. The base model is wav2vec2-base, which is pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. For more information refer to SUPERB: Speech processing Universal PERformance Benchmark.", "model_name": "superb/wav2vec2-base-superb-er"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "facebook/timesformer-base-finetuned-k600", "api_call": "TimesformerForVideoClassification.from_pretrained('facebook/timesformer-base-finetuned-k600')", "performance": {"dataset": "Kinetics-600", "accuracy": null}, "description": "TimeSformer model pre-trained on Kinetics-600. It was introduced in the paper TimeSformer: Is Space-Time Attention All You Need for Video Understanding? by Tong et al. and first released in this repository.", "model_name": "facebook/timesformer-base-finetuned-k600"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "sup-simcse-roberta-large", "api_call": "AutoModel.from_pretrained('princeton-nlp/sup-simcse-roberta-large')", "performance": {"dataset": "STS tasks", "accuracy": "Spearman's correlation (See associated paper Appendix B)"}, "description": "A pretrained RoBERTa-large model for simple contrastive learning of sentence embeddings. It can be used for feature extraction and has been evaluated on semantic textual similarity (STS) tasks and downstream transfer tasks.", "model_name": "princeton-nlp/sup-simcse-roberta-large"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Image Upscaling", "api_name": "stabilityai/sd-x2-latent-upscaler", "api_call": "StableDiffusionLatentUpscalePipeline.from_pretrained(stabilityai/sd-x2-latent-upscaler, torch_dtype=torch.float16)", "performance": {"dataset": "LAION-2B", "accuracy": "Not specified"}, "description": "Stable Diffusion x2 latent upscaler is a diffusion-based upscaler model developed by Katherine Crowson in collaboration with Stability AI. It is designed to upscale Stable Diffusion's latent denoised image embeddings, allowing for fast text-to-image and upscaling pipelines. The model was trained on a high-resolution subset of the LAION-2B dataset and works with all Stable Diffusion checkpoints.", "model_name": "stabilityai/sd-x2-latent-upscaler"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Image Classification", "api_name": "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k", "api_call": "pipeline('image-classification', model='laion/CLIP-ViT-bigG-14-laion2B-39B-b160k')", "performance": {"dataset": "ImageNet-1k", "accuracy": "80.1"}, "description": "A CLIP ViT-bigG/14 model trained with the LAION-2B English subset of LAION-5B using OpenCLIP. The model is intended for research purposes and enables researchers to better understand and explore zero-shot, arbitrary image classification. It can be used for interdisciplinary studies of the potential impact of such models. The model achieves a 80.1 zero-shot top-1 accuracy on ImageNet-1k.", "model_name": "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "Text Summarization", "api_name": "distilbart-cnn-12-6-samsum", "api_call": "pipeline('summarization', model='philschmid/distilbart-cnn-12-6-samsum')", "performance": {"dataset": "samsum", "accuracy": {"ROUGE-1": 41.09, "ROUGE-2": 20.746, "ROUGE-L": 31.595, "ROUGE-LSUM": 38.339}}, "description": "This model is a DistilBART-based text summarization model trained on the SAMsum dataset. It can be used to generate summaries of conversational text.", "model_name": "philschmid/distilbart-cnn-12-6-samsum"}
{"domain": "Multimodal Feature Extraction", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "hubert-large-ll60k", "api_call": "HubertModel.from_pretrained('facebook/hubert-large-ll60k')", "performance": {"dataset": "Libri-Light", "accuracy": "matches or improves upon the state-of-the-art wav2vec 2.0 performance"}, "description": "Hubert-Large is a self-supervised speech representation learning model pretrained on 16kHz sampled speech audio. It is designed to deal with the unique problems in speech representation learning, such as multiple sound units in each input utterance, no lexicon of input sound units during the pre-training phase, and variable lengths of sound units with no explicit segmentation. The model relies on an offline clustering step to provide aligned target labels for a BERT-like prediction loss.", "model_name": "facebook/hubert-large-ll60k"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "functionality": "Question Answering", "api_name": "deepset/deberta-v3-base-squad2", "api_call": "AutoModelForQuestionAnswering.from_pretrained('deepset/deberta-v3-base-squad2')", "performance": {"dataset": "squad_v2", "accuracy": {"Exact Match": 83.825, "F1": 87.41}}, "description": "This is the deberta-v3-base model, fine-tuned using the SQuAD2.0 dataset. It's been trained on question-answer pairs, including unanswerable questions, for the task of Question Answering.", "model_name": "deepset/deberta-v3-base-squad2"}
{"domain": "Audio Text-to-Speech", "framework": "SpeechBrain", "functionality": "Text-to-Speech", "api_name": "speechbrain/tts-tacotron2-ljspeech", "api_call": "Tacotron2.from_hparams(source='speechbrain/tts-tacotron2-ljspeech')", "performance": {"dataset": "LJSpeech", "accuracy": "Not specified"}, "description": "This repository provides all the necessary tools for Text-to-Speech (TTS) with SpeechBrain using a Tacotron2 pretrained on LJSpeech. The pre-trained model takes in input a short text and produces a spectrogram in output. One can get the final waveform by applying a vocoder (e.g., HiFIGAN) on top of the generated spectrogram.", "model_name": "speechbrain/tts-tacotron2-ljspeech"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "superb/wav2vec2-base-superb-ks", "api_call": "pipeline('audio-classification', model='superb/wav2vec2-base-superb-ks')", "performance": {"dataset": "Speech Commands dataset v1.0", "accuracy": {"s3prl": 0.9623, "transformers": 0.9643}}, "description": "Wav2Vec2-Base for Keyword Spotting (KS) task in the SUPERB benchmark. The base model is pretrained on 16kHz sampled speech audio. The KS task detects preregistered keywords by classifying utterances into a predefined set of words. The model is trained on the Speech Commands dataset v1.0.", "model_name": "superb/wav2vec2-base-superb-ks"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Transformers", "functionality": "Text Generation", "api_name": "google/t5-v1_1-base", "api_call": "pipeline('text2text-generation', model='google/t5-v1_1-base')", "performance": {"dataset": "c4", "accuracy": "Not provided"}, "description": "Google's T5 Version 1.1 is a state-of-the-art text-to-text transformer model that achieves high performance on various NLP tasks such as summarization, question answering, and text classification. It is pre-trained on the Colossal Clean Crawled Corpus (C4) and fine-tuned on downstream tasks.", "model_name": "google/t5-v1_1-base"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "glpn-nyu-finetuned-diode-221215-092352", "api_call": "AutoModel.from_pretrained('sayakpaul/glpn-nyu-finetuned-diode-221215-092352')", "performance": {"dataset": "DIODE", "accuracy": ""}, "description": "A depth estimation model fine-tuned on the DIODE dataset.", "model_name": "sayakpaul/glpn-nyu-finetuned-diode-221215-092352"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face", "functionality": "Zero-Shot Image Classification", "api_name": "timm/eva02_enormous_patch14_plus_clip_224.laion2b_s9b_b144k", "api_call": "clip.load('timm/eva02_enormous_patch14_plus_clip_224.laion2b_s9b_b144k')", "performance": {"dataset": "N/A", "accuracy": "N/A"}, "description": "This model is a zero-shot image classification model based on OpenCLIP. It can be used for classifying images into various categories without any additional training.", "model_name": "timm/eva02_enormous_patch14_plus_clip_224.laion2b_s9b_b144k"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Speech Recognition", "api_name": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic", "api_call": "Wav2Vec2Model.from_pretrained('jonatasgrosman/wav2vec2-large-xlsr-53-arabic')", "performance": {"dataset": "Common Voice ar", "accuracy": {"WER": 39.59, "CER": 18.18}}, "description": "Fine-tuned XLSR-53 large model for speech recognition in Arabic. Fine-tuned facebook/wav2vec2-large-xlsr-53 on Arabic using the train and validation splits of Common Voice 6.1 and Arabic Speech Corpus.", "model_name": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "results-yelp", "api_call": "AutoTokenizer.from_pretrained('bert-base-uncased')", "performance": {"dataset": "Yelp", "accuracy": 0.9302}, "description": "This model is a fine-tuned version of textattack/bert-base-uncased-yelp-polarity on a filtered and manually reviewed Yelp dataset containing restaurant reviews only. It is intended to perform text classification, specifically sentiment analysis, on text data obtained from restaurant reviews to determine if the particular review is positive or negative.", "model_name": "bert-base-uncased"}
{"domain": "Reinforcement Learning", "framework": "Stable-Baselines3", "functionality": "deep-reinforcement-learning", "api_name": "ppo-BreakoutNoFrameskip-v4", "api_call": "load_from_hub(repo_id='sb3/ppo-BreakoutNoFrameskip-v4',filename='{MODEL FILENAME}.zip',)", "performance": {"dataset": "BreakoutNoFrameskip-v4", "accuracy": "398.00 +/- 16.30"}, "description": "This is a trained model of a PPO agent playing BreakoutNoFrameskip-v4 using the stable-baselines3 library and the RL Zoo. The RL Zoo is a training framework for Stable Baselines3 reinforcement learning agents, with hyperparameter optimization and pre-trained agents included.", "model_name": "sb3/ppo-BreakoutNoFrameskip-v4"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Semantic Segmentation", "api_name": "nvidia/segformer-b2-finetuned-cityscapes-1024-1024", "api_call": "SegformerForSemanticSegmentation.from_pretrained('nvidia/segformer-b2-finetuned-cityscapes-1024-1024')", "performance": {"dataset": "Cityscapes", "accuracy": "Not provided"}, "description": "SegFormer model fine-tuned on CityScapes at resolution 1024x1024. It was introduced in the paper SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers by Xie et al. and first released in this repository.", "model_name": "nvidia/segformer-b2-finetuned-cityscapes-1024-1024"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "timm/vit_large_patch14_clip_224.openai_ft_in12k_in1k", "api_call": "pipeline('image-classification', model='timm/vit_large_patch14_clip_224.openai_ft_in12k_in1k', framework='pt')", "performance": {"dataset": "", "accuracy": ""}, "description": "A ViT-based image classification model trained on ImageNet-1K and fine-tuned on ImageNet-12K by OpenAI.", "model_name": "timm/vit_large_patch14_clip_224.openai_ft_in12k_in1k"}
{"domain": "Natural Language Processing Conversational", "framework": "Hugging Face Transformers", "functionality": "Text Generation", "api_name": "DialoGPT-large", "api_call": "AutoModelForCausalLM.from_pretrained('microsoft/DialoGPT-large')", "performance": {"dataset": "Reddit discussion thread", "accuracy": "Comparable to human response quality under a single-turn conversation Turing test"}, "description": "DialoGPT is a SOTA large-scale pretrained dialogue response generation model for multiturn conversations. The human evaluation results indicate that the response generated from DialoGPT is comparable to human response quality under a single-turn conversation Turing test. The model is trained on 147M multi-turn dialogue from Reddit discussion thread.", "model_name": "microsoft/DialoGPT-large"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Transcription and Translation", "api_name": "openai/whisper-medium", "api_call": "WhisperForConditionalGeneration.from_pretrained('openai/whisper-medium')", "performance": {"dataset": [{"name": "LibriSpeech (clean)", "accuracy": 2.9}, {"name": "LibriSpeech (other)", "accuracy": 5.9}, {"name": "Common Voice 11.0", "accuracy": 53.87}]}, "description": "Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning. It is a Transformer-based encoder-decoder model and was trained on either English-only data or multilingual data.", "model_name": "openai/whisper-medium"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Transformers", "functionality": "Text2Text Generation", "api_name": "castorini/doc2query-t5-base-msmarco", "api_call": "T5ForConditionalGeneration.from_pretrained('castorini/doc2query-t5-base-msmarco')", "performance": {"dataset": "MS MARCO", "accuracy": "Not specified"}, "description": "A T5 model trained on the MS MARCO dataset for generating queries from documents.", "model_name": "castorini/doc2query-t5-base-msmarco"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "git-large-r-textcaps", "api_call": "pipeline('text-generation', model='microsoft/git-large-r-textcaps')", "performance": {"dataset": "TextCaps", "accuracy": ""}, "description": "GIT (short for GenerativeImage2Text) model, large-sized version, fine-tuned on TextCaps. It was introduced in the paper GIT: A Generative Image-to-text Transformer for Vision and Language by Wang et al. and first released in this repository. The model is trained using 'teacher forcing' on a lot of (image, text) pairs. The goal for the model is simply to predict the next text token, giving the image tokens and previous text tokens. This allows the model to be used for tasks like image and video captioning, visual question answering (VQA) on images and videos, and even image classification (by simply conditioning the model on the image and asking it to generate a class for it in text).", "model_name": "microsoft/git-large-r-textcaps"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Transformers", "functionality": "Table Question Answering", "api_name": "google/tapas-base-finetuned-wikisql-supervised", "api_call": "TapasForQuestionAnswering.from_pretrained('google/tapas-base-finetuned-wikisql-supervised')", "performance": {"dataset": "wikisql", "accuracy": "Not provided"}, "description": "TAPAS is a BERT-like transformers model pretrained on a large corpus of English data from Wikipedia in a self-supervised fashion. It was pretrained with two objectives: Masked language modeling (MLM) and Intermediate pre-training. Fine-tuning is done by adding a cell selection head and aggregation head on top of the pre-trained model, and then jointly train these randomly initialized classification heads with the base model on SQA and WikiSQL.", "model_name": "google/tapas-base-finetuned-wikisql-supervised"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "glpn-nyu-finetuned-diode-221122-082237", "api_call": "AutoModel.from_pretrained('sayakpaul/glpn-nyu-finetuned-diode-221122-082237')", "performance": {"dataset": "diode-subset", "accuracy": {"Loss": 0.3421, "Mae": 0.27, "Rmse": 0.4042, "Abs Rel": 0.3279, "Log Mae": 0.11320000000000001, "Log Rmse": 0.1688, "Delta1": 0.5839, "Delta2": 0.8408, "Delta3": 0.9309000000000001}}, "description": "This model is a fine-tuned version of vinvino02/glpn-nyu on the diode-subset dataset. It is used for depth estimation tasks.", "model_name": "sayakpaul/glpn-nyu-finetuned-diode-221122-082237"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/table-transformer-detection", "api_call": "TableTransformerDetrModel.from_pretrained('microsoft/table-transformer-detection')", "performance": {"dataset": "PubTables1M", "accuracy": "Not provided"}, "description": "Table Transformer (DETR) model trained on PubTables1M for detecting tables in documents. Introduced in the paper PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents by Smock et al.", "model_name": "microsoft/table-transformer-detection"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "fcakyon/timesformer-large-finetuned-k400", "api_call": "TimesformerForVideoClassification.from_pretrained('fcakyon/timesformer-large-finetuned-k400')", "performance": {"dataset": "Kinetics-400", "accuracy": "Not provided"}, "description": "TimeSformer model pre-trained on Kinetics-400 for video classification into one of the 400 possible Kinetics-400 labels. Introduced in the paper 'TimeSformer: Is Space-Time Attention All You Need for Video Understanding?' by Tong et al.", "model_name": "fcakyon/timesformer-large-finetuned-k400"}
{"domain": "Tabular Tabular Regression", "framework": "Hugging Face", "functionality": "Predicting Pokemon HP", "api_name": "julien-c/pokemon-predict-hp", "api_call": "pipeline('regression', model='julien-c/pokemon-predict-hp')", "performance": {"dataset": "julien-c/kaggle-rounakbanik-pokemon", "accuracy": {"mean_absolute_error": 15.909, "model_loss": 647.605}}, "description": "A tabular regression model trained on the julien-c/kaggle-rounakbanik-pokemon dataset to predict the HP of Pokemon.", "model_name": "julien-c/pokemon-predict-hp"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Semantic Segmentation", "api_name": "nvidia/segformer-b0-finetuned-ade-512-512", "api_call": "SegformerForSemanticSegmentation.from_pretrained('nvidia/segformer-b0-finetuned-ade-512-512')", "performance": {"dataset": "ADE20k", "accuracy": "Not provided"}, "description": "SegFormer model fine-tuned on ADE20k at resolution 512x512. It was introduced in the paper SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers by Xie et al. and first released in this repository.", "model_name": "nvidia/segformer-b0-finetuned-ade-512-512"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "Image Inpainting", "api_name": "lllyasviel/control_v11p_sd15_inpaint", "api_call": "ControlNetModel.from_pretrained('lllyasviel/control_v11p_sd15_inpaint')", "performance": {"dataset": "Stable Diffusion v1-5", "accuracy": "Not specified"}, "description": "ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on inpaint images.", "model_name": "lllyasviel/control_v11p_sd15_inpaint"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Question Answering", "api_name": "impira/layoutlm-invoices", "api_call": "pipeline('question-answering', model='impira/layoutlm-invoices')", "performance": {"dataset": "proprietary dataset of invoices, SQuAD2.0, and DocVQA", "accuracy": "not provided"}, "description": "This is a fine-tuned version of the multi-modal LayoutLM model for the task of question answering on invoices and other documents. It has been fine-tuned on a proprietary dataset of invoices as well as both SQuAD2.0 and DocVQA for general comprehension. Unlike other QA models, which can only extract consecutive tokens (because they predict the start and end of a sequence), this model can predict longer-range, non-consecutive sequences with an additional classifier head.", "model_name": "impira/layoutlm-invoices"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Object Detection", "api_name": "facebook/detr-resnet-101", "api_call": "DetrForObjectDetection.from_pretrained('facebook/detr-resnet-101')", "performance": {"dataset": "COCO 2017", "accuracy": "43.5 AP"}, "description": "DEtection TRansformer (DETR) model trained end-to-end on COCO 2017 object detection (118k annotated images). It was introduced in the paper End-to-End Object Detection with Transformers by Carion et al. and first released in this repository.", "model_name": "facebook/detr-resnet-101"}
{"domain": "Multimodal Zero-Shot Image Classification", "framework": "Hugging Face", "functionality": "Zero-Shot Image Classification", "api_name": "microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224", "api_call": "pipeline('zero-shot-image-classification', model='microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224')", "performance": {"dataset": "PMC-15M", "accuracy": "State of the art"}, "description": "BiomedCLIP is a biomedical vision-language foundation model pretrained on PMC-15M, a dataset of 15 million figure-caption pairs extracted from biomedical research articles in PubMed Central, using contrastive learning. It uses PubMedBERT as the text encoder and Vision Transformer as the image encoder, with domain-specific adaptations. It can perform various vision-language processing (VLP) tasks such as cross-modal retrieval, image classification, and visual question answering.", "model_name": "microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Classification", "api_name": "MoritzLaurer/DeBERTa-v3-xsmall-mnli-fever-anli-ling-binary", "api_call": "AutoModelForSequenceClassification.from_pretrained('MoritzLaurer/DeBERTa-v3-xsmall-mnli-fever-anli-ling-binary')", "performance": {"dataset": {"mnli-m-2c": {"accuracy": 0.925}, "mnli-mm-2c": {"accuracy": 0.922}, "fever-nli-2c": {"accuracy": 0.892}, "anli-all-2c": {"accuracy": 0.676}, "anli-r3-2c": {"accuracy": 0.665}, "lingnli-2c": {"accuracy": 0.888}}}, "description": "This model was trained on 782 357 hypothesis-premise pairs from 4 NLI datasets: MultiNLI, Fever-NLI, LingNLI and ANLI. The base model is DeBERTa-v3-xsmall from Microsoft. The v3 variant of DeBERTa substantially outperforms previous versions of the model by including a different pre-training objective.", "model_name": "MoritzLaurer/DeBERTa-v3-xsmall-mnli-fever-anli-ling-binary"}
{"domain": "Audio Text-to-Speech", "framework": "Fairseq", "functionality": "Text-to-Speech", "api_name": "facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_fr_css10", "api_call": "load_model_ensemble_and_task_from_hf_hub('facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_fr_css10)", "performance": {"dataset": "covost2", "accuracy": ""}, "description": "A text-to-speech model trained on mtedx, covost2, europarl_st, and voxpopuli datasets for English, French, Spanish, and Italian languages. Licensed under cc-by-nc-4.0.", "model_name": "facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_fr_css10"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Fill-Mask", "api_name": "GroNLP/bert-base-dutch-cased", "api_call": "AutoModel.from_pretrained('GroNLP/bert-base-dutch-cased')", "performance": {"dataset": [{"name": "CoNLL-2002", "accuracy": "90.24"}, {"name": "SoNaR-1", "accuracy": "84.93"}, {"name": "spaCy UD LassySmall", "accuracy": "86.10"}]}, "description": "BERTje is a Dutch pre-trained BERT model developed at the University of Groningen.", "model_name": "GroNLP/bert-base-dutch-cased"}
{"domain": "Natural Language Processing Question Answering", "framework": "Transformers", "functionality": "Question Answering", "api_name": "thatdramebaazguy/roberta-base-squad", "api_call": "pipeline(task='question-answering',model='thatdramebaazguy/roberta-base-squad')", "performance": {"dataset": [{"name": "SQuADv1", "accuracy": {"exact_match": 83.6045, "f1": 91.1709}}, {"name": "MoviesQA", "accuracy": {"exact_match": 51.6494, "f1": 68.2615}}]}, "description": "This is Roberta Base trained to do the SQuAD Task. This makes a QA model capable of answering questions.", "model_name": "thatdramebaazguy/roberta-base-squad"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Automatic Speech Recognition and Speech Translation", "api_name": "openai/whisper-base", "api_call": "WhisperForConditionalGeneration.from_pretrained('openai/whisper-base')", "performance": {"dataset": "LibriSpeech (clean) test set", "accuracy": "5.009 WER"}, "description": "Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalize to many datasets and domains without the need for fine-tuning.", "model_name": "openai/whisper-base"}
{"domain": "Multimodal Feature Extraction", "framework": "Hugging Face Transformers", "functionality": "Audio Spectrogram", "api_name": "audio-spectrogram-transformer", "api_call": "ASTModel.from_pretrained('MIT/ast-finetuned-audioset-10-10-0.4593')", "performance": {"dataset": "", "accuracy": ""}, "description": "One custom ast model for testing of HF repos", "model_name": "MIT/ast-finetuned-audioset-10-10-0.4593"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "blip2-flan-t5-xl", "api_call": "Blip2ForConditionalGeneration.from_pretrained('Salesforce/blip2-flan-t5-xl')", "performance": {"dataset": "LAION", "accuracy": "Not provided"}, "description": "BLIP-2 model, leveraging Flan T5-xl (a large language model). It was introduced in the paper BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models by Li et al. and first released in this repository. The goal for the model is to predict the next text token, giving the query embeddings and the previous text. This allows the model to be used for tasks like image captioning, visual question answering (VQA), and chat-like conversations by feeding the image and the previous conversation as prompt to the model.", "model_name": "Salesforce/blip2-flan-t5-xl"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "glpn-nyu-finetuned-diode-221228-072509", "api_call": "AutoModel.from_pretrained('sayakpaul/glpn-nyu-finetuned-diode-221228-072509')", "performance": {"dataset": "diode-subset", "accuracy": {"Loss": 0.4012, "Mae": 0.403, "Rmse": 0.6173000000000001, "Abs Rel": 0.3487, "Log Mae": 0.1574, "Log Rmse": 0.211, "Delta1": 0.4308, "Delta2": 0.6997, "Delta3": 0.8249000000000001}}, "description": "This model is a fine-tuned version of vinvino02/glpn-nyu on the diode-subset dataset.", "model_name": "sayakpaul/glpn-nyu-finetuned-diode-221228-072509"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "EimisAnimeDiffusion_1.0v", "api_call": "DiffusionPipeline.from_pretrained('eimiss/EimisAnimeDiffusion_1.0v')", "performance": {"dataset": "Not specified", "accuracy": "Not specified"}, "description": "EimisAnimeDiffusion_1.0v is a text-to-image model trained with high-quality and detailed anime images. It works well on anime and landscape generations and supports a Gradio Web UI.", "model_name": "eimiss/EimisAnimeDiffusion_1.0v"}
{"domain": "Natural Language Processing Translation", "framework": "Transformers", "functionality": "Translation", "api_name": "Helsinki-NLP/opus-mt-fi-en", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('Helsinki-NLP/opus-mt-fi-en')", "performance": {"dataset": [{"name": "newsdev2015-enfi-fineng.fin.eng", "accuracy": "BLEU: 25.3, chr-F: 0.536"}, {"name": "newstest2015-enfi-fineng.fin.eng", "accuracy": "BLEU: 26.9, chr-F: 0.547"}, {"name": "newstest2016-enfi-fineng.fin.eng", "accuracy": "BLEU: 29.0, chr-F: 0.571"}, {"name": "newstest2017-enfi-fineng.fin.eng", "accuracy": "BLEU: 32.3, chr-F: 0.594"}, {"name": "newstest2018-enfi-fineng.fin.eng", "accuracy": "BLEU: 23.8, chr-F: 0.517"}, {"name": "newstest2019-fien-fineng.fin.eng", "accuracy": "BLEU: 29.0, chr-F: 0.565"}, {"name": "newstestB2016-enfi-fineng.fin.eng", "accuracy": "BLEU: 24.5, chr-F: 0.527"}, {"name": "newstestB2017-enfi-fineng.fin.eng", "accuracy": "BLEU: 27.4, chr-F: 0.557"}, {"name": "newstestB2017-fien-fineng.fin.eng", "accuracy": "BLEU: 27.4, chr-F: 0.557"}, {"name": "Tatoeba-test.fin.eng", "accuracy": "BLEU: 53.4, chr-F: 0.697"}]}, "description": "Helsinki-NLP/opus-mt-fi-en is a machine translation model for translating Finnish text to English text. It is trained on the OPUS dataset and can be used with the Hugging Face Transformers library.", "model_name": "Helsinki-NLP/opus-mt-fi-en"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "abhishek/autotrain-dog-vs-food", "api_call": "pipeline('image-classification', model='abhishek/autotrain-dog-vs-food')", "performance": {"dataset": "sasha/dog-food", "accuracy": 0.998}, "description": "A pre-trained model for classifying images as either dog or food using Hugging Face's AutoTrain framework.", "model_name": "abhishek/autotrain-dog-vs-food"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Conversational", "api_name": "microsoft/GODEL-v1_1-large-seq2seq", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('microsoft/GODEL-v1_1-large-seq2seq')", "performance": {"dataset": "Reddit discussion thread, instruction and knowledge grounded dialogs", "accuracy": "Not provided"}, "description": "GODEL is a large-scale pre-trained model for goal-directed dialogs. It is parameterized with a Transformer-based encoder-decoder model and trained for response generation grounded in external text, which allows more effective fine-tuning on dialog tasks that require conditioning the response on information that is external to the current conversation (e.g., a retrieved document). The pre-trained model can be efficiently fine-tuned and adapted to accomplish a new dialog task with a handful of task-specific dialogs. The v1.1 model is trained on 551M multi-turn dialogs from Reddit discussion thread, and 5M instruction and knowledge grounded dialogs.", "model_name": "microsoft/GODEL-v1_1-large-seq2seq"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Image Segmentation", "api_name": "nvidia/segformer-b0-finetuned-cityscapes-1024-1024", "api_call": "SegformerForSemanticSegmentation.from_pretrained('nvidia/segformer-b0-finetuned-cityscapes-1024-1024')", "performance": {"dataset": "CityScapes", "accuracy": "Not provided"}, "description": "SegFormer model fine-tuned on CityScapes at resolution 1024x1024. It was introduced in the paper SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers by Xie et al. and first released in this repository.", "model_name": "nvidia/segformer-b0-finetuned-cityscapes-1024-1024"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "lysandre/tiny-vit-random", "api_call": "ViTForImageClassification.from_pretrained('lysandre/tiny-vit-random')", "performance": {"dataset": "", "accuracy": ""}, "description": "A tiny-vit-random model for image classification using Hugging Face Transformers.", "model_name": "lysandre/tiny-vit-random"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "promptcap-coco-vqa", "api_call": "PromptCap('vqascore/promptcap-coco-vqa')", "performance": {"dataset": {"coco": {"accuracy": "150 CIDEr"}, "OK-VQA": {"accuracy": "60.4%"}, "A-OKVQA": {"accuracy": "59.6%"}}}, "description": "PromptCap is a captioning model that can be controlled by natural language instruction. The instruction may contain a question that the user is interested in. It achieves SOTA performance on COCO captioning (150 CIDEr) and knowledge-based VQA tasks when paired with GPT-3 (60.4% on OK-VQA and 59.6% on A-OKVQA).", "model_name": "vqascore/promptcap-coco-vqa"}
{"domain": "Audio Text-to-Speech", "framework": "ESPnet", "functionality": "Text-to-Speech", "api_name": "kan-bayashi_csmsc_tts_train_tacotron2_raw_phn_pypinyin_g2p_phone_train.loss.best", "api_call": "Text2Speech.from_pretrained('espnet/kan-bayashi_csmsc_tts_train_tacotron2_raw_phn_pypinyin_g2p_phone_train.loss.best')", "performance": {"dataset": "csmsc", "accuracy": "Not specified"}, "description": "A pre-trained Text-to-Speech model for Chinese language using ESPnet framework. It can be used to convert text input into speech output in Chinese.", "model_name": "espnet/kan-bayashi_csmsc_tts_train_tacotron2_raw_phn_pypinyin_g2p_phone_train.loss.best"}
{"domain": "Audio Text-to-Speech", "framework": "ONNX", "functionality": "Text-to-Speech", "api_name": "NeuML/ljspeech-jets-onnx", "api_call": "TextToSpeech(NeuML/ljspeech-jets-onnx)", "performance": {"dataset": "ljspeech", "accuracy": null}, "description": "ESPnet JETS Text-to-Speech (TTS) Model for ONNX exported using the espnet_onnx library. Can be used with txtai pipeline or directly with ONNX.", "model_name": "NeuML/ljspeech-jets-onnx"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "martin-ha/toxic-comment-model", "api_call": "pipeline(model='martin-ha/toxic-comment-model')", "performance": {"dataset": "held-out test set", "accuracy": 0.9400000000000001, "f1-score": 0.59}, "description": "This model is a fine-tuned version of the DistilBERT model to classify toxic comments.", "model_name": "martin-ha/toxic-comment-model"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "zero-shot-object-detection", "api_name": "google/owlvit-base-patch16", "api_call": "OwlViTForObjectDetection.from_pretrained('google/owlvit-base-patch16')", "performance": {"dataset": "COCO", "accuracy": "Not provided"}, "description": "OWL-ViT is a zero-shot text-conditioned object detection model that can be used to query an image with one or multiple text queries. OWL-ViT uses CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features.", "model_name": "google/owlvit-base-patch16"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Transformers", "functionality": "Text Classification", "api_name": "typeform/distilbert-base-uncased-mnli", "api_call": "AutoModelForSequenceClassification.from_pretrained('typeform/distilbert-base-uncased-mnli')", "performance": {"dataset": "multi_nli", "accuracy": 0.8206875509}, "description": "This is the uncased DistilBERT model fine-tuned on Multi-Genre Natural Language Inference (MNLI) dataset for the zero-shot classification task.", "model_name": "typeform/distilbert-base-uncased-mnli"}
{"domain": "Natural Language Processing Text Generation", "framework": "Transformers", "functionality": "Conversational", "api_name": "microsoft/DialoGPT-large", "api_call": "AutoModelForCausalLM.from_pretrained('microsoft/DialoGPT-large')", "performance": {"dataset": "Reddit discussion thread", "accuracy": "Comparable to human response quality under a single-turn conversation Turing test"}, "description": "DialoGPT is a state-of-the-art large-scale pretrained dialogue response generation model for multi-turn conversations. The model is trained on 147M multi-turn dialogues from Reddit discussion threads.", "model_name": "microsoft/DialoGPT-large"}
{"domain": "Audio Text-to-Speech", "framework": "Fairseq", "functionality": "Text-to-Speech", "api_name": "tts_transformer-ar-cv7", "api_call": "load_model_ensemble_and_task_from_hf_hub('facebook/tts_transformer-ar-cv7')", "performance": {"dataset": "common_voice", "accuracy": "Not specified"}, "description": "Transformer text-to-speech model for Arabic language with a single-speaker male voice, trained on Common Voice v7 dataset.", "model_name": "facebook/tts_transformer-ar-cv7"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "opus-mt-de-en", "api_call": "translation_pipeline('translation_de_to_en', model='Helsinki-NLP/opus-mt-de-en')", "performance": {"dataset": "opus", "accuracy": {"newssyscomb2009.de.en": 29.4, "news-test2008.de.en": 27.8, "newstest2009.de.en": 26.8, "newstest2010.de.en": 30.2, "newstest2011.de.en": 27.4, "newstest2012.de.en": 29.1, "newstest2013.de.en": 32.1, "newstest2014-deen.de.en": 34.0, "newstest2015-ende.de.en": 34.2, "newstest2016-ende.de.en": 40.4, "newstest2017-ende.de.en": 35.7, "newstest2018-ende.de.en": 43.7, "newstest2019-deen.de.en": 40.1, "Tatoeba.de.en": 55.4}}, "description": "A German to English translation model trained on the OPUS dataset using the Hugging Face Transformers library.", "model_name": "Helsinki-NLP/opus-mt-de-en"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Language model", "api_name": "google/flan-t5-small", "api_call": "T5ForConditionalGeneration.from_pretrained('google/flan-t5-small')", "performance": {"dataset": [{"name": "MMLU", "accuracy": "75.2%"}]}, "description": "FLAN-T5 small is a fine-tuned version of T5 language model on more than 1000 additional tasks covering multiple languages. It achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. The model is designed for research on language models, including zero-shot and few-shot NLP tasks, reasoning, question answering, fairness, and safety research. It has not been tested in real-world applications and should not be used directly in any application without prior assessment of safety and fairness concerns specific to the application.", "model_name": "google/flan-t5-small"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "microsoft/swinv2-tiny-patch4-window8-256", "api_call": "AutoModelForImageClassification.from_pretrained('microsoft/swinv2-tiny-patch4-window8-256')", "performance": {"dataset": "imagenet-1k", "accuracy": "Not provided"}, "description": "Swin Transformer v2 model pre-trained on ImageNet-1k at resolution 256x256. It was introduced in the paper Swin Transformer V2: Scaling Up Capacity and Resolution by Liu et al. and first released in this repository. The Swin Transformer is a type of Vision Transformer. It builds hierarchical feature maps by merging image patches in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window. Swin Transformer v2 adds 3 main improvements: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) a log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) a self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images.", "model_name": "microsoft/swinv2-tiny-patch4-window8-256"}
{"domain": "Tabular Tabular Classification", "framework": "Scikit-learn", "functionality": "Wine Quality classification", "api_name": "osanseviero/wine-quality", "api_call": "joblib.load(cached_download(hf_hub_url('julien-c/wine-quality', 'sklearn_model.joblib')))", "performance": {"dataset": "winequality-red.csv", "accuracy": 0.6616635397}, "description": "A Simple Example of Scikit-learn Pipeline for Wine Quality classification. Inspired by https://towardsdatascience.com/a-simple-example-of-pipeline-in-machine-learning-with-scikit-learn-e726ffbb6976 by Saptashwa Bhattacharyya.", "model_name": "julien-c/wine-quality"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Unconditional Image Generation", "api_name": "google/ddpm-cat-256", "api_call": "DDPMPipeline.from_pretrained('google/ddpm-cat-256')", "performance": {"dataset": "CIFAR10", "accuracy": {"Inception_score": 9.46, "FID_score": 3.17}}, "description": "Denoising Diffusion Probabilistic Models (DDPM) is a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. It can generate high-quality images using discrete noise schedulers such as scheduling_ddpm, scheduling_ddim, and scheduling_pndm. The model is trained on the unconditional CIFAR10 dataset and 256x256 LSUN, obtaining an Inception score of 9.46 and a state-of-the-art FID score of 3.17.", "model_name": "google/ddpm-cat-256"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "ControlNet - M-LSD Straight Line Version", "api_name": "lllyasviel/sd-controlnet-mlsd", "api_call": "ControlNetModel.from_pretrained('lllyasviel/sd-controlnet-mlsd')", "performance": {"dataset": "600k edge-image, caption pairs generated from Places2", "accuracy": "Not specified"}, "description": "ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on M-LSD straight line detection. It can be used in combination with Stable Diffusion.", "model_name": "lllyasviel/sd-controlnet-mlsd"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Classification", "api_name": "MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7", "api_call": "AutoModelForSequenceClassification.from_pretrained('MoritzLaurer/DeBERTa-v3-xsmall-mnli-fever-anli-ling-binary')", "performance": {"dataset": [{"name": "MultiNLI-matched", "accuracy": 0.857}, {"name": "MultiNLI-mismatched", "accuracy": 0.856}, {"name": "ANLI-all", "accuracy": 0.537}, {"name": "ANLI-r3", "accuracy": 0.497}, {"name": "WANLI", "accuracy": 0.732}, {"name": "LingNLI", "accuracy": 0.788}, {"name": "fever-nli", "accuracy": 0.761}]}, "description": "This multilingual model can perform natural language inference (NLI) on 100 languages and is therefore also suitable for multilingual zero-shot classification. The underlying mDeBERTa-v3-base model was pre-trained by Microsoft on the CC100 multilingual dataset with 100 languages. The model was then fine-tuned on the XNLI dataset and on the multilingual-NLI-26lang-2mil7 dataset. Both datasets contain more than 2.7 million hypothesis-premise pairs in 27 languages spoken by more than 4 billion people.", "model_name": "MoritzLaurer/DeBERTa-v3-xsmall-mnli-fever-anli-ling-binary"}
{"domain": "Reinforcement Learning", "framework": "Stable-Baselines3", "functionality": "LunarLander-v2", "api_name": "araffin/ppo-LunarLander-v2", "api_call": "PPO.load_from_hub('araffin/ppo-LunarLander-v2', 'ppo-LunarLander-v2.zip')", "performance": {"dataset": "LunarLander-v2", "accuracy": "283.49 +/- 13.74"}, "description": "This is a trained model of a PPO agent playing LunarLander-v2 using the stable-baselines3 library.", "model_name": "araffin/ppo-LunarLander-v2"}
{"domain": "Natural Language Processing Token Classification", "framework": "Transformers", "functionality": "Named Entity Recognition", "api_name": "distilbert-base-multilingual-cased-ner-hrl", "api_call": "AutoModelForTokenClassification.from_pretrained('Davlan/distilbert-base-multilingual-cased-ner-hrl')", "performance": {"dataset": [{"name": "ANERcorp", "language": "Arabic"}, {"name": "conll 2003", "language": "German"}, {"name": "conll 2003", "language": "English"}, {"name": "conll 2002", "language": "Spanish"}, {"name": "Europeana Newspapers", "language": "French"}, {"name": "Italian I-CAB", "language": "Italian"}, {"name": "Latvian NER", "language": "Latvian"}, {"name": "conll 2002", "language": "Dutch"}, {"name": "Paramopama + Second Harem", "language": "Portuguese"}, {"name": "MSRA", "language": "Chinese"}], "accuracy": "Not specified"}, "description": "distilbert-base-multilingual-cased-ner-hrl is a Named Entity Recognition model for 10 high resourced languages (Arabic, German, English, Spanish, French, Italian, Latvian, Dutch, Portuguese and Chinese) based on a fine-tuned Distiled BERT base model. It has been trained to recognize three types of entities: location (LOC), organizations (ORG), and person (PER).", "model_name": "Davlan/distilbert-base-multilingual-cased-ner-hrl"}
{"domain": "Audio Audio-to-Audio", "framework": "Fairseq", "functionality": "speech-to-speech-translation", "api_name": "xm_transformer_unity_en-hk", "api_call": "load_model_ensemble_and_task_from_hf_hub('facebook/xm_transformer_unity_en-hk')", "performance": {"dataset": "MuST-C", "accuracy": null}, "description": "Speech-to-speech translation model with two-pass decoder (UnitY) from fairseq: English-Hokkien. Trained with supervised data in TED domain, and weakly supervised data in TED and Audiobook domain.", "model_name": "facebook/xm_transformer_unity_en-hk"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Transformers", "functionality": "Zero-Shot Image Classification", "api_name": "openai/clip-vit-large-patch14-336", "api_call": "CLIPModel.from_pretrained('openai/clip-vit-large-patch14').", "performance": {"dataset": "unknown", "accuracy": "N/A"}, "description": "This model was trained from scratch on an unknown dataset.", "model_name": "openai/clip-vit-large-patch14"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "videomae-large", "api_call": "VideoMAEForPreTraining.from_pretrained('MCG-NJU/videomae-large')", "performance": {"dataset": "Kinetics-400", "accuracy": "Not provided"}, "description": "VideoMAE is an extension of Masked Autoencoders (MAE) to video. The architecture of the model is very similar to that of a standard Vision Transformer (ViT), with a decoder on top for predicting pixel values for masked patches. Videos are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds fixed sinus/cosinus position embeddings before feeding the sequence to the layers of the Transformer encoder. By pre-training the model, it learns an inner representation of videos that can then be used to extract features useful for downstream tasks.", "model_name": "MCG-NJU/videomae-large"}
{"domain": "Multimodal Text-to-Video", "framework": "Hugging Face", "functionality": "Text-to-video synthesis", "api_name": "damo-vilab/text-to-video-ms-1.7b", "api_call": "DiffusionPipeline.from_pretrained('damo-vilab/text-to-video-ms-1.7b', torch_dtype=torch.float16, variant=fp16)", "performance": {"dataset": "Webvid, ImageNet, LAION5B", "accuracy": "N/A"}, "description": "This model is based on a multi-stage text-to-video generation diffusion model, which inputs a description text and returns a video that matches the text description. The model consists of three sub-networks: text feature extraction model, text feature-to-video latent space diffusion model, and video latent space to video visual space model. The overall model parameters are about 1.7 billion. Currently, it only supports English input.", "model_name": "damo-vilab/text-to-video-ms-1.7b"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "functionality": "Information Retrieval", "api_name": "cross-encoder/ms-marco-MiniLM-L-12-v2", "api_call": "AutoModelForSequenceClassification.from_pretrained('cross-encoder/ms-marco-MiniLM-L-12-v2')", "performance": {"dataset": {"TREC Deep Learning 2019": {"NDCG@10": 74.31}, "MS Marco Passage Reranking": {"MRR@10": 39.02, "accuracy": "960 Docs / Sec"}}}, "description": "This model was trained on the MS Marco Passage Ranking task. The model can be used for Information Retrieval: Given a query, encode the query will all possible passages (e.g. retrieved with ElasticSearch). Then sort the passages in a decreasing order. See SBERT.net Retrieve & Re-rank for more details. The training code is available here: SBERT.net Training MS Marco", "model_name": "cross-encoder/ms-marco-MiniLM-L-12-v2"}
{"domain": "Audio Text-to-Speech", "framework": "Fairseq", "functionality": "Speech-to-speech translation", "api_name": "facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj_dur", "api_call": "load_model_ensemble_and_task_from_hf_hub('facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj_dur')", "performance": {"dataset": "covost2", "accuracy": null}, "description": "Speech-to-speech translation model from fairseq S2UT (paper/code) for Spanish-English. Trained on mTEDx, CoVoST 2, Europarl-ST, and VoxPopuli.", "model_name": "facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj_dur"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Speech Recognition", "api_name": "jonatasgrosman/wav2vec2-large-xlsr-53-russian", "api_call": "SpeechRecognitionModel('jonatasgrosman/wav2vec2-large-xlsr-53-russian')", "performance": {"dataset": "mozilla-foundation/common_voice_6_0", "accuracy": {"Test WER": 13.3, "Test CER": 2.88, "Test WER (+LM)": 9.57, "Test CER (+LM)": 2.24}}, "description": "Fine-tuned XLSR-53 large model for speech recognition in Russian. Fine-tuned facebook/wav2vec2-large-xlsr-53 on Russian using the train and validation splits of Common Voice 6.1 and CSS10.", "model_name": "jonatasgrosman/wav2vec2-large-xlsr-53-russian"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "videomae-small-finetuned-ssv2", "api_call": "VideoMAEForVideoClassification.from_pretrained('MCG-NJU/videomae-small-finetuned-ssv2')", "performance": {"dataset": "Something-Something V2", "accuracy": {"top-1": 66.8, "top-5": 90.3}}, "description": "VideoMAE is an extension of Masked Autoencoders (MAE) to video. The architecture of the model is very similar to that of a standard Vision Transformer (ViT), with a decoder on top for predicting pixel values for masked patches. Videos are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds fixed sinus/cosinus position embeddings before feeding the sequence to the layers of the Transformer encoder. By pre-training the model, it learns an inner representation of videos that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled videos for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire video.", "model_name": "MCG-NJU/videomae-small-finetuned-ssv2"}
{"domain": "Natural Language Processing Question Answering", "framework": "Transformers", "functionality": "Question Answering", "api_name": "bert-large-cased-whole-word-masking-finetuned-squad", "api_call": "AutoModel.from_pretrained('bert-large-cased-whole-word-masking-finetuned-squad')", "performance": {"dataset": [{"name": "BookCorpus", "accuracy": "N/A"}, {"name": "English Wikipedia", "accuracy": "N/A"}]}, "description": "BERT large model (cased) whole word masking finetuned on SQuAD. This model is cased and trained with a new technique: Whole Word Masking. After pre-training, this model was fine-tuned on the SQuAD dataset.", "model_name": "bert-large-cased-whole-word-masking-finetuned-squad"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Speech Recognition", "api_name": "jonatasgrosman/wav2vec2-large-xlsr-53-english", "api_call": "Wav2Vec2Model.from_pretrained('jonatasgrosman/wav2vec2-large-xlsr-53-english')", "performance": {"dataset": "mozilla-foundation/common_voice_6_0", "accuracy": {"Test WER": 19.06, "Test CER": 7.69, "Test WER (+LM)": 14.81, "Test CER (+LM)": 6.84}}, "description": "Fine-tuned facebook/wav2vec2-large-xlsr-53 on English using the train and validation splits of Common Voice 6.1. When using this model, make sure that your speech input is sampled at 16kHz.", "model_name": "jonatasgrosman/wav2vec2-large-xlsr-53-english"}
{"domain": "Natural Language Processing Token Classification", "framework": "Hugging Face Transformers", "functionality": "Entity Extraction", "api_name": "903429548", "api_call": "AutoModelForTokenClassification.from_pretrained('ismail-lucifer011/autotrain-company_all-903429548', use_auth_token=True)", "performance": {"dataset": "ismail-lucifer011/autotrain-data-company_all", "accuracy": 0.9979930567}, "description": "A token classification model trained using AutoTrain for entity extraction. The model is based on the distilbert architecture and trained on the ismail-lucifer011/autotrain-data-company_all dataset. It can be used to identify and extract company names from text.", "model_name": "ismail-lucifer011/autotrain-company_all-903429548"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/git-large-vqav2", "api_call": "AutoModel.from_pretrained('microsoft/git-large-vqav2')", "performance": {"dataset": "VQAv2", "accuracy": "Refer to the paper"}, "description": "GIT (short for GenerativeImage2Text) model, large-sized version, fine-tuned on VQAv2. It was introduced in the paper GIT: A Generative Image-to-text Transformer for Vision and Language by Wang et al. and first released in this repository. The model is a Transformer decoder conditioned on both CLIP image tokens and text tokens. It can be used for tasks like image and video captioning, visual question answering (VQA) on images and videos, and even image classification.", "model_name": "microsoft/git-large-vqav2"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/trocr-large-printed", "api_call": "VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-large-printed')", "performance": {"dataset": "SROIE", "accuracy": "Not provided"}, "description": "TrOCR model fine-tuned on the SROIE dataset. It was introduced in the paper TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Li et al. and first released in this repository. The TrOCR model is an encoder-decoder model, consisting of an image Transformer as encoder, and a text Transformer as decoder. The image encoder was initialized from the weights of BEiT, while the text decoder was initialized from the weights of RoBERTa.", "model_name": "microsoft/trocr-large-printed"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Masked Language Modeling", "api_name": "albert-base-v2", "api_call": "pipeline('fill-mask', model='albert-base-v2')", "performance": {"dataset": {"SQuAD1.1": "90.2/83.2", "SQuAD2.0": "82.1/79.3", "MNLI": "84.6", "SST-2": "92.9", "RACE": "66.8"}, "accuracy": "82.3"}, "description": "ALBERT Base v2 is a transformers model pretrained on a large corpus of English data in a self-supervised fashion using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model, as all ALBERT models, is uncased: it does not make a difference between english and English.", "model_name": "albert-base-v2"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "nvidia/mit-b0", "api_call": "SegformerForImageClassification.from_pretrained('nvidia/mit-b0')", "performance": {"dataset": "imagenet_1k", "accuracy": "Not provided"}, "description": "SegFormer encoder fine-tuned on Imagenet-1k. It was introduced in the paper SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers by Xie et al. and first released in this repository. SegFormer consists of a hierarchical Transformer encoder and a lightweight all-MLP decode head to achieve great results on semantic segmentation benchmarks such as ADE20K and Cityscapes.", "model_name": "nvidia/mit-b0"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Fill-Mask", "api_name": "distilroberta-base", "api_call": "pipeline('fill-mask', model='distilroberta-base')", "performance": {"dataset": "openwebtext", "accuracy": "Not provided"}, "description": "DistilRoBERTa is a distilled version of the RoBERTa-base model, designed to be smaller, faster, and lighter. It is a Transformer-based language model trained on the OpenWebTextCorpus, which is a reproduction of OpenAI's WebText dataset. The model has 6 layers, 768 dimensions, and 12 heads, totaling 82M parameters. It is primarily intended for fine-tuning on downstream tasks such as sequence classification, token classification, or question answering.", "model_name": "distilroberta-base"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Translation, Summarization, Question Answering, Sentiment Analysis, Regression", "api_name": "t5-large", "api_call": "T5Model.from_pretrained('t5-large')", "performance": {"dataset": "c4", "accuracy": "See research paper, Table 14"}, "description": "T5-Large is a Text-To-Text Transfer Transformer (T5) model with 770 million parameters. It is designed to handle a variety of NLP tasks, including translation, summarization, question answering, sentiment analysis, and regression. The model is pre-trained on the Colossal Clean Crawled Corpus (C4) and fine-tuned on various supervised and unsupervised tasks.", "model_name": "t5-large"}
{"domain": "Multimodal Graph Machine Learning", "framework": "Hugging Face Transformers", "functionality": "GTA5 AI model", "api_name": "GTA5_PROCESS_LEARNING_AI", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('janpase97/codeformer-pretrained')", "performance": {"dataset": "MNIST", "accuracy": "Not specified"}, "description": "This AI model is designed to train on the MNIST dataset with a specified data cap and save the trained model as an .onnx file. It can be attached to the GTA5 game process by PID and checks if the targeted application is running. The model is trained on a GPU if available.", "model_name": "janpase97/codeformer-pretrained"}
{"domain": "Audio Voice Activity Detection", "framework": "Hugging Face Transformers", "functionality": "Voice Activity Detection, Speech-to-Noise Ratio, and C50 Room Acoustics Estimation", "api_name": "pyannote/brouhaha", "api_call": "Model.from_pretrained('pyannote/brouhaha', use_auth_token='ACCESS_TOKEN_GOES_HERE')", "performance": {"dataset": "LibriSpeech, AudioSet, EchoThief, MIT-Acoustical-Reverberation-Scene", "accuracy": "Not provided"}, "description": "Brouhaha is a joint voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation model. It is based on the PyTorch framework and uses the pyannote.audio library.", "model_name": "pyannote/brouhaha"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "tuner007/pegasus_paraphrase", "api_call": "PegasusForConditionalGeneration.from_pretrained('tuner007/pegasus_paraphrase')", "performance": {"dataset": "unknown", "accuracy": "unknown"}, "description": "PEGASUS fine-tuned for paraphrasing", "model_name": "tuner007/pegasus_paraphrase"}
{"domain": "Audio Audio-to-Audio", "framework": "Hugging Face Transformers", "functionality": "Speech Enhancement", "api_name": "speechbrain/sepformer-wham16k-enhancement", "api_call": "separator.from_hparams(source=speechbrain/sepformer-wham16k-enhancement, savedir='pretrained_models/sepformer-wham16k-enhancement')", "performance": {"dataset": "WHAM!", "accuracy": {"Test-Set SI-SNR": "14.3 dB", "Test-Set PESQ": "2.20"}}, "description": "This repository provides all the necessary tools to perform speech enhancement (denoising) with a SepFormer model, implemented with SpeechBrain, and pretrained on WHAM! dataset with 16k sampling frequency, which is basically a version of WSJ0-Mix dataset with environmental noise and reverberation in 8k.", "model_name": "speechbrain/sepformer-wham16k-enhancement"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Transformers", "functionality": "Table-based QA", "api_name": "neulab/omnitab-large", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('neulab/omnitab-large')", "performance": {"dataset": "wikitablequestions", "accuracy": null}, "description": "OmniTab is a table-based QA model proposed in OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering. neulab/omnitab-large (based on BART architecture) is initialized with microsoft/tapex-large and continuously pretrained on natural and synthetic data.", "model_name": "neulab/omnitab-large"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "google/vit-base-patch16-224-in21k", "api_call": "ViTModel.from_pretrained('google/vit-base-patch16-224-in21k')", "performance": {"dataset": "ImageNet-21k", "accuracy": "Refer to tables 2 and 5 of the original paper"}, "description": "The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him.", "model_name": "google/vit-base-patch16-224-in21k"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Masked Language Modeling", "api_name": "bert-base-cased", "api_call": "pipeline('fill-mask', model='bert-base-cased')", "performance": {"dataset": "GLUE", "accuracy": 79.6}, "description": "BERT base model (cased) is a pre-trained transformer model on English language using a masked language modeling (MLM) objective. It was introduced in a paper and first released in a repository. This model is case-sensitive, which means it can differentiate between 'english' and 'English'. The model can be used for masked language modeling or next sentence prediction, but it's mainly intended to be fine-tuned on a downstream task.", "model_name": "bert-base-cased"}
{"domain": "Natural Language Processing Token Classification", "framework": "Transformers", "functionality": "Named Entity Recognition", "api_name": "dslim/bert-large-NER", "api_call": "AutoModelForTokenClassification.from_pretrained('dslim/bert-large-NER')", "performance": {"dataset": "conll2003", "accuracy": {"f1": 0.92, "precision": 0.92, "recall": 0.919}}, "description": "bert-large-NER is a fine-tuned BERT model that is ready to use for Named Entity Recognition and achieves state-of-the-art performance for the NER task. It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC).", "model_name": "dslim/bert-large-NER"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "glpn-nyu-finetuned-diode-221116-110652", "api_call": "AutoModel.from_pretrained('sayakpaul/glpn-nyu-finetuned-diode-221116-110652')", "performance": {"dataset": "diode-subset", "accuracy": {"Loss": 0.40180000000000005, "Mae": 0.3272, "Rmse": 0.4546, "Abs Rel": 0.3934, "Log Mae": 0.138, "Log Rmse": 0.1907, "Delta1": 0.45980000000000004, "Delta2": 0.7659, "Delta3": 0.9082}}, "description": "This model is a fine-tuned version of vinvino02/glpn-nyu on the diode-subset dataset. It is used for depth estimation tasks.", "model_name": "sayakpaul/glpn-nyu-finetuned-diode-221116-110652"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "functionality": "Information Retrieval", "api_name": "cross-encoder/ms-marco-MiniLM-L-6-v2", "api_call": "AutoModelForSequenceClassification.from_pretrained('cross-encoder/ms-marco-MiniLM-L-6-v2')", "performance": {"dataset": "MS Marco Passage Reranking", "accuracy": "MRR@10: 39.01%"}, "description": "This model was trained on the MS Marco Passage Ranking task and can be used for Information Retrieval. Given a query, encode the query with all possible passages, then sort the passages in a decreasing order.", "model_name": "cross-encoder/ms-marco-MiniLM-L-6-v2"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Summarization", "api_name": "mrm8488/t5-base-finetuned-summarize-news", "api_call": "AutoModelWithLMHead.from_pretrained('mrm8488/t5-base-finetuned-summarize-news')", "performance": {"dataset": "News Summary", "accuracy": "Not provided"}, "description": "Google's T5 base fine-tuned on News Summary dataset for summarization downstream task. The dataset consists of 4515 examples and contains Author_name, Headlines, Url of Article, Short text, Complete Article. Time period ranges from February to August 2017.", "model_name": "mrm8488/t5-base-finetuned-summarize-news"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Paraphrasing", "api_name": "prithivida/parrot_paraphraser_on_T5", "api_call": "Parrot(model_tag='prithivida/parrot_paraphraser_on_T5', use_gpu=False)", "performance": {"dataset": "Not mentioned", "accuracy": "Not mentioned"}, "description": "Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models. It offers knobs to control Adequacy, Fluency, and Diversity as per your needs. It mainly focuses on augmenting texts typed-into or spoken-to conversational interfaces for building robust NLU models.", "model_name": "prithivida/parrot_paraphraser_on_T5"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face", "functionality": "Zero-Shot Image Classification", "api_name": "laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup", "api_call": "pipeline('image-classification', model='laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup')", "performance": {"dataset": "ImageNet-1k", "accuracy": "76.9"}, "description": "A series of CLIP ConvNeXt-Large (w/ extra text depth, vision MLP head) models trained on the LAION-2B (english) subset of LAION-5B using OpenCLIP. The models utilize the timm ConvNeXt-Large model (convnext_large) as the image tower, a MLP (fc - gelu - drop - fc) head in vision tower instead of the single projection of other CLIP models, and a text tower with same width but 4 layers more depth than ViT-L / RN50x16 models (depth 16, embed dim 768).", "model_name": "laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Transformers", "functionality": "Table Question Answering", "api_name": "google/tapas-large-finetuned-wikisql-supervised", "api_call": "pipeline('table-question-answering', model='google/tapas-large-finetuned-wikisql-supervised')", "performance": {"dataset": "wikisql", "accuracy": "Not provided"}, "description": "TAPAS is a BERT-like transformers model pretrained on a large corpus of English data from Wikipedia in a self-supervised fashion. It can be used for answering questions related to a table.", "model_name": "google/tapas-large-finetuned-wikisql-supervised"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/tapex-base-finetuned-wikisql", "api_call": "BartForConditionalGeneration.from_pretrained('microsoft/tapex-base-finetuned-wikisql')", "performance": {"dataset": "wikisql"}, "description": "TAPEX (Table Pre-training via Execution) is a conceptually simple and empirically powerful pre-training approach to empower existing models with table reasoning skills. TAPEX realizes table pre-training by learning a neural SQL executor over a synthetic corpus, which is obtained by automatically synthesizing executable SQL queries.", "model_name": "microsoft/tapex-base-finetuned-wikisql"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/trocr-large-handwritten", "api_call": "VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-large-handwritten')", "performance": {"dataset": "IAM", "accuracy": "Not specified"}, "description": "TrOCR model fine-tuned on the IAM dataset. It was introduced in the paper TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Li et al. and first released in this repository. The TrOCR model is an encoder-decoder model, consisting of an image Transformer as encoder, and a text Transformer as decoder. The image encoder was initialized from the weights of BEiT, while the text decoder was initialized from the weights of RoBERTa.", "model_name": "microsoft/trocr-large-handwritten"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "nitrosocke/nitro-diffusion", "api_call": "StableDiffusionPipeline.from_pretrained('nitrosocke/nitro-diffusion', torch_dtype=torch.float16)", "performance": {"dataset": "Stable Diffusion", "accuracy": "N/A"}, "description": "Nitro Diffusion is a fine-tuned Stable Diffusion model trained on three artstyles simultaneously while keeping each style separate from the others. It allows for high control of mixing, weighting, and single style use.", "model_name": "nitrosocke/nitro-diffusion"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "facebook/mask2former-swin-large-coco-panoptic", "api_call": "Mask2FormerForUniversalSegmentation.from_pretrained('facebook/mask2former-swin-large-coco-panoptic')", "performance": {"dataset": "COCO", "accuracy": "Not provided"}, "description": "Mask2Former model trained on COCO panoptic segmentation (large-sized version, Swin backbone). It was introduced in the paper Masked-attention Mask Transformer for Universal Image Segmentation and first released in this repository. Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA, MaskFormer both in terms of performance an efficiency.", "model_name": "facebook/mask2former-swin-large-coco-panoptic"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "Helsinki-NLP/opus-mt-ru-en", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('Helsinki-NLP/opus-mt-ru-en')", "performance": {"dataset": "newstest2019-ruen.ru.en", "accuracy": 31.4}, "description": "A Russian to English translation model developed by the Language Technology Research Group at the University of Helsinki. It is based on the Transformer-align architecture and trained on the OPUS dataset. The model can be used for translation and text-to-text generation tasks.", "model_name": "Helsinki-NLP/opus-mt-ru-en"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Classification", "api_name": "cross-encoder/nli-deberta-v3-small", "api_call": "CrossEncoder('cross-encoder/nli-deberta-v3-small')", "performance": {"dataset": {"SNLI-test": "91.65", "MNLI-mismatched": "87.55"}, "accuracy": {"SNLI-test": "91.65", "MNLI-mismatched": "87.55"}}, "description": "Cross-Encoder for Natural Language Inference based on microsoft/deberta-v3-small, trained on the SNLI and MultiNLI datasets. For a given sentence pair, it will output three scores corresponding to the labels: contradiction, entailment, neutral.", "model_name": "cross-encoder/nli-deberta-v3-small"}
{"domain": "Tabular Tabular Regression", "framework": "Scikit-learn", "functionality": "Tabular Regression", "api_name": "rajistics/california_housing", "api_call": "RandomForestRegressor()", "performance": {"dataset": "", "accuracy": ""}, "description": "A RandomForestRegressor model trained on the California Housing dataset for predicting housing prices.", "model_name": "RandomForestRegressor()"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "text2text-generation", "api_name": "google/pegasus-cnn_dailymail", "api_call": "PegasusForConditionalGeneration.from_pretrained('google/pegasus-cnn_dailymail')", "performance": {"dataset": "cnn_dailymail", "accuracy": "44.16/21.56/41.30"}, "description": "PEGASUS model for abstractive summarization, pretrained on the CNN/DailyMail dataset.", "model_name": "google/pegasus-cnn_dailymail"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Transformers", "functionality": "Zero-Shot Classification", "api_name": "valhalla/distilbart-mnli-12-9", "api_call": "pipeline('zero-shot-classification', model='valhalla/distilbart-mnli-12-9')", "performance": {"dataset": "MNLI", "accuracy": {"matched_acc": 89.56, "mismatched_acc": 89.52}}, "description": "distilbart-mnli is the distilled version of bart-large-mnli created using the No Teacher Distillation technique proposed for BART summarisation by Huggingface. It is used for zero-shot text classification tasks.", "model_name": "valhalla/distilbart-mnli-12-9"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "zero-shot-object-detection", "api_name": "google/owlvit-base-patch32", "api_call": "OwlViTForObjectDetection.from_pretrained('google/owlvit-base-patch32')", "performance": {"dataset": "COCO and OpenImages", "accuracy": "Not specified"}, "description": "OWL-ViT is a zero-shot text-conditioned object detection model that uses CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. The model can be used to query an image with one or multiple text queries.", "model_name": "google/owlvit-base-patch32"}
{"domain": "Natural Language Processing Conversational", "framework": "Hugging Face Transformers", "functionality": "text-generation", "api_name": "pygmalion-2.7b", "api_call": "pipeline('text-generation', model='PygmalionAI/pygmalion-2.7b')", "performance": {"dataset": "56MB of dialogue data", "accuracy": "N/A"}, "description": "Pygmalion 2.7B is a proof-of-concept dialogue model based on EleutherAI's gpt-neo-2.7B. It is fine-tuned on 56MB of dialogue data gathered from multiple sources, including real and partially machine-generated conversations. The model is intended for use in generating conversational responses and can be used with a specific input format that includes character persona, dialogue history, and user input message.", "model_name": "PygmalionAI/pygmalion-2.7b"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "wavymulder/Analog-Diffusion", "api_call": "pipeline('text-to-image', model='wavymulder/Analog-Diffusion')", "performance": {"dataset": "analog photographs", "accuracy": "Not specified"}, "description": "Analog Diffusion is a dreambooth model trained on a diverse set of analog photographs. It can generate images based on text prompts with an analog style. Use the activation token 'analog style' in your prompt to get the desired output. The model is available on the Hugging Face Inference API and can be used with the transformers library.", "model_name": "wavymulder/Analog-Diffusion"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "opus-mt-ROMANCE-en", "api_call": "MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-ROMANCE-en')", "performance": {"dataset": "opus", "accuracy": {"BLEU": 62.2, "chr-F": 0.75}}, "description": "A model for translating Romance languages to English, trained on the OPUS dataset. It supports multiple source languages such as French, Spanish, Portuguese, Italian, and Romanian, among others. The model is based on the transformer architecture and uses normalization and SentencePiece for pre-processing.", "model_name": "Helsinki-NLP/opus-mt-ROMANCE-en"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Transformers", "functionality": "Zero-Shot Classification", "api_name": "cross-encoder/nli-deberta-v3-xsmall", "api_call": "pipeline('zero-shot-classification', model='cross-encoder/nli-deberta-v3-xsmall')", "performance": {"dataset": {"SNLI-test": "91.64", "MNLI_mismatched": "87.77"}}, "description": "This model is a Cross-Encoder for Natural Language Inference, trained on the SNLI and MultiNLI datasets. It can be used for zero-shot classification tasks.", "model_name": "cross-encoder/nli-deberta-v3-xsmall"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Speech Emotion Recognition", "api_name": "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition", "api_call": "Wav2Vec2ForCTC.from_pretrained('ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition')", "performance": {"dataset": "RAVDESS", "accuracy": 0.8223}, "description": "The model is a fine-tuned version of jonatasgrosman/wav2vec2-large-xlsr-53-english for a Speech Emotion Recognition (SER) task. The dataset used to fine-tune the original pre-trained model is the RAVDESS dataset. This dataset provides 1440 samples of recordings from actors performing on 8 different emotions in English, which are: emotions = ['angry', 'calm', 'disgust', 'fearful', 'happy', 'neutral', 'sad', 'surprised'].", "model_name": "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"}
{"domain": "Audio Audio-to-Audio", "framework": "Hugging Face Transformers", "functionality": "Asteroid", "api_name": "mpariente/DPRNNTasNet-ks2_WHAM_sepclean", "api_call": "pipeline('audio-source-separation', model='mpariente/DPRNNTasNet-ks2_WHAM_sepclean')", "performance": {"dataset": "WHAM!", "si_sdr": 19.3167434907, "si_sdr_imp": 19.3178952739, "sdr": 19.6808534719, "sdr_imp": 19.5298092933, "sir": 30.3622139987, "sir_imp": 30.2111698201, "sar": 20.1555325134, "sar_imp": -129.0209176235, "stoi": 0.9777266431, "stoi_imp": 0.2396809152}, "description": "This model was trained by Manuel Pariente using the wham/DPRNN recipe in Asteroid. It was trained on the sep_clean task of the WHAM! dataset.", "model_name": "mpariente/DPRNNTasNet-ks2_WHAM_sepclean"}
{"domain": "Audio Audio-to-Audio", "framework": "Fairseq", "functionality": "speech-to-speech-translation", "api_name": "facebook/textless_sm_ro_en", "api_call": "pipeline('audio-to-audio', model='facebook/textless_sm_ro_en')", "performance": {"dataset": "unknown", "accuracy": "unknown"}, "description": "A speech-to-speech translation model for Romanian to English developed by Facebook AI", "model_name": "facebook/textless_sm_ro_en"}
{"domain": "Natural Language Processing Question Answering", "framework": "Transformers", "functionality": "Question Answering", "api_name": "distilbert-base-cased-distilled-squad", "api_call": "DistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased-distilled-squad')", "performance": {"dataset": "SQuAD v1.1", "accuracy": {"Exact Match": 79.6, "F1": 86.996}}, "description": "DistilBERT base cased distilled SQuAD is a fine-tuned checkpoint of DistilBERT-base-cased, trained using knowledge distillation on SQuAD v1.1 dataset. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark. This model can be used for question answering.", "model_name": "distilbert-base-cased-distilled-squad"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "Text2Text Generation", "api_name": "mrm8488/bert2bert_shared-spanish-finetuned-summarization", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('mrm8488/bert2bert_shared-spanish-finetuned-summarization')", "performance": {"dataset": "mlsum", "accuracy": {"Rouge1": 26.24, "Rouge2": 8.9, "RougeL": 21.01, "RougeLsum": 21.02}}, "description": "Spanish BERT2BERT (BETO) fine-tuned on MLSUM ES for summarization", "model_name": "mrm8488/bert2bert_shared-spanish-finetuned-summarization"}
{"domain": "Natural Language Processing Translation", "framework": "Transformers", "functionality": "Translation", "api_name": "opus-mt-de-es", "api_call": "pipeline('translation_de_to_es', model='Helsinki-NLP/opus-mt-de-es')", "performance": {"dataset": "Tatoeba.de.es", "accuracy": {"BLEU": 48.5, "chr-F": 0.676}}, "description": "A German to Spanish translation model based on the OPUS dataset and trained using the transformer-align architecture. The model is pre-processed with normalization and SentencePiece tokenization.", "model_name": "Helsinki-NLP/opus-mt-de-es"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Image Classification", "api_name": "patrickjohncyh/fashion-clip", "api_call": "CLIPModel.from_pretrained('patrickjohncyh/fashion-clip')", "performance": {"dataset": [{"name": "FMNIST", "accuracy": 0.8300000000000001}, {"name": "KAGL", "accuracy": 0.73}, {"name": "DEEP", "accuracy": 0.62}]}, "description": "FashionCLIP is a CLIP-based model developed to produce general product representations for fashion concepts. Leveraging the pre-trained checkpoint (ViT-B/32) released by OpenAI, it is trained on a large, high-quality novel fashion dataset to study whether domain specific fine-tuning of CLIP-like models is sufficient to produce product representations that are zero-shot transferable to entirely new datasets and tasks.", "model_name": "patrickjohncyh/fashion-clip"}
{"domain": "Natural Language Processing Question Answering", "framework": "Transformers", "functionality": "Question Answering", "api_name": "deepset/tinyroberta-squad2", "api_call": "AutoModelForQuestionAnswering.from_pretrained('deepset/tinyroberta-squad2')", "performance": {"dataset": "squad_v2", "accuracy": {"exact": 78.6911479828, "f1": 81.9198998537}}, "description": "This is the distilled version of the deepset/roberta-base-squad2 model. This model has a comparable prediction quality and runs at twice the speed of the base model.", "model_name": "deepset/tinyroberta-squad2"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Transformers", "api_name": "sentence-transformers/distiluse-base-multilingual-cased-v1", "api_call": "SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v1')", "performance": {"dataset": "https://seb.sbert.net", "accuracy": "N/A"}, "description": "This is a sentence-transformers model: It maps sentences & paragraphs to a 512 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/distiluse-base-multilingual-cased-v1"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "facebook/timesformer-hr-finetuned-k400", "api_call": "TimesformerForVideoClassification.from_pretrained('facebook/timesformer-hr-finetuned-k400')", "performance": {"dataset": "Kinetics-400", "accuracy": "Not specified"}, "description": "TimeSformer model pre-trained on Kinetics-400 for video classification into one of the 400 possible Kinetics-400 labels. Introduced in the paper TimeSformer: Is Space-Time Attention All You Need for Video Understanding? by Tong et al.", "model_name": "facebook/timesformer-hr-finetuned-k400"}
{"domain": "Multimodal Feature Extraction", "framework": "Hugging Face Transformers", "functionality": "Feature Engineering", "api_name": "microsoft/unixcoder-base", "api_call": "AutoModel.from_pretrained('microsoft/unixcoder-base')", "performance": {"dataset": "Not specified", "accuracy": "Not specified"}, "description": "UniXcoder is a unified cross-modal pre-trained model that leverages multimodal data (i.e. code comment and AST) to pretrain code representation. Developed by Microsoft Team and shared by Hugging Face. It is based on the RoBERTa model and trained on English language data. The model can be used for feature engineering tasks.", "model_name": "microsoft/unixcoder-base"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Object Detection", "api_name": "keremberke/yolov8s-hard-hat-detection", "api_call": "YOLO('keremberke/yolov8s-hard-hat-detection')", "performance": {"dataset": "hard-hat-detection", "accuracy": 0.834}, "description": "An object detection model trained to detect hard hats and no-hard hats in images. The model is based on YOLOv8 architecture and can be used for safety applications.", "model_name": "keremberke/yolov8s-hard-hat-detection"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "text2vec-large-chinese", "api_call": "AutoModel.from_pretrained('GanymedeNil/text2vec-large-chinese')", "performance": {"dataset": "https://huggingface.co/shibing624/text2vec-base-chinese", "accuracy": "Not provided"}, "description": "A Chinese sentence similarity model based on the derivative model of https://huggingface.co/shibing624/text2vec-base-chinese, replacing MacBERT with LERT, and keeping other training conditions unchanged.", "model_name": "GanymedeNil/text2vec-large-chinese"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "lewtun/tiny-random-mt5", "api_call": "AutoModel.from_pretrained('lewtun/tiny-random-mt5')", "performance": {"dataset": "", "accuracy": ""}, "description": "A tiny random mt5 model for text generation", "model_name": "lewtun/tiny-random-mt5"}
{"domain": "Audio Text-to-Speech", "framework": "ESPnet", "functionality": "Text-to-Speech", "api_name": "SYSPIN/Marathi_Male_TTS", "api_call": "api.load('ESPnet/espnet_model_zoo:SYSPIN/Marathi_Male_TTS').", "performance": {"dataset": "", "accuracy": ""}, "description": "A Marathi Male Text-to-Speech model using ESPnet framework.", "model_name": "ESPnet/espnet_model_zoo"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "facebook/convnext-base-224", "api_call": "ConvNextForImageClassification.from_pretrained('facebook/convnext-base-224')", "performance": {"dataset": "imagenet-1k", "accuracy": null}, "description": "ConvNeXT is a pure convolutional model (ConvNet), inspired by the design of Vision Transformers, that claims to outperform them. The authors started from a ResNet and 'modernized' its design by taking the Swin Transformer as inspiration. You can use the raw model for image classification.", "model_name": "facebook/convnext-base-224"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Transformers", "api_name": "sentence-transformers/all-roberta-large-v1", "api_call": "SentenceTransformer('sentence-transformers/all-roberta-large-v1')", "performance": {"dataset": "https://seb.sbert.net", "accuracy": "Automated evaluation"}, "description": "This is a sentence-transformers model: It maps sentences & paragraphs to a 1024 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/all-roberta-large-v1"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Masked Language Modeling", "api_name": "bert-base-multilingual-cased", "api_call": "pipeline('fill-mask', model='bert-base-multilingual-cased')", "performance": {"dataset": "wikipedia", "accuracy": "Not provided"}, "description": "BERT multilingual base model (cased) is pretrained on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective. The model is case sensitive and can be used for masked language modeling or next sentence prediction. It is intended to be fine-tuned on a downstream task, such as sequence classification, token classification, or question answering.", "model_name": "bert-base-multilingual-cased"}
{"domain": "Natural Language Processing Token Classification", "framework": "Transformers", "functionality": "Token Classification", "api_name": "ckiplab/bert-base-chinese-ws", "api_call": "AutoModel.from_pretrained('ckiplab/bert-base-chinese-ws')", "performance": {"dataset": "Not specified", "accuracy": "Not specified"}, "description": "This project provides traditional Chinese transformers models (including ALBERT, BERT, GPT2) and NLP tools (including word segmentation, part-of-speech tagging, named entity recognition).", "model_name": "ckiplab/bert-base-chinese-ws"}
{"domain": "Audio Audio-to-Audio", "framework": "SpeechBrain", "functionality": "Audio Source Separation", "api_name": "speechbrain/sepformer-whamr", "api_call": "separator.from_hparams(source='speechbrain/sepformer-whamr', savedir='pretrained_models/sepformer-whamr')", "performance": {"dataset": "WHAMR!", "accuracy": "13.7 dB SI-SNRi"}, "description": "This repository provides all the necessary tools to perform audio source separation with a SepFormer model, implemented with SpeechBrain, and pretrained on WHAMR! dataset, which is basically a version of WSJ0-Mix dataset with environmental noise and reverberation.", "model_name": "speechbrain/sepformer-whamr"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Transcription and Translation", "api_name": "openai/whisper-tiny", "api_call": "WhisperForConditionalGeneration.from_pretrained('openai/whisper-tiny')", "performance": {"dataset": "LibriSpeech (clean)", "accuracy": 7.54}, "description": "Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning. It is a Transformer-based encoder-decoder model that can be used for transcription and translation tasks.", "model_name": "openai/whisper-tiny"}
{"domain": "Audio Text-to-Speech", "framework": "Fairseq", "functionality": "Text-to-Speech", "api_name": "facebook/tts_transformer-ru-cv7_css10", "api_call": "load_model_ensemble_and_task_from_hf_hub('facebook/tts_transformer-ru-cv7_css10')", "performance": {"dataset": "common_voice", "accuracy": null}, "description": "Transformer text-to-speech model from fairseq S^2. Russian single-speaker male voice. Pre-trained on Common Voice v7, fine-tuned on CSS10.", "model_name": "facebook/tts_transformer-ru-cv7_css10"}
{"domain": "Reinforcement Learning", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "decision-transformer-gym-hopper-medium", "api_call": "AutoModel.from_pretrained('edbeeching/decision-transformer-gym-hopper-medium')", "performance": {"dataset": "Gym Hopper environment", "accuracy": "Not provided"}, "description": "Decision Transformer model trained on medium trajectories sampled from the Gym Hopper environment.", "model_name": "edbeeching/decision-transformer-gym-hopper-medium"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "functionality": "Question Answering", "api_name": "luhua/chinese_pretrain_mrc_roberta_wwm_ext_large", "api_call": "pipeline('question-answering', model='luhua/chinese_pretrain_mrc_roberta_wwm_ext_large')", "performance": {"dataset": "Dureader-2021", "accuracy": "83.1"}, "description": "A Chinese MRC roberta_wwm_ext_large model trained on a large amount of Chinese MRC data. This model has significantly improved performance on reading comprehension and classification tasks. It has helped multiple users achieve top 5 results in the Dureader-2021 competition.", "model_name": "luhua/chinese_pretrain_mrc_roberta_wwm_ext_large"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Image Classification", "api_name": "kakaobrain/align-base", "api_call": "AlignModel.from_pretrained('kakaobrain/align-base')", "performance": {"dataset": "COYO-700M", "accuracy": "on-par or outperforms Google ALIGN's reported metrics"}, "description": "The ALIGN model is a dual-encoder architecture with EfficientNet as its vision encoder and BERT as its text encoder. It learns to align visual and text representations with contrastive learning. This implementation is trained on the open source COYO dataset and can be used for zero-shot image classification and multi-modal embedding retrieval.", "model_name": "kakaobrain/align-base"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "bart-large-cnn-samsum-ChatGPT_v3", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('Qiliang/bart-large-cnn-samsum-ChatGPT_v3')", "performance": {"dataset": "unknown", "accuracy": "unknown"}, "description": "This model is a fine-tuned version of philschmid/bart-large-cnn-samsum on an unknown dataset.", "model_name": "Qiliang/bart-large-cnn-samsum-ChatGPT_v3"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Unconditional Image Generation", "api_name": "Apocalypse-19/shoe-generator", "api_call": "DDPMPipeline.from_pretrained('Apocalypse-19/shoe-generator')", "performance": {"dataset": "custom dataset", "accuracy": "128x128 resolution"}, "description": "This model is a diffusion model for unconditional image generation of shoes trained on a custom dataset at 128x128 resolution.", "model_name": "Apocalypse-19/shoe-generator"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Speaker Verification", "api_name": "speechbrain/spkrec-xvect-voxceleb", "api_call": "EncoderClassifier.from_hparams(source='speechbrain/spkrec-xvect-voxceleb', savedir='pretrained_models/spkrec-xvect-voxceleb')", "performance": {"dataset": "Voxceleb1-test set (Cleaned)", "accuracy": "EER(%) 3.2"}, "description": "This repository provides all the necessary tools to extract speaker embeddings with a pretrained TDNN model using SpeechBrain. The system is trained on Voxceleb 1+ Voxceleb2 training data.", "model_name": "speechbrain/spkrec-xvect-voxceleb"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Transformers", "api_name": "sentence-transformers/all-distilroberta-v1", "api_call": "SentenceTransformer('sentence-transformers/all-distilroberta-v1')", "performance": {"dataset": [{"name": "s2orc", "accuracy": "Not provided"}, {"name": "MS Marco", "accuracy": "Not provided"}, {"name": "yahoo_answers_topics", "accuracy": "Not provided"}]}, "description": "This is a sentence-transformers model that maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/all-distilroberta-v1"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Masked Language Modeling and Next Sentence Prediction", "api_name": "bert-large-uncased", "api_call": "pipeline('fill-mask', model='bert-large-uncased')", "performance": {"dataset": {"SQUAD 1.1 F1/EM": "91.0/84.3", "Multi NLI Accuracy": "86.05"}}, "description": "BERT large model (uncased) is a transformer model pretrained on a large corpus of English data using a masked language modeling (MLM) objective. It has 24 layers, 1024 hidden dimensions, 16 attention heads, and 336M parameters. The model is intended to be fine-tuned on a downstream task, such as sequence classification, token classification, or question answering.", "model_name": "bert-large-uncased"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "Diffusion-based text-to-image generation model", "api_name": "lllyasviel/control_v11p_sd15_normalbae", "api_call": "ControlNetModel.from_pretrained('lllyasviel/control_v11p_sd15_normalbae')", "performance": {"dataset": "N/A", "accuracy": "N/A"}, "description": "ControlNet v1.1 is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on normalbae images. It can be used in combination with Stable Diffusion, such as runwayml/stable-diffusion-v1-5.", "model_name": "lllyasviel/control_v11p_sd15_normalbae"}
{"domain": "Reinforcement Learning", "framework": "Stable-Baselines3", "functionality": "Pendulum-v1", "api_name": "ppo-Pendulum-v1", "api_call": "load_from_hub(repo_id='HumanCompatibleAI/ppo-Pendulum-v1',filename='{MODEL FILENAME}.zip',)", "performance": {"dataset": "Pendulum-v1", "accuracy": "-336.89 +/- 406.36"}, "description": "This is a trained model of a PPO agent playing Pendulum-v1 using the stable-baselines3 library and the RL Zoo. The RL Zoo is a training framework for Stable Baselines3 reinforcement learning agents, with hyperparameter optimization and pre-trained agents included.", "model_name": "HumanCompatibleAI/ppo-Pendulum-v1"}
{"domain": "Audio Text-to-Speech", "framework": "ESPnet", "functionality": "Text-to-Speech", "api_name": "mio/tokiwa_midori", "api_call": "./run.sh --skip_data_prep false --skip_train true --download_model mio/tokiwa_midori", "performance": {"dataset": "amadeus", "accuracy": "Not provided"}, "description": "This model was trained by mio using amadeus recipe in espnet.", "model_name": "mio/tokiwa_midori"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "Helsinki-NLP/opus-mt-en-ar", "api_call": "pipeline('translation_en_to_ar', model='Helsinki-NLP/opus-mt-en-ar')", "performance": {"dataset": "Tatoeba-test.eng.ara", "accuracy": {"BLEU": 14.0, "chr-F": 0.437}}, "description": "A Hugging Face Transformers model for English to Arabic translation, trained on the Tatoeba dataset. It uses a transformer architecture and requires a sentence initial language token in the form of '>>id<<' (id = valid target language ID).", "model_name": "Helsinki-NLP/opus-mt-en-ar"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Image Captioning", "api_name": "microsoft/git-base", "api_call": "pipeline('image-to-text', model='microsoft/git-base')", "performance": {"dataset": ["COCO", "Conceptual Captions (CC3M)", "SBU", "Visual Genome (VG)", "Conceptual Captions (CC12M)", "ALT200M"], "accuracy": "Refer to the paper for evaluation results"}, "description": "GIT (short for GenerativeImage2Text) model, base-sized version. It was introduced in the paper GIT: A Generative Image-to-text Transformer for Vision and Language by Wang et al. and first released in this repository. The model is trained using 'teacher forcing' on a lot of (image, text) pairs. The goal for the model is simply to predict the next text token, giving the image tokens and previous text tokens. This allows the model to be used for tasks like image and video captioning, visual question answering (VQA) on images and videos, and even image classification (by simply conditioning the model on the image and asking it to generate a class for it in text).", "model_name": "microsoft/git-base"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "stabilityai/sd-vae-ft-mse", "api_call": "StableDiffusionPipeline.from_pretrained('CompVis/stable-diffusion-v1-4', vae='AutoencoderKL.from_pretrained(stabilityai/sd-vae-ft-mse)')", "performance": {"dataset": [{"name": "COCO 2017 (256x256, val, 5000 images)", "accuracy": {"rFID": "4.70", "PSNR": "24.5 +/- 3.7", "SSIM": "0.71 +/- 0.13", "PSIM": "0.92 +/- 0.27"}}, {"name": "LAION-Aesthetics 5+ (256x256, subset, 10000 images)", "accuracy": {"rFID": "1.88", "PSNR": "27.3 +/- 4.7", "SSIM": "0.83 +/- 0.11", "PSIM": "0.65 +/- 0.34"}}]}, "description": "This model is a fine-tuned VAE decoder for the Stable Diffusion Pipeline. It is designed to be used with the diffusers library and can be integrated into existing workflows by including a vae argument to the StableDiffusionPipeline. The model has been finetuned on a 1:1 ratio of LAION-Aesthetics and LAION-Humans datasets and has been evaluated on COCO 2017 and LAION-Aesthetics 5+ datasets.", "model_name": "CompVis/stable-diffusion-v1-4"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "darkstorm2150/Protogen_v2.2_Official_Release", "api_call": "StableDiffusionPipeline.from_pretrained('darkstorm2150/Protogen_v2.2_Official_Release')", "performance": {"dataset": "Various datasets", "accuracy": "Not specified"}, "description": "Protogen v2.2 is a text-to-image model that generates high-quality images based on text prompts. It was warm-started with Stable Diffusion v1-5 and fine-tuned on a large amount of data from large datasets new and trending on civitai.com. The model can be used with the Stable Diffusion Pipeline and supports trigger words like 'modelshoot style' to enforce camera capture.", "model_name": "darkstorm2150/Protogen_v2.2_Official_Release"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Transformers", "api_name": "sentence-transformers/all-MiniLM-L12-v2", "api_call": "SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2')", "performance": {"dataset": "1,170,060,424 training pairs", "accuracy": "Not provided"}, "description": "This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/all-MiniLM-L12-v2"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "facebook/timesformer-hr-finetuned-ssv2", "api_call": "TimesformerForVideoClassification.from_pretrained('facebook/timesformer-hr-finetuned-ssv2')", "performance": {"dataset": "Something Something v2", "accuracy": "Not provided"}, "description": "TimeSformer model pre-trained on Something Something v2. It was introduced in the paper TimeSformer: Is Space-Time Attention All You Need for Video Understanding? by Tong et al. and first released in this repository.", "model_name": "facebook/timesformer-hr-finetuned-ssv2"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "functionality": "Masked Language Modeling", "api_name": "xlm-roberta-large", "api_call": "pipeline('fill-mask', model='xlm-roberta-large')", "performance": {"dataset": "CommonCrawl", "accuracy": "N/A"}, "description": "XLM-RoBERTa is a multilingual version of RoBERTa pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It is designed for masked language modeling and can be fine-tuned on downstream tasks such as sequence classification, token classification, or question answering.", "model_name": "xlm-roberta-large"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/trocr-small-handwritten", "api_call": "VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-small-handwritten')", "performance": {"dataset": "IAM", "accuracy": "Not provided"}, "description": "TrOCR model fine-tuned on the IAM dataset. It was introduced in the paper TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Li et al. and first released in this repository.", "model_name": "microsoft/trocr-small-handwritten"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "functionality": "Fill-Mask", "api_name": "neuralmind/bert-base-portuguese-cased", "api_call": "AutoModelForPreTraining.from_pretrained('neuralmind/bert-base-portuguese-cased')", "performance": {"dataset": "brWaC", "accuracy": "state-of-the-art"}, "description": "BERTimbau Base is a pretrained BERT model for Brazilian Portuguese that achieves state-of-the-art performances on three downstream NLP tasks: Named Entity Recognition, Sentence Textual Similarity and Recognizing Textual Entailment. It is available in two sizes: Base and Large.", "model_name": "neuralmind/bert-base-portuguese-cased"}
{"domain": "Reinforcement Learning", "framework": "ML-Agents", "functionality": "SoccerTwos", "api_name": "Raiden-1001/poca-Soccerv7", "api_call": "mlagents-load-from-hf --repo-id='Raiden-1001/poca-Soccerv7.1' --local-dir='./downloads'", "performance": {"dataset": "SoccerTwos", "accuracy": "Not provided"}, "description": "This is a trained model of a poca agent playing SoccerTwos using the Unity ML-Agents Library.", "model_name": "Raiden-1001/poca-Soccerv7.1"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "functionality": "financial-sentiment-analysis", "api_name": "yiyanghkust/finbert-tone", "api_call": "BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3)", "performance": {"dataset": "10,000 manually annotated sentences from analyst reports", "accuracy": "superior performance on financial tone analysis task"}, "description": "FinBERT is a BERT model pre-trained on financial communication text. It is trained on the following three financial communication corpus: Corporate Reports 10-K & 10-Q, Earnings Call Transcripts, and Analyst Reports. This released finbert-tone model is the FinBERT model fine-tuned on 10,000 manually annotated (positive, negative, neutral) sentences from analyst reports. This model achieves superior performance on financial tone analysis task.", "model_name": "yiyanghkust/finbert-tone"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "nguyenvulebinh/wav2vec2-base-vietnamese-250h", "api_call": "Wav2Vec2ForCTC.from_pretrained('nguyenvulebinh/wav2vec2-base-vietnamese-250h')", "performance": {"dataset": [{"name": "VIVOS", "accuracy": 6.15}, {"name": "Common Voice vi", "accuracy": 11.52}]}, "description": "Vietnamese end-to-end speech recognition using wav2vec 2.0. Pre-trained on 13k hours of Vietnamese youtube audio (un-label data) and fine-tuned on 250 hours labeled of VLSP ASR dataset on 16kHz sampled speech audio.", "model_name": "nguyenvulebinh/wav2vec2-base-vietnamese-250h"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Transformers", "functionality": "Multilingual Sequence-to-Sequence", "api_name": "facebook/mbart-large-50", "api_call": "MBartForConditionalGeneration.from_pretrained('facebook/mbart-large-50')", "performance": {"dataset": "Multilingual Denoising Pretraining", "accuracy": "Not specified"}, "description": "mBART-50 is a multilingual Sequence-to-Sequence model pre-trained using the 'Multilingual Denoising Pretraining' objective. It was introduced in Multilingual Translation with Extensible Multilingual Pretraining and Finetuning paper.", "model_name": "facebook/mbart-large-50"}
{"domain": "Natural Language Processing Text Generation", "framework": "Transformers", "functionality": "Text Generation", "api_name": "sshleifer/tiny-gpt2", "api_call": "TinyGPT2LMHeadModel.from_pretrained('sshleifer/tiny-gpt2')", "performance": {"dataset": "N/A", "accuracy": "N/A"}, "description": "A tiny GPT-2 model for text generation, suitable for low-resource environments and faster inference. This model is part of the Hugging Face Transformers library and can be used for generating text given a prompt.", "model_name": "sshleifer/tiny-gpt2"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Generative Commonsense Reasoning", "api_name": "mrm8488/t5-base-finetuned-common_gen", "api_call": "AutoModelWithLMHead.from_pretrained('mrm8488/t5-base-finetuned-common_gen')", "performance": {"dataset": "common_gen", "accuracy": {"ROUGE-2": 17.1, "ROUGE-L": 39.47}}, "description": "Google's T5 fine-tuned on CommonGen for Generative Commonsense Reasoning. CommonGen is a constrained text generation task, associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts; the task is to generate a coherent sentence describing an everyday scenario using these concepts.", "model_name": "mrm8488/t5-base-finetuned-common_gen"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Transformers", "api_name": "sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens", "api_call": "SentenceTransformer('sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens')", "performance": {"dataset": "https://seb.sbert.net", "accuracy": "Not provided"}, "description": "This is a sentence-transformers model that maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Document Question Answering", "api_name": "CQI_Visual_Question_Awnser_PT_v0", "api_call": "pipeline('question-answering', model=LayoutLMForQuestionAnswering.from_pretrained('microsoft/layoutlm-base-uncased'))", "performance": {"dataset": [{"accuracy": 0.9943976999999999}, {"accuracy": 0.9912158999999999}, {"accuracy": 0.59147286}]}, "description": "A model for visual question answering in Portuguese and English, capable of processing PDFs and images to extract information and answer questions.", "model_name": "microsoft/layoutlm-base-uncased"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "sentiment_analysis_generic_dataset", "api_call": "pipeline('text-classification', model='Seethal/sentiment_analysis_generic_dataset')", "performance": {"dataset": "generic_dataset", "accuracy": "Not specified"}, "description": "This is a fine-tuned downstream version of the bert-base-uncased model for sentiment analysis, this model is not intended for further downstream fine-tuning for any other tasks. This model is trained on a classified dataset for text classification.", "model_name": "Seethal/sentiment_analysis_generic_dataset"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "google/pegasus-pubmed", "api_call": "AutoModel.from_pretrained('google/pegasus-pubmed')", "performance": {"dataset": [{"name": "xsum", "accuracy": "47.60/24.83/39.64"}, {"name": "cnn_dailymail", "accuracy": "44.16/21.56/41.30"}, {"name": "newsroom", "accuracy": "45.98/34.20/42.18"}, {"name": "multi_news", "accuracy": "47.65/18.75/24.95"}, {"name": "gigaword", "accuracy": "39.65/20.47/36.76"}, {"name": "wikihow", "accuracy": "46.39/22.12/38.41"}, {"name": "reddit_tifu", "accuracy": "27.99/9.81/22.94"}, {"name": "big_patent", "accuracy": "52.29/33.08/41.66"}, {"name": "arxiv", "accuracy": "44.21/16.95/25.67"}, {"name": "pubmed", "accuracy": "45.97/20.15/28.25"}, {"name": "aeslc", "accuracy": "37.68/21.25/36.51"}, {"name": "billsum", "accuracy": "59.67/41.58/47.59"}]}, "description": "The PEGASUS model is designed for abstractive summarization. It is pretrained on a mixture of C4 and HugeNews datasets and stochastically samples important sentences. The model uses a gap sentence ratio between 15% and 45% and a sentencepiece tokenizer that encodes newline characters.", "model_name": "google/pegasus-pubmed"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "fcakyon/timesformer-hr-finetuned-k400", "api_call": "TimesformerForVideoClassification.from_pretrained('fcakyon/timesformer-hr-finetuned-k400')", "performance": {"dataset": "Kinetics-400", "accuracy": "Not provided"}, "description": "TimeSformer model pre-trained on Kinetics-400 for video classification into one of the 400 possible Kinetics-400 labels. Introduced in the paper 'TimeSformer: Is Space-Time Attention All You Need for Video Understanding?' by Tong et al.", "model_name": "fcakyon/timesformer-hr-finetuned-k400"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face", "functionality": "Question Answering", "api_name": "impira/layoutlm-document-qa", "api_call": "pipeline('question-answering', model=LayoutLMForQuestionAnswering.from_pretrained('impira/layoutlm-document-qa', return_dict=True))", "performance": {"dataset": "SQuAD2.0 and DocVQA", "accuracy": "Not provided"}, "description": "A fine-tuned version of the multi-modal LayoutLM model for the task of question answering on documents.", "model_name": "impira/layoutlm-document-qa"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Conversational", "api_name": "pygmalion-350m", "api_call": "pipeline('conversational', model='PygmalionAI/pygmalion-350m')", "performance": {"dataset": "The Pile", "accuracy": "N/A"}, "description": "This is a proof-of-concept fine-tune of Facebook's OPT-350M model optimized for dialogue, to be used as a stepping stone to higher parameter models. Disclaimer: NSFW data was included in the fine-tuning of this model. Although SFW inputs will usually result in SFW outputs, you are advised to chat at your own risk. This model is not suitable for use by minors.", "model_name": "PygmalionAI/pygmalion-350m"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/git-base-vqav2", "api_call": "pipeline('visual-question-answering', model='microsoft/git-base-vqav2')", "performance": {"dataset": "VQAv2", "accuracy": "Refer to the paper for evaluation results"}, "description": "GIT (short for GenerativeImage2Text) model, base-sized version, fine-tuned on VQAv2. It was introduced in the paper GIT: A Generative Image-to-text Transformer for Vision and Language by Wang et al. and first released in this repository.", "model_name": "microsoft/git-base-vqav2"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Classification", "api_name": "vicgalle/xlm-roberta-large-xnli-anli", "api_call": "XLMRobertaForSequenceClassification.from_pretrained('vicgalle/xlm-roberta-large-xnli-anli')", "performance": {"dataset": [{"name": "XNLI-es", "accuracy": "93.7%"}, {"name": "XNLI-fr", "accuracy": "93.2%"}, {"name": "ANLI-R1", "accuracy": "68.5%"}, {"name": "ANLI-R2", "accuracy": "53.6%"}, {"name": "ANLI-R3", "accuracy": "49.0%"}]}, "description": "XLM-RoBERTa-large model finetunned over several NLI datasets, ready to use for zero-shot classification.", "model_name": "vicgalle/xlm-roberta-large-xnli-anli"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "facebook/regnet-y-008", "api_call": "RegNetForImageClassification.from_pretrained('zuppif/regnet-y-040')", "performance": {"dataset": "imagenet-1k", "accuracy": "Not provided"}, "description": "RegNet model trained on imagenet-1k. It was introduced in the paper Designing Network Design Spaces and first released in this repository.", "model_name": "zuppif/regnet-y-040"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Classification", "api_name": "MoritzLaurer/mDeBERTa-v3-base-mnli-xnli", "api_call": "mDeBERTaForSequenceClassification.from_pretrained('MoritzLaurer/mDeBERTa-v3-base-mnli-xnli')", "performance": {"dataset": {"average": 0.808, "ar": 0.802, "bg": 0.8290000000000001, "de": 0.8250000000000001, "el": 0.8260000000000001, "en": 0.883, "es": 0.845, "fr": 0.834, "hi": 0.771, "ru": 0.8130000000000001, "sw": 0.748, "th": 0.793, "tr": 0.807, "ur": 0.74, "vi": 0.795, "zh": 0.8116}, "accuracy": "0.808"}, "description": "This multilingual model can perform natural language inference (NLI) on 100 languages and is therefore also suitable for multilingual zero-shot classification. The underlying model was pre-trained by Microsoft on the CC100 multilingual dataset. It was then fine-tuned on the XNLI dataset, which contains hypothesis-premise pairs from 15 languages, as well as the English MNLI dataset.", "model_name": "MoritzLaurer/mDeBERTa-v3-base-mnli-xnli"}
{"domain": "Natural Language Processing Question Answering", "framework": "Transformers", "functionality": "Question Answering", "api_name": "philschmid/distilbert-onnx", "api_call": "pipeline('question-answering', model='philschmid/distilbert-onnx')", "performance": {"dataset": "squad", "accuracy": "F1 score: 87.1"}, "description": "This model is a fine-tune checkpoint of DistilBERT-base-cased, fine-tuned using (a second step of) knowledge distillation on SQuAD v1.1.", "model_name": "philschmid/distilbert-onnx"}
{"domain": "Audio Audio-to-Audio", "framework": "Hugging Face Transformers", "functionality": "Asteroid", "api_name": "Awais/Audio_Source_Separation", "api_call": "pipeline('audio-source-separation', model='Awais/Audio_Source_Separation')", "performance": {"dataset": "Libri2Mix", "accuracy": {"si_sdr": 14.7645436345, "si_sdr_imp": 14.7640293756, "sdr": 15.2933797075, "sdr_imp": 15.1141466051, "sir": 24.0929046611, "sir_imp": 23.9136696831, "sar": 16.0605590692, "sar_imp": -51.9807844413, "stoi": 0.9311142441, "stoi_imp": 0.2181737614}}, "description": "This model was trained by Joris Cosentino using the librimix recipe in Asteroid. It was trained on the sep_clean task of the Libri2Mix dataset.", "model_name": "Awais/Audio_Source_Separation"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "text2text-generation", "api_name": "google/bigbird-pegasus-large-arxiv", "api_call": "BigBirdPegasusForConditionalGeneration.from_pretrained('google/bigbird-pegasus-large-arxiv')", "performance": {"dataset": "scientific_papers", "accuracy": {"ROUGE-1": 36.028, "ROUGE-2": 13.417, "ROUGE-L": 21.961, "ROUGE-LSUM": 29.648}}, "description": "BigBird, is a sparse-attention based transformer which extends Transformer based models, such as BERT to much longer sequences. Moreover, BigBird comes along with a theoretical understanding of the capabilities of a complete transformer that the sparse model can handle. BigBird was introduced in this paper and first released in this repository. BigBird relies on block sparse attention instead of normal attention (i.e. BERT's attention) and can handle sequences up to a length of 4096 at a much lower compute cost compared to BERT. It has achieved SOTA on various tasks involving very long sequences such as long documents summarization, question-answering with long contexts.", "model_name": "google/bigbird-pegasus-large-arxiv"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "Diffusion-based text-to-image generation", "api_name": "lllyasviel/control_v11p_sd15_softedge", "api_call": "ControlNetModel.from_pretrained('lllyasviel/control_v11p_sd15_softedge')", "performance": {"dataset": "ControlNet", "accuracy": "Not provided"}, "description": "Controlnet v1.1 is a diffusion-based text-to-image generation model that controls pretrained large diffusion models to support additional input conditions. This checkpoint corresponds to the ControlNet conditioned on Soft edges. It can be used in combination with Stable Diffusion, such as runwayml/stable-diffusion-v1-5.", "model_name": "lllyasviel/control_v11p_sd15_softedge"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Object Detection", "api_name": "hustvl/yolos-small", "api_call": "YolosForObjectDetection.from_pretrained('hustvl/yolos-small')", "performance": {"dataset": "COCO 2017 validation", "accuracy": "36.1 AP"}, "description": "YOLOS model fine-tuned on COCO 2017 object detection (118k annotated images). It was introduced in the paper You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection by Fang et al. and first released in this repository. YOLOS is a Vision Transformer (ViT) trained using the DETR loss. Despite its simplicity, a base-sized YOLOS model is able to achieve 42 AP on COCO validation 2017 (similar to DETR and more complex frameworks such as Faster R-CNN).", "model_name": "hustvl/yolos-small"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Object Detection", "api_name": "hustvl/yolos-tiny", "api_call": "YolosForObjectDetection.from_pretrained('hustvl/yolos-tiny')", "performance": {"dataset": "COCO 2017 validation", "accuracy": "28.7 AP"}, "description": "YOLOS is a Vision Transformer (ViT) trained using the DETR loss. Despite its simplicity, a base-sized YOLOS model is able to achieve 42 AP on COCO validation 2017 (similar to DETR and more complex frameworks such as Faster R-CNN). The model is trained using a bipartite matching loss: one compares the predicted classes + bounding boxes of each of the N = 100 object queries to the ground truth annotations, padded up to the same length N (so if an image only contains 4 objects, 96 annotations will just have a no object as class and no bounding box as bounding box). The Hungarian matching algorithm is used to create an optimal one-to-one mapping between each of the N queries and each of the N annotations. Next, standard cross-entropy (for the classes) and a linear combination of the L1 and generalized IoU loss (for the bounding boxes) are used to optimize the parameters of the model.", "model_name": "hustvl/yolos-tiny"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Automatic Speech Recognition", "api_name": "facebook/hubert-large-ls960-ft", "api_call": "HubertForCTC.from_pretrained('facebook/hubert-large-ls960-ft')", "performance": {"dataset": "LibriSpeech (clean)", "accuracy": "1.900 WER"}, "description": "Facebook's Hubert-Large-Finetuned is an Automatic Speech Recognition model fine-tuned on 960h of Librispeech on 16kHz sampled speech audio. It is based on the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. The model either matches or improves upon the state-of-the-art wav2vec 2.0 performance on the Librispeech and Libri-light benchmarks with various fine-tuning subsets.", "model_name": "facebook/hubert-large-ls960-ft"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image Diffusion Models", "api_name": "lllyasviel/control_v11p_sd15s2_lineart_anime", "api_call": "ControlNetModel.from_pretrained('lllyasviel/control_v11p_sd15s2_lineart_anime')", "performance": {"dataset": "Not specified", "accuracy": "Not specified"}, "description": "ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on lineart_anime images.", "model_name": "lllyasviel/control_v11p_sd15s2_lineart_anime"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "Abstractive Russian Summarization", "api_name": "cointegrated/rut5-base-absum", "api_call": "T5ForConditionalGeneration.from_pretrained('cointegrated/rut5-base-absum')", "performance": {"dataset": ["csebuetnlp/xlsum", "IlyaGusev/gazeta", "mlsum"], "accuracy": "Not provided"}, "description": "This is a model for abstractive Russian summarization, based on cointegrated/rut5-base-multitask and fine-tuned on 4 datasets.", "model_name": "cointegrated/rut5-base-absum"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Unconditional Image Generation", "api_name": "sd-class-pandas-32", "api_call": "DDPMPipeline.from_pretrained('schdoel/sd-class-AFHQ-32')", "performance": {"dataset": "AFHQ", "accuracy": "Not provided"}, "description": "This model is a diffusion model for unconditional image generation of cute \ud83e\udd8b.", "model_name": "schdoel/sd-class-AFHQ-32"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Object Detection", "api_name": "keremberke/yolov8m-nlf-head-detection", "api_call": "YOLO('keremberke/yolov8m-nlf-head-detection')", "performance": {"dataset": "nfl-object-detection", "accuracy": 0.28700000000000003}, "description": "A YOLOv8 model trained for head detection in American football. The model is capable of detecting helmets, blurred helmets, difficult helmets, partial helmets, and sideline helmets.", "model_name": "keremberke/yolov8m-nlf-head-detection"}
{"domain": "Multimodal Text-to-Video", "framework": "Hugging Face", "functionality": "Text-to-Video Generation", "api_name": "redshift-man-skiing", "api_call": "TuneAVideoPipeline.from_pretrained('nitrosocke/redshift-diffusion', unet=UNet3DConditionModel.from_pretrained('Tune-A-Video-library/redshift-man-skiing', subfolder='unet', torch_dtype=torch.float16), torch_dtype=torch.float16)", "performance": {"dataset": "N/A", "accuracy": "N/A"}, "description": "Tune-A-Video - Redshift is a text-to-video generation model based on the nitrosocke/redshift-diffusion model. It generates videos based on textual prompts, such as 'a man is skiing' or '(redshift style) spider man is skiing'.", "model_name": "nitrosocke/redshift-diffusion"}
{"domain": "Natural Language Processing Translation", "framework": "Transformers", "functionality": "Translation", "api_name": "Helsinki-NLP/opus-mt-es-en", "api_call": "pipeline('translation_es_to_en', model='Helsinki-NLP/opus-mt-es-en')", "performance": {"dataset": [{"name": "newssyscomb2009-spaeng.spa.eng", "accuracy": {"BLEU": 30.6, "chr-F": 0.5700000000000001}}, {"name": "news-test2008-spaeng.spa.eng", "accuracy": {"BLEU": 27.9, "chr-F": 0.553}}, {"name": "newstest2009-spaeng.spa.eng", "accuracy": {"BLEU": 30.4, "chr-F": 0.5720000000000001}}, {"name": "newstest2010-spaeng.spa.eng", "accuracy": {"BLEU": 36.1, "chr-F": 0.614}}, {"name": "newstest2011-spaeng.spa.eng", "accuracy": {"BLEU": 34.2, "chr-F": 0.599}}, {"name": "newstest2012-spaeng.spa.eng", "accuracy": {"BLEU": 37.9, "chr-F": 0.624}}, {"name": "newstest2013-spaeng.spa.eng", "accuracy": {"BLEU": 35.3, "chr-F": 0.609}}, {"name": "Tatoeba-test.spa.eng", "accuracy": {"BLEU": 59.6, "chr-F": 0.739}}]}, "description": "Helsinki-NLP/opus-mt-es-en is a machine translation model trained to translate from Spanish to English using the Hugging Face Transformers library. The model is based on the Marian framework and was trained on the OPUS dataset.", "model_name": "Helsinki-NLP/opus-mt-es-en"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Fill-Mask", "api_name": "emilyalsentzer/Bio_ClinicalBERT", "api_call": "AutoModel.from_pretrained('emilyalsentzer/Bio_ClinicalBERT')", "performance": {"dataset": "MIMIC III", "accuracy": "Not provided"}, "description": "Bio_ClinicalBERT is a model initialized with BioBERT and trained on all MIMIC notes. It can be used for various NLP tasks in the clinical domain, such as Named Entity Recognition (NER) and Natural Language Inference (NLI).", "model_name": "emilyalsentzer/Bio_ClinicalBERT"}
{"domain": "Audio Audio-to-Audio", "framework": "Hugging Face Transformers", "functionality": "Asteroid", "api_name": "DCCRNet_Libri1Mix_enhsingle_16k", "api_call": "AutoModelForAudioToAudio.from_pretrained('JorisCos/DCCRNet_Libri1Mix_enhsingle_16k')", "performance": {"dataset": "Libri1Mix", "accuracy": {"si_sdr": 13.3297673983, "si_sdr_imp": 9.8799860925, "sdr": 13.87279933, "sdr_imp": 10.3701365308, "sir": "Infinity", "sir_imp": "NaN", "sar": 13.87279933, "sar_imp": 10.3701365308, "stoi": 0.9140907016, "stoi_imp": 0.11817087800000001}}, "description": "This model was trained by Joris Cosentino using the librimix recipe in Asteroid. It was trained on the enh_single task of the Libri1Mix dataset.", "model_name": "JorisCos/DCCRNet_Libri1Mix_enhsingle_16k"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "videomae-base-finetuned-RealLifeViolenceSituations-subset", "api_call": "AutoModelForVideoClassification.from_pretrained('dangle124/videomae-base-finetuned-RealLifeViolenceSituations-subset')", "performance": {"dataset": "unknown", "accuracy": 0.9533}, "description": "This model is a fine-tuned version of MCG-NJU/videomae-base on an unknown dataset. It is trained for video classification task, specifically for RealLifeViolenceSituations.", "model_name": "dangle124/videomae-base-finetuned-RealLifeViolenceSituations-subset"}
{"domain": "Natural Language Processing Summarization", "framework": "Transformers", "functionality": "Text Summarization", "api_name": "sshleifer/distilbart-cnn-6-6", "api_call": "BartForConditionalGeneration.from_pretrained('sshleifer/distilbart-cnn-6-6')", "performance": {"dataset": {"cnn_dailymail": {"Rouge 2": 20.17, "Rouge-L": 29.7}, "xsum": {"Rouge 2": 20.92, "Rouge-L": 35.73}}}, "description": "DistilBART model for text summarization, trained on the CNN/Daily Mail and XSum datasets. It is a smaller and faster version of BART, suitable for summarizing English text.", "model_name": "sshleifer/distilbart-cnn-6-6"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Text Generation", "api_name": "cerebras/Cerebras-GPT-111M", "api_call": "AutoModelForCausalLM.from_pretrained('cerebras/Cerebras-GPT-111M')", "performance": {"dataset": "The Pile", "accuracy": {"PILE_test_xent": 2.566, "Hella-Swag": 0.268, "PIQA": 0.594, "Wino-Grande": 0.488, "Lambada": 0.194, "ARC-e": 0.38, "ARC-c": 0.166, "OpenBookQA": 0.11800000000000001, "Downstream_Average": 0.315}}, "description": "Cerebras-GPT-111M is a transformer-based language model with 111M parameters, trained on the Pile dataset using the GPT-3 style architecture. It is intended for use in research and as a foundation model for NLP applications, ethics, and alignment research. The model can be fine-tuned for various tasks and is licensed under Apache 2.0.", "model_name": "cerebras/Cerebras-GPT-111M"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Depth Estimation", "api_name": "glpn-kitti-finetuned-diode", "api_call": "AutoModel.from_pretrained('sayakpaul/glpn-kitti-finetuned-diode')", "performance": {"dataset": "diode-subset", "accuracy": {"Loss": 0.5845, "Rmse": 0.6175}}, "description": "This model is a fine-tuned version of vinvino02/glpn-kitti on the diode-subset dataset.", "model_name": "sayakpaul/glpn-kitti-finetuned-diode"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Transformers", "functionality": "Zero-Shot Classification", "api_name": "valhalla/distilbart-mnli-12-6", "api_call": "pipeline('zero-shot-classification', model='valhalla/distilbart-mnli-12-6')", "performance": {"dataset": "MNLI", "accuracy": {"matched_acc": "89.19", "mismatched_acc": "89.01"}}, "description": "distilbart-mnli is the distilled version of bart-large-mnli created using the No Teacher Distillation technique proposed for BART summarisation by Huggingface. It is designed for zero-shot classification tasks.", "model_name": "valhalla/distilbart-mnli-12-6"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Image Classification", "api_name": "laion/CLIP-ViT-B-32-laion2B-s34B-b79K", "api_call": "pipeline('zero-shot-classification', model='laion/CLIP-ViT-B-32-laion2B-s34B-b79K')", "performance": {"dataset": "ImageNet-1k", "accuracy": 66.6}, "description": "A CLIP ViT-B/32 model trained with the LAION-2B English subset of LAION-5B using OpenCLIP. It enables researchers to better understand and explore zero-shot, arbitrary image classification. The model can be used for zero-shot image classification, image and text retrieval, among others.", "model_name": "laion/CLIP-ViT-B-32-laion2B-s34B-b79K"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Speech Emotion Recognition", "api_name": "audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim", "api_call": "EmotionModel.from_pretrained('audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim')", "performance": {"dataset": "msp-podcast", "accuracy": "Not provided"}, "description": "Model for Dimensional Speech Emotion Recognition based on Wav2vec 2.0. The model expects a raw audio signal as input and outputs predictions for arousal, dominance and valence in a range of approximately 0...1. In addition, it also provides the pooled states of the last transformer layer. The model was created by fine-tuning Wav2Vec2-Large-Robust on MSP-Podcast (v1.7). The model was pruned from 24 to 12 transformer layers before fine-tuning. An ONNX export of the model is available from doi:10.5281/zenodo.6221127. Further details are given in the associated paper and tutorial.", "model_name": "audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim"}
{"domain": "Natural Language Processing Token Classification", "framework": "Hugging Face Transformers", "functionality": "Named Entity Recognition", "api_name": "Babelscape/wikineural-multilingual-ner", "api_call": "AutoModelForTokenClassification.from_pretrained('Babelscape/wikineural-multilingual-ner')", "performance": {"dataset": "Babelscape/wikineural-multilingual-ner", "accuracy": "span-based F1-score up to 6 points over previous state-of-the-art systems for data creation"}, "description": "A multilingual Named Entity Recognition (NER) model fine-tuned on the WikiNEuRal dataset, supporting 9 languages (de, en, es, fr, it, nl, pl, pt, ru). It is based on the mBERT architecture and trained on all 9 languages jointly. The model can be used with the Hugging Face Transformers pipeline for NER tasks.", "model_name": "Babelscape/wikineural-multilingual-ner"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "facebook/timesformer-base-finetuned-k400", "api_call": "TimesformerForVideoClassification.from_pretrained('facebook/timesformer-base-finetuned-k400')", "performance": {"dataset": "Kinetics-400", "accuracy": "Not provided"}, "description": "TimeSformer is a video classification model pre-trained on Kinetics-400. It was introduced in the paper TimeSformer: Is Space-Time Attention All You Need for Video Understanding? by Tong et al. and first released in this repository. The model can be used for video classification into one of the 400 possible Kinetics-400 labels.", "model_name": "facebook/timesformer-base-finetuned-k400"}
{"domain": "Multimodal Feature Extraction", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "microsoft/codebert-base", "api_call": "AutoModel.from_pretrained('microsoft/codebert-base')", "performance": {"dataset": "CodeSearchNet", "accuracy": "n/a"}, "description": "Pretrained weights for CodeBERT: A Pre-Trained Model for Programming and Natural Languages. The model is trained on bi-modal data (documents & code) of CodeSearchNet. This model is initialized with Roberta-base and trained with MLM+RTD objective.", "model_name": "microsoft/codebert-base"}
{"domain": "Audio Text-to-Speech", "framework": "Hugging Face Transformers", "functionality": "Text-to-Speech", "api_name": "microsoft/speecht5_tts", "api_call": "SpeechT5ForTextToSpeech.from_pretrained('microsoft/speecht5_tts')", "performance": {"dataset": "LibriTTS", "accuracy": "Not specified"}, "description": "SpeechT5 model fine-tuned for speech synthesis (text-to-speech) on LibriTTS. It is a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. It can be used for a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.", "model_name": "microsoft/speecht5_tts"}
{"domain": "Natural Language Processing Text Classification", "framework": "Transformers", "functionality": "Sentiment Analysis", "api_name": "siebert/sentiment-roberta-large-english", "api_call": "pipeline('sentiment-analysis', model='siebert/sentiment-roberta-large-english')", "performance": {"dataset": [{"name": "McAuley and Leskovec (2013) (Reviews)", "accuracy": 98.0}, {"name": "McAuley and Leskovec (2013) (Review Titles)", "accuracy": 87.0}, {"name": "Yelp Academic Dataset", "accuracy": 96.5}, {"name": "Maas et al. (2011)", "accuracy": 96.0}, {"name": "Kaggle", "accuracy": 96.0}, {"name": "Pang and Lee (2005)", "accuracy": 91.0}, {"name": "Nakov et al. (2013)", "accuracy": 88.5}, {"name": "Shamma (2009)", "accuracy": 87.0}, {"name": "Blitzer et al. (2007) (Books)", "accuracy": 92.5}, {"name": "Blitzer et al. (2007) (DVDs)", "accuracy": 92.5}, {"name": "Blitzer et al. (2007) (Electronics)", "accuracy": 95.0}, {"name": "Blitzer et al. (2007) (Kitchen devices)", "accuracy": 98.5}, {"name": "Pang et al. (2002)", "accuracy": 95.5}, {"name": "Speriosu et al. (2011)", "accuracy": 85.5}, {"name": "Hartmann et al. (2019)", "accuracy": 98.0}], "average_accuracy": 93.2}, "description": "This model ('SiEBERT', prefix for 'Sentiment in English') is a fine-tuned checkpoint of RoBERTa-large (Liu et al. 2019). It enables reliable binary sentiment analysis for various types of English-language text. For each instance, it predicts either positive (1) or negative (0) sentiment. The model was fine-tuned and evaluated on 15 data sets from diverse text sources to enhance generalization across different types of texts (reviews, tweets, etc.). Consequently, it outperforms models trained on only one type of text (e.g., movie reviews from the popular SST-2 benchmark) when used on new data as shown below.", "model_name": "siebert/sentiment-roberta-large-english"}
{"domain": "Audio Audio-to-Audio", "framework": "Hugging Face Transformers", "functionality": "Speech Enhancement", "api_name": "speechbrain/mtl-mimic-voicebank", "api_call": "WaveformEnhancement.from_hparams('speechbrain/mtl-mimic-voicebank', 'pretrained_models/mtl-mimic-voicebank')", "performance": {"dataset": "Voicebank", "accuracy": {"Test PESQ": 3.05, "Test COVL": 3.74, "Valid WER": 2.89, "Test WER": 2.8}}, "description": "This repository provides all the necessary tools to perform enhancement and\nrobust ASR training (EN) within\nSpeechBrain. For a better experience we encourage you to learn more about\nSpeechBrain. The model performance is:\nRelease\nTest PESQ\nTest COVL\nValid WER\nTest WER\n22-06-21\n3.05\n3.74\n2.89\n2.80\nWorks with SpeechBrain v0.5.12", "model_name": "speechbrain/mtl-mimic-voicebank"}
{"domain": "Natural Language Processing Conversational", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "Zixtrauce/JohnBot", "api_call": "AutoModelForCausalLM.from_pretrained('Zixtrauce/JohnBot')", "performance": {"dataset": "", "accuracy": ""}, "description": "JohnBot is a conversational model based on the gpt2 architecture and trained using the Hugging Face Transformers library. It can be used for generating text responses in a chat-based interface.", "model_name": "Zixtrauce/JohnBot"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "facebook/wav2vec2-xlsr-53-espeak-cv-ft", "api_call": "Wav2Vec2ForCTC.from_pretrained('facebook/wav2vec2-xlsr-53-espeak-cv-ft')", "performance": {"dataset": "common_voice", "accuracy": "Not specified"}, "description": "Wav2Vec2-Large-XLSR-53 finetuned on multi-lingual Common Voice for phonetic label recognition in multiple languages. The model outputs a string of phonetic labels, and a dictionary mapping phonetic labels to words has to be used to map the phonetic output labels to output words.", "model_name": "facebook/wav2vec2-xlsr-53-espeak-cv-ft"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "google/pix2struct-textcaps-base", "api_call": "Pix2StructForConditionalGeneration.from_pretrained('google/pix2struct-textcaps-base')", "performance": {"dataset": "TextCaps", "accuracy": "state-of-the-art"}, "description": "Pix2Struct is an image encoder - text decoder model that is trained on image-text pairs for various tasks, including image captioning and visual question answering. It is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks.", "model_name": "google/pix2struct-textcaps-base"}
{"domain": "Audio Audio-to-Audio", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/speecht5_vc", "api_call": "SpeechT5ForSpeechToSpeech.from_pretrained('microsoft/speecht5_vc')", "performance": {"dataset": "CMU ARCTIC", "accuracy": "Not specified"}, "description": "SpeechT5 model fine-tuned for voice conversion (speech-to-speech) on CMU ARCTIC. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. It is designed to improve the modeling capability for both speech and text. This model can be used for speech conversion tasks.", "model_name": "microsoft/speecht5_vc"}
{"domain": "Natural Language Processing Text Classification", "framework": "Transformers", "functionality": "Sentiment Analysis", "api_name": "cardiffnlp/twitter-roberta-base-sentiment", "api_call": "AutoModelForSequenceClassification.from_pretrained('cardiffnlp/twitter-roberta-base-sentiment')", "performance": {"dataset": "tweet_eval", "accuracy": "Not provided"}, "description": "Twitter-roBERTa-base for Sentiment Analysis. This is a roBERTa-base model trained on ~58M tweets and finetuned for sentiment analysis with the TweetEval benchmark. This model is suitable for English.", "model_name": "cardiffnlp/twitter-roberta-base-sentiment"}
{"domain": "Audio Text-to-Speech", "framework": "ESPnet", "functionality": "Text-to-Speech", "api_name": "mio/Artoria", "api_call": "pipeline('text-to-speech', model='mio/Artoria')", "performance": {"dataset": "fate", "accuracy": "Not provided"}, "description": "This model was trained by mio using fate recipe in espnet. It is a text-to-speech model that can convert text input into speech output.", "model_name": "mio/Artoria"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "Image-to-Image", "api_name": "GreeneryScenery/SheepsControlV5", "api_call": "pipeline('image-to-image', model='GreeneryScenery/SheepsControlV5')", "performance": {"dataset": "poloclub/diffusiondb", "accuracy": "Not provided"}, "description": "SheepsControlV5 is an image-to-image model trained on the poloclub/diffusiondb dataset. It is designed for transforming input images into a different style or representation.", "model_name": "GreeneryScenery/SheepsControlV5"}
{"domain": "Audio Voice Activity Detection", "framework": "pyannote.audio", "functionality": "Speaker Diarization", "api_name": "philschmid/pyannote-speaker-diarization-endpoint", "api_call": "Pipeline.from_pretrained('philschmid/pyannote-speaker-diarization-endpoint')", "performance": {"dataset": [{"name": "AISHELL-4", "accuracy": {"DER%": 14.61, "FA%": 3.31, "Miss%": 4.35, "Conf%": 6.95}}, {"name": "AMI Mix-Headset only_words", "accuracy": {"DER%": 18.21, "FA%": 3.2800000000000002, "Miss%": 11.07, "Conf%": 3.87}}, {"name": "AMI Array1-01 only_words", "accuracy": {"DER%": 29.0, "FA%": 2.71, "Miss%": 21.61, "Conf%": 4.68}}, {"name": "CALLHOME Part2", "accuracy": {"DER%": 30.24, "FA%": 3.71, "Miss%": 16.86, "Conf%": 9.66}}, {"name": "DIHARD 3 Full", "accuracy": {"DER%": 20.99, "FA%": 4.25, "Miss%": 10.74, "Conf%": 6.0}}, {"name": "REPERE Phase 2", "accuracy": {"DER%": 12.62, "FA%": 1.55, "Miss%": 3.3, "Conf%": 7.76}}, {"name": "VoxConverse v0.0.2", "accuracy": {"DER%": 12.76, "FA%": 3.45, "Miss%": 3.85, "Conf%": 5.46}}]}, "description": "A speaker diarization pipeline that uses pyannote.audio to perform voice activity detection, speaker change detection, and overlapped speech detection. It can handle fully automatic processing with no manual intervention and can be fine-tuned with various hyperparameters.", "model_name": "philschmid/pyannote-speaker-diarization-endpoint"}
{"domain": "Audio Audio-to-Audio", "framework": "SpeechBrain", "functionality": "Audio Source Separation", "api_name": "speechbrain/sepformer-wham", "api_call": "separator.from_hparams(source='speechbrain/sepformer-wham', savedir='pretrained_models/sepformer-wham')", "performance": {"dataset": "WHAM!", "accuracy": "16.3 dB SI-SNRi"}, "description": "This repository provides all the necessary tools to perform audio source separation with a SepFormer model, implemented with SpeechBrain, and pretrained on WHAM! dataset, which is basically a version of WSJ0-Mix dataset with environmental noise.", "model_name": "speechbrain/sepformer-wham"}
{"domain": "Audio Automatic Speech Recognition", "framework": "pyannote.audio", "functionality": "overlapped-speech-detection", "api_name": "pyannote/overlapped-speech-detection", "api_call": "pipeline.from_pretrained('pyannote/overlapped-speech-detection', use_auth_token='ACCESS_TOKEN_GOES_HERE')", "performance": {"dataset": "ami", "accuracy": null}, "description": "Automatic overlapped speech detection using pyannote.audio framework. The model detects when two or more speakers are active in an audio file.", "model_name": "pyannote/overlapped-speech-detection"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image Generation", "api_name": "CompVis/stable-diffusion-v1-4", "api_call": "StableDiffusionPipeline.from_pretrained('CompVis/stable-diffusion-v1-4')", "performance": {"dataset": "COCO2017 validation set", "accuracy": "Not optimized for FID scores"}, "description": "Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. The Stable-Diffusion-v1-4 checkpoint was fine-tuned on 225k steps at resolution 512x512 on laion-aesthetics v2 5+ and 10% dropping of the text-conditioning to improve classifier-free guidance sampling. This model is intended for research purposes and can be used for generating artworks, design, educational or creative tools, and research on generative models.", "model_name": "CompVis/stable-diffusion-v1-4"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "codet5-large-ntp-py", "api_call": "T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-large-ntp-py')", "performance": {"dataset": "APPS benchmark", "accuracy": "See Table 5 of the paper"}, "description": "CodeT5 is a family of encoder-decoder language models for code from the paper: CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation by Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. The checkpoint included in this repository is denoted as CodeT5-large-ntp-py (770M), which is introduced by the paper: CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning by Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi.", "model_name": "Salesforce/codet5-large-ntp-py"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "functionality": "Question Answering", "api_name": "deepset/roberta-large-squad2", "api_call": "pipeline('question-answering', model='deepset/roberta-large-squad2')", "performance": {"dataset": "squad_v2", "accuracy": "Not provided"}, "description": "A pre-trained RoBERTa model for question answering tasks, specifically trained on the SQuAD v2 dataset. It can be used to answer questions based on a given context.", "model_name": "deepset/roberta-large-squad2"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Multilingual Translation", "api_name": "facebook/m2m100_418M", "api_call": "M2M100ForConditionalGeneration.from_pretrained('facebook/m2m100_418M')", "performance": {"dataset": "WMT", "accuracy": "Not provided"}, "description": "M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation. It can directly translate between the 9,900 directions of 100 languages. To translate into a target language, the target language id is forced as the first generated token.", "model_name": "facebook/m2m100_418M"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "dmis-lab/biobert-v1.1", "api_call": "AutoModel.from_pretrained('dmis-lab/biobert-v1.1')", "performance": {"dataset": "", "accuracy": ""}, "description": "BioBERT is a pre-trained biomedical language representation model for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, and question answering.", "model_name": "dmis-lab/biobert-v1.1"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "functionality": "Question Answering", "api_name": "deepset/roberta-base-squad2", "api_call": "AutoModelForQuestionAnswering.from_pretrained('deepset/roberta-base-squad2')", "performance": {"dataset": "squad_v2", "accuracy": {"exact": 79.8702939442, "f1": 82.9125116958}}, "description": "This is the roberta-base model, fine-tuned using the SQuAD2.0 dataset for the task of Question Answering. It's been trained on question-answer pairs, including unanswerable questions.", "model_name": "deepset/roberta-base-squad2"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "ast-finetuned-speech-commands-v2", "api_call": "AutoModelForAudioClassification.from_pretrained('MIT/ast-finetuned-speech-commands-v2')", "performance": {"dataset": "Speech Commands v2", "accuracy": "98.120"}, "description": "Audio Spectrogram Transformer (AST) model fine-tuned on Speech Commands v2. It was introduced in the paper AST: Audio Spectrogram Transformer by Gong et al. and first released in this repository. The Audio Spectrogram Transformer is equivalent to ViT, but applied on audio. Audio is first turned into an image (as a spectrogram), after which a Vision Transformer is applied. The model gets state-of-the-art results on several audio classification benchmarks.", "model_name": "MIT/ast-finetuned-speech-commands-v2"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "vintedois-diffusion-v0-1", "api_call": "pipeline('text-to-image', model='22h/vintedois-diffusion-v0-1')", "performance": {"dataset": "large amount of high quality images", "accuracy": "not specified"}, "description": "Vintedois (22h) Diffusion model trained by Predogl and piEsposito with open weights, configs and prompts. This model generates beautiful images without a lot of prompt engineering. It can also generate high fidelity faces with a little amount of steps.", "model_name": "22h/vintedois-diffusion-v0-1"}
{"domain": "Natural Language Processing Conversational", "framework": "Hugging Face Transformers", "functionality": "Text Generation", "api_name": "pygmalion-6b", "api_call": "AutoModelForCausalLM.from_pretrained('waifu-workshop/pygmalion-6b')", "performance": {"dataset": "56MB of dialogue data gathered from multiple sources", "accuracy": "Not specified"}, "description": "Pygmalion 6B is a proof-of-concept dialogue model based on EleutherAI's GPT-J-6B. It is fine-tuned on 56MB of dialogue data gathered from multiple sources, which includes both real and partially machine-generated conversations. The model is intended for conversational text generation and can be used to play a character in a dialogue.", "model_name": "waifu-workshop/pygmalion-6b"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "text2text-generation", "api_name": "blip2-opt-6.7b", "api_call": "pipeline('text2text-generation', model='salesforce/blip2-opt-6.7b')", "performance": {"dataset": "LAION", "accuracy": "Not specified"}, "description": "BLIP-2 model, leveraging OPT-6.7b (a large language model with 6.7 billion parameters). It was introduced in the paper BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models by Li et al. and first released in this repository. The goal for the model is to predict the next text token, giving the query embeddings and the previous text. This allows the model to be used for tasks like image captioning, visual question answering (VQA), and chat-like conversations by feeding the image and the previous conversation as prompt to the model.", "model_name": "salesforce/blip2-opt-6.7b"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Language model", "api_name": "google/flan-t5-large", "api_call": "T5ForConditionalGeneration.from_pretrained('google/flan-t5-large')", "performance": {"dataset": [{"name": "MMLU", "accuracy": "75.2%"}]}, "description": "FLAN-T5 large is a language model fine-tuned on over 1000 tasks and multiple languages. It achieves state-of-the-art performance on several benchmarks, including 75.2% on five-shot MMLU. The model is based on pretrained T5 and fine-tuned with instructions for better zero-shot and few-shot performance. It can be used for research on language models, zero-shot NLP tasks, in-context few-shot learning NLP tasks, reasoning, question answering, and advancing fairness and safety research.", "model_name": "google/flan-t5-large"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "microsoft/beit-base-patch16-224-pt22k-ft22k", "api_call": "BeitForImageClassification.from_pretrained('microsoft/beit-base-patch16-224-pt22k-ft22k')", "performance": {"dataset": "ImageNet-22k", "accuracy": "Not specified"}, "description": "BEiT model pre-trained in a self-supervised fashion on ImageNet-22k - also called ImageNet-21k (14 million images, 21,841 classes) at resolution 224x224, and fine-tuned on the same dataset at resolution 224x224. It was introduced in the paper BEIT: BERT Pre-Training of Image Transformers by Hangbo Bao, Li Dong and Furu Wei and first released in this repository.", "model_name": "microsoft/beit-base-patch16-224-pt22k-ft22k"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "Helsinki-NLP/opus-mt-en-de", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('Helsinki-NLP/opus-mt-en-de')", "performance": {"dataset": "newstest2018-ende.en.de", "accuracy": {"BLEU": 45.2, "chr-F": 0.6900000000000001}}, "description": "The Helsinki-NLP/opus-mt-en-de model is a translation model developed by the Language Technology Research Group at the University of Helsinki. It translates English text to German using the Hugging Face Transformers library. The model is trained on the OPUS dataset and has a BLEU score of 45.2 on the newstest2018-ende.en.de dataset.", "model_name": "Helsinki-NLP/opus-mt-en-de"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face Transformers", "functionality": "Question Answering", "api_name": "uclanlp/visualbert-vqa", "api_call": "AutoModelForQuestionAnswering.from_pretrained('uclanlp/visualbert-vqa')", "performance": {"dataset": "", "accuracy": ""}, "description": "A VisualBERT model for Visual Question Answering.", "model_name": "uclanlp/visualbert-vqa"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "functionality": "Question Answering", "api_name": "deepset/bert-base-cased-squad2", "api_call": "AutoModelForQuestionAnswering.from_pretrained('deepset/bert-base-cased-squad2')", "performance": {"dataset": "squad_v2", "accuracy": {"exact_match": 71.152, "f1": 74.671}}, "description": "This is a BERT base cased model trained on SQuAD v2", "model_name": "deepset/bert-base-cased-squad2"}
{"domain": "Audio Text-to-Speech", "framework": "ESPnet", "functionality": "Text-to-Speech", "api_name": "kan-bayashi_jvs_tts_finetune_jvs001_jsut_vits_raw_phn_jaconv_pyopenjta-truncated-178804", "api_call": "AutoModelForCausalLM.from_pretrained('espnet/kan-bayashi_jvs_tts_finetune_jvs001_jsut_vits_raw_phn_jaconv_pyopenjta-truncated-178804')", "performance": {"dataset": "", "accuracy": ""}, "description": "A Japanese text-to-speech model trained using the ESPnet framework. It is designed to convert text input into natural-sounding speech.", "model_name": "espnet/kan-bayashi_jvs_tts_finetune_jvs001_jsut_vits_raw_phn_jaconv_pyopenjta-truncated-178804"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Transformers", "functionality": "Sentence Correction", "api_name": "flexudy/t5-base-multi-sentence-doctor", "api_call": "AutoModelWithLMHead.from_pretrained('flexudy/t5-base-multi-sentence-doctor')", "performance": {"dataset": "tatoeba", "accuracy": "Not specified"}, "description": "Sentence doctor is a T5 model that attempts to correct the errors or mistakes found in sentences. Model works on English, German and French text.", "model_name": "flexudy/t5-base-multi-sentence-doctor"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/tapex-large-finetuned-wtq", "api_call": "BartForConditionalGeneration.from_pretrained('microsoft/tapex-large-finetuned-wtq')", "performance": {"dataset": "wikitablequestions", "accuracy": "Not provided"}, "description": "TAPEX (Table Pre-training via Execution) is a conceptually simple and empirically powerful pre-training approach to empower existing models with table reasoning skills. TAPEX realizes table pre-training by learning a neural SQL executor over a synthetic corpus, which is obtained by automatically synthesizing executable SQL queries. TAPEX is based on the BART architecture, the transformer encoder-decoder (seq2seq) model with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder. This model is the tapex-base model fine-tuned on the WikiTableQuestions dataset.", "model_name": "microsoft/tapex-large-finetuned-wtq"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Image Generation", "api_name": "runwayml/stable-diffusion-inpainting", "api_call": "StableDiffusionInpaintPipeline.from_pretrained('runwayml/stable-diffusion-inpainting', revision=fp16, torch_dtype=torch.float16)", "performance": {"dataset": {"name": "LAION-2B (en)", "accuracy": "Not optimized for FID scores"}}, "description": "Stable Diffusion Inpainting is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input, with the extra capability of inpainting the pictures by using a mask.", "model_name": "runwayml/stable-diffusion-inpainting"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Fill-Mask", "api_name": "uer/albert-base-chinese-cluecorpussmall", "api_call": "AlbertForMaskedLM.from_pretrained('uer/albert-base-chinese-cluecorpussmall')", "performance": {"dataset": "CLUECorpusSmall", "accuracy": "Not provided"}, "description": "This is the set of Chinese ALBERT models pre-trained by UER-py on the CLUECorpusSmall dataset. The model can be used for tasks like text generation and feature extraction.", "model_name": "uer/albert-base-chinese-cluecorpussmall"}
{"domain": "Multimodal Document Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "frizwankhan/entity-linking-model-final", "api_call": "pipeline('question-answering', model='frizwankhan/entity-linking-model-final')", "performance": {"dataset": "", "accuracy": ""}, "description": "A Document Question Answering model based on layoutlmv2", "model_name": "frizwankhan/entity-linking-model-final"}
{"domain": "Audio Text-to-Speech", "framework": "ESPnet", "functionality": "Text-to-Speech", "api_name": "imdanboy/jets", "api_call": "pipeline('text-to-speech', model='imdanboy/jets')", "performance": {"dataset": "ljspeech", "accuracy": null}, "description": "This model was trained by imdanboy using ljspeech recipe in espnet.", "model_name": "imdanboy/jets"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "facebook/mask2former-swin-large-cityscapes-semantic", "api_call": "Mask2FormerForUniversalSegmentation.from_pretrained('facebook/mask2former-swin-large-cityscapes-semantic')", "performance": {"dataset": "Cityscapes", "accuracy": "Not specified"}, "description": "Mask2Former model trained on Cityscapes semantic segmentation (large-sized version, Swin backbone). It addresses instance, semantic and panoptic segmentation by predicting a set of masks and corresponding labels. The model outperforms the previous SOTA, MaskFormer, in terms of performance and efficiency.", "model_name": "facebook/mask2former-swin-large-cityscapes-semantic"}
{"domain": "Computer Vision Image-to-Image", "framework": "Diffusers", "functionality": "Text-to-Image Diffusion Models", "api_name": "lllyasviel/control_v11p_sd15_openpose", "api_call": "ControlNetModel.from_pretrained('lllyasviel/control_v11p_sd15_openpose')", "performance": {"dataset": "Not specified", "accuracy": "Not specified"}, "description": "ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on openpose images.", "model_name": "lllyasviel/control_v11p_sd15_openpose"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Object Detection", "api_name": "facebook/detr-resnet-50", "api_call": "DetrForObjectDetection.from_pretrained('facebook/detr-resnet-50')", "performance": {"dataset": "COCO 2017 validation", "accuracy": "42.0 AP"}, "description": "DEtection TRansformer (DETR) model trained end-to-end on COCO 2017 object detection (118k annotated images). It was introduced in the paper End-to-End Object Detection with Transformers by Carion et al. and first released in this repository.", "model_name": "facebook/detr-resnet-50"}
{"domain": "Natural Language Processing Token Classification", "framework": "Hugging Face Transformers", "functionality": "Named Entity Recognition", "api_name": "Jean-Baptiste/camembert-ner", "api_call": "AutoModelForTokenClassification.from_pretrained('Jean-Baptiste/camembert-ner')", "performance": {"dataset": "wikiner-fr", "accuracy": {"overall_f1": 0.8914000000000001, "PER_f1": 0.9483, "ORG_f1": 0.8181, "LOC_f1": 0.8955000000000001, "MISC_f1": 0.8146}}, "description": "camembert-ner is a Named Entity Recognition (NER) model fine-tuned from camemBERT on the wikiner-fr dataset. It can recognize entities such as persons, organizations, locations, and miscellaneous entities.", "model_name": "Jean-Baptiste/camembert-ner"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "videomae-base-ssv2", "api_call": "VideoMAEForPreTraining.from_pretrained('MCG-NJU/videomae-base-short-ssv2')", "performance": {"dataset": "Something-Something-v2", "accuracy": ""}, "description": "VideoMAE is an extension of Masked Autoencoders (MAE) to video. The architecture of the model is very similar to that of a standard Vision Transformer (ViT), with a decoder on top for predicting pixel values for masked patches. Videos are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds fixed sinus/cosinus position embeddings before feeding the sequence to the layers of the Transformer encoder. By pre-training the model, it learns an inner representation of videos that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled videos for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire video.", "model_name": "MCG-NJU/videomae-base-short-ssv2"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Transformers", "api_name": "sentence-transformers/all-MiniLM-L12-v1", "api_call": "SentenceTransformer('sentence-transformers/all-MiniLM-L12-v1')", "performance": {"dataset": [{"name": "Sentence Embeddings Benchmark", "url": "https://seb.sbert.net"}], "accuracy": "Not provided"}, "description": "This is a sentence-transformers model that maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/all-MiniLM-L12-v1"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "glpn-nyu-finetuned-diode-221215-093747", "api_call": "AutoModel.from_pretrained('sayakpaul/glpn-nyu-finetuned-diode-221215-093747')", "performance": {"dataset": "DIODE", "accuracy": ""}, "description": "A depth estimation model fine-tuned on the DIODE dataset.", "model_name": "sayakpaul/glpn-nyu-finetuned-diode-221215-093747"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Text Generation", "api_name": "google/flan-t5-xl", "api_call": "T5ForConditionalGeneration.from_pretrained('google/flan-t5-xl')", "performance": {"dataset": [{"name": "MMLU", "accuracy": "75.2%"}]}, "description": "FLAN-T5 XL is a large-scale language model fine-tuned on more than 1000 tasks covering multiple languages. It achieves state-of-the-art performance on several benchmarks and is designed for research on zero-shot and few-shot NLP tasks, such as reasoning, question answering, and understanding the limitations of current large language models.", "model_name": "google/flan-t5-xl"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Diffusers", "api_name": "Minecraft-Skin-Diffusion", "api_call": "DDPMPipeline.from_pretrained('WiNE-iNEFF/Minecraft-Skin-Diffusion')", "performance": {"dataset": "", "accuracy": ""}, "description": "Unconditional Image Generation model for generating Minecraft skins using diffusion-based methods.", "model_name": "WiNE-iNEFF/Minecraft-Skin-Diffusion"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face", "functionality": "Zero-Shot Image Classification", "api_name": "laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg", "api_call": "pipeline('image-classification', model='laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg')", "performance": {"dataset": "ImageNet-1k", "accuracy": "75.9%"}, "description": "A series of CLIP ConvNeXt-Large (w/ extra text depth, vision MLP head) models trained on LAION-2B (english), a subset of LAION-5B, using OpenCLIP. The models are trained at 256x256 image resolution and achieve a 75.9 top-1 zero-shot accuracy on ImageNet-1k.", "model_name": "laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Hugging Face Transformers", "functionality": "Unconditional Image Generation", "api_name": "utyug1/sd-class-butterflies-32", "api_call": "DDPMPipeline.from_pretrained('utyug1/sd-class-butterflies-32')", "performance": {"dataset": "Not specified", "accuracy": "Not specified"}, "description": "This model is a diffusion model for unconditional image generation of cute butterflies.", "model_name": "utyug1/sd-class-butterflies-32"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Masked Language Modeling", "api_name": "bert-large-cased", "api_call": "pipeline('fill-mask', model='bert-large-cased')", "performance": {"dataset": {"SQUAD 1.1": {"F1": 91.5, "EM": 84.8}, "Multi NLI": {"accuracy": 86.09}}}, "description": "BERT large model (cased) pretrained on English language using a masked language modeling (MLM) objective. It has 24 layers, 1024 hidden dimensions, 16 attention heads, and 336M parameters.", "model_name": "bert-large-cased"}
{"domain": "Natural Language Processing Conversational", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "facebook/blenderbot-1B-distill", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('facebook/blenderbot-1B-distill')", "performance": {"dataset": "blended_skill_talk", "accuracy": "Not mentioned"}, "description": "BlenderBot-1B is a large-scale open-domain chatbot model that can engage in conversations, ask and answer questions, and display knowledge, empathy, and personality. This distilled version is smaller and faster than the original 9.4B parameter model, making it more accessible for use.", "model_name": "facebook/blenderbot-1B-distill"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Depth Estimation", "api_name": "glpn-nyu", "api_call": "GLPNForDepthEstimation.from_pretrained('vinvino02/glpn-nyu')", "performance": {"dataset": "NYUv2", "accuracy": "Not provided"}, "description": "Global-Local Path Networks (GLPN) model trained on NYUv2 for monocular depth estimation. It was introduced in the paper Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth by Kim et al. and first released in this repository.", "model_name": "vinvino02/glpn-nyu"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "fxmarty/resnet-tiny-beans", "api_call": "pipeline('image-classification', model='fxmarty/resnet-tiny-beans')", "performance": {"dataset": "beans", "accuracy": "Not provided"}, "description": "A model trained on the beans dataset, just for testing and having a really tiny model.", "model_name": "fxmarty/resnet-tiny-beans"}
{"domain": "Audio Audio-to-Audio", "framework": "Hugging Face Transformers", "functionality": "Asteroid", "api_name": "ConvTasNet_Libri2Mix_sepclean_16k", "api_call": "Asteroid('JorisCos/ConvTasNet_Libri2Mix_sepclean_16k')", "performance": {"dataset": "Libri2Mix", "accuracy": {"si_sdr": 15.2436713569, "si_sdr_imp": 15.2430341785, "sdr": 15.6681089196, "sdr_imp": 15.578229918, "sir": 25.2951007566, "sir_imp": 25.2052199213, "sar": 16.3076825902, "sar_imp": -51.6498996376, "stoi": 0.9394951175, "stoi_imp": 0.2264019274}}, "description": "This model was trained by Joris Cosentino using the librimix recipe in Asteroid. It was trained on the sep_clean task of the Libri2Mix dataset.", "model_name": "JorisCos/ConvTasNet_Libri2Mix_sepclean_16k"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face", "functionality": "Zero-Shot Image Classification", "api_name": "CLIPModel.from_pretrained('laion/CLIP-convnext_base_w-laion2B-s13B-b82K')", "api_call": "CLIPModel.from_pretrained('laion/CLIP-convnext_base_w-laion2B-s13B-b82K')", "performance": {"dataset": "ImageNet-1k", "accuracy": "70.8 - 71.7%"}, "description": "A series of CLIP ConvNeXt-Base (w/ wide embed dim) models trained on subsets LAION-5B using OpenCLIP. The models achieve between 70.8 and 71.7 zero-shot top-1 accuracy on ImageNet-1k. The models can be used for zero-shot image classification, image and text retrieval, and other related tasks.", "model_name": "laion/CLIP-convnext_base_w-laion2B-s13B-b82K"}
{"domain": "Computer Vision Unconditional Image Generation", "framework": "Transformers", "functionality": "Unconditional Image Generation", "api_name": "ceyda/butterfly_cropped_uniq1K_512", "api_call": "LightweightGAN.from_pretrained('ceyda/butterfly_cropped_uniq1K_512')", "performance": {"dataset": "huggan/smithsonian_butterflies_subset", "accuracy": "FID score on 100 images"}, "description": "Butterfly GAN model based on the paper 'Towards Faster and Stabilized GAN Training for High-fidelity Few-shot Image Synthesis'. The model is intended for fun and learning purposes. It was trained on 1000 images from the huggan/smithsonian_butterflies_subset dataset, with a focus on low data training as mentioned in the paper. The model generates high-quality butterfly images.", "model_name": "ceyda/butterfly_cropped_uniq1K_512"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "facebook/dino-vits8", "api_call": "ViTModel.from_pretrained('facebook/dino-vits8')", "performance": {"dataset": "imagenet-1k", "accuracy": null}, "description": "Vision Transformer (ViT) model trained using the DINO method. It was introduced in the paper Emerging Properties in Self-Supervised Vision Transformers by Mathilde Caron, Hugo Touvron, Ishan Misra, Herv\u00e9 J\u00e9gou, Julien Mairal, Piotr Bojanowski, Armand Joulin and first released in this repository.", "model_name": "facebook/dino-vits8"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": ["Translation", "Summarization", "Question Answering", "Text Classification", "Text Regression"], "api_name": "t5-small", "api_call": "T5Model.from_pretrained('t5-small')", "performance": {"dataset": "c4", "accuracy": "See research paper, Table 14 for full results"}, "description": "T5-Small is a Text-To-Text Transfer Transformer (T5) model with 60 million parameters. It is designed to perform a variety of NLP tasks, including machine translation, document summarization, question answering, and classification tasks. The model is pre-trained on the Colossal Clean Crawled Corpus (C4) and can be fine-tuned for specific tasks.", "model_name": "t5-small"}
{"domain": "Multimodal Feature Extraction", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "rasa/LaBSE", "api_call": "AutoModel.from_pretrained('rasa/LaBSE')", "performance": {"dataset": "", "accuracy": ""}, "description": "LaBSE (Language-agnostic BERT Sentence Embedding) model for extracting sentence embeddings in multiple languages.", "model_name": "rasa/LaBSE"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Embeddings", "api_name": "sentence-transformers/distilbert-base-nli-stsb-mean-tokens", "api_call": "SentenceTransformer('sentence-transformers/distilbert-base-nli-stsb-mean-tokens')", "performance": {"dataset": "https://seb.sbert.net", "accuracy": "Not provided"}, "description": "This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/distilbert-base-nli-stsb-mean-tokens"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Transformers", "api_name": "hkunlp/instructor-base", "api_call": "INSTRUCTOR('hkunlp/instructor-base')", "performance": {"dataset": "MTEB AmazonCounterfactualClassification (en)", "accuracy": 86.209}, "description": "Instructor is an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g., classification, retrieval, clustering, text evaluation, etc.) and domains (e.g., science, finance, etc.) by simply providing the task instruction, without any finetuning. Instructor achieves state-of-the-art performance on 70 diverse embedding tasks.", "model_name": "hkunlp/instructor-base"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Conversational", "api_name": "microsoft/GODEL-v1_1-base-seq2seq", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('microsoft/GODEL-v1_1-base-seq2seq')", "performance": {"dataset": "Reddit discussion thread, instruction and knowledge grounded dialogs", "accuracy": "N/A"}, "description": "GODEL is a large-scale pre-trained model for goal-directed dialogs. It is parameterized with a Transformer-based encoder-decoder model and trained for response generation grounded in external text, which allows more effective fine-tuning on dialog tasks that require conditioning the response on information that is external to the current conversation (e.g., a retrieved document). The pre-trained model can be efficiently fine-tuned and adapted to accomplish a new dialog task with a handful of task-specific dialogs. The v1.1 model is trained on 551M multi-turn dialogs from Reddit discussion thread, and 5M instruction and knowledge grounded dialogs.", "model_name": "microsoft/GODEL-v1_1-base-seq2seq"}
{"domain": "Audio Voice Activity Detection", "framework": "pyannote.audio", "functionality": "Speaker diarization", "api_name": "johnislarry/cloned-pyannote-speaker-diarization-endpoint", "api_call": "Pipeline.from_pretrained('pyannote/speaker-diarization@2.1',use_auth_token='ACCESS_TOKEN_GOES_HERE')", "performance": {"dataset": [{"name": "AISHELL-4", "accuracy": {"DER%": 14.61, "FA%": 3.31, "Miss%": 4.35, "Conf%": 6.95}}, {"name": "AMI Mix-Headset only_words", "accuracy": {"DER%": 18.21, "FA%": 3.2800000000000002, "Miss%": 11.07, "Conf%": 3.87}}, {"name": "AMI Array1-01 only_words", "accuracy": {"DER%": 29.0, "FA%": 2.71, "Miss%": 21.61, "Conf%": 4.68}}, {"name": "CALLHOME Part2", "accuracy": {"DER%": 30.24, "FA%": 3.71, "Miss%": 16.86, "Conf%": 9.66}}, {"name": "DIHARD 3 Full", "accuracy": {"DER%": 20.99, "FA%": 4.25, "Miss%": 10.74, "Conf%": 6.0}}, {"name": "REPERE Phase 2", "accuracy": {"DER%": 12.62, "FA%": 1.55, "Miss%": 3.3, "Conf%": 7.76}}, {"name": "VoxConverse v0.0.2", "accuracy": {"DER%": 12.76, "FA%": 3.45, "Miss%": 3.85, "Conf%": 5.46}}]}, "description": "This API provides speaker diarization functionality using the pyannote.audio framework. It is capable of processing audio files and outputting speaker diarization results in RTTM format. The API supports providing the number of speakers, minimum and maximum number of speakers, and adjusting the segmentation onset threshold.", "model_name": "pyannote/speaker-diarization"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "kha-white/manga-ocr-base", "api_call": "pipeline('ocr', model='kha-white/manga-ocr-base')", "performance": {"dataset": "manga109s", "accuracy": ""}, "description": "Optical character recognition for Japanese text, with the main focus being Japanese manga. It uses Vision Encoder Decoder framework. Manga OCR can be used as a general purpose printed Japanese OCR, but its main goal was to provide a high quality text recognition, robust against various scenarios specific to manga: both vertical and horizontal text, text with furigana, text overlaid on images, wide variety of fonts and font styles, and low quality images.", "model_name": "kha-white/manga-ocr-base"}
{"domain": "Natural Language Processing Conversational", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "Zixtrauce/BaekBot", "api_call": "pipeline('conversational', model='Zixtrauce/BaekBot')", "performance": {"dataset": "", "accuracy": ""}, "description": "BaekBot is a conversational model based on the GPT-2 architecture for text generation. It can be used for generating human-like responses in a chat-like environment.", "model_name": "Zixtrauce/BaekBot"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Transformers", "functionality": "Table Question Answering", "api_name": "google/tapas-medium-finetuned-sqa", "api_call": "pipeline('table-question-answering', model='google/tapas-medium-finetuned-sqa')", "performance": {"dataset": "msr_sqa", "accuracy": 0.6561}, "description": "TAPAS medium model fine-tuned on Sequential Question Answering (SQA). This model is pretrained on a large corpus of English data from Wikipedia and uses relative position embeddings. It can be used for answering questions related to a table in a conversational set-up.", "model_name": "google/tapas-medium-finetuned-sqa"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Text Generation", "api_name": "facebook/opt-66b", "api_call": "AutoModelForCausalLM.from_pretrained('facebook/opt-66b', torch_dtype=torch.float16)", "performance": {"dataset": "GPT-3", "accuracy": "roughly matched"}, "description": "OPT (Open Pre-trained Transformer) is a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, designed to enable reproducible and responsible research at scale. OPT models are trained to roughly match the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data collection and efficient training. The pretrained-only model can be used for prompting for evaluation of downstream tasks as well as text generation.", "model_name": "facebook/opt-66b"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Program Synthesis", "api_name": "Salesforce/codegen-350M-multi", "api_call": "AutoTokenizer.from_pretrained('Salesforce/codegen-350M-multi')", "performance": {"dataset": "HumanEval and MTPB", "accuracy": "Refer to the paper for accuracy details"}, "description": "CodeGen is a family of autoregressive language models for program synthesis. The checkpoint included in this repository is denoted as CodeGen-Multi 350M, where Multi means the model is initialized with CodeGen-NL 350M and further pre-trained on a dataset of multiple programming languages, and 350M refers to the number of trainable parameters. The model is capable of extracting features from given natural language and programming language texts, and calculating the likelihood of them. It is best at program synthesis, generating executable code given English prompts, and can complete partially-generated code as well.", "model_name": "Salesforce/codegen-350M-multi"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "vit_base_patch16_224.augreg2_in21k_ft_in1k", "api_call": "ViTForImageClassification.from_pretrained('timm/vit_base_patch16_224.augreg2_in21k_ft_in1k')", "performance": {"dataset": "", "accuracy": ""}, "description": "A Vision Transformer model for image classification, pretrained on ImageNet-21k and fine-tuned on ImageNet-1k.", "model_name": "timm/vit_base_patch16_224.augreg2_in21k_ft_in1k"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Transformers", "api_name": "nikcheerla/nooks-amd-detection-v2-full", "api_call": "SentenceTransformer.from_pretrained('nikcheerla/nooks-amd-detection-v2-full')", "performance": {"dataset": "https://seb.sbert.net", "accuracy": "Not provided"}, "description": "This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space. It can be used for tasks like clustering or semantic search.", "model_name": "nikcheerla/nooks-amd-detection-v2-full"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "shi-labs/oneformer_ade20k_swin_tiny", "api_call": "OneFormerForUniversalSegmentation.from_pretrained('shi-labs/oneformer_ade20k_swin_tiny')", "performance": {"dataset": "ADE20k", "accuracy": "Not provided"}, "description": "OneFormer is the first multi-task universal image segmentation framework. It needs to be trained only once with a single universal architecture, a single model, and on a single dataset, to outperform existing specialized models across semantic, instance, and panoptic segmentation tasks. OneFormer uses a task token to condition the model on the task in focus, making the architecture task-guided for training, and task-dynamic for inference, all with a single model.", "model_name": "shi-labs/oneformer_ade20k_swin_tiny"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "google/deplot", "api_call": "Pix2StructForConditionalGeneration.from_pretrained('google/deplot')", "performance": {"dataset": "ChartQA", "accuracy": "24.0% improvement over finetuned SOTA"}, "description": "DePlot is a model that translates the image of a plot or chart to a linearized table. It decomposes the challenge of visual language reasoning into two steps: (1) plot-to-text translation, and (2) reasoning over the translated text. The output of DePlot can then be directly used to prompt a pretrained large language model (LLM), exploiting the few-shot reasoning capabilities of LLMs.", "model_name": "google/deplot"}
{"domain": "Reinforcement Learning", "framework": "Stable-Baselines3", "functionality": "deep-reinforcement-learning", "api_name": "ppo-PongNoFrameskip-v4", "api_call": "load_from_hub(repo_id='sb3/ppo-PongNoFrameskip-v4',filename='{MODEL FILENAME}.zip',)", "performance": {"dataset": "PongNoFrameskip-v4", "accuracy": "21.00 +/- 0.00"}, "description": "This is a trained model of a PPO agent playing PongNoFrameskip-v4 using the stable-baselines3 library and the RL Zoo. The RL Zoo is a training framework for Stable Baselines3 reinforcement learning agents, with hyperparameter optimization and pre-trained agents included.", "model_name": "sb3/ppo-PongNoFrameskip-v4"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Document Question Answering", "api_name": "xhyi/layoutlmv3_docvqa_t11c5000", "api_call": "pipeline('question-answering', model='xhyi/layoutlmv3_docvqa_t11c5000')", "performance": {"dataset": "DocVQA", "accuracy": ""}, "description": "LayoutLMv3 model trained for document question answering task.", "model_name": "xhyi/layoutlmv3_docvqa_t11c5000"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Embeddings", "api_name": "sentence-transformers/bert-base-nli-mean-tokens", "api_call": "SentenceTransformer('sentence-transformers/bert-base-nli-mean-tokens')", "performance": {"dataset": "https://seb.sbert.net", "accuracy": "Not provided"}, "description": "This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/bert-base-nli-mean-tokens"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Classification", "api_name": "joeddav/xlm-roberta-large-xnli", "api_call": "XLMRobertaForSequenceClassification.from_pretrained('joeddav/xlm-roberta-large-xnli')", "performance": {"dataset": {"xnli": "56.6k", "multi_nli": "8.73k"}, "accuracy": "Not specified"}, "description": "This model takes xlm-roberta-large and fine-tunes it on a combination of NLI data in 15 languages. It is intended to be used for zero-shot text classification, such as with the Hugging Face ZeroShotClassificationPipeline.", "model_name": "joeddav/xlm-roberta-large-xnli"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "wav2vec2-xlsr-53-russian-emotion-recognition", "api_call": "Wav2Vec2Model.from_pretrained('facebook/wav2vec2-large-xlsr-53')", "performance": {"dataset": "Russian Emotional Speech Dialogs", "accuracy": "72%"}, "description": "A model trained to recognize emotions in Russian speech using wav2vec2. It can classify emotions such as anger, disgust, enthusiasm, fear, happiness, neutral, and sadness.", "model_name": "facebook/wav2vec2-large-xlsr-53"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "swin2SR-lightweight-x2-64", "api_call": "Swin2SRForConditionalGeneration.from_pretrained('condef/Swin2SR-lightweight-x2-64').", "performance": {"dataset": "", "accuracy": ""}, "description": "Swin2SR model that upscales images x2. It was introduced in the paper Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration by Conde et al. and first released in this repository. This model is intended for lightweight image super resolution.", "model_name": "condef/Swin2SR-lightweight-x2-64"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "Diffusion-based text-to-image generation model", "api_name": "lllyasviel/control_v11e_sd15_ip2p", "api_call": "ControlNetModel.from_pretrained('lllyasviel/control_v11e_sd15_ip2p')", "performance": {"dataset": "Stable Diffusion v1-5", "accuracy": "Not provided"}, "description": "ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on instruct pix2pix images.", "model_name": "lllyasviel/control_v11e_sd15_ip2p"}
{"domain": "Natural Language Processing Question Answering", "framework": "Transformers", "functionality": "Question Answering", "api_name": "monologg/koelectra-small-v2-distilled-korquad-384", "api_call": "pipeline('question-answering', model='monologg/koelectra-small-v2-distilled-korquad-384')", "performance": {"dataset": "KorQuAD", "accuracy": "Not provided"}, "description": "A Korean Question Answering model based on Electra and trained on the KorQuAD dataset.", "model_name": "monologg/koelectra-small-v2-distilled-korquad-384"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image Generation", "api_name": "runwayml/stable-diffusion-v1-5", "api_call": "StableDiffusionPipeline.from_pretrained(runwayml/stable-diffusion-v1-5, torch_dtype=torch.float16)(prompt).images[0]", "performance": {"dataset": "COCO2017", "accuracy": "Not optimized for FID scores"}, "description": "Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.", "model_name": "runwayml/stable-diffusion-v1-5"}
{"domain": "Natural Language Processing Translation", "framework": "Transformers", "functionality": "Text-to-Text Generation", "api_name": "optimum/t5-small", "api_call": "ORTModelForSeq2SeqLM.from_pretrained('optimum/t5-small')", "performance": {"dataset": "c4", "accuracy": "N/A"}, "description": "T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. It can be used for translation, text-to-text generation, and summarization.", "model_name": "optimum/t5-small"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "superb/hubert-base-superb-er", "api_call": "pipeline('audio-classification', model='superb/hubert-base-superb-er')", "performance": {"dataset": "IEMOCAP", "accuracy": {"session1": 0.6492, "session2": 0.6359}}, "description": "Hubert-Base for Emotion Recognition is a ported version of S3PRL's Hubert for the SUPERB Emotion Recognition task. The base model is hubert-base-ls960, which is pretrained on 16kHz sampled speech audio. The model is used for predicting an emotion class for each utterance, and it is trained and evaluated on the IEMOCAP dataset.", "model_name": "superb/hubert-base-superb-er"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "vit_tiny_patch16_224.augreg_in21k_ft_in1k", "api_call": "timm.create_model('hf_hub:timm/vit_tiny_patch16_224.augreg_in21k_ft_in1k', pretrained=True)", "performance": {"dataset": "", "accuracy": ""}, "description": "A Vision Transformer model for image classification, pretrained on ImageNet-21k and fine-tuned on ImageNet-1k with augmentations and regularization.", "model_name": "timm/vit_tiny_patch16_224.augreg_in21k_ft_in1k"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "ivelin/donut-refexp-combined-v1", "api_call": "pipeline('visual-question-answering', model='ivelin/donut-refexp-combined-v1')", "performance": {"dataset": "ivelin/donut-refexp-combined-v1", "accuracy": "N/A"}, "description": "A visual question answering model that takes an image and a question as input and provides an answer based on the visual content of the image and the context of the question.", "model_name": "ivelin/donut-refexp-combined-v1"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "MCG-NJU/videomae-base-finetuned-kinetics", "api_call": "VideoMAEForVideoClassification.from_pretrained('MCG-NJU/videomae-base-finetuned-kinetics')", "performance": {"dataset": "Kinetics-400", "accuracy": {"top-1": 80.9, "top-5": 94.7}}, "description": "VideoMAE model pre-trained for 1600 epochs in a self-supervised way and fine-tuned in a supervised way on Kinetics-400. It was introduced in the paper VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training by Tong et al. and first released in this repository.", "model_name": "MCG-NJU/videomae-base-finetuned-kinetics"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Document Question Answering", "api_name": "layoutlmv2-base-uncased_finetuned_docvqa", "api_call": "AutoModelForDocumentQuestionAnswering.from_pretrained('tiennvcs/layoutlmv2-base-uncased-finetuned-docvqa')", "performance": {"dataset": "None", "accuracy": {"Loss": 4.3167}}, "description": "This model is a fine-tuned version of microsoft/layoutlmv2-base-uncased on the None dataset.", "model_name": "tiennvcs/layoutlmv2-base-uncased-finetuned-docvqa"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Conversational", "api_name": "allenai/cosmo-xl", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('allenai/cosmo-xl')", "performance": {"dataset": {"allenai/soda": "", "allenai/prosocial-dialog": ""}, "accuracy": ""}, "description": "COSMO is a conversation agent with greater generalizability on both in- and out-of-domain chitchat datasets (e.g., DailyDialog, BlendedSkillTalk). It is trained on two datasets: SODA and ProsocialDialog. COSMO is especially aiming to model natural human conversations. It can accept situation descriptions as well as instructions on what role it should play in the situation.", "model_name": "allenai/cosmo-xl"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Speech Recognition", "api_name": "jonatasgrosman/wav2vec2-large-xlsr-53-japanese", "api_call": "SpeechRecognitionModel('jonatasgrosman/wav2vec2-large-xlsr-53-japanese')", "performance": {"dataset": "common_voice", "accuracy": {"WER": 81.8, "CER": 20.16}}, "description": "Fine-tuned XLSR-53 large model for speech recognition in Japanese. Trained on Common Voice 6.1, CSS10, and JSUT datasets. Make sure your speech input is sampled at 16kHz.", "model_name": "jonatasgrosman/wav2vec2-large-xlsr-53-japanese"}
{"domain": "Natural Language Processing Token Classification", "framework": "Transformers", "functionality": "Token Classification", "api_name": "xlm-roberta-large-finetuned-conll03-english", "api_call": "AutoModelForTokenClassification.from_pretrained('xlm-roberta-large-finetuned-conll03-english')", "performance": {"dataset": "conll2003", "accuracy": "More information needed"}, "description": "The XLM-RoBERTa model is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data. This model is XLM-RoBERTa-large fine-tuned with the conll2003 dataset in English. It can be used for token classification tasks such as Named Entity Recognition (NER) and Part-of-Speech (PoS) tagging.", "model_name": "xlm-roberta-large-finetuned-conll03-english"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "Summarization", "api_name": "google/pegasus-large", "api_call": "pipeline('summarization', model='google/pegasus-large')", "performance": {"dataset": [{"name": "xsum", "accuracy": "47.60/24.83/39.64"}, {"name": "cnn_dailymail", "accuracy": "44.16/21.56/41.30"}, {"name": "newsroom", "accuracy": "45.98/34.20/42.18"}, {"name": "multi_news", "accuracy": "47.65/18.75/24.95"}, {"name": "gigaword", "accuracy": "39.65/20.47/36.76"}, {"name": "wikihow", "accuracy": "46.39/22.12/38.41"}, {"name": "reddit_tifu", "accuracy": "27.99/9.81/22.94"}, {"name": "big_patent", "accuracy": "52.29/33.08/41.66"}, {"name": "arxiv", "accuracy": "44.21/16.95/25.67"}, {"name": "pubmed", "accuracy": "45.97/20.15/28.25"}, {"name": "aeslc", "accuracy": "37.68/21.25/36.51"}, {"name": "billsum", "accuracy": "59.67/41.58/47.59"}]}, "description": "google/pegasus-large is a pre-trained model for abstractive text summarization based on the PEGASUS architecture. It is trained on a mixture of C4 and HugeNews datasets and uses a sentencepiece tokenizer that can encode newline characters. The model has been fine-tuned for various summarization tasks and achieves state-of-the-art performance on multiple benchmarks.", "model_name": "google/pegasus-large"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Conversational", "api_name": "Pi3141/DialoGPT-medium-elon-3", "api_call": "pipeline('text-generation', model='Pi3141/DialoGPT-medium-elon-3')", "performance": {"dataset": "Twitter tweets by Elon Musk", "accuracy": "N/A"}, "description": "DialoGPT model that talks like Elon Musk, trained on Twitter tweets by Elon Musk. This model will spew meaningless shit about 40% of the time. Trained on 8 epochs. But with a larger dataset this time. The AI can now use more emojis, I think.", "model_name": "Pi3141/DialoGPT-medium-elon-3"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "timm/mobilenetv3_large_100.ra_in1k", "api_call": "timm.create_model('mobilenetv3_large_100.ra_in1k', pretrained=True)", "performance": {"dataset": "imagenet-1k", "accuracy": "Not provided"}, "description": "A MobileNet-v3 image classification model. Trained on ImageNet-1k in timm using recipe template described below. Recipe details: RandAugment RA recipe. Inspired by and evolved from EfficientNet RandAugment recipes. Published as B recipe in ResNet Strikes Back. RMSProp (TF 1.0 behaviour) optimizer, EMA weight averaging. Step (exponential decay w/ staircase) LR schedule with warmup.", "model_name": "timm/mobilenetv3_large_100.ra_in1k"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "glpn-nyu-finetuned-diode-230131-041708", "api_call": "AutoModelForImageClassification.from_pretrained('sayakpaul/glpn-nyu-finetuned-diode-230131-041708')", "performance": {"dataset": "diode-subset", "accuracy": {"Loss": 0.4425, "Mae": 0.427, "Rmse": 0.6196, "Abs_Rel": 0.45430000000000004, "Log_Mae": 0.17320000000000002, "Log_Rmse": 0.2288, "Delta1": 0.37870000000000004, "Delta2": 0.6298, "Delta3": 0.8083}}, "description": "This model is a fine-tuned version of vinvino02/glpn-nyu on the diode-subset dataset. It is used for depth estimation in computer vision tasks.", "model_name": "sayakpaul/glpn-nyu-finetuned-diode-230131-041708"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Image Segmentation", "api_name": "keremberke/yolov8m-pothole-segmentation", "api_call": "YOLO('keremberke/yolov8m-pothole-segmentation')", "performance": {"dataset": "pothole-segmentation", "accuracy": {"mAP@0.5(box)": 0.858, "mAP@0.5(mask)": 0.895}}, "description": "A YOLOv8 model for pothole segmentation trained on keremberke/pothole-segmentation dataset. It can detect potholes in images and provide segmentation masks for the detected potholes.", "model_name": "keremberke/yolov8m-pothole-segmentation"}
{"domain": "Multimodal Document Question Answer", "framework": "Transformers", "functionality": "Document Question Answering", "api_name": "tiny-random-LayoutLMv3ForQuestionAnswering", "api_call": "LayoutLMv3ForQuestionAnswering.from_pretrained('hf-tiny-model-private/tiny-random-LayoutLMv3ForQuestionAnswering')", "performance": {"dataset": "", "accuracy": ""}, "description": "A tiny random LayoutLMv3 model for document question answering. Can be used with the Hugging Face Inference API.", "model_name": "hf-tiny-model-private/tiny-random-LayoutLMv3ForQuestionAnswering"}
{"domain": "Natural Language Processing Text Classification", "framework": "Transformers", "functionality": "Text Classification", "api_name": "joeddav/distilbert-base-uncased-go-emotions-student", "api_call": "pipeline('text-classification', model='joeddav/distilbert-base-uncased-go-emotions-student')", "performance": {"dataset": "go_emotions"}, "description": "This model is distilled from the zero-shot classification pipeline on the unlabeled GoEmotions dataset. It is primarily intended as a demo of how an expensive NLI-based zero-shot model can be distilled to a more efficient student, allowing a classifier to be trained with only unlabeled data.", "model_name": "joeddav/distilbert-base-uncased-go-emotions-student"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "tiny-wav2vec2-stable-ln", "api_call": "pipeline('automatic-speech-recognition', model='ybelkada/tiny-wav2vec2-stable-ln')", "performance": {"dataset": null, "accuracy": null}, "description": "A tiny wav2vec2 model for Automatic Speech Recognition", "model_name": "ybelkada/tiny-wav2vec2-stable-ln"}
{"domain": "Reinforcement Learning", "framework": "Stable-Baselines3", "functionality": "deep-reinforcement-learning", "api_name": "ppo-seals-CartPole-v0", "api_call": "load_from_hub(repo_id='HumanCompatibleAI/ppo-seals-CartPole-v0',filename='{MODEL FILENAME}.zip',)", "performance": {"dataset": "seals/CartPole-v0", "accuracy": "500.00 +/- 0.00"}, "description": "This is a trained model of a PPO agent playing seals/CartPole-v0 using the stable-baselines3 library and the RL Zoo. The RL Zoo is a training framework for Stable Baselines3 reinforcement learning agents, with hyperparameter optimization and pre-trained agents included.", "model_name": "HumanCompatibleAI/ppo-seals-CartPole-v0"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "functionality": "Question Answering", "api_name": "sultan/BioM-ELECTRA-Large-SQuAD2", "api_call": "pipeline('question-answering', model='sultan/BioM-ELECTRA-Large-SQuAD2')", "performance": {"dataset": "SQuAD2.0 Dev", "accuracy": {"exact": 84.3342036554, "f1": 87.4935424189}}, "description": "BioM-ELECTRA-Large-SQuAD2 is a fine-tuned version of BioM-ELECTRA-Large, which was pre-trained on PubMed Abstracts, on the SQuAD2.0 dataset. Fine-tuning the biomedical language model on the SQuAD dataset helps improve the score on the BioASQ challenge. This model is suitable for working with BioASQ or biomedical QA tasks.", "model_name": "sultan/BioM-ELECTRA-Large-SQuAD2"}
{"domain": "Natural Language Processing Translation", "framework": "PyTorch Transformers", "functionality": "text2text-generation", "api_name": "facebook/nllb-200-distilled-600M", "api_call": "pipeline('translation_xx_to_yy', model='facebook/nllb-200-distilled-600M')", "performance": {"dataset": "Flores-200", "accuracy": "BLEU, spBLEU, chrF++"}, "description": "NLLB-200 is a machine translation model primarily intended for research in machine translation, especially for low-resource languages. It allows for single sentence translation among 200 languages. The model was trained on general domain text data and is not intended to be used with domain specific texts, such as medical domain or legal domain. The model is not intended to be used for document translation.", "model_name": "facebook/nllb-200-distilled-600M"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "PyTorch Transformers", "functionality": "Table Question Answering", "api_name": "table-question-answering-tapas", "api_call": "pipeline('table-question-answering', model='Meena/table-question-answering-tapas')", "performance": {"dataset": [{"name": "SQA (Sequential Question Answering by Microsoft)", "accuracy": null}, {"name": "WTQ (Wiki Table Questions by Stanford University)", "accuracy": null}, {"name": "WikiSQL (by Salesforce)", "accuracy": null}]}, "description": "TAPAS, the model learns an inner representation of the English language used in tables and associated texts, which can then be used to extract features useful for downstream tasks such as answering questions about a table, or determining whether a sentence is entailed or refuted by the contents of a table. It is a BERT-based model specifically designed (and pre-trained) for answering questions about tabular data. TAPAS uses relative position embeddings and has 7 token types that encode tabular structure. It is pre-trained on the masked language modeling (MLM) objective on a large dataset comprising millions of tables from English Wikipedia and corresponding texts.", "model_name": "Meena/table-question-answering-tapas"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face", "functionality": "Sentiment Analysis", "api_name": "cardiffnlp/twitter-xlm-roberta-base-sentiment", "api_call": "pipeline(sentiment-analysis, model='cardiffnlp/twitter-xlm-roberta-base-sentiment')", "performance": {"dataset": "Twitter", "accuracy": "Not provided"}, "description": "This is a multilingual XLM-roBERTa-base model trained on ~198M tweets and finetuned for sentiment analysis. The sentiment fine-tuning was done on 8 languages (Ar, En, Fr, De, Hi, It, Sp, Pt) but it can be used for more languages (see paper for details).", "model_name": "cardiffnlp/twitter-xlm-roberta-base-sentiment"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "glpn-kitti-finetuned-diode-221214-123047", "api_call": "pipeline('depth-estimation', model='sayakpaul/glpn-kitti-finetuned-diode-221214-123047')", "performance": {"dataset": "diode-subset", "accuracy": {"Loss": 0.3497, "Mae": 0.2847, "Rmse": 0.3977, "Abs Rel": 0.3477, "Log Mae": 0.1203, "Log Rmse": 0.1726, "Delta1": 0.5217, "Delta2": 0.8246, "Delta3": 0.9436}}, "description": "This model is a fine-tuned version of vinvino02/glpn-kitti on the diode-subset dataset. It is used for depth estimation in computer vision applications.", "model_name": "sayakpaul/glpn-kitti-finetuned-diode-221214-123047"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Fill-Mask", "api_name": "nlpaueb/legal-bert-small-uncased", "api_call": "AutoModel.from_pretrained('nlpaueb/legal-bert-small-uncased')", "performance": {"dataset": "Legal Corpora", "accuracy": "Comparable to larger models"}, "description": "LEGAL-BERT is a family of BERT models for the legal domain, intended to assist legal NLP research, computational law, and legal technology applications. This is the light-weight version of BERT-BASE (33% the size of BERT-BASE) pre-trained from scratch on legal data, which achieves comparable performance to larger models, while being much more efficient (approximately 4 times faster) with a smaller environmental footprint.", "model_name": "nlpaueb/legal-bert-small-uncased"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Classification", "api_name": "sileod/deberta-v3-base-tasksource-nli", "api_call": "AutoModelForSequenceClassification.from_pretrained('sileod/deberta-v3-base-tasksource-nli')", "performance": {"dataset": ["glue", "piqa", "sciq"], "accuracy": "70% on WNLI"}, "description": "DeBERTa-v3-base fine-tuned with multi-task learning on 520 tasks of the tasksource collection. This checkpoint has strong zero-shot validation performance on many tasks, and can be used for zero-shot NLI pipeline (similar to bart-mnli but better).", "model_name": "sileod/deberta-v3-base-tasksource-nli"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "prompthero/openjourney-v4", "api_call": "pipeline('text-to-image', model='prompthero/openjourney-v4')", "performance": {"dataset": "Midjourney v4 images", "accuracy": "Not provided"}, "description": "Openjourney v4 is trained on +124k Midjourney v4 images by PromptHero. It is used for generating images based on text inputs.", "model_name": "prompthero/openjourney-v4"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Transformers", "functionality": "Zero-Shot Classification", "api_name": "valhalla/distilbart-mnli-12-3", "api_call": "pipeline('zero-shot-classification', model='valhalla/distilbart-mnli-12-3')", "performance": {"dataset": [{"name": "matched acc", "accuracy": 88.1}, {"name": "mismatched acc", "accuracy": 88.19}]}, "description": "distilbart-mnli is the distilled version of bart-large-mnli created using the No Teacher Distillation technique proposed for BART summarisation by Huggingface. It is a simple and effective technique with very little performance drop.", "model_name": "valhalla/distilbart-mnli-12-3"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Classification", "api_name": "MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli", "api_call": "DeBERTaModel.from_pretrained('MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli')", "performance": {"dataset": {"mnli-m": 0.903, "mnli-mm": 0.903, "fever-nli": 0.777, "anli-all": 0.579, "anli-r3": 0.495}, "accuracy": {"mnli-m": 0.903, "mnli-mm": 0.903, "fever-nli": 0.777, "anli-all": 0.579, "anli-r3": 0.495}}, "description": "This model was trained on the MultiNLI, Fever-NLI and Adversarial-NLI (ANLI) datasets, which comprise 763 913 NLI hypothesis-premise pairs. This base model outperforms almost all large models on the ANLI benchmark. The base model is DeBERTa-v3-base from Microsoft. The v3 variant of DeBERTa substantially outperforms previous versions of the model by including a different pre-training objective, see annex 11 of the original DeBERTa paper.", "model_name": "MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "facebook/mask2former-swin-small-coco-instance", "api_call": "Mask2FormerForUniversalSegmentation.from_pretrained('facebook/mask2former-swin-small-coco-instance')", "performance": {"dataset": "COCO", "accuracy": "Not provided"}, "description": "Mask2Former model trained on COCO instance segmentation (small-sized version, Swin backbone). It was introduced in the paper Masked-attention Mask Transformer for Universal Image Segmentation and first released in this repository. Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA, MaskFormer both in terms of performance an efficiency.", "model_name": "facebook/mask2former-swin-small-coco-instance"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Emotion Recognition", "api_name": "superb/hubert-large-superb-er", "api_call": "pipeline('audio-classification', model='superb/hubert-large-superb-er')", "performance": {"dataset": "IEMOCAP", "accuracy": 0.6762}, "description": "This is a ported version of S3PRL's Hubert for the SUPERB Emotion Recognition task. The base model is hubert-large-ll60k, which is pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. For more information refer to SUPERB: Speech processing Universal PERformance Benchmark.", "model_name": "superb/hubert-large-superb-er"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Code Understanding and Generation", "api_name": "Salesforce/codet5-base", "api_call": "T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')", "performance": {"dataset": "code_search_net", "accuracy": "Refer to the paper for evaluation results on several downstream benchmarks"}, "description": "CodeT5 is a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers. It supports both code understanding and generation tasks and allows for multi-task learning. The model can be used for tasks such as code summarization, code generation, code translation, code refinement, code defect detection, and code clone detection.", "model_name": "Salesforce/codet5-base"}
{"domain": "Computer Vision Image-to-Image", "framework": "Keras", "functionality": "Image Deblurring", "api_name": "google/maxim-s3-deblurring-gopro", "api_call": "from_pretrained_keras('google/maxim-s3-deblurring-gopro')", "performance": {"dataset": "GoPro", "accuracy": {"PSNR": 32.86, "SSIM": 0.961}}, "description": "MAXIM model pre-trained for image deblurring. It was introduced in the paper MAXIM: Multi-Axis MLP for Image Processing by Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, Yinxiao Li and first released in this repository.", "model_name": "google/maxim-s3-deblurring-gopro"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Transformers", "api_name": "sentence-transformers/all-MiniLM-L6-v2", "api_call": "SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')", "performance": {"dataset": "1B sentence pairs dataset", "accuracy": "https://seb.sbert.net"}, "description": "This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/all-MiniLM-L6-v2"}
{"domain": "Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "mazkooleg/0-9up-unispeech-sat-base-ft", "api_call": "pipeline('audio-classification', model='mazkooleg/0-9up-unispeech-sat-base-ft')", "performance": {"dataset": "mazkooleg/0-9up_google_speech_commands_augmented_raw", "accuracy": 0.9979}, "description": "This model is a fine-tuned version of microsoft/unispeech-sat-base on the None dataset. It achieves the following results on the evaluation set: Loss: 0.0123, Accuracy: 0.9979.", "model_name": "mazkooleg/0-9up-unispeech-sat-base-ft"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Fill-Mask", "api_name": "cl-tohoku/bert-base-japanese", "api_call": "AutoModelForMaskedLM.from_pretrained('cl-tohoku/bert-base-japanese')", "performance": {"dataset": "wikipedia", "accuracy": "N/A"}, "description": "This is a BERT model pretrained on texts in the Japanese language. This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by the WordPiece subword tokenization.", "model_name": "cl-tohoku/bert-base-japanese"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Language model", "api_name": "google/flan-t5-base", "api_call": "T5ForConditionalGeneration.from_pretrained('google/flan-t5-base')", "performance": {"dataset": [{"name": "MMLU", "accuracy": "75.2%"}]}, "description": "FLAN-T5 is a language model fine-tuned on more than 1000 additional tasks covering multiple languages. It achieves state-of-the-art performance on several benchmarks and is designed for research on zero-shot NLP tasks and in-context few-shot learning NLP tasks, such as reasoning, and question answering.", "model_name": "google/flan-t5-base"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Object Detection", "api_name": "keremberke/yolov8m-valorant-detection", "api_call": "YOLO('keremberke/yolov8m-valorant-detection')", "performance": {"dataset": "valorant-object-detection", "accuracy": 0.965}, "description": "A YOLOv8 model for object detection in Valorant game, trained on a custom dataset. It detects dropped spike, enemy, planted spike, and teammate objects.", "model_name": "keremberke/yolov8m-valorant-detection"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Image Classification", "api_name": "OFA-Sys/chinese-clip-vit-base-patch16", "api_call": "ChineseCLIPModel.from_pretrained('OFA-Sys/chinese-clip-vit-base-patch16')", "performance": {"dataset": {"MUGE Text-to-Image Retrieval": {"accuracy": {"Zero-shot R@1": 63.0, "Zero-shot R@5": 84.1, "Zero-shot R@10": 89.2, "Finetune R@1": 68.9, "Finetune R@5": 88.7, "Finetune R@10": 93.1}}, "Flickr30K-CN Retrieval": {"accuracy": {"Zero-shot Text-to-Image R@1": 71.2, "Zero-shot Text-to-Image R@5": 91.4, "Zero-shot Text-to-Image R@10": 95.5, "Finetune Text-to-Image R@1": 83.8, "Finetune Text-to-Image R@5": 96.9, "Finetune Text-to-Image R@10": 98.6, "Zero-shot Image-to-Text R@1": 81.6, "Zero-shot Image-to-Text R@5": 97.5, "Zero-shot Image-to-Text R@10": 98.8, "Finetune Image-to-Text R@1": 95.3, "Finetune Image-to-Text R@5": 99.7, "Finetune Image-to-Text R@10": 100.0}}, "COCO-CN Retrieval": {"accuracy": {"Zero-shot Text-to-Image R@1": 69.2, "Zero-shot Text-to-Image R@5": 89.9, "Zero-shot Text-to-Image R@10": 96.1, "Finetune Text-to-Image R@1": 81.5, "Finetune Text-to-Image R@5": 96.9, "Finetune Text-to-Image R@10": 99.1, "Zero-shot Image-to-Text R@1": 63.0, "Zero-shot Image-to-Text R@5": 86.6, "Zero-shot Image-to-Text R@10": 92.9, "Finetune Image-to-Text R@1": 83.5, "Finetune Image-to-Text R@5": 97.3, "Finetune Image-to-Text R@10": 99.2}}, "Zero-shot Image Classification": {"accuracy": {"CIFAR10": 96.0, "CIFAR100": 79.7, "DTD": 51.2, "EuroSAT": 52.0, "FER": 55.1, "FGVC": 26.2, "KITTI": 49.9, "MNIST": 79.4, "PC": 63.5, "VOC": 84.9}}}}, "description": "Chinese CLIP is a simple implementation of CLIP on a large-scale dataset of around 200 million Chinese image-text pairs. It uses ViT-B/16 as the image encoder and RoBERTa-wwm-base as the text encoder.", "model_name": "OFA-Sys/chinese-clip-vit-base-patch16"}
{"domain": "Natural Language Processing Text Generation", "framework": "Transformers", "functionality": "Text Generation", "api_name": "facebook/opt-125m", "api_call": "pipeline('text-generation', model='facebook/opt-125m')", "performance": {"dataset": "Various", "accuracy": "Roughly matches GPT-3 performance"}, "description": "OPT (Open Pre-trained Transformers) is a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, designed to enable reproducible and responsible research at scale. It was predominantly pretrained with English text, but a small amount of non-English data is present within the training corpus via CommonCrawl. The model was pretrained using a causal language modeling (CLM) objective. OPT can be used for prompting for evaluation of downstream tasks as well as text generation.", "model_name": "facebook/opt-125m"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Video Classification", "api_name": "videomae-base-finetuned-ucf101-subset", "api_call": "AutoModelForSequenceClassification.from_pretrained('zahrav/videomae-base-finetuned-ucf101-subset')", "performance": {"dataset": "unknown", "accuracy": 0.8968}, "description": "This model is a fine-tuned version of MCG-NJU/videomae-base on an unknown dataset. It is used for video classification tasks.", "model_name": "zahrav/videomae-base-finetuned-ucf101-subset"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "text2text-generation", "api_name": "tuner007/pegasus_summarizer", "api_call": "PegasusForConditionalGeneration.from_pretrained('tuner007/pegasus_summarizer')", "performance": {"dataset": "cnn_dailymail", "accuracy": {"ROUGE-1": 36.604, "ROUGE-2": 14.64, "ROUGE-L": 23.884, "ROUGE-LSUM": 32.902, "loss": 2.576, "gen_len": 76.398}}, "description": "PEGASUS fine-tuned for summarization", "model_name": "tuner007/pegasus_summarizer"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Transformers", "functionality": "Table Question Answering", "api_name": "google/tapas-base-finetuned-wtq", "api_call": "TapasForQuestionAnswering.from_pretrained('google/tapas-base-finetuned-wtq')", "performance": {"dataset": "wikitablequestions", "accuracy": 0.46380000000000005}, "description": "TAPAS base model fine-tuned on WikiTable Questions (WTQ). This model is pretrained on a large corpus of English data from Wikipedia in a self-supervised fashion, and then fine-tuned on SQA, WikiSQL, and finally WTQ. It can be used for answering questions related to a table.", "model_name": "google/tapas-base-finetuned-wtq"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "Helsinki-NLP/opus-mt-en-it", "api_call": "pipeline('translation_en_to_it', model='Helsinki-NLP/opus-mt-en-it')", "performance": {"dataset": "opus", "accuracy": {"newssyscomb2009.en.it": {"BLEU": 30.9, "chr-F": 0.606}, "newstest2009.en.it": {"BLEU": 31.9, "chr-F": 0.604}, "Tatoeba.en.it": {"BLEU": 48.2, "chr-F": 0.6950000000000001}}}, "description": "A Transformer-based English to Italian translation model trained on the OPUS dataset. This model can be used for translation tasks using the Hugging Face Transformers library.", "model_name": "Helsinki-NLP/opus-mt-en-it"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Speech Emotion Recognition", "api_name": "harshit345/xlsr-wav2vec-speech-emotion-recognition", "api_call": "Wav2Vec2ForSpeechClassification.from_pretrained('harshit345/xlsr-wav2vec-speech-emotion-recognition')", "performance": {"dataset": "JTES v1.1", "accuracy": {"anger": 0.8200000000000001, "disgust": 0.85, "fear": 0.78, "happiness": 0.84, "sadness": 0.86, "overall": 0.806}}, "description": "This model is trained on the JTES v1.1 dataset for speech emotion recognition. It uses the Wav2Vec2 architecture for audio classification and can recognize emotions like anger, disgust, fear, happiness, and sadness.", "model_name": "harshit345/xlsr-wav2vec-speech-emotion-recognition"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "functionality": "Question Answering", "api_name": "valhalla/longformer-base-4096-finetuned-squadv1", "api_call": "AutoModelForQuestionAnswering.from_pretrained('valhalla/longformer-base-4096-finetuned-squadv1')", "performance": {"dataset": "squad_v1", "accuracy": {"Exact Match": 85.1466, "F1": 91.5415}}, "description": "This is longformer-base-4096 model fine-tuned on SQuAD v1 dataset for question answering task. Longformer model created by Iz Beltagy, Matthew E. Peters, Arman Coha from AllenAI. As the paper explains it, Longformer is a BERT-like model for long documents. The pre-trained model can handle sequences with up to 4096 tokens.", "model_name": "valhalla/longformer-base-4096-finetuned-squadv1"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "YituTech/conv-bert-base", "api_call": "AutoModel.from_pretrained('YituTech/conv-bert-base')", "performance": {"dataset": "N/A", "accuracy": "N/A"}, "description": "A pre-trained ConvBERT model for feature extraction provided by YituTech, based on the Hugging Face Transformers library.", "model_name": "YituTech/conv-bert-base"}
{"domain": "Tabular Tabular Classification", "framework": "Scikit-learn", "functionality": "Classification", "api_name": "imodels/figs-compas-recidivism", "api_call": "joblib.load(cached_download(hf_hub_url('imodels/figs-compas-recidivism', 'sklearn_model.joblib')))", "performance": {"dataset": "imodels/compas-recidivism", "accuracy": 0.6759165485}, "description": "A tabular classification model for predicting recidivism using the COMPAS dataset. The model is an imodels.FIGSClassifier trained with Scikit-learn and can be used with the Hugging Face Inference API.", "model_name": "imodels/figs-compas-recidivism"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Document Question Answering", "api_name": "layoutlmv2-base-uncased-finetuned-docvqa", "api_call": "AutoModelForDocumentQuestionAnswering.from_pretrained('tiennvcs/layoutlmv2-base-uncased-finetuned-docvqa')", "performance": {"dataset": "unknown", "accuracy": {"Loss": 1.194}}, "description": "This model is a fine-tuned version of microsoft/layoutlmv2-base-uncased on an unknown dataset.", "model_name": "tiennvcs/layoutlmv2-base-uncased-finetuned-docvqa"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "facebook/maskformer-swin-base-coco", "api_call": "MaskFormerForInstanceSegmentation.from_pretrained('facebook/maskformer-swin-base-coco')", "performance": {"dataset": "COCO", "accuracy": "Not provided"}, "description": "MaskFormer model trained on COCO panoptic segmentation (base-sized version, Swin backbone). It was introduced in the paper Per-Pixel Classification is Not All You Need for Semantic Segmentation and first released in this repository.", "model_name": "facebook/maskformer-swin-base-coco"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Document Question Answering", "api_name": "layoutlmv2-base-uncased_finetuned_docvqa", "api_call": "AutoModel.from_pretrained('microsoft/layoutlmv2-base-uncased')", "performance": {"dataset": "None", "accuracy": {"Loss": 4.843}}, "description": "This model is a fine-tuned version of microsoft/layoutlmv2-base-uncased on the None dataset.", "model_name": "microsoft/layoutlmv2-base-uncased"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "superb/wav2vec2-base-superb-sid", "api_call": "pipeline('audio-classification', model='superb/wav2vec2-base-superb-sid')", "performance": {"dataset": "VoxCeleb1", "accuracy": 0.7518}, "description": "This is a ported version of S3PRL's Wav2Vec2 for the SUPERB Speaker Identification task. The base model is wav2vec2-base, which is pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. For more information refer to SUPERB: Speech processing Universal PERformance Benchmark.", "model_name": "superb/wav2vec2-base-superb-sid"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Text Generation", "api_name": "bigscience/bloom-7b1", "api_call": "pipeline('text-generation', model='bigscience/bloom-7b1') should be changed to TextGenerationPipeline(model=Bloom7b1Model.from_pretrained('bigscience/bloom-7b1')).", "performance": {"dataset": "Training Data", "accuracy": {"Training Loss": 2.3, "Validation Loss": 2.9, "Perplexity": 16}}, "description": "BigScience Large Open-science Open-access Multilingual Language Model (BLOOM) is a transformer-based language model designed for text generation and as a pretrained base model for fine-tuning on specific tasks. It supports 48 languages and has 7,069,016,064 parameters. The model is trained on a diverse corpus containing 45 natural languages, 12 programming languages, and 1.5TB of pre-processed text.", "model_name": "bigscience/bloom-7b1"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Neural machine translation", "api_name": "opus-mt-tc-big-en-pt", "api_call": "MarianMTModel.from_pretrained('pytorch-models/opus-mt-tc-big-en-pt')", "performance": {"dataset": [{"name": "flores101-devtest", "accuracy": 50.4}, {"name": "tatoeba-test-v2021-08-07", "accuracy": 49.6}]}, "description": "Neural machine translation model for translating from English (en) to Portuguese (pt). This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world.", "model_name": "pytorch-models/opus-mt-tc-big-en-pt"}
{"domain": "Reinforcement Learning", "framework": "Stable-Baselines3", "functionality": "CartPole-v1", "api_name": "sb3/ppo-CartPole-v1", "api_call": "load_from_hub(repo_id='sb3/ppo-CartPole-v1',filename='{MODEL FILENAME}.zip',)", "performance": {"dataset": "CartPole-v1", "accuracy": "500.00 +/- 0.00"}, "description": "This is a trained model of a PPO agent playing CartPole-v1 using the stable-baselines3 library and the RL Zoo. The RL Zoo is a training framework for Stable Baselines3 reinforcement learning agents, with hyperparameter optimization and pre-trained agents included.", "model_name": "sb3/ppo-CartPole-v1"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "microsoft/resnet-50", "api_call": "ResNetForImageClassification.from_pretrained('microsoft/resnet-50')", "performance": {"dataset": "imagenet-1k", "accuracy": "~0.5% top1"}, "description": "ResNet-50 v1.5 is a pre-trained convolutional neural network for image classification on the ImageNet-1k dataset at resolution 224x224. It was introduced in the paper Deep Residual Learning for Image Recognition by He et al. ResNet (Residual Network) democratized the concepts of residual learning and skip connections, enabling the training of much deeper models. ResNet-50 v1.5 differs from the original model in the bottleneck blocks which require downsampling, v1 has stride = 2 in the first 1x1 convolution, whereas v1.5 has stride = 2 in the 3x3 convolution. This difference makes ResNet50 v1.5 slightly more accurate but comes with a small performance drawback.", "model_name": "microsoft/resnet-50"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Transformers", "functionality": "text2text-generation", "api_name": "neulab/omnitab-large-1024shot-finetuned-wtq-1024shot", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('neulab/omnitab-large-1024shot-finetuned-wtq-1024shot')", "performance": {"dataset": "wikitablequestions", "accuracy": "Not provided"}, "description": "OmniTab is a table-based QA model proposed in OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering. The original Github repository is https://github.com/jzbjyb/OmniTab. neulab/omnitab-large-1024shot-finetuned-wtq-1024shot (based on BART architecture) is initialized with neulab/omnitab-large-1024shot and fine-tuned on WikiTableQuestions in the 1024-shot setting.", "model_name": "neulab/omnitab-large-1024shot-finetuned-wtq-1024shot"}
{"domain": "Natural Language Processing Conversational", "framework": "Hugging Face", "functionality": "Dialogue Response Generation", "api_name": "microsoft/DialoGPT-small", "api_call": "AutoModelForCausalLM.from_pretrained('microsoft/DialoGPT-small')", "performance": {"dataset": "Reddit discussion thread", "accuracy": "Comparable to human response quality under a single-turn conversation Turing test"}, "description": "DialoGPT is a state-of-the-art large-scale pretrained dialogue response generation model for multiturn conversations. The model is trained on 147M multi-turn dialogue from Reddit discussion thread.", "model_name": "microsoft/DialoGPT-small"}
{"domain": "Audio Text-to-Speech", "framework": "Fairseq", "functionality": "Text-to-Speech", "api_name": "fastspeech2-en-male1", "api_call": "load_model_ensemble_and_task_from_hf_hub('facebook/fastspeech2-en-200_speaker-cv4',arg_overrides={'vocoder': 'hifigan', 'fp16': False})", "performance": {"dataset": "common_voice", "accuracy": null}, "description": "FastSpeech 2 text-to-speech model from fairseq S^2. English, 200 male/female voices, trained on Common Voice v4.", "model_name": "facebook/fastspeech2-en-200_speaker-cv4"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "google/mobilenet_v1_0.75_192", "api_call": "AutoModelForImageClassification.from_pretrained('google/mobilenet_v1_0.75_192')", "performance": {"dataset": "imagenet-1k", "accuracy": "Not provided"}, "description": "MobileNet V1 model pre-trained on ImageNet-1k at resolution 192x192. It was introduced in MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications by Howard et al, and first released in this repository. MobileNets are small, low-latency, low-power models parameterized to meet the resource constraints of a variety of use cases. They can be built upon for classification, detection, embeddings and segmentation similar to how other popular large scale models, such as Inception, are used. MobileNets can be run efficiently on mobile devices.", "model_name": "google/mobilenet_v1_0.75_192"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "wav2vec2", "api_name": "facebook/wav2vec2-large-960h-lv60-self", "api_call": "Wav2Vec2ForCTC.from_pretrained('facebook/wav2vec2-large-960h-lv60-self')", "performance": {"dataset": "librispeech_asr", "accuracy": {"clean": 1.9, "other": 3.9}}, "description": "Facebook's Wav2Vec2 model pretrained and fine-tuned on 960 hours of Libri-Light and Librispeech on 16kHz sampled speech audio. The model was trained with Self-Training objective. The model is used for Automatic Speech Recognition and can be used as a standalone acoustic model.", "model_name": "facebook/wav2vec2-large-960h-lv60-self"}
{"domain": "Natural Language Processing Token Classification", "framework": "Transformers", "functionality": "Part-of-speech tagging", "api_name": "ckiplab/bert-base-chinese-pos", "api_call": "AutoModel.from_pretrained('ckiplab/bert-base-chinese-pos')", "performance": {"dataset": "", "accuracy": ""}, "description": "This project provides traditional Chinese transformers models (including ALBERT, BERT, GPT2) and NLP tools (including word segmentation, part-of-speech tagging, named entity recognition).", "model_name": "ckiplab/bert-base-chinese-pos"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "stabilityai/sd-vae-ft-ema", "api_call": "StableDiffusionPipeline.from_pretrained('CompVis/stable-diffusion-v1-4', vae=AutoencoderKL.from_pretrained('stabilityai/sd-vae-ft-ema'))", "performance": {"dataset": {"COCO 2017 (256x256, val, 5000 images)": {"accuracy": {"rFID": 4.42, "PSNR": "23.8 +/- 3.9", "SSIM": "0.69 +/- 0.13", "PSIM": "0.96 +/- 0.27"}}, "LAION-Aesthetics 5+ (256x256, subset, 10000 images)": {"accuracy": {"rFID": 1.77, "PSNR": "26.7 +/- 4.8", "SSIM": "0.82 +/- 0.12", "PSIM": "0.67 +/- 0.34"}}}}, "description": "This is a fine-tuned VAE decoder for the Stable Diffusion Pipeline. It has been fine-tuned on a 1:1 ratio of LAION-Aesthetics and LAION-Humans datasets. The decoder can be used as a drop-in replacement for the existing autoencoder.", "model_name": "CompVis/stable-diffusion-v1-4"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "vilt-finetuned-vqasi", "api_call": "ViltModel.from_pretrained('tufa15nik/vilt-finetuned-vqasi')", "performance": {"dataset": "", "accuracy": ""}, "description": "A Visual Question Answering model fine-tuned on the VQASI dataset by tufa15nik using the ViLT architecture. The model is designed to answer questions based on the content of an input image.", "model_name": "tufa15nik/vilt-finetuned-vqasi"}
{"domain": "Audio Text-to-Speech", "framework": "speechbrain", "functionality": "Text-to-Speech", "api_name": "tts-hifigan-german", "api_call": "HIFIGAN.from_hparams(source='padmalcom/tts-hifigan-german', savedir=tmpdir_vocoder)", "performance": {"dataset": "custom German dataset", "accuracy": "Not specified"}, "description": "A HiFIGAN vocoder trained on a generated German dataset using mp3_to_training_data. The pre-trained model takes in input a spectrogram and produces a waveform in output. Typically, a vocoder is used after a TTS model that converts an input text into a spectrogram.", "model_name": "padmalcom/tts-hifigan-german"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Transformers", "functionality": "Table Question Answering", "api_name": "google/tapas-mini-finetuned-sqa", "api_call": "TapasForQuestionAnswering.from_pretrained('google/tapas-mini-finetuned-sqa')", "performance": {"dataset": "msr_sqa", "accuracy": 0.5148}, "description": "TAPAS mini model fine-tuned on Sequential Question Answering (SQA)", "model_name": "google/tapas-mini-finetuned-sqa"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "facebook/mask2former-swin-tiny-coco-instance", "api_call": "Mask2FormerForUniversalSegmentation.from_pretrained('facebook/mask2former-swin-tiny-coco-instance')", "performance": {"dataset": "COCO", "accuracy": "Not specified"}, "description": "Mask2Former model trained on COCO instance segmentation (tiny-sized version, Swin backbone). It was introduced in the paper Masked-attention Mask Transformer for Universal Image Segmentation and first released in this repository. This model addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. You can use this particular checkpoint for instance segmentation.", "model_name": "facebook/mask2former-swin-tiny-coco-instance"}
{"domain": "Natural Language Processing Conversational", "framework": "Hugging Face Transformers", "functionality": "text-generation", "api_name": "pygmalion-1.3b", "api_call": "pipeline('text-generation', 'PygmalionAI/pygmalion-1.3b')", "performance": {"dataset": "56MB of dialogue data", "accuracy": "Not provided"}, "description": "Pygmalion 1.3B is a proof-of-concept dialogue model based on EleutherAI's pythia-1.3b-deduped. It is designed for generating conversational responses and can be used with a specific input format that includes character persona, dialogue history, and user input message.", "model_name": "PygmalionAI/pygmalion-1.3b"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face", "functionality": "Zero-Shot Image Classification", "api_name": "laion/CLIP-convnext_base_w-laion_aesthetic-s13B-b82K", "api_call": "pipeline('image-classification', model='laion/CLIP-convnext_base_w-laion_aesthetic-s13B-b82K')", "performance": {"dataset": "ImageNet-1k", "accuracy": "70.8% to 71.7%"}, "description": "A series of CLIP ConvNeXt-Base (w/ wide embed dim) models trained on subsets LAION-5B using OpenCLIP. These models achieve between 70.8 and 71.7 zero-shot top-1 accuracy on ImageNet-1k. They can be used for zero-shot image classification, image and text retrieval, and other tasks.", "model_name": "laion/CLIP-convnext_base_w-laion_aesthetic-s13B-b82K"}
{"domain": "Tabular Tabular Regression", "framework": "Scikit-learn", "functionality": "baseline-trainer", "api_name": "srg/outhimar_64-Close-regression", "api_call": "joblib.load(hf_hub_download('srg/outhimar_64-Close-regression', 'sklearn_model.joblib'))", "performance": {"dataset": "outhimar_64", "accuracy": {"r2": 0.9998579999999999, "neg_mean_squared_error": -1.067685}}, "description": "Baseline Model trained on outhimar_64 to apply regression on Close. Disclaimer: This model is trained with dabl library as a baseline, for better results, use AutoTrain. Logs of training including the models tried in the process can be found in logs.txt.", "model_name": "srg/outhimar_64-Close-regression"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image Diffusion Models", "api_name": "lllyasviel/sd-controlnet-scribble", "api_call": "ControlNetModel.from_pretrained('lllyasviel/sd-controlnet-scribble')", "performance": {"dataset": "500k scribble-image, caption pairs", "accuracy": "Not provided"}, "description": "ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on Scribble images. It can be used in combination with Stable Diffusion.", "model_name": "lllyasviel/sd-controlnet-scribble"}
{"domain": "Natural Language Processing Translation", "framework": "Transformers", "functionality": "Translation", "api_name": "Helsinki-NLP/opus-mt-it-en", "api_call": "pipeline('translation_it_to_en', model='Helsinki-NLP/opus-mt-it-en')", "performance": {"dataset": "opus", "accuracy": {"BLEU": {"newssyscomb2009.it.en": 35.3, "newstest2009.it.en": 34.0, "Tatoeba.it.en": 70.9}, "chr-F": {"newssyscomb2009.it.en": 0.6000000000000001, "newstest2009.it.en": 0.594, "Tatoeba.it.en": 0.808}}}, "description": "A transformer model for Italian to English translation trained on the OPUS dataset. It can be used for translating Italian text to English.", "model_name": "Helsinki-NLP/opus-mt-it-en"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Image Segmentation", "api_name": "keremberke/yolov8s-pothole-segmentation", "api_call": "YOLO('keremberke/yolov8s-pothole-segmentation')", "performance": {"dataset": "pothole-segmentation", "accuracy": {"mAP@0.5(box)": 0.928, "mAP@0.5(mask)": 0.928}}, "description": "A YOLOv8 model for pothole segmentation. This model detects potholes in images and outputs bounding boxes and masks for the detected potholes.", "model_name": "keremberke/yolov8s-pothole-segmentation"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "wav2vec2-base-superb-sv", "api_call": "AutoModelForAudioXVector.from_pretrained('anton-l/wav2vec2-base-superb-sv')", "performance": {"dataset": "superb", "accuracy": "More information needed"}, "description": "This is a ported version of S3PRL's Wav2Vec2 for the SUPERB Speaker Verification task. The base model is wav2vec2-large-lv60, which is pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. For more information refer to SUPERB: Speech processing Universal PERformance Benchmark.", "model_name": "anton-l/wav2vec2-base-superb-sv"}
{"domain": "Natural Language Processing Translation", "framework": "Transformers", "functionality": "Translation", "api_name": "opus-mt-fr-en", "api_call": "pipeline('translation_fr_to_en', model='Helsinki-NLP/opus-mt-fr-en')", "performance": {"dataset": "opus", "accuracy": {"BLEU": {"newsdiscussdev2015-enfr.fr.en": 33.1, "newsdiscusstest2015-enfr.fr.en": 38.7, "newssyscomb2009.fr.en": 30.3, "news-test2008.fr.en": 26.2, "newstest2009.fr.en": 30.2, "newstest2010.fr.en": 32.2, "newstest2011.fr.en": 33.0, "newstest2012.fr.en": 32.8, "newstest2013.fr.en": 33.9, "newstest2014-fren.fr.en": 37.8, "Tatoeba.fr.en": 57.5}}}, "description": "Helsinki-NLP/opus-mt-fr-en is a machine translation model trained to translate from French to English. It is based on the Marian NMT framework and trained on the OPUS dataset.", "model_name": "Helsinki-NLP/opus-mt-fr-en"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "shi-labs/oneformer_coco_swin_large", "api_call": "'OneFormerForUniversalSegmentation.from_pretrained(shi-labs/oneformer_coco_swin_large)'", "performance": {"dataset": "ydshieh/coco_dataset_script", "accuracy": "Not provided"}, "description": "OneFormer model trained on the COCO dataset (large-sized version, Swin backbone). It was introduced in the paper OneFormer: One Transformer to Rule Universal Image Segmentation by Jain et al. and first released in this repository. OneFormer is the first multi-task universal image segmentation framework. It needs to be trained only once with a single universal architecture, a single model, and on a single dataset, to outperform existing specialized models across semantic, instance, and panoptic segmentation tasks. OneFormer uses a task token to condition the model on the task in focus, making the architecture task-guided for training, and task-dynamic for inference, all with a single model.", "model_name": "shi-labs/oneformer_coco_swin_large"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "lysandre/tiny-tapas-random-wtq", "api_call": "TapasForQuestionAnswering.from_pretrained('lysandre/tiny-tapas-random-wtq')", "performance": {"dataset": "WTQ", "accuracy": "Not provided"}, "description": "A tiny TAPAS model trained on the WikiTableQuestions dataset for table question answering tasks.", "model_name": "lysandre/tiny-tapas-random-wtq"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "dpt-large-redesign", "api_call": "AutoModelForDepthEstimation.from_pretrained('nielsr/dpt-large-redesign')", "performance": {"dataset": "", "accuracy": ""}, "description": "A depth estimation model based on the DPT architecture.", "model_name": "nielsr/dpt-large-redesign"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "Helsinki-NLP/opus-mt-nl-en", "api_call": "pipeline('translation_nl_to_en', model='Helsinki-NLP/opus-mt-nl-en')", "performance": {"dataset": "Tatoeba.nl.en", "accuracy": {"BLEU": 60.9, "chr-F": 0.749}}, "description": "A Dutch to English translation model based on the OPUS dataset, using a transformer-align architecture with normalization and SentencePiece pre-processing.", "model_name": "Helsinki-NLP/opus-mt-nl-en"}
{"domain": "Natural Language Processing Summarization", "framework": "Transformers", "functionality": "text2text-generation", "api_name": "csebuetnlp/mT5_multilingual_XLSum", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('csebuetnlp/mT5_multilingual_XLSum')", "performance": {"dataset": "xsum", "accuracy": {"ROUGE-1": 36.5, "ROUGE-2": 13.934, "ROUGE-L": 28.988, "ROUGE-LSUM": 28.996, "loss": 2.067, "gen_len": 26.973}}, "description": "This repository contains the mT5 checkpoint finetuned on the 45 languages of XL-Sum dataset. It is a multilingual abstractive summarization model that supports text-to-text generation for 43 languages.", "model_name": "csebuetnlp/mT5_multilingual_XLSum"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "vitouphy/wav2vec2-xls-r-300m-phoneme", "api_call": "Wav2Vec2ForCTC.from_pretrained('vitouphy/wav2vec2-xls-r-300m-phoneme')", "performance": {"dataset": "None", "accuracy": {"Loss": 0.3327, "Cer": 0.1332}}, "description": "This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on the None dataset. It is designed for Automatic Speech Recognition tasks.", "model_name": "vitouphy/wav2vec2-xls-r-300m-phoneme"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "functionality": "Natural Language Inference", "api_name": "cointegrated/rubert-base-cased-nli-threeway", "api_call": "AutoModelForSequenceClassification.from_pretrained('cointegrated/rubert-base-cased-nli-threeway')", "performance": {"dataset": ["JOCI", "MNLI", "MPE", "SICK", "SNLI", "ANLI", "NLI-style FEVER", "IMPPRES"], "accuracy": {"ROC AUC": {"entailment": 0.91, "contradiction": 0.71, "neutral": 0.79}}}, "description": "This is the DeepPavlov/rubert-base-cased fine-tuned to predict the logical relationship between two short texts: entailment, contradiction, or neutral.", "model_name": "cointegrated/rubert-base-cased-nli-threeway"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "sberbank-ai/sbert_large_mt_nlu_ru", "api_call": "AutoModel.from_pretrained('sberbank-ai/sbert_large_mt_nlu_ru')", "performance": {"dataset": "Russian SuperGLUE", "accuracy": "Not provided"}, "description": "BERT large model multitask (cased) for Sentence Embeddings in Russian language.", "model_name": "sberbank-ai/sbert_large_mt_nlu_ru"}
{"domain": "Multimodal Text-to-Video", "framework": "Hugging Face", "functionality": "Text-to-Video", "api_name": "camenduru/text2-video-zero", "api_call": "pipeline('text-to-video', model='camenduru/text2-video-zero')", "performance": {"dataset": "", "accuracy": ""}, "description": "This model is used for generating videos from text inputs. It is based on the Hugging Face framework and can be used with the transformers library. The model is trained on a variety of text and video datasets, and can be used for tasks such as video summarization, video generation from text prompts, and more.", "model_name": "camenduru/text2-video-zero"}
{"domain": "Natural Language Processing Table Question Answering", "framework": "Hugging Face Transformers", "functionality": "Table Question Answering", "api_name": "navteca/tapas-large-finetuned-wtq", "api_call": "AutoModelForTableQuestionAnswering.from_pretrained('navteca/tapas-large-finetuned-wtq')", "performance": {"dataset": "wikisql", "accuracy": "Not provided"}, "description": "TAPAS large model fine-tuned on WikiTable Questions (WTQ). It is a BERT-like transformers model pretrained on a large corpus of English data from Wikipedia in a self-supervised fashion. It can be used for answering questions related to a table.", "model_name": "navteca/tapas-large-finetuned-wtq"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face", "functionality": "Visual Question Answering", "api_name": "sheldonxxxx/OFA_model_weights", "api_call": "AutoModel.from_pretrained('sheldonxxxx/OFA_model_weights')", "performance": {"dataset": "", "accuracy": ""}, "description": "This is an unoffical mirror of the model weights for use with https://github.com/OFA-Sys/OFA. The original link is too slow when downloading from outside of China.", "model_name": "sheldonxxxx/OFA_model_weights"}
{"domain": "Audio Voice Activity Detection", "framework": "Hugging Face", "functionality": "Voice Activity Detection", "api_name": "d4data/Indian-voice-cloning", "api_call": "pipeline('voice-activity-detection', model='d4data/Indian-voice-cloning')", "performance": {"dataset": "", "accuracy": ""}, "description": "A model for detecting voice activity in Indian languages.", "model_name": "d4data/Indian-voice-cloning"}
{"domain": "Tabular Tabular Classification", "framework": "Hugging Face", "functionality": "Carbon Emissions", "api_name": "Xinhhd/autotrain-zhongxin-contest-49402119333", "api_call": "AutoModel.from_pretrained('Xinhhd/autotrain-zhongxin-contest-49402119333')", "performance": {"dataset": "Xinhhd/autotrain-data-zhongxin-contest", "accuracy": 0.889}, "description": "A multi-class classification model trained with AutoTrain to predict carbon emissions based on input features.", "model_name": "Xinhhd/autotrain-zhongxin-contest-49402119333"}
{"domain": "Audio Text-to-Speech", "framework": "ESPnet", "functionality": "Text-to-Speech", "api_name": "SYSPIN/Telugu_Male_TTS", "api_call": "pipeline('text-to-speech', model='SYSPIN/Telugu_Male_TTS')", "performance": {"dataset": "", "accuracy": ""}, "description": "A Telugu Male Text-to-Speech model using the ESPnet framework, provided by Hugging Face.", "model_name": "SYSPIN/Telugu_Male_TTS"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "api_name": "m-a-p/MERT-v1-330M", "performance": {"dataset": "", "accuracy": null}, "description": "the development log of our music audio pre-training m-a-p model family: - 02/06/2023: and training released. - 17/03/2023: we release two advanced music understanding models, and , trained with new paradigm and dataset. they outperform the previous models and can better generalize to more tasks. the development log of our music audio pre-training m-a-p model family: - 02/06/2023: and training released. - 17/03/2023: we release two advanced music understanding models, and , trained with new paradigm and dataset. they outperform the previous models and can better generalize to more tasks. - 14/03/2023: we retrained the mert-v0 model with open-source-only music dataset - 29/12/2022: a music understanding model trained with mlm paradigm, which performs better at downstream tasks. - 29/10/2022: a pre-trained mir model trained with byol paradigm. here is a table for quick model pick-up:  name pre-train paradigm training data hour pre-train context second model size transformer layer-dimension feature rate sample rate release date              mlm 160k 5 330m 24-1024 75 hz 24k hz 17/03/2023   mlm 20k 5 95m 12-768 75 hz 24k hz 17/03/2023   mlm 900 5 95m 12-768 50 hz 16k hz 14/03/2023   mlm 1000 5 95 m 12-768 50 hz 16k hz 29/12/2022   byol 1000 30 95 m 12-768 50 hz 16k hz 30/10/2022", "api_call": "", "model_name": "m-a-p/MERT-v1-330M"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "NbAiLab/nb-bert-base", "performance": {"dataset": "", "accuracy": null}, "description": "- release 1.1 march 11, 2021 - release 1.0 january 13, 2021 nb-bert-base is a general bert-base model built on the large digital collection at the national library of norway. nb-bert-base is a general bert-base model built on the large digital collection at the national library of norway. this model is based on the same structure as , and is trained on a wide variety of norwegian text both bokm\u00e5l and nynorsk from the last 200 years.", "api_call": "", "model_name": "NbAiLab/nb-bert-base"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Generic", "api_name": "1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase", "performance": {"dataset": "", "accuracy": null}, "description": "this is an `xlm-roberta` fine-tuned to restore punctuation, true-case capitalize, and detect sentence boundaries full stops in 47 languages. if you want to just play with the model, the widget on this page will suffice. to use the model offline,", "api_call": "", "model_name": "1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "api_name": "IlyaGusev/rut5_base_sum_gazeta", "performance": {"dataset": "", "accuracy": null}, "description": "this is the model for abstractive summarization for russian based on . colab: - dataset: this is the model for abstractive summarization for russian based on .", "api_call": "", "model_name": "IlyaGusev/rut5_base_sum_gazeta"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "uer/gpt2-chinese-lyric", "performance": {"dataset": "", "accuracy": null}, "description": "the model is pre-trained by , which is introduced in . besides, the model could also be pre-trained by introduced in , which inherits uer-py to support models with parameters above one billion, and extends it to a multimodal pre-training framework. the model is used to generate chinese lyrics. you can download the model from the , or , or via huggingface from the link you can use the model directly with a pipeline for text generation: the model is pre-trained by , which is introduced in . besides, the model could also be pre-trained by introduced in , which inherits uer-py to support models with parameters above one billion, and extends it to a multimodal pre-training framework. the model is used to generate chinese lyrics. you can download the model from the , or , or via huggingface from the link", "api_call": "", "model_name": "uer/gpt2-chinese-lyric"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "almanach/camembert-bio-base", "performance": {"dataset": "", "accuracy": null}, "description": "camembert-bio is a state-of-the-art french biomedical language model built using continual-pretraining from . it was trained on a french public biomedical corpus of 413m words containing scientific documents, drug leaflets and clinical cases extrated from theses and articles. it shows 2.54 points of f1 score improvement on average on 5 different biomedical named entity recognition tasks compared to .", "api_call": "", "model_name": "almanach/camembert-bio-base"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-en-he", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: en\n* target languages: he\n*  OPUS readme: [en-he](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/en-he/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/en-he/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-he/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-he/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-en-he"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "Human Pose Estimation", "api_name": "lllyasviel/sd-controlnet-openpose", "api_call": "ControlNetModel.from_pretrained('lllyasviel/sd-controlnet-openpose')", "performance": {"dataset": "200k pose-image, caption pairs", "accuracy": "Not specified"}, "description": "ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on Human Pose Estimation. It can be used in combination with Stable Diffusion.", "model_name": "lllyasviel/sd-controlnet-openpose"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "dumitrescustefan/bert-base-romanian-cased-v1", "performance": {"dataset": "", "accuracy": null}, "description": "the bert base , cased model for romanian, trained on a 15gb corpus, version remember to always sanitize your text! replace ``s`` and ``t`` cedilla-letters to comma-letters with : because the model was not trained on cedilla ``s`` and ``t``s. if you don't, you will have decreased performance due to", "api_call": "", "model_name": "dumitrescustefan/bert-base-romanian-cased-v1"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "Hugging Face Transformers", "api_name": "deepset/gbert-base-germandpr-question_encoder", "performance": {"dataset": "", "accuracy": null}, "description": "## Details\n- We trained a dense passage retrieval model with two gbert-base models as encoders of questions and passages.\n- The dataset is GermanDPR, a new, German language dataset, which we hand-annotated and published [online](https://deepset.ai/germanquad).\n- It comprises 9275 question/answer pairs in the training set and 1025 pairs in the test set.\nFor each pair, there are one positive context and three hard negative contexts.\n- As the basis of the training data, we used our hand-annotated GermanQuAD dataset as positive samples and generated hard negative samples from the latest German Wikipedia dump (6GB of raw txt files).\n- The data dump was cleaned with tailored scripts, leading to 2.8 million indexed passages from German Wikipedia.\n\nSee https://deepset.ai/germanquad for more details and dataset download.", "api_call": "", "model_name": "deepset/gbert-base-germandpr-question_encoder"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-en-ro", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: en\n* target languages: ro\n*  OPUS readme: [en-ro](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/en-ro/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/en-ro/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-ro/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-ro/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-en-ro"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "Hugging Face Transformers", "api_name": "LeBenchmark/wav2vec2-FR-7K-large", "performance": {"dataset": "", "accuracy": null}, "description": "lebenchmark provides an ensemble of pretrained wav2vec2 models on different french datasets containing spontaneous, read, and broadcasted speech. it comes with 2 versions, in which, the later version lebenchmark 2.0 is an extended version of the first version in terms of both numbers of pre-trained ssl models, and numbers of downstream tasks. for more information on the different benchmarks that can be used to evaluate the wav2vec2 models, please refer to our paper at: we release four different models that can be found under our huggingface organization. four different wav2vec2 architectures light , base , large and xlarge are coupled with our small 1k, medium 3k, large 7k, and extra large 14k corpus. in short:", "api_call": "", "model_name": "LeBenchmark/wav2vec2-FR-7K-large"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-ar-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: ar\n* target languages: en\n*  OPUS readme: [ar-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/ar-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/ar-en/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ar-en/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ar-en/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-ar-en"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "api_name": "VoVanPhuc/sup-SimCSE-VietNamese-phobert-base", "performance": {"dataset": "", "accuracy": null}, "description": "Pre-trained SimeCSE_Vietnamese models are the state-of-the-art of Sentence Embeddings with Vietnamese : \n\n - SimeCSE_Vietnamese pre-training approach is based on [SimCSE](https://arxiv.org/abs/2104.08821) which optimizes the SimeCSE_Vietnamese pre-training procedure for more robust performance.\n - SimeCSE_Vietnamese encode input sentences using a pre-trained language model such as  [PhoBert](https://www.aclweb.org/anthology/2020.findings-emnlp.92/)\n - SimeCSE_Vietnamese works with both unlabeled and labeled data.", "api_call": "", "model_name": "VoVanPhuc/sup-SimCSE-VietNamese-phobert-base"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-sv-fi", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: sv\n* target languages: fi\n*  OPUS readme: [sv-fi](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/sv-fi/README.md)\n\n*  dataset: opus+bt\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus+bt-2020-04-07.zip](https://object.pouta.csc.fi/OPUS-MT-models/sv-fi/opus+bt-2020-04-07.zip)\n* test set translations: [opus+bt-2020-04-07.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/sv-fi/opus+bt-2020-04-07.test.txt)\n* test set scores: [opus+bt-2020-04-07.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/sv-fi/opus+bt-2020-04-07.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-sv-fi"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-eu-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: eu\n* target languages: en\n*  OPUS readme: [eu-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/eu-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/eu-en/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/eu-en/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/eu-en/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-eu-en"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Timm", "api_name": "timm/convnext_large_mlp.clip_laion2b_soup_ft_in12k_in1k_320", "performance": {"dataset": "", "accuracy": null}, "description": "a convnext image classification model. clip image tower weights pretrained in on laion and fine-tuned on imagenet-12k followed by imagenet-1k in `timm` bby ross wightman. please see related openclip model cards for more details on pretrain:", "api_call": "", "model_name": "timm/convnext_large_mlp.clip_laion2b_soup_ft_in12k_in1k_320"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "api_name": "pierreguillou/bert-large-cased-squad-v1.1-portuguese", "performance": {"dataset": "", "accuracy": null}, "description": "the model was trained on the dataset squad v1.1 in portuguese from the . the language model used is the aka \"bert-large-portuguese-cased\" from : bertimbau is a pretrained bert model for brazilian portuguese that achieves state-of-the-art performances on three downstream nlp tasks: named entity recognition, sentence textual similarity and recognizing textual entailment. it is available in two sizes: base and large. all the informations are in the blog post : the model was trained on the dataset squad v1.1 in portuguese from the . the language model used is the aka \"bert-large-portuguese-cased\" from : bertimbau is a pretrained bert model for brazilian portuguese that achieves state-of-the-art performances on three downstream nlp tasks: named entity recognition, sentence textual similarity and recognizing textual entailment. it is available in two sizes: base and large.", "api_call": "", "model_name": "pierreguillou/bert-large-cased-squad-v1.1-portuguese"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "Hugging Face Transformers", "api_name": "princeton-nlp/unsup-simcse-bert-base-uncased", "performance": {"dataset": "", "accuracy": null}, "description": "more information needed - developed by: princeton nlp group - shared by optional: hugging face more information needed - developed by: princeton nlp group - shared by optional: hugging face - model type: feature extraction - languages nlp: more information needed - license: more information needed - related models:  - parent model: bert - resources for more information:  - - -", "api_call": "", "model_name": "princeton-nlp/unsup-simcse-bert-base-uncased"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Image Classification", "api_name": "openai/clip-vit-base-patch32", "api_call": "CLIPModel.from_pretrained('openai/clip-vit-base-patch32')", "performance": {"dataset": ["Food101", "CIFAR10", "CIFAR100", "Birdsnap", "SUN397", "Stanford Cars", "FGVC Aircraft", "VOC2007", "DTD", "Oxford-IIIT Pet dataset", "Caltech101", "Flowers102", "MNIST", "SVHN", "IIIT5K", "Hateful Memes", "SST-2", "UCF101", "Kinetics700", "Country211", "CLEVR Counting", "KITTI Distance", "STL-10", "RareAct", "Flickr30", "MSCOCO", "ImageNet", "ImageNet-A", "ImageNet-R", "ImageNet Sketch", "ObjectNet (ImageNet Overlap)", "Youtube-BB", "ImageNet-Vid"], "accuracy": "varies"}, "description": "The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner.", "model_name": "openai/clip-vit-base-patch32"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "microsoft/codereviewer", "performance": {"dataset": "", "accuracy": null}, "description": "codereviewer is a model pre-trained with code change and code review data to support code review tasks. zhiyu li, shuai lu, daya guo, nan duan, shailesh jannu, grant jenks, deep majumder, jared green, alexey svyatkovskiy, shengyu fu, neel sundaresan. if you user codereviewer, please consider citing the following paper: codereviewer is a model pre-trained with code change and code review data to support code review tasks. zhiyu li, shuai lu, daya guo, nan duan, shailesh jannu, grant jenks, deep majumder, jared green, alexey svyatkovskiy, shengyu fu, neel sundaresan.", "api_call": "", "model_name": "microsoft/codereviewer"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "dbddv01/gpt2-french-small", "performance": {"dataset": "", "accuracy": null}, "description": "a small french language model for french text generation and possibly more nlp tasks... introduction this french gpt2 model is based on openai gpt-2 small model.", "api_call": "", "model_name": "dbddv01/gpt2-french-small"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Depth Estimation", "api_name": "glpn-kitti", "api_call": "GLPNForDepthEstimation.from_pretrained('vinvino02/glpn-kitti')", "performance": {"dataset": "KITTI", "accuracy": "Not provided"}, "description": "Global-Local Path Networks (GLPN) model trained on KITTI for monocular depth estimation. It was introduced in the paper Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth by Kim et al. and first released in this repository.", "model_name": "vinvino02/glpn-kitti"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face Diffusers", "api_name": "timbrooks/instruct-pix2pix", "performance": {"dataset": "", "accuracy": null}, "description": "", "api_call": "", "model_name": "timbrooks/instruct-pix2pix"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "api_name": "LukasStankevicius/t5-base-lithuanian-news-summaries-175", "performance": {"dataset": "", "accuracy": null}, "description": "this is t5-base transformer model trained on lithuanian news summaries for 175 000 steps. it was created during the work . given the following article body from :", "api_call": "", "model_name": "LukasStankevicius/t5-base-lithuanian-news-summaries-175"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "functionality": "Transcription", "api_name": "openai/whisper-tiny.en", "api_call": "WhisperForConditionalGeneration.from_pretrained('openai/whisper-tiny.en')", "performance": {"dataset": "LibriSpeech (clean)", "accuracy": 8.437}, "description": "Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning.", "model_name": "openai/whisper-tiny.en"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "ehsanaghaei/SecureBERT", "performance": {"dataset": "", "accuracy": null}, "description": "securebert is a roberta-based, domain-specific language model trained on a large cybersecurity-focused corpus. it is designed to represent and understand cybersecurity text more effectively than general-purpose models. was trained on extensive in-domain data crawled from diverse online resources. it has demonstrated strong performance in a range of cybersecurity nlp tasks. \ud83d\udc49 see the .", "api_call": "", "model_name": "ehsanaghaei/SecureBERT"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "Text Summarization", "api_name": "facebook/bart-large-cnn", "api_call": "pipeline('summarization', model='facebook/bart-large-cnn')", "performance": {"dataset": "cnn_dailymail", "accuracy": {"ROUGE-1": 42.949, "ROUGE-2": 20.815, "ROUGE-L": 30.619, "ROUGE-LSUM": 40.038}}, "description": "BART (large-sized model), fine-tuned on CNN Daily Mail. BART is a transformer encoder-encoder (seq2seq) model with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder. BART is pre-trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. BART is particularly effective when fine-tuned for text generation (e.g. summarization, translation) but also works well for comprehension tasks (e.g. text classification, question answering). This particular checkpoint has been fine-tuned on CNN Daily Mail, a large collection of text-summary pairs.", "model_name": "facebook/bart-large-cnn"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Transformers", "api_name": "sentence-transformers/multi-qa-MiniLM-L6-cos-v1", "api_call": "SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1')", "performance": {"dataset": [{"name": "WikiAnswers", "accuracy": "77,427,422"}, {"name": "PAQ", "accuracy": "64,371,441"}, {"name": "Stack Exchange", "accuracy": "25,316,456"}, {"name": "MS MARCO", "accuracy": "17,579,773"}, {"name": "GOOAQ", "accuracy": "3,012,496"}, {"name": "Amazon-QA", "accuracy": "2,448,839"}, {"name": "Yahoo Answers", "accuracy": "1,198,260"}, {"name": "SearchQA", "accuracy": "582,261"}, {"name": "ELI5", "accuracy": "325,475"}, {"name": "Quora", "accuracy": "103,663"}, {"name": "Natural Questions (NQ)", "accuracy": "100,231"}, {"name": "SQuAD2.0", "accuracy": "87,599"}, {"name": "TriviaQA", "accuracy": "73,346"}]}, "description": "This is a sentence-transformers model that maps sentences & paragraphs to a 384-dimensional dense vector space and was designed for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.", "model_name": "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"}
{"domain": "Computer Vision Image-to-Text", "framework": "Hugging Face Transformers", "api_name": "kpyu/video-blip-flan-t5-xl-ego4d", "performance": {"dataset": "", "accuracy": null}, "description": "videoblip model, leveraging with a large language model with 2.7 billion parameters as its llm backbone. videoblip is an augmented blip-2 that can handle videos. videoblip-opt uses off-the-shelf flan-t5 as the language model. it inherits the same risks and limitations from : videoblip is an augmented blip-2 that can handle videos.", "api_call": "", "model_name": "kpyu/video-blip-flan-t5-xl-ego4d"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "MilaNLProc/feel-it-italian-emotion", "performance": {"dataset": "", "accuracy": null}, "description": "you can find the package that uses this model for emotion and sentiment classification it is meant to be a very simple interface over huggingface models. users should refer to the sentiment analysis is a common task to understand people's reactions online. still, we often need more nuanced information: is the post negative because the user is angry or because they are sad?", "api_call": "", "model_name": "MilaNLProc/feel-it-italian-emotion"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face Transformers", "api_name": "google/matcha-chartqa", "performance": {"dataset": "", "accuracy": null}, "description": "<img src=\"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/matcha_architecture.jpg\"\nalt=\"drawing\" width=\"600\"/>\n\nThis model is the MatCha model, fine-tuned on Chart2text-pew dataset. \n\n#  Table of Contents\n\n0. [TL;DR](#TL;DR)\n1. [Using the model](#using-the-model)\n2. [Contribution](#contribution)\n3. [Citation](#citation)\n\n# TL;DR\n\nThe abstract of the paper states that: \n> Visual language data such as plots, charts, and infographics are ubiquitous in the human world. However, state-of-the-art visionlanguage models do not perform well on these data. We propose MATCHA (Math reasoning and Chart derendering pretraining) to enhance visual language models\u2019 capabilities jointly modeling charts/plots and language data. Specifically we propose several pretraining tasks that cover plot deconstruction and numerical reasoning which are the key capabilities in visual language modeling. We perform the MATCHA pretraining starting from Pix2Struct, a recently proposed imageto-text visual language model. On standard benchmarks such as PlotQA and ChartQA, MATCHA model outperforms state-of-the-art methods by as much as nearly 20%. We also examine how well MATCHA pretraining transfers to domains such as screenshot, textbook diagrams, and document figures and observe overall improvement, verifying the usefulness of MATCHA pretraining on broader visual language tasks.\n\n# Using the model \n\nYou should ask specific questions to the model in order to get consistent generations. Here we are asking the model whether the sum of values that are in a chart are greater than the largest value.\n\n```python\nfrom transformers import Pix2StructProcessor, Pix2StructForConditionalGeneration\nimport requests\nfrom PIL import Image\n\nprocessor = Pix2StructProcessor.from_pretrained('google/matcha-chartqa')\nmodel = Pix2StructForConditionalGeneration.from_pretrained('google/matcha-chartqa')\n\nurl = \"https://raw.githubusercontent.com/vis-nlp/ChartQA/main/ChartQA%20Dataset/val/png/20294671002019.png\"\nimage = Image.open(requests.get(url, stream=True).raw)\n\ninputs = processor(images=image, text=\"Is the sum of all 4 places greater than Laos?\", return_tensors=\"pt\")\npredictions = model.generate(**inputs, max_new_tokens=512)\nprint(processor.decode(predictions[0], skip_special_tokens=True))\n>>> No\n```\n\nTo run the predictions on GPU, simply add `.to(0)` when creating the model and when getting the inputs (`inputs = inputs.to(0)`)\n\n# Converting from T5x to huggingface\n\nYou can use the [`convert_pix2struct_checkpoint_to_pytorch.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/pix2struct/convert_pix2struct_original_pytorch_to_hf.py) script as follows:\n```bash\npython convert_pix2struct_checkpoint_to_pytorch.py --t5x_checkpoint_path PATH_TO_T5X_CHECKPOINTS --pytorch_dump_path PATH_TO_SAVE --is_vqa\n```\nif you are converting a large model, run:\n```bash\npython convert_pix2struct_checkpoint_to_pytorch.py --t5x_checkpoint_path PATH_TO_T5X_CHECKPOINTS --pytorch_dump_path PATH_TO_SAVE --use-large --is_vqa\n```\nOnce saved, you can push your converted model with the following snippet:\n```python\nfrom transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor\n\nmodel = Pix2StructForConditionalGeneration.from_pretrained(PATH_TO_SAVE)\nprocessor = Pix2StructProcessor.from_pretrained(PATH_TO_SAVE)\n\nmodel.push_to_hub(\"USERNAME/MODEL_NAME\")\nprocessor.push_to_hub(\"USERNAME/MODEL_NAME\")\n```\n\n# Contribution\n\nThis model was originally contributed by Fangyu Liu, Francesco Piccinno et al. and added to the Hugging Face ecosystem by [Younes Belkada](https://huggingface.co/ybelkada).\n\n# Citation\n\nIf you want to cite this work, please consider citing the original paper:\n```\n@misc{liu2022matcha,\n      title={MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering}, \n      author={Fangyu Liu and Francesco Piccinno and Syrine Krichene and Chenxi Pang and Kenton Lee and Mandar Joshi and Yasemin Altun and Nigel Collier and Julian Martin Eisenschlos},\n      year={2022},\n      eprint={2212.09662},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```", "api_call": "", "model_name": "google/matcha-chartqa"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-af-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: af\n* target languages: en\n*  OPUS readme: [af-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/af-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/af-en/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-en/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/af-en/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-af-en"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "unicamp-dl/ptt5-base-portuguese-vocab", "performance": {"dataset": "", "accuracy": null}, "description": "ptt5 is a t5 model pretrained in the brwac corpus, a large collection of web pages in portuguese, improving t5's performance on portuguese sentence similarity and entailment tasks. it's available in three sizes small, base and large and two vocabularies google's t5 original and ours, trained on portuguese wikipedia. for further information or requests, please go to . model size params vocabulary  ptt5 is a t5 model pretrained in the brwac corpus, a large collection of web pages in portuguese, improving t5's performance on portuguese sentence similarity and entailment tasks. it's available in three sizes small, base and large and two vocabularies google's t5 original and ours, trained on portuguese wikipedia. for further information or requests, please go to .", "api_call": "", "model_name": "unicamp-dl/ptt5-base-portuguese-vocab"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "indolem/indobertweet-base-uncased", "performance": {"dataset": "", "accuracy": null}, "description": "fajri koto, jey han lau, and timothy baldwin. . in proceedings of the 2021 conference on empirical methods in natural language processing emnlp 2021 , dominican republic virtual. is the first large-scale pretrained model for indonesian twitter", "api_call": "", "model_name": "indolem/indobertweet-base-uncased"}
{"domain": "Computer Vision Text-to-Video", "framework": "Hugging Face Transformers", "api_name": "Searchium-ai/clip4clip-webvid150k", "performance": {"dataset": "", "accuracy": null}, "description": "a clip4clip video-text retrieval model trained on a subset of the webvid dataset. the model and training method are described in the paper by lou et el, and implemented in the accompanying . the training process utilized the , a comprehensive collection of short videos with corresponding textual descriptions sourced from the web.", "api_call": "", "model_name": "Searchium-ai/clip4clip-webvid150k"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "robotjung/SemiRealMix", "performance": {"dataset": "", "accuracy": null}, "description": "the result of many merges aimed at making semi-realistic human images. i use the following options to get good generation results: delicate, masterpiece, best shadow, 1 girl:1.3, korean girl:1.2, from side:1.2, from below:0.5, photorealistic:1.5, extremely detailed skin, studio, beige background, warm soft light, low contrast, head tilt", "api_call": "", "model_name": "robotjung/SemiRealMix"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Image Captioning", "api_name": "nlpconnect/vit-gpt2-image-captioning", "api_call": "VisionEncoderDecoderModel.from_pretrained('nlpconnect/vit-gpt2-image-captioning')", "performance": {"dataset": "Not provided", "accuracy": "Not provided"}, "description": "An image captioning model that uses transformers to generate captions for input images. The model is based on the Illustrated Image Captioning using transformers approach.", "model_name": "nlpconnect/vit-gpt2-image-captioning"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "nlpaueb/sec-bert-base", "performance": {"dataset": "", "accuracy": null}, "description": "sec-bert is a family of bert models for the financial domain, intended to assist financial nlp research and fintech applications. sec-bert consists of the following models: sec-bert-base this model: same architecture as bert-base trained on financial documents.", "api_call": "", "model_name": "nlpaueb/sec-bert-base"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "PlanTL-GOB-ES/roberta-base-biomedical-clinical-es", "performance": {"dataset": "", "accuracy": null}, "description": "click to expand - - biomedical pretrained language model for spanish. this model is a model trained on a biomedical-clinical corpus in spanish collected from several sources.", "api_call": "", "model_name": "PlanTL-GOB-ES/roberta-base-biomedical-clinical-es"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-ur-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source group: Urdu \n* target group: English \n*  OPUS readme: [urd-eng](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/urd-eng/README.md)\n\n*  model: transformer-align\n* source language(s): urd\n* target language(s): eng\n* model: transformer-align\n* pre-processing: normalization + SentencePiece (spm32k,spm32k)\n* download original weights: [opus-2020-06-17.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/urd-eng/opus-2020-06-17.zip)\n* test set translations: [opus-2020-06-17.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/urd-eng/opus-2020-06-17.test.txt)\n* test set scores: [opus-2020-06-17.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/urd-eng/opus-2020-06-17.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-ur-en"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "yiyanghkust/finbert-esg-9-categories", "performance": {"dataset": "", "accuracy": null}, "description": "esg analysis can help investors determine a business' long-term sustainability and identify associated risks. finbert-esg-9-categories is a finbert model fine-tuned on about 14,000 manually annotated sentences from firms' esg reports and annual reports. finbert-esg-9-categories classifies a text into nine fine-grained esg topics: climate change, natural capital, pollution & waste, human capital, product liability, community relations, corporate governance, business ethics & values, and non-esg . this model complements which classifies a text into four coarse-grained esg themes e, s, g or none . detailed description of the nine fine-grained esg topic definition, some examples for each topic, training sample, and the model\u2019s performance can be found .", "api_call": "", "model_name": "yiyanghkust/finbert-esg-9-categories"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "Hugging Face Transformers", "api_name": "google/tapas-base", "performance": {"dataset": "", "accuracy": null}, "description": "this model has 2 versions which can be used. the latest version, which is the default one, corresponds to the `tapas inter masklm base reset` checkpoint of the . this model was pre-trained on mlm and an additional step which the authors call intermediate pre-training. it uses relative position embeddings by default i.e. resetting the position index at every cell of the table. the other non-default version which can be used is the one with absolute position embeddings: tapas is a bert-like transformers model pretrained on a large corpus of english data from wikipedia in a self-supervised fashion. this means it was pretrained on the raw tables and associated texts only, with no humans labelling them in any way which is why it can use lots of publicly available data with an automatic process to generate inputs and labels from those texts. more precisely, it was pretrained with two objectives: - masked language modeling mlm: taking a flattened table and associated context, the model randomly masks 15% of the words in the input, then runs the entire partially masked sequence through the model. the model then has to predict the masked words. this is different from traditional recurrent neural networks rnns that usually see the words one after the other, or from autoregressive models like gpt which internally mask the future tokens. it allows the model to learn a bidirectional representation of a table and associated text. - intermediate pre-training: to encourage numerical reasoning on tables, the authors additionally pre-trained the model by creating a balanced dataset of millions of syntactically created training examples. here, the model must predict classify whether a sentence is supported or refuted by the contents of a table. the training examples are created based on synthetic as well as counterfactual statements. this way, the model learns an inner representation of the english language used in tables and associated texts, which can then be used to extract features useful for downstream tasks such as answering questions about a table, or determining whether a sentence is entailed or refuted by the contents of a table. fine-tuning is done by adding one or more classification heads on top of the pre-trained model, and then jointly train these randomly initialized classification heads with the base model on a downstream task.", "api_call": "", "model_name": "google/tapas-base"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "bigscience/bloom", "performance": {"dataset": "", "accuracy": null}, "description": "BLOOM is an autoregressive Large Language Model (LLM), trained to continue text from a prompt on vast amounts of text data using industrial-scale computational resources. As such, it is able to output coherent text in 46 languages and 13 programming languages that is hardly distinguishable from text written by humans. BLOOM can also be instructed to perform text tasks it hasn't been explicitly trained for, by casting them as text generation tasks.", "api_call": "", "model_name": "bigscience/bloom"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "cointegrated/rut5-base-paraphraser", "performance": {"dataset": "", "accuracy": null}, "description": "this is a paraphraser for russian sentences described . it is recommended to use the model with the `encoder no repeat ngram size` argument:", "api_call": "", "model_name": "cointegrated/rut5-base-paraphraser"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face Transformers", "functionality": "Visual Question Answering", "api_name": "blip-vqa-base", "api_call": "BlipForQuestionAnswering.from_pretrained('Salesforce/blip-vqa-base')", "performance": {"dataset": "VQA", "accuracy": "+1.6% in VQA score"}, "description": "BLIP is a Vision-Language Pre-training (VLP) framework that transfers flexibly to both vision-language understanding and generation tasks. It effectively utilizes noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. This model is trained on visual question answering with a base architecture (using ViT base backbone).", "model_name": "Salesforce/blip-vqa-base"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "api_name": "etalab-ia/camembert-base-squadFR-fquad-piaf", "performance": {"dataset": "", "accuracy": null}, "description": "question-answering french model, using base fine-tuned on a combo of three french q&a datasets: 1. 2. question-answering french model, using base fine-tuned on a combo of three french q&a datasets: 1. 2. 3.", "api_call": "", "model_name": "etalab-ia/camembert-base-squadFR-fquad-piaf"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face Diffusers", "api_name": "DionTimmer/controlnet_qrcode-control_v1p_sd15", "performance": {"dataset": "", "accuracy": null}, "description": "this repo holds the safetensors & diffusers versions of the qr code conditioned controlnet for stable diffusion v1.5. the stable diffusion 2.1 version is marginally more effective, as it was developed to address my specific needs. however, this 1.5 version model was also trained on the same dataset for those who are using the older version. these models perform quite well in most cases, but please note that they are not 100% accurate. in some instances, the qr code shape might not come through as expected. you can increase the controlnet weight to emphasize the qr code shape. however, be cautious as this might negatively impact the style of your output. to optimize for scanning, please generate your qr codes with correction mode 'h' 30%.  this repo holds the safetensors & diffusers versions of the qr code conditioned controlnet for stable diffusion v1.5. the stable diffusion 2.1 version is marginally more effective, as it was developed to address my specific needs. however, this 1.5 version model was also trained on the same dataset for those who are using the older version.", "api_call": "", "model_name": "DionTimmer/controlnet_qrcode-control_v1p_sd15"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "EleutherAI/gpt-j-6b", "performance": {"dataset": "", "accuracy": null}, "description": "gpt-j 6b is a transformer model trained using ben wang's . \"gpt-j\" refers to the class of model, while \"6b\" represents the number of trainable parameters. hyperparameter value    gpt-j 6b is a transformer model trained using ben wang's . \"gpt-j\" refers to the class of model, while \"6b\" represents the number of trainable parameters.  hyperparameter value       \\\\n parameters\\\\ 6053381344   \\\\n layers\\\\ 28&ast;   \\\\d model\\\\ 4096   \\\\d ff\\\\ 16384   \\\\n heads\\\\ 16   \\\\d head\\\\ 256   \\\\n ctx\\\\ 2048   \\\\n vocab\\\\ 50257/50400&dagger; same tokenizer as gpt-2/3   positional encoding   rope dimensions  &ast; each layer consists of one feedforward block and one self attention block. &dagger; although the embedding matrix has a size of 50400, only 50257 entries are used by the gpt-2 tokenizer. the model consists of 28 layers with a model dimension of 4096, and a feedforward dimension of 16384. the model dimension is split into 16 heads, each with a dimension of 256. rotary position embedding rope is applied to 64 dimensions of each head. the model is trained with a tokenization vocabulary of 50257, using the same set of bpes as gpt-2/gpt-3.", "api_call": "", "model_name": "EleutherAI/gpt-j-6b"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "CAMeL-Lab/bert-base-arabic-camelbert-da-sentiment", "performance": {"dataset": "", "accuracy": null}, "description": "camelbert-da sa model is a sentiment analysis sa model that was built by fine-tuning the model. for the fine-tuning, we used the , , and datasets. our fine-tuning procedure and the hyperparameters we used can be found in our paper \".\"  camelbert-da sa model is a sentiment analysis sa model that was built by fine-tuning the model. for the fine-tuning, we used the , , and datasets. our fine-tuning procedure and the hyperparameters we used can be found in our paper \".\"  our fine-tuning code can be found .", "api_call": "", "model_name": "CAMeL-Lab/bert-base-arabic-camelbert-da-sentiment"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "Summarization", "api_name": "google/pegasus-xsum", "api_call": "pipeline('summarization', model='google/pegasus-xsum')", "performance": {"dataset": [{"name": "xsum", "accuracy": {"ROUGE-1": 46.862, "ROUGE-2": 24.453, "ROUGE-L": 39.055, "ROUGE-LSUM": 39.099}}, {"name": "cnn_dailymail", "accuracy": {"ROUGE-1": 22.206, "ROUGE-2": 7.67, "ROUGE-L": 15.405, "ROUGE-LSUM": 19.218}}, {"name": "samsum", "accuracy": {"ROUGE-1": 21.81, "ROUGE-2": 4.253, "ROUGE-L": 17.447, "ROUGE-LSUM": 18.891}}]}, "description": "PEGASUS is a pre-trained model for abstractive summarization, developed by Google. It is based on the Transformer architecture and trained on both C4 and HugeNews datasets. The model is designed to extract gap sentences and generate summaries by stochastically sampling important sentences.", "model_name": "google/pegasus-xsum"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face Diffusers", "api_name": "mfidabel/controlnet-segment-anything", "performance": {"dataset": "", "accuracy": null}, "description": "these are controlnet weights trained on runwayml/stable-diffusion-v1-5 with a new type of conditioning. you can find some example images in the following. prompt : contemporary living room of a house negative prompt : low quality", "api_call": "", "model_name": "mfidabel/controlnet-segment-anything"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Monocular Depth Estimation", "api_name": "Intel/dpt-large", "api_call": "DPTForDepthEstimation.from_pretrained('Intel/dpt-large')", "performance": {"dataset": "MIX 6", "accuracy": "10.82"}, "description": "Dense Prediction Transformer (DPT) model trained on 1.4 million images for monocular depth estimation. Introduced in the paper Vision Transformers for Dense Prediction by Ranftl et al. (2021). DPT uses the Vision Transformer (ViT) as backbone and adds a neck + head on top for monocular depth estimation.", "model_name": "Intel/dpt-large"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "naver-clova-ix/donut-base", "api_call": "AutoModel.from_pretrained('naver-clova-ix/donut-base')", "performance": {"dataset": "arxiv:2111.15664", "accuracy": "Not provided"}, "description": "Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder.", "model_name": "naver-clova-ix/donut-base"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-en-eu", "performance": {"dataset": "", "accuracy": null}, "description": "* source group: English \n* target group: Basque \n*  OPUS readme: [eng-eus](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/eng-eus/README.md)\n\n*  model: transformer-align\n* source language(s): eng\n* target language(s): eus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece (spm32k,spm32k)\n* download original weights: [opus-2020-06-17.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/eng-eus/opus-2020-06-17.zip)\n* test set translations: [opus-2020-06-17.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/eng-eus/opus-2020-06-17.test.txt)\n* test set scores: [opus-2020-06-17.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/eng-eus/opus-2020-06-17.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-en-eu"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face Keras", "api_name": "google/maxim-s2-deraining-rain13k", "performance": {"dataset": "", "accuracy": null}, "description": "maxim model pre-trained for image deraining. it was introduced in the paper by zhengzhong tu, hossein talebi, han zhang, feng yang, peyman milanfar, alan bovik, yinxiao li and first released in . disclaimer: the team releasing maxim did not write a model card for this model so this model card has been written by the hugging face team. maxim introduces a shared mlp-based backbone for different image processing tasks such as image deblurring, deraining, denoising, dehazing, low-light image enhancement, and retouching. the following figure depicts the main components of maxim: maxim introduces a shared mlp-based backbone for different image processing tasks such as image deblurring, deraining, denoising, dehazing, low-light image enhancement, and retouching. the following figure depicts the main components of maxim:", "api_call": "", "model_name": "google/maxim-s2-deraining-rain13k"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-en-hu", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: en\n* target languages: hu\n*  OPUS readme: [en-hu](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/en-hu/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/en-hu/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-hu/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-hu/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-en-hu"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "lambdalabs/dreambooth-avatar", "performance": {"dataset": "", "accuracy": null}, "description": "dreambooth finetuning of stable diffusion v1.5.1 on avatar art style by . this text-to-image stable diffusion model was trained with dreambooth. put in a text prompt and generate your own avatar style image! base model is stable diffusion v1.5 and was trained using dreambooth with 60 input images sized 512x512 displaying avatar character images. the model is learning to associate avatar images with the style tokenized as 'avatarart style'. prior preservation was used during training using the class 'person' to avoid training bleeding into the representations for that class. training ran on 2xa6000 gpus on for 700 steps, batch size 4 a couple hours, at a cost of about $4. author: eole cervenka", "api_call": "", "model_name": "lambdalabs/dreambooth-avatar"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "wonrax/phobert-base-vietnamese-sentiment", "performance": {"dataset": "", "accuracy": null}, "description": "model = RobertaForSequenceClassification.from_pretrained(\"wonrax/phobert-base-vietnamese-sentiment\")\n\ntokenizer = AutoTokenizer.from_pretrained(\"wonrax/phobert-base-vietnamese-sentiment\", use_fast=False)\n\n# Just like PhoBERT: INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!\nsentence = '\u0110\u00e2y l\u00e0 m\u00f4_h\u00ecnh r\u1ea5t hay , ph\u00f9_h\u1ee3p v\u1edbi \u0111i\u1ec1u_ki\u1ec7n v\u00e0 nh\u01b0 c\u1ea7u c\u1ee7a nhi\u1ec1u ng\u01b0\u1eddi .'  \n\ninput_ids = torch.tensor([tokenizer.encode(sentence)])\n\nwith torch.no_grad():\n    out = model(input_ids)\n    print(out.logits.softmax(dim=-1).tolist())\n    # Output:\n    # [[0.002, 0.988, 0.01]]\n    #     ^      ^      ^\n    #    NEG    POS    NEU\n```", "api_call": "", "model_name": "wonrax/phobert-base-vietnamese-sentiment"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "SamLowe/roberta-base-go_emotions", "performance": {"dataset": "", "accuracy": null}, "description": "model trained from on the dataset for multi-label classification. a version of this model in onnx format including an int8 quantized onnx version is now available at . these are faster for inference, esp for smaller batch sizes, massively reduce the size of the dependencies required for inference, make inference of the model more multi-platform, and in the case of the quantized version reduce the model file/download size by 75% whilst retaining almost all the accuracy if you only need inference. is based on reddit data and has 28 labels. it is a multi-label dataset where one or multiple labels may apply for any given input text, hence this model is a multi-label classification model with 28 'probability' float outputs for any given input text. typically a threshold of 0.5 is applied to the probabilities for the prediction for each label. as provided in the above notebook, evaluation of the multi-label output of the 28 dim output via a threshold of 0.5 to binarize each using the dataset test split gives: - accuracy: 0.474 - precision: 0.575 - recall: 0.396 - f1: 0.450 but the metrics are more meaningful when measured per label given the multi-label nature each label is effectively an independent binary classification and the fact that there is drastically different representations of the labels in the dataset. with a threshold of 0.5 applied to binarize the model outputs, as per the above notebook, the metrics per label are:  accuracy precision recall f1 mcc support threshold             admiration 0.946 0.725 0.675 0.699 0.670 504 0.5   amusement 0.982 0.790 0.871 0.829 0.821 264 0.5   anger 0.970 0.652 0.379 0.479 0.483 198 0.5   annoyance 0.940 0.472 0.159 0.238 0.250 320 0.5   approval 0.942 0.609 0.302 0.404 0.403 351 0.5   caring 0.973 0.448 0.319 0.372 0.364 135 0.5   confusion 0.972 0.500 0.431 0.463 0.450 153 0.5   curiosity 0.950 0.537 0.356 0.428 0.412 284 0.5   desire 0.987 0.630 0.410 0.496 0.502 83 0.5   disappointment 0.974 0.625 0.199 0.302 0.343 151 0.5   disapproval 0.950 0.494 0.307 0.379 0.365 267 0.5   disgust 0.982 0.707 0.333 0.453 0.478 123 0.5   embarrassment 0.994 0.750 0.243 0.367 0.425 37 0.5   excitement 0.983 0.603 0.340 0.435 0.445 103 0.5   fear 0.992 0.758 0.603 0.671 0.672 78 0.5   gratitude 0.990 0.960 0.881 0.919 0.914 352 0.5   grief 0.999 0.000 0.000 0.000 0.000 6 0.5   joy 0.978 0.647 0.559 0.600 0.590 161 0.5   love 0.982 0.773 0.832 0.802 0.793 238 0.5   nervousness 0.996 0.600 0.130 0.214 0.278 23 0.5   optimism 0.972 0.667 0.376 0.481 0.488 186 0.5   pride 0.997 0.000 0.000 0.000 0.000 16 0.5   realization 0.974 0.541 0.138 0.220 0.264 145 0.5   relief 0.998 0.000 0.000 0.000 0.000 11 0.5   remorse 0.991 0.553 0.750 0.636 0.640 56 0.5   sadness 0.977 0.621 0.494 0.550 0.542 156 0.5   surprise 0.981 0.750 0.404 0.525 0.542 141 0.5   neutral 0.782 0.694 0.604 0.646 0.492 1787 0.5  optimizing the threshold per label for the one that gives the optimum f1 metrics gives slightly better metrics - sacrificing some precision for a greater gain in recall, hence to the benefit of f1 how this was done is shown in the above notebook:  accuracy precision recall f1 mcc support threshold             admiration 0.940 0.651 0.776 0.708 0.678 504 0.25   amusement 0.982 0.781 0.890 0.832 0.825 264 0.45   anger 0.959 0.454 0.601 0.517 0.502 198 0.15   annoyance 0.864 0.243 0.619 0.349 0.328 320 0.10   approval 0.926 0.432 0.442 0.437 0.397 351 0.30   caring 0.972 0.426 0.385 0.405 0.391 135 0.40   confusion 0.974 0.548 0.412 0.470 0.462 153 0.55   curiosity 0.943 0.473 0.711 0.568 0.552 284 0.25   desire 0.985 0.518 0.530 0.524 0.516 83 0.25   disappointment 0.974 0.562 0.298 0.390 0.398 151 0.40   disapproval 0.941 0.414 0.468 0.439 0.409 267 0.30   disgust 0.978 0.523 0.463 0.491 0.481 123 0.20   embarrassment 0.994 0.567 0.459 0.507 0.507 37 0.10   excitement 0.981 0.500 0.417 0.455 0.447 103 0.35   fear 0.991 0.712 0.667 0.689 0.685 78 0.40   gratitude 0.990 0.957 0.889 0.922 0.917 352 0.45   grief 0.999 0.333 0.333 0.333 0.333 6 0.05   joy 0.978 0.623 0.646 0.634 0.623 161 0.40   love 0.982 0.740 0.899 0.812 0.807 238 0.25   nervousness 0.996 0.571 0.348 0.432 0.444 23 0.25   optimism 0.971 0.580 0.565 0.572 0.557 186 0.20   pride 0.998 0.875 0.438 0.583 0.618 16 0.10   realization 0.961 0.270 0.262 0.266 0.246 145 0.15   relief 0.992 0.152 0.636 0.246 0.309 11 0.05   remorse 0.991 0.541 0.946 0.688 0.712 56 0.10   sadness 0.977 0.599 0.583 0.591 0.579 156 0.40   surprise 0.977 0.543 0.674 0.601 0.593 141 0.15   neutral 0.758 0.598 0.810 0.688 0.513 1787 0.25  this improves the overall metrics: - precision: 0.542 - recall: 0.577 - f1: 0.541 or if calculated weighted by the relative size of the support of each label: - precision: 0.572 - recall: 0.677 - f1: 0.611", "api_call": "", "model_name": "SamLowe/roberta-base-go_emotions"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "ControlNet", "api_name": "lllyasviel/control_v11p_sd15_lineart", "api_call": "ControlNetModel.from_pretrained('lllyasviel/control_v11p_sd15_lineart')", "performance": {"dataset": "ControlNet-1-1-preview", "accuracy": "Not provided"}, "description": "ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on lineart images.", "model_name": "lllyasviel/control_v11p_sd15_lineart"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "TurkuNLP/bert-base-finnish-cased-v1", "performance": {"dataset": "", "accuracy": null}, "description": "**Release 1.0** (November 25, 2019)\n\nWe generally recommend the use of the cased model.\n\nPaper presenting Finnish BERT: [arXiv:1912.07076](https://arxiv.org/abs/1912.07076)", "api_call": "", "model_name": "TurkuNLP/bert-base-finnish-cased-v1"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "nbroad/ESG-BERT", "performance": {"dataset": "", "accuracy": null}, "description": "domain specific bert model for text mining in sustainable investing - developed by: , and - shared by optional: huggingface - developed by: , and - shared by optional: huggingface - model type: language model - languages nlp: en - license: more information needed - related models:  - parent model: bert - resources for more information:  - -", "api_call": "", "model_name": "nbroad/ESG-BERT"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "nitrosocke/Ghibli-Diffusion", "performance": {"dataset": "", "accuracy": null}, "description": "this is the fine-tuned stable diffusion model trained on images from modern anime feature films from studio ghibli. use the tokens ghibli style in your prompts for the effect. if you enjoy my work and want to test new models before release, please consider supporting me", "api_call": "", "model_name": "nitrosocke/Ghibli-Diffusion"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Nemo", "api_name": "nvidia/stt_ru_conformer_transducer_large", "performance": {"dataset": "", "accuracy": null}, "description": "<style>\nimg {\n display: inline;\n}\n</style>\n\n| [![Model architecture](https://img.shields.io/badge/Model_Arch-Conformer--Transducer-lightgrey#model-badge)](#model-architecture)\n| [![Model size](https://img.shields.io/badge/Params-120M-lightgrey#model-badge)](#model-architecture)\n| [![Language](https://img.shields.io/badge/Language-ru-lightgrey#model-badge)](#datasets)\n\nThis model transcribes speech into lowercase Cyrillic alphabet including space, and is trained on around 1636 hours of Russian speech data.\nIt is a non-autoregressive \"large\" variant of Conformer, with around 120 million parameters.\nSee the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details.", "api_call": "", "model_name": "nvidia/stt_ru_conformer_transducer_large"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "kandinsky-community/kandinsky-2-1", "performance": {"dataset": "", "accuracy": null}, "description": "kandinsky 2.1 inherits best practices from dall-e 2 and latent diffusion while introducing some new ideas. it uses the clip model as a text and image encoder, and diffusion image prior mapping between latent spaces of clip modalities. this approach increases the visual performance of the model and unveils new horizons in blending images and text-guided image manipulation. the kandinsky model is created by , , , , and", "api_call": "", "model_name": "kandinsky-community/kandinsky-2-1"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "PORTULAN/albertina-ptbr", "performance": {"dataset": "", "accuracy": null}, "description": "&nbsp;&nbsp;&nbsp;&nbsp;this is the model card for albertina 900m pt-br. you may be interested in some of the other models in the albertina encoders families.  this model card is for albertina-pt-br , with 900m parameters, 24 layers and a hidden size of 1536. this model is distributed respecting the license granted by the data set on which it was trained, namely that it is \"available solely for academic research purposes, and you agreed not to use it for any commercial applications\".", "api_call": "", "model_name": "PORTULAN/albertina-ptbr"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "DGSpitzer/Cyberpunk-Anime-Diffusion", "performance": {"dataset": "", "accuracy": null}, "description": "an ai model that generates cyberpunk anime characters! based of a finetuned waifu diffusion v1.3 model with stable diffusion v1.5 new vae, training in dreambooth by", "api_call": "", "model_name": "DGSpitzer/Cyberpunk-Anime-Diffusion"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "git-large-textcaps", "api_call": "AutoModelForCausalLM.from_pretrained('microsoft/git-large-textcaps')", "performance": {"dataset": "TextCaps", "accuracy": "Refer to the paper"}, "description": "GIT (short for GenerativeImage2Text) model, large-sized version, fine-tuned on TextCaps. It was introduced in the paper GIT: A Generative Image-to-text Transformer for Vision and Language by Wang et al. and first released in this repository. The model is trained using 'teacher forcing' on a lot of (image, text) pairs. The goal for the model is simply to predict the next text token, giving the image tokens and previous text tokens. This allows the model to be used for tasks like image and video captioning, visual question answering (VQA) on images and videos, and even image classification (by simply conditioning the model on the image and asking it to generate a class for it in text).", "model_name": "microsoft/git-large-textcaps"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "Hugging Face Transformers", "api_name": "megagonlabs/transformers-ud-japanese-electra-base-ginza-510", "performance": {"dataset": "", "accuracy": null}, "description": "this is an model pretrained on approximately 200m japanese sentences extracted from the and finetuned by on . the base pretrain model is . the entire spacy v3 model is distributed as a python package named from pypi along with which provides some custom pipeline components to recognize the japanese bunsetu-phrase structures.", "api_call": "", "model_name": "megagonlabs/transformers-ud-japanese-electra-base-ginza-510"}
{"domain": "Natural Language Processing Text Classification", "framework": "Transformers", "functionality": "Sentiment Analysis", "api_name": "finiteautomata/beto-sentiment-analysis", "api_call": "pipeline('sentiment-analysis', model='finiteautomata/beto-sentiment-analysis')", "performance": {"dataset": "TASS 2020 corpus", "accuracy": ""}, "description": "Model trained with TASS 2020 corpus (around ~5k tweets) of several dialects of Spanish. Base model is BETO, a BERT model trained in Spanish. Uses POS, NEG, NEU labels.", "model_name": "finiteautomata/beto-sentiment-analysis"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "lmsys/fastchat-t5-3b-v1.0", "performance": {"dataset": "", "accuracy": null}, "description": "model type: fastchat-t5 is an open-source chatbot trained by fine-tuning flan-t5-xl 3b parameters on user-shared conversations collected from sharegpt. it is based on an encoder-decoder transformer architecture, and can autoregressively generate responses to users' inputs.", "api_call": "", "model_name": "lmsys/fastchat-t5-3b-v1.0"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "api_name": "NbAiLab/nb-bert-base-mnli", "performance": {"dataset": "", "accuracy": null}, "description": "release 1.0 march 11, 2021 the most effective way of creating a good classifier is to finetune a pre-trained model for the specific task at hand. however, in many cases this is simply impossible. proposed a very clever way of using pre-trained mnli models as zero-shot sequence classifiers. the methods works by reformulating the question to an mnli hypothesis. if we want to figure out if a text is about \"sport\", we simply state that \"this text is about sport\" \"denne teksten handler om sport\". the most effective way of creating a good classifier is to finetune a pre-trained model for the specific task at hand. however, in many cases this is simply impossible. proposed a very clever way of using pre-trained mnli models as zero-shot sequence classifiers. the methods works by reformulating the question to an mnli hypothesis. if we want to figure out if a text is about \"sport\", we simply state that \"this text is about sport\" \"denne teksten handler om sport\". when the model is finetuned on the 400k large mnli task, it is in many cases able to solve this classification tasks. there are no mnli-set of this size in norwegian but we have trained it on a machine translated version of the original mnli-set.", "api_call": "", "model_name": "NbAiLab/nb-bert-base-mnli"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "staka/fugumt-en-ja", "performance": {"dataset": "", "accuracy": null}, "description": "This is a translation model using Marian-NMT.\nFor more details, please see [my repository](https://github.com/s-taka/fugumt).\n\n* source language: en\n* target language: ja \n\n### How to use\n\nThis model uses transformers and sentencepiece.\n```python\n!pip install transformers sentencepiece\n```\n\nYou can use this model directly with a pipeline:\n```python\nfrom transformers import pipeline\nfugu_translator = pipeline('translation', model='staka/fugumt-en-ja')\nfugu_translator('This is a cat.')\n```\n\nIf you want to translate multiple sentences, we recommend using [pySBD](https://github.com/nipunsadvilkar/pySBD).\n```python\n!pip install transformers sentencepiece pysbd\n\nimport pysbd\nseg_en = pysbd.Segmenter(language=\"en\", clean=False)\n\nfrom transformers import pipeline\nfugu_translator = pipeline('translation', model='staka/fugumt-en-ja')\ntxt = 'This is a cat. It is very cute.'\nprint(fugu_translator(seg_en.segment(txt)))\n```\n\n\n### Eval results\n\nThe results of the evaluation using [tatoeba](https://tatoeba.org/ja)(randomly selected 500 sentences) are as follows:\n\n|source |target |BLEU(*1)| \n|-------|-------|--------|\n|en     |ja     |32.7    |\n\n(*1) sacrebleu --tokenize ja-mecab", "api_call": "", "model_name": "staka/fugumt-en-ja"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Sentence-Transformers", "api_name": "Voicelab/sbert-large-cased-pl", "performance": {"dataset": "", "accuracy": null}, "description": "sentencebert is a modification of the pretrained bert network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. training was based on the original paper with a slight modification of how the training data was used. the goal of the model is to generate different embeddings based on the semantic and topic similarity of the given text. semantic textual similarity analyzes how similar two pieces of texts are. read more about how the model was prepared in our .", "api_call": "", "model_name": "Voicelab/sbert-large-cased-pl"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "daspartho/prompt-extend", "performance": {"dataset": "", "accuracy": null}, "description": "text generation model for generating suitable style cues given the main idea for a prompt. it is a gpt-2 model trained on of stable diffusion prompts. the following hyperparameters were used during training:", "api_call": "", "model_name": "daspartho/prompt-extend"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "api_name": "aiautomationlab/german-news-title-gen-mt5", "performance": {"dataset": "", "accuracy": null}, "description": "this is a model for the task of news headline generation in german. while this task is very similar to summarization, there remain differences like length, structure, and language style, which cause state-of-the-art summarization models not to be suited best for headline generation and demand further fine tuning on this task. for this model, by google is used as a foundation model.", "api_call": "", "model_name": "aiautomationlab/german-news-title-gen-mt5"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "dbmdz/bert-base-german-uncased", "performance": {"dataset": "", "accuracy": null}, "description": "in this repository the mdz digital library team dbmdz at the bavarian state library open sources another german bert models \ud83c\udf89 in addition to the recently released", "api_call": "", "model_name": "dbmdz/bert-base-german-uncased"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "cffl/bert-base-styleclassification-subjective-neutral", "performance": {"dataset": "", "accuracy": null}, "description": "this model has been fine-tuned on the - a parallel corpus of 180,000 biased and neutralized sentence pairs along with contextual sentences and metadata. the model can be used to classify text as subjectively biased vs. neutrally toned. the development and modeling efforts that produced this model are documented in detail through . the model is intended purely as a research output for nlp and data science communities. we developed this model for the purpose of evaluating text style transfer output. specifically, we derive a style transfer intensity sti metric from the classifier's output distributions. we also extract feautre importances from the model via with support a content preservation score cps. this model has been fine-tuned on the - a parallel corpus of 180,000 biased and neutralized sentence pairs along with contextual sentences and metadata. the model can be used to classify text as subjectively biased vs. neutrally toned. the development and modeling efforts that produced this model are documented in detail through .", "api_call": "", "model_name": "cffl/bert-base-styleclassification-subjective-neutral"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-bg-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: bg\n* target languages: en\n*  OPUS readme: [bg-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/bg-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/bg-en/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-en/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/bg-en/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-bg-en"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "lambdalabs/sd-pokemon-diffusers", "performance": {"dataset": "", "accuracy": null}, "description": "stable diffusion fine tuned on pok\u00e9mon by . put in a text prompt and generate your own pok\u00e9mon character, no \"prompt engineering\" required! if you want to find out how to train your own stable diffusion variants, see this from lambda labs. trained on using 2xa6000 gpus on for around 15,000 step about 6 hours, at a cost of about $10.", "api_call": "", "model_name": "lambdalabs/sd-pokemon-diffusers"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "api_name": "google/roberta2roberta_L-24_cnn_daily_mail", "performance": {"dataset": "", "accuracy": null}, "description": "the model was introduced in by sascha rothe, shashi narayan, aliaksei severyn and first released in . the model is an encoder-decoder model that was initialized on the `roberta-large` checkpoints for both the encoder", "api_call": "", "model_name": "google/roberta2roberta_L-24_cnn_daily_mail"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "rinna/japanese-roberta-base", "performance": {"dataset": "", "accuracy": null}, "description": "this repository provides a base-sized japanese roberta model. the model was trained using code from github repository by from transformers import autotokenizer, automodelformaskedlm", "api_call": "", "model_name": "rinna/japanese-roberta-base"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "ml6team/mt5-small-german-query-generation", "performance": {"dataset": "", "accuracy": null}, "description": "this model was created with the purpose to generate possible queries for a german input article. for this model, we finetuned a multilingual t5 model on the the machine translated version of the ms marco dataset. the model was trained for 1 epoch, on 200,000 unique queries of the dataset. we trained the model on one k80 gpu for 25,000 iterations with following parameters: this model was created with the purpose to generate possible queries for a german input article. for this model, we finetuned a multilingual t5 model on the the machine translated version of the ms marco dataset. the model was trained for 1 epoch, on 200,000 unique queries of the dataset. we trained the model on one k80 gpu for 25,000 iterations with following parameters: - learning rate: 1e-3 - train batch size: 8 - max input sequence length: 512 - max target sequence length: 64", "api_call": "", "model_name": "ml6team/mt5-small-german-query-generation"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-en-tl", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: en\n* target languages: tl\n*  OPUS readme: [en-tl](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/en-tl/README.md)\n\n*  dataset: opus+bt\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus+bt-2020-02-26.zip](https://object.pouta.csc.fi/OPUS-MT-models/en-tl/opus+bt-2020-02-26.zip)\n* test set translations: [opus+bt-2020-02-26.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-tl/opus+bt-2020-02-26.test.txt)\n* test set scores: [opus+bt-2020-02-26.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-tl/opus+bt-2020-02-26.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-en-tl"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "EMBEDDIA/sloberta", "performance": {"dataset": "", "accuracy": null}, "description": "load in transformers library with: sloberta model is a monolingual slovene bert-like model. it is closely related to french camembert model the corpora used for training the model have 3.47 billion tokens in total. the subword vocabulary contains 32,000 tokens. the scripts and programs used for data preparation and training the model are available on sloberta was trained for 200,000 iterations or about 98 epochs.", "api_call": "", "model_name": "EMBEDDIA/sloberta"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-fr-ru", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: fr\n* target languages: ru\n*  OPUS readme: [fr-ru](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/fr-ru/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-24.zip](https://object.pouta.csc.fi/OPUS-MT-models/fr-ru/opus-2020-01-24.zip)\n* test set translations: [opus-2020-01-24.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/fr-ru/opus-2020-01-24.test.txt)\n* test set scores: [opus-2020-01-24.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/fr-ru/opus-2020-01-24.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-fr-ru"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "nghuyong/ernie-3.0-base-zh", "performance": {"dataset": "", "accuracy": null}, "description": "ernie 3.0: large-scale knowledge enhanced pre-training for language understanding and generation more detail: this released pytorch model is converted from the officially released paddlepaddle ernie model and ernie 3.0: large-scale knowledge enhanced pre-training for language understanding and generation more detail:", "api_call": "", "model_name": "nghuyong/ernie-3.0-base-zh"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "api_name": "osiria/deberta-italian-question-answering", "performance": {"dataset": "", "accuracy": null}, "description": "You can also try the model online using this web app: https://huggingface.co/spaces/osiria/deberta-italian-question-answering\n\n<h3>References</h3>\n\n[1] https://arxiv.org/abs/2111.09543\n\n[2] https://link.springer.com/chapter/10.1007/978-3-030-03840-3_29\n\n<h3>Limitations</h3>\n\nThis model was trained on the English SQuAD v2 and on SQuAD-IT, which is mainly a machine translated version of the original SQuAD v1.1. This means that the quality of the training set is limited by the machine translation.\nMoreover, the model is meant to answer questions under the assumption that the required information is actually contained in the given context (which is the underlying assumption of SQuAD v1.1). \nIf the assumption is violated, the model will try to return an answer in any case, which is going to be incorrect.\n\n<h3>License</h3>\n\nThe model is released under <b>MIT</b> license", "api_call": "", "model_name": "osiria/deberta-italian-question-answering"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/table-transformer-structure-recognition", "api_call": "pipeline('object-detection', model='microsoft/table-transformer-structure-recognition')", "performance": {"dataset": "PubTables1M", "accuracy": ""}, "description": "Table Transformer (DETR) model trained on PubTables1M for detecting the structure (like rows, columns) in tables.", "model_name": "microsoft/table-transformer-structure-recognition"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "Hugging Face Transformers", "api_name": "biu-nlp/abstract-sim-sentence", "performance": {"dataset": "", "accuracy": null}, "description": "a model for mapping abstract sentence descriptions to sentences that fit the descriptions. trained on wikipedia. use to load the query and sentence encoder, and to encode a sentence with the model. note : the method uses a dual encoder architecture. this is the sentence encoder ; it should be used alongside the . usage example:", "api_call": "", "model_name": "biu-nlp/abstract-sim-sentence"}
{"domain": "Natural Language Processing Text Generation", "framework": "Transformers", "functionality": "Text Generation", "api_name": "distilgpt2", "api_call": "pipeline('text-generation', model='distilgpt2')", "performance": {"dataset": "WikiText-103", "accuracy": "21.100"}, "description": "DistilGPT2 is an English-language model pre-trained with the supervision of the 124 million parameter version of GPT-2. With 82 million parameters, it was developed using knowledge distillation and designed to be a faster, lighter version of GPT-2. It can be used for text generation, writing assistance, creative writing, entertainment, and more.", "model_name": "distilgpt2"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-uk-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: uk\n* target languages: en\n*  OPUS readme: [uk-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/uk-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-16.zip](https://object.pouta.csc.fi/OPUS-MT-models/uk-en/opus-2020-01-16.zip)\n* test set translations: [opus-2020-01-16.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/uk-en/opus-2020-01-16.test.txt)\n* test set scores: [opus-2020-01-16.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/uk-en/opus-2020-01-16.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-uk-en"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "api_name": "facebook/mask2former-swin-large-mapillary-vistas-semantic", "performance": {"dataset": "", "accuracy": null}, "description": "mask2former model trained on mapillary vistas semantic segmentation large-sized version, swin backbone. it was introduced in the paper and first released in . disclaimer: the team releasing mask2former did not write a model card for this model so this model card has been written by the hugging face team. mask2former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. hence, all 3 tasks are treated as if they were instance segmentation. mask2former outperforms the previous sota, mask2former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. hence, all 3 tasks are treated as if they were instance segmentation. mask2former outperforms the previous sota, both in terms of performance an efficiency by i replacing the pixel decoder with a more advanced multi-scale deformable attention transformer, ii adopting a transformer decoder with masked attention to boost performance without without introducing additional computation and iii improving training efficiency by calculating the loss on subsampled points instead of whole masks.", "api_call": "", "model_name": "facebook/mask2former-swin-large-mapillary-vistas-semantic"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "Gustavosta/MagicPrompt-Dalle", "performance": {"dataset": "", "accuracy": null}, "description": "this is a model from the magicprompt series of models, which are models intended to generate prompt texts for imaging ais, in this case: . this model was trained with a set of about 26k of data filtered and extracted from various places such as: , and . this may be a relatively small dataset, but we have to consider that dall-e 2 is a closed service and we only have prompts from people who share it and have access to the service, for now. the set was trained with about 40,000 steps and i have plans to improve the model if possible. if you want to test the model with a demo, you can go to: \"\".", "api_call": "", "model_name": "Gustavosta/MagicPrompt-Dalle"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "api_name": "ixa-ehu/SciBERT-SQuAD-QuAC", "performance": {"dataset": "", "accuracy": null}, "description": "this is the fine tuned for question answering. scibert is a pre-trained language model based on bert that has been trained on a large corpus of scientific text. when fine tuning for question answering we combined and datasets. if using this model, please cite the following paper:", "api_call": "", "model_name": "ixa-ehu/SciBERT-SQuAD-QuAC"}
{"domain": "Natural Language Processing Question Answering", "framework": "Transformers", "functionality": "Feature Extraction", "api_name": "facebook/dpr-question_encoder-single-nq-base", "api_call": "DPRQuestionEncoder.from_pretrained('facebook/dpr-question_encoder-single-nq-base')", "performance": {"dataset": [{"name": "NQ", "accuracy": {"top_20": 78.4, "top_100": 85.4}}, {"name": "TriviaQA", "accuracy": {"top_20": 79.4, "top_100": 85.0}}, {"name": "WQ", "accuracy": {"top_20": 73.2, "top_100": 81.4}}, {"name": "TREC", "accuracy": {"top_20": 79.8, "top_100": 89.1}}, {"name": "SQuAD", "accuracy": {"top_20": 63.2, "top_100": 77.2}}]}, "description": "Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. dpr-question_encoder-single-nq-base is the question encoder trained using the Natural Questions (NQ) dataset (Lee et al., 2019; Kwiatkowski et al., 2019).", "model_name": "facebook/dpr-question_encoder-single-nq-base"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "opus-mt-sv-en", "api_call": "AutoModel.from_pretrained('Helsinki-NLP/opus-mt-sv-en').", "performance": {"dataset": "Tatoeba.sv.en", "accuracy": "BLEU: 64.5, chr-F: 0.763"}, "description": "A Swedish to English translation model trained on the OPUS dataset using the transformer-align architecture. The model is pre-processed with normalization and SentencePiece.", "model_name": "Helsinki-NLP/opus-mt-sv-en"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "cointegrated/rubert-tiny-sentiment-balanced", "performance": {"dataset": "", "accuracy": null}, "description": "this is the model fine-tuned for classification of sentiment for short russian texts. the problem is formulated as multiclass classification: `negative` vs `neutral` vs `positive`. the function below estimates the sentiment of the given text:", "api_call": "", "model_name": "cointegrated/rubert-tiny-sentiment-balanced"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "medalpaca/medalpaca-7b", "performance": {"dataset": "", "accuracy": null}, "description": "## Table of Contents\n\n[Model Description](#model-description)  \n- [Architecture](#architecture)    \n- [Training Data](#trainig-data)  \n[Model Usage](#model-usage)  \n[Limitations](#limitations)", "api_call": "", "model_name": "medalpaca/medalpaca-7b"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "xyn-ai/anything-v4.0", "performance": {"dataset": "", "accuracy": null}, "description": "fantasy.ai is the official and exclusive hosted ai generation platform that holds a commercial use license for anything v4.0, you can use their service at please report any unauthorized commercial use.", "api_call": "", "model_name": "xyn-ai/anything-v4.0"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "dandelin/vilt-b32-finetuned-vqa", "api_call": "ViltForQuestionAnswering.from_pretrained('dandelin/vilt-b32-finetuned-vqa')", "performance": {"dataset": "VQAv2", "accuracy": "to do"}, "description": "Vision-and-Language Transformer (ViLT) model fine-tuned on VQAv2. It was introduced in the paper ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision by Kim et al. and first released in this repository.", "model_name": "dandelin/vilt-b32-finetuned-vqa"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face Diffusers", "api_name": "lllyasviel/control_v11f1e_sd15_tile", "performance": {"dataset": "", "accuracy": null}, "description": "controlnet v1.1 was released in by . this checkpoint is a conversion of into `diffusers` format. it can be used in combination with stable diffusion , such as . controlnet was proposed in by lvmin zhang, maneesh agrawala. the abstract reads as follows:  we present a neural network structure, controlnet, to control pretrained large diffusion models to support additional input conditions. the controlnet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small trained with canny edge detection a monochrome image with white edges on a black background.   trained with pixel to pixel instruction no condition .   trained with image inpainting no condition.   trained with multi-level line segment detection an image with annotated line segments.   trained with depth estimation an image with depth information, usually represented as a grayscale image.   trained with surface normal estimation an image with surface normal information, usually represented as a color-coded image.   trained with image segmentation an image with segmented regions, usually represented as a color-coded image.   trained with line art generation an image with line art, usually black lines on a white background.   trained with anime line art generation an image with anime-style line art.   trained with human pose estimation an image with human poses, usually represented as a set of keypoints or skeletons.   trained with scribble-based image generation an image with scribbles, usually random or user-drawn strokes.   trained with soft edge image generation an image with soft edges, usually to create a more painterly or artistic effect.   trained with image shuffling an image with shuffled patches or regions.   trained with image tiling a blurry image or part of an image .", "api_call": "", "model_name": "lllyasviel/control_v11f1e_sd15_tile"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "ElKulako/cryptobert", "performance": {"dataset": "", "accuracy": null}, "description": "for academic reference, cite the following paper: cryptobert is a pre-trained nlp model to analyse the language and sentiments of cryptocurrency-related social media posts and messages. it was built by further training the language model on the cryptocurrency domain, using a corpus of over 3.2m unique cryptocurrency-related social media posts. a research paper with more details will follow soon.", "api_call": "", "model_name": "ElKulako/cryptobert"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "moussaKam/barthez-orangesum-abstract", "api_call": "BarthezModel.from_pretrained('moussaKam/barthez-orangesum-abstract')", "performance": {"dataset": "orangeSum", "accuracy": ""}, "description": "Barthez model finetuned on orangeSum for abstract generation in French language", "model_name": "moussaKam/barthez-orangesum-abstract"}
{"domain": "Audio Audio-to-Audio", "framework": "Hugging Face Speechbrain", "api_name": "speechbrain/sepformer-libri3mix", "performance": {"dataset": "", "accuracy": null}, "description": "this repository provides all the necessary tools to perform audio source separation with a model, implemented with speechbrain, and pretrained on libri3mix dataset. for a better experience we encourage you to learn more about . the model performance is 19.8 db si-snri on the test set of libri3mix dataset.", "api_call": "", "model_name": "speechbrain/sepformer-libri3mix"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face Transformers", "api_name": "google/pix2struct-ai2d-base", "performance": {"dataset": "", "accuracy": null}, "description": "![model_image](https://s3.amazonaws.com/moonup/production/uploads/1678713353867-62441d1d9fdefb55a0b7d12c.png)\n\n#  Table of Contents\n\n0. [TL;DR](#TL;DR)\n1. [Using the model](#using-the-model)\n2. [Contribution](#contribution)\n3. [Citation](#citation)\n\n# TL;DR\n\nPix2Struct is an image encoder - text decoder model that is trained on image-text pairs for various tasks, including image captionning and visual question answering. The full list of available models can be found on the Table 1 of the paper:\n\n![Table 1 - paper](https://s3.amazonaws.com/moonup/production/uploads/1678712985040-62441d1d9fdefb55a0b7d12c.png)\n\n\nThe abstract of the model states that: \n> Visually-situated language is ubiquitous\u2014sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and\nforms. Perhaps due to this diversity, previous work has typically relied on domainspecific recipes with limited sharing of the underlying data, model architectures,\nand objectives. We present Pix2Struct, a pretrained image-to-text model for\npurely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse\nmasked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large\nsource of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy,\nwe introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions\nare rendered directly on top of the input image. For the first time, we show that a\nsingle pretrained model can achieve state-of-the-art results in six out of nine tasks\nacross four domains: documents, illustrations, user interfaces, and natural images.\n\n# Using the model \n\nThis model has been fine-tuned on VQA, you need to provide a question in a specific format, ideally in the format of a Choices question answering", "api_call": "", "model_name": "google/pix2struct-ai2d-base"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "prompthero/openjourney", "api_call": "StableDiffusionPipeline.from_pretrained('prompthero/openjourney', torch_dtype=torch.float16)", "performance": {"dataset": "Midjourney images", "accuracy": "Not specified"}, "description": "Openjourney is an open source Stable Diffusion fine-tuned model on Midjourney images, by PromptHero. It can be used for generating AI art based on text prompts.", "model_name": "prompthero/openjourney"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "api_name": "IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese", "performance": {"dataset": "", "accuracy": null}, "description": "- Main Page:[Fengshenbang](https://fengshenbang-lm.com/)\n- Github: [Fengshenbang-LM](https://github.com/IDEA-CCNL/Fengshenbang-LM)", "api_call": "", "model_name": "IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "MonoHime/rubert-base-cased-sentiment-new", "performance": {"dataset": "", "accuracy": null}, "description": "russian texts sentiment classification. - developed by: tatyana voloshina - shared by optional: tatyana voloshina russian texts sentiment classification. - developed by: tatyana voloshina - shared by optional: tatyana voloshina - model type: text classification - languages nlp: more information needed - license: more information needed - parent model: bert - resources for more information:  -", "api_call": "", "model_name": "MonoHime/rubert-base-cased-sentiment-new"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Transformers", "functionality": "Summarization", "api_name": "pszemraj/long-t5-tglobal-base-16384-book-summary", "api_call": "T5ForConditionalGeneration.from_pretrained('pszemraj/long-t5-tglobal-base-16384-book-summary')", "performance": {"dataset": "kmfoda/booksum", "accuracy": {"ROUGE-1": 36.408, "ROUGE-2": 6.065, "ROUGE-L": 16.721, "ROUGE-LSUM": 33.34}}, "description": "A fine-tuned version of google/long-t5-tglobal-base on the kmfoda/booksum dataset, which can be used to summarize long text and generate SparkNotes-esque summaries of arbitrary topics. The model generalizes reasonably well to academic and narrative text.", "model_name": "pszemraj/long-t5-tglobal-base-16384-book-summary"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "dreamlike-art/dreamlike-diffusion-1.0", "api_call": "StableDiffusionPipeline.from_pretrained('dreamlike-art/dreamlike-diffusion-1.0', torch_dtype=torch.float16)", "performance": {"dataset": "high quality art", "accuracy": "not provided"}, "description": "Dreamlike Diffusion 1.0 is SD 1.5 fine tuned on high quality art, made by dreamlike.art.", "model_name": "dreamlike-art/dreamlike-diffusion-1.0"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face Transformers", "api_name": "google/pix2struct-docvqa-base", "performance": {"dataset": "", "accuracy": null}, "description": "![model_image](https://s3.amazonaws.com/moonup/production/uploads/1678713353867-62441d1d9fdefb55a0b7d12c.png)\n\n#  Table of Contents\n\n0. [TL;DR](#TL;DR)\n1. [Using the model](#using-the-model)\n2. [Contribution](#contribution)\n3. [Citation](#citation)\n\n# TL;DR\n\nPix2Struct is an image encoder - text decoder model that is trained on image-text pairs for various tasks, including image captionning and visual question answering. The full list of available models can be found on the Table 1 of the paper:\n\n![Table 1 - paper](https://s3.amazonaws.com/moonup/production/uploads/1678712985040-62441d1d9fdefb55a0b7d12c.png)\n\n\nThe abstract of the model states that: \n> Visually-situated language is ubiquitous\u2014sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and\nforms. Perhaps due to this diversity, previous work has typically relied on domainspecific recipes with limited sharing of the underlying data, model architectures,\nand objectives. We present Pix2Struct, a pretrained image-to-text model for\npurely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse\nmasked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large\nsource of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy,\nwe introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions\nare rendered directly on top of the input image. For the first time, we show that a\nsingle pretrained model can achieve state-of-the-art results in six out of nine tasks\nacross four domains: documents, illustrations, user interfaces, and natural images.\n\n# Using the model", "api_call": "", "model_name": "google/pix2struct-docvqa-base"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Sentence-Transformers", "api_name": "DMetaSoul/sbert-chinese-general-v2", "performance": {"dataset": "", "accuracy": null}, "description": "\u6b64\u6a21\u578b\u57fa\u4e8e [bert-base-chinese](https://huggingface.co/bert-base-chinese) \u7248\u672c BERT \u6a21\u578b\uff0c\u5728\u767e\u4e07\u7ea7\u8bed\u4e49\u76f8\u4f3c\u6570\u636e\u96c6 [SimCLUE](https://github.com/CLUEbenchmark/SimCLUE) \u4e0a\u8fdb\u884c\u8bad\u7ec3\uff0c\u9002\u7528\u4e8e**\u901a\u7528\u8bed\u4e49\u5339\u914d**\u573a\u666f\uff0c\u4ece\u6548\u679c\u6765\u770b\u8be5\u6a21\u578b\u5728\u5404\u79cd\u4efb\u52a1\u4e0a**\u6cdb\u5316\u80fd\u529b\u66f4\u597d**\u3002\n\n\u6ce8\uff1a\u6b64\u6a21\u578b\u7684[\u8f7b\u91cf\u5316\u7248\u672c](https://huggingface.co/DMetaSoul/sbert-chinese-general-v2-distill)\uff0c\u4e5f\u5df2\u7ecf\u5f00\u6e90\u5566\uff01\n\n# Usage", "api_call": "", "model_name": "DMetaSoul/sbert-chinese-general-v2"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "0xJustin/Dungeons-and-Diffusion", "performance": {"dataset": "", "accuracy": null}, "description": "for the new version download 'd&diffusion3.0 protogen.ckpt' the newest version is finetuned from protogen to great effect. also works great at resolutions great than 512x512! species in new version: aarakocra, aasimar, air genasi, centaur, dragonborn, drow, dwarf, earth genasi, elf, firbolg, fire genasi, gith, gnome, goblin, goliath, halfling, human, illithid, kenku, kobold, lizardfolk, minotaur, orc, tabaxi, thrikreen, tiefling, tortle, warforged, water genasi", "api_call": "", "model_name": "0xJustin/Dungeons-and-Diffusion"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Sentence-Transformers", "api_name": "uer/sbert-base-chinese-nli", "performance": {"dataset": "", "accuracy": null}, "description": "this is the sentence embedding model pre-trained by , which is introduced in . besides, the model could also be pre-trained by introduced in , which inherits uer-py to support models with parameters above one billion, and extends it to a multimodal pre-training framework. you can use this model to extract sentence embeddings for sentence similarity task. we use cosine distance to calculate the embedding similarity here: is used as training data. this is the sentence embedding model pre-trained by , which is introduced in . besides, the model could also be pre-trained by introduced in , which inherits uer-py to support models with parameters above one billion, and extends it to a multimodal pre-training framework.", "api_call": "", "model_name": "uer/sbert-base-chinese-nli"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-sn-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: sn\n* target languages: en\n*  OPUS readme: [sn-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/sn-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-16.zip](https://object.pouta.csc.fi/OPUS-MT-models/sn-en/opus-2020-01-16.zip)\n* test set translations: [opus-2020-01-16.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/sn-en/opus-2020-01-16.test.txt)\n* test set scores: [opus-2020-01-16.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/sn-en/opus-2020-01-16.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-sn-en"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-ceb-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source group: Cebuano \n* target group: English \n*  OPUS readme: [ceb-eng](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/ceb-eng/README.md)\n\n*  model: transformer-align\n* source language(s): ceb\n* target language(s): eng\n* model: transformer-align\n* pre-processing: normalization + SentencePiece (spm32k,spm32k)\n* download original weights: [opus-2020-06-17.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/ceb-eng/opus-2020-06-17.zip)\n* test set translations: [opus-2020-06-17.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/ceb-eng/opus-2020-06-17.test.txt)\n* test set scores: [opus-2020-06-17.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/ceb-eng/opus-2020-06-17.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-ceb-en"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "cmarkea/distilcamembert-base-sentiment", "performance": {"dataset": "", "accuracy": null}, "description": "distilcamembert-sentiment we present distilcamembert-sentiment, which is fine-tuned for the sentiment analysis task for the french language. this model is built using two datasets: and to minimize the bias. indeed, amazon reviews are similar in messages and relatively shorts, contrary to allocin\u00e9 critics, who are long and rich texts.", "api_call": "", "model_name": "cmarkea/distilcamembert-base-sentiment"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "Hugging Face Transformers", "api_name": "nghuyong/ernie-health-zh", "performance": {"dataset": "", "accuracy": null}, "description": "ernie-health is a chinese biomedical language model pre-trained from in-domain text of de-identified online doctor-patient dialogues, electronic medical records, and textbooks. more detail: model name language model structure  ernie-health is a chinese biomedical language model pre-trained from in-domain text of de-identified online doctor-patient dialogues, electronic medical records, and textbooks. more detail:", "api_call": "", "model_name": "nghuyong/ernie-health-zh"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "cardiffnlp/twitter-roberta-base-irony", "performance": {"dataset": "", "accuracy": null}, "description": "this is a roberta-base model trained on 58m tweets and finetuned for irony detection with the tweeteval benchmark. this model has integrated into the . - paper: .", "api_call": "", "model_name": "cardiffnlp/twitter-roberta-base-irony"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "Helsinki-NLP/opus-mt-zh-en", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('Helsinki-NLP/opus-mt-zh-en')", "performance": {"dataset": "opus", "accuracy": {"BLEU": 36.1, "chr-F": 0.548}}, "description": "A Chinese to English translation model developed by the Language Technology Research Group at the University of Helsinki. It is based on the Marian NMT framework and trained on the OPUS dataset.", "model_name": "Helsinki-NLP/opus-mt-zh-en"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "api_name": "Sahajtomar/German_Zeroshot", "performance": {"dataset": "", "accuracy": null}, "description": "this model has as base model and fine-tuned it on xnli de dataset. the default hypothesis template is in english: `this text is `. while using this model , change it to \"in deisem geht es um .\" or something different. while inferencing through huggingface api may give poor results as it uses by default english template. since model is monolingual and not multilingual, hypothesis template needs to be changed accordingly. accuracy: 85.5 this model has as base model and fine-tuned it on xnli de dataset. the default hypothesis template is in english: `this text is `. while using this model , change it to \"in deisem geht es um .\" or something different. while inferencing through huggingface api may give poor results as it uses by default english template. since model is monolingual and not multilingual, hypothesis template needs to be changed accordingly.", "api_call": "", "model_name": "Sahajtomar/German_Zeroshot"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "nitrosocke/Arcane-Diffusion", "performance": {"dataset": "", "accuracy": null}, "description": "this is the fine-tuned stable diffusion model trained on images from the tv show arcane. use the tokens arcane style in your prompts for the effect. if you enjoy my work, please consider supporting me", "api_call": "", "model_name": "nitrosocke/Arcane-Diffusion"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-tl-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source group: Tagalog \n* target group: English \n*  OPUS readme: [tgl-eng](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/tgl-eng/README.md)\n\n*  model: transformer-align\n* source language(s): tgl_Latn\n* target language(s): eng\n* model: transformer-align\n* pre-processing: normalization + SentencePiece (spm32k,spm32k)\n* download original weights: [opus-2020-06-17.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/tgl-eng/opus-2020-06-17.zip)\n* test set translations: [opus-2020-06-17.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/tgl-eng/opus-2020-06-17.test.txt)\n* test set scores: [opus-2020-06-17.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/tgl-eng/opus-2020-06-17.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-tl-en"}
{"domain": "Audio Text-to-Audio", "framework": "Hugging Face Transformers", "api_name": "facebook/musicgen-small", "performance": {"dataset": "", "accuracy": null}, "description": "musicgen is a text-to-music model capable of genreating high-quality music samples conditioned on text descriptions or audio prompts. it is a single stage auto-regressive transformer model trained over a 32khz encodec tokenizer with 4 codebooks sampled at 50 hz. unlike existing methods, like musiclm, musicgen doesn't require a self-supervised semantic representation, and it generates all 4 codebooks in one pass.", "api_call": "", "model_name": "facebook/musicgen-small"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "Kirili4ik/mbart_ruDialogSum", "api_call": "MBartForConditionalGeneration.from_pretrained('Kirili4ik/mbart_ruDialogSum')", "performance": {"dataset": [{"name": "SAMSum Corpus (translated to Russian)", "accuracy": {"Validation ROGUE-1": 34.5, "Validation ROGUE-L": 33, "Test ROGUE-1": 31, "Test ROGUE-L": 28}}]}, "description": "MBart for Russian summarization fine-tuned for dialogues summarization. This model was firstly fine-tuned by Ilya Gusev on Gazeta dataset. We have fine tuned that model on SamSum dataset translated to Russian using GoogleTranslateAPI. Moreover! We have implemented a ! telegram bot @summarization_bot ! with the inference of this model. Add it to the chat and get summaries instead of dozens spam messages!", "model_name": "Kirili4ik/mbart_ruDialogSum"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "facebook/mask2former-swin-base-coco-panoptic", "api_call": "Mask2FormerForUniversalSegmentation.from_pretrained('facebook/mask2former-swin-base-coco-panoptic')", "performance": {"dataset": "COCO panoptic segmentation", "accuracy": null}, "description": "Mask2Former model trained on COCO panoptic segmentation (base-sized version, Swin backbone). It was introduced in the paper Masked-attention Mask Transformer for Universal Image Segmentation and first released in this repository. Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA, MaskFormer both in terms of performance an efficiency.", "model_name": "facebook/mask2former-swin-base-coco-panoptic"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "smanjil/German-MedBERT", "performance": {"dataset": "", "accuracy": null}, "description": "this is a fine-tuned model on the medical domain for the german language and based on german bert. this model has only been trained to improve on-target tasks masked language model. it can later be used to perform a downstream task of your needs, while i performed it for the nts-icd-10 text classification task. language model: bert-base-german-cased language: german", "api_call": "", "model_name": "smanjil/German-MedBERT"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-da-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: da\n* target languages: en\n*  OPUS readme: [da-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/da-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/da-en/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da-en/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da-en/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-da-en"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "deepset/gbert-base", "performance": {"dataset": "", "accuracy": null}, "description": "released, oct 2020, this is a german bert language model trained collaboratively by the makers of the original german bert aka \"bert-base-german-cased\" and the dbmdz bert aka bert-base-german-dbmdz-cased. in our , we outline the steps taken to train our model and show that it outperforms its predecessors. paper: architecture: bert base", "api_call": "", "model_name": "deepset/gbert-base"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "api_name": "microsoft/conditional-detr-resnet-50", "performance": {"dataset": "", "accuracy": null}, "description": "conditional detection transformer detr model trained end-to-end on coco 2017 object detection 118k annotated images. it was introduced in the paper by meng et al. and first released in . the recently-developed detr approach applies the transformer encoder and decoder architecture to object detection and achieves promising performance. in this paper, we handle the critical issue, slow training convergence, and present a conditional cross-attention mechanism for fast detr training. our approach is motivated by that the cross-attention in detr relies highly on the content embeddings for localizing the four extremities and predicting the box, which increases the need for high-quality content embeddings and thus the training difficulty. our approach, named conditional detr, learns a conditional spatial query from the decoder embedding for decoder multi-head cross-attention. the benefit is that through the conditional spatial query, each cross-attention head is able to attend to a band containing a distinct region, e.g., one object extremity or a region inside the object box. this narrows down the spatial range for localizing the distinct regions for object classification and box regression, thus relaxing the dependence on the content embeddings and easing the training. empirical results show that conditional detr converges 6.7\u00d7 faster for the backbones r50 and r101 and 10\u00d7 faster for stronger backbones dc5-r50 and dc5-r101. you can use the raw model for object detection. see the to look for all available conditional detr models. the recently-developed detr approach applies the transformer encoder and decoder architecture to object detection and achieves promising performance. in this paper, we handle the critical issue, slow training convergence, and present a conditional cross-attention mechanism for fast detr training. our approach is motivated by that the cross-attention in detr relies highly on the content embeddings for localizing the four extremities and predicting the box, which increases the need for high-quality content embeddings and thus the training difficulty. our approach, named conditional detr, learns a conditional spatial query from the decoder embedding for decoder multi-head cross-attention. the benefit is that through the conditional spatial query, each cross-attention head is able to attend to a band containing a distinct region, e.g., one object extremity or a region inside the object box. this narrows down the spatial range for localizing the distinct regions for object classification and box regression, thus relaxing the dependence on the content embeddings and easing the training. empirical results show that conditional detr converges 6.7\u00d7 faster for the backbones r50 and r101 and 10\u00d7 faster for stronger backbones dc5-r50 and dc5-r101.", "api_call": "", "model_name": "microsoft/conditional-detr-resnet-50"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "valhalla/t5-base-qg-hl", "performance": {"dataset": "", "accuracy": null}, "description": "this is model trained for answer aware question generation task. the answer spans are highlighted within the text with special highlight tokens. you can play with the model using the inference api, just highlight the answer spans with `` tokens and end the text with ``. for example ` 42 is the answer to life, the universe and everything. `", "api_call": "", "model_name": "valhalla/t5-base-qg-hl"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "microsoft/deberta-large-mnli", "performance": {"dataset": "", "accuracy": null}, "description": "improves the bert and roberta models using disentangled attention and enhanced mask decoder. it outperforms bert and roberta on majority of nlu tasks with 80gb training data. please check the for more details and updates. this is the deberta large model fine-tuned with mnli task.", "api_call": "", "model_name": "microsoft/deberta-large-mnli"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "hakurei/waifu-diffusion", "api_call": "StableDiffusionPipeline.from_pretrained('hakurei/waifu-diffusion', torch_dtype=torch.float32)", "performance": {"dataset": "high-quality anime images", "accuracy": "not available"}, "description": "waifu-diffusion is a latent text-to-image diffusion model that has been conditioned on high-quality anime images through fine-tuning.", "model_name": "hakurei/waifu-diffusion"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-en-ca", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: en\n* target languages: ca\n*  OPUS readme: [en-ca](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/en-ca/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/en-ca/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-ca/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-ca/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-en-ca"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face Keras", "api_name": "google/maxim-s2-dehazing-sots-outdoor", "performance": {"dataset": "", "accuracy": null}, "description": "maxim model pre-trained for image dehazing. it was introduced in the paper by zhengzhong tu, hossein talebi, han zhang, feng yang, peyman milanfar, alan bovik, yinxiao li and first released in . disclaimer: the team releasing maxim did not write a model card for this model so this model card has been written by the hugging face team. maxim introduces a shared mlp-based backbone for different image processing tasks such as image deblurring, deraining, denoising, dehazing, low-light image enhancement, and retouching. the following figure depicts the main components of maxim: maxim introduces a shared mlp-based backbone for different image processing tasks such as image deblurring, deraining, denoising, dehazing, low-light image enhancement, and retouching. the following figure depicts the main components of maxim:", "api_call": "", "model_name": "google/maxim-s2-dehazing-sots-outdoor"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "IDEA-CCNL/Taiyi-Stable-Diffusion-1B-Chinese-v0.1", "performance": {"dataset": "", "accuracy": null}, "description": "- Main Page:[Fengshenbang](https://fengshenbang-lm.com/)\n- Github: [Fengshenbang-LM](https://github.com/IDEA-CCNL/Fengshenbang-LM)", "api_call": "", "model_name": "IDEA-CCNL/Taiyi-Stable-Diffusion-1B-Chinese-v0.1"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-tc-big-en-pt", "performance": {"dataset": "", "accuracy": null}, "description": "neural machine translation model for translating from english en to portuguese pt. this model is part of the , an effort to make neural machine translation models widely available and accessible for many languages in the world. all models are originally trained using the amazing framework of , an efficient nmt implementation written in pure c++. the models have been converted to pytorch using the transformers library by huggingface. training data is taken from and training pipelines use the procedures of . publications: and please, cite if you use this model.", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-tc-big-en-pt"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "bigcode/tiny_starcoder_py", "performance": {"dataset": "", "accuracy": null}, "description": "this is a 164m parameters model with the same architecture as 8k context length, mqa & fim. it was trained on the python data from for 6 epochs which amounts to 100b tokens. the model was trained on github code, to assist with some tasks like . for pure code completion, we advise using our 15b models or .", "api_call": "", "model_name": "bigcode/tiny_starcoder_py"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "snrspeaks/t5-one-line-summary", "performance": {"dataset": "", "accuracy": null}, "description": "a t5 model trained on 370,000 research papers, to generate one line summary based on description/abstract of the papers. it is trained using library - a python package built on top of pytorch lightning\u26a1\ufe0f & transformers\ud83e\udd17 to quickly train t5 models", "api_call": "", "model_name": "snrspeaks/t5-one-line-summary"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "alexandrainst/da-hatespeech-detection-small", "performance": {"dataset": "", "accuracy": null}, "description": "the electra offensive model detects whether a danish text is offensive or not. it is based on the pretrained model. see the for more details.", "api_call": "", "model_name": "alexandrainst/da-hatespeech-detection-small"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "lodestones/P.A.W.F.E.C.T-Alpha", "performance": {"dataset": "", "accuracy": null}, "description": "diffusion model trained on 500k image-tags pairs scraped from furaffinity. alpha still, expect more epochs, more training data and overall better results in the future. \"anthro, fox, male, general, by 100racs\" epoch 24, no inpainting the tags contain the original fa tag list with tags appearing less than 40 times in total omitted, plus a tag corresponding the the general/mature/adult rating. if the artist also appears more than 40 times, an artist tag is added as well. the full list of tags and their number of occurences are available . training was done on tpuv3s using the lion optimizer.", "api_call": "", "model_name": "lodestones/P.A.W.F.E.C.T-Alpha"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "law-ai/InLegalBERT", "performance": {"dataset": "", "accuracy": null}, "description": "model and tokenizer files for the inlegalbert model from the paper . for building the pre-training corpus of indian legal text, we collected a large corpus of case documents from the indian supreme court and many high courts of india. the court cases in our dataset range from 1950 to 2019, and belong to all legal domains, such as civil, criminal, constitutional, and so on.", "api_call": "", "model_name": "law-ai/InLegalBERT"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-tc-big-en-ar", "performance": {"dataset": "", "accuracy": null}, "description": "neural machine translation model for translating from english en to arabic ar. this model is part of the , an effort to make neural machine translation models widely available and accessible for many languages in the world. all models are originally trained using the amazing framework of , an efficient nmt implementation written in pure c++. the models have been converted to pytorch using the transformers library by huggingface. training data is taken from and training pipelines use the procedures of . publications: and please, cite if you use this model.", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-tc-big-en-ar"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "PyTorch Transformers", "functionality": "Feature Extraction", "api_name": "kobart-base-v2", "api_call": "BartModel.from_pretrained('gogamza/kobart-base-v2')", "performance": {"dataset": "NSMC", "accuracy": 0.901}, "description": "KoBART is a Korean encoder-decoder language model trained on over 40GB of Korean text using the BART architecture. It can be used for feature extraction and has been trained on a variety of data sources, including Korean Wiki, news, books, and more.", "model_name": "gogamza/kobart-base-v2"}
{"domain": "Audio Text-to-Speech", "framework": "Fairseq", "functionality": "Text-to-Speech", "api_name": "facebook/tts_transformer-zh-cv7_css10", "api_call": "load_model_ensemble_and_task_from_hf_hub('facebook/tts_transformer-zh-cv7_css10', arg_overrides={'vocoder': 'hifigan', 'fp16': False})", "performance": {"dataset": "common_voice", "accuracy": "Not provided"}, "description": "Transformer text-to-speech model from fairseq S^2. Simplified Chinese, Single-speaker female voice, Pre-trained on Common Voice v7, fine-tuned on CSS10.", "model_name": "facebook/tts_transformer-zh-cv7_css10"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "api_name": "Rifky/Indobert-QA", "performance": {"dataset": "", "accuracy": null}, "description": "notice of attribution clarification this is to clarify that muhammad fajrin buyang daffa is not, and has never been, a part of this project. they have made no contributions to this repository, and as such, shall not be given any attribution in relation to this work. for further inquiries, please contact the rifky@genta.tech.", "api_call": "", "model_name": "Rifky/Indobert-QA"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Summarization", "api_name": "it5-base-news-summarization", "api_call": "pipeline('summarization', model='it5/it5-base-news-summarization')", "performance": {"dataset": "NewsSum-IT", "accuracy": {"Rouge1": 0.339, "Rouge2": 0.16, "RougeL": 0.263}}, "description": "IT5 Base model fine-tuned on news summarization on the Fanpage and Il Post corpora for Italian Language Understanding and Generation.", "model_name": "it5/it5-base-news-summarization"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-sq-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: sq\n* target languages: en\n*  OPUS readme: [sq-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/sq-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-16.zip](https://object.pouta.csc.fi/OPUS-MT-models/sq-en/opus-2020-01-16.zip)\n* test set translations: [opus-2020-01-16.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/sq-en/opus-2020-01-16.test.txt)\n* test set scores: [opus-2020-01-16.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/sq-en/opus-2020-01-16.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-sq-en"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "mgp-str", "api_call": "MgpstrForSceneTextRecognition.from_pretrained('alibaba-damo/mgp-str-base')", "performance": {"dataset": "MJSynth and SynthText", "accuracy": null}, "description": "MGP-STR is a pure vision Scene Text Recognition (STR) model, consisting of ViT and specially designed A^3 modules. It is trained on MJSynth and SynthText datasets and can be used for optical character recognition (OCR) on text images.", "model_name": "alibaba-damo/mgp-str-base"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "mrm8488/t5-base-finetuned-emotion", "performance": {"dataset": "", "accuracy": null}, "description": "base fine-tuned on dataset for emotion recognition downstream task. the t5 model was presented in by colin raffel, noam shazeer, adam roberts, katherine lee, sharan narang, michael matena, yanqi zhou, wei li, peter j. liu in here the abstract: transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing nlp. the effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. in this paper, we explore the landscape of transfer learning techniques for nlp by introducing a unified framework that converts every language problem into a text-to-text format. our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. by combining the insights from our exploration with scale and our new \u201ccolossal clean crawled corpus\u201d, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. to facilitate future work on transfer learning for nlp, we release our dataset, pre-trained models, and code.", "api_call": "", "model_name": "mrm8488/t5-base-finetuned-emotion"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "Salesforce/codegen25-7b-multi", "performance": {"dataset": "", "accuracy": null}, "description": "title: authors: \\ , \\ , yingbo zhou, caiming xiong \\ equal contribution is a family of autoregressive language models for program synthesis . building upon , the model is trained on for 1.4t tokens, achieving competitive results compared to starcoderbase-15.5b with less than half the size. like codegen2, this model is capable of infilling, and supports multiple programming languages. we then further train on python, then on instruction data. we release all the models as follows:  codegen2.5-7b-multi this repo: trained on starcoderdata. licensed under apache-2.0.  codegen2.5-7b-mono : further trained on additional python tokens. licensed under apache-2.0.  codegen2.5-7b-instruct : further trained from codegen2.5-7b-mono on instruction data. research purposes only .", "api_call": "", "model_name": "Salesforce/codegen25-7b-multi"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Age Classification", "api_name": "nateraw/vit-age-classifier", "api_call": "ViTForImageClassification.from_pretrained('nateraw/vit-age-classifier')", "performance": {"dataset": "fairface", "accuracy": null}, "description": "A vision transformer finetuned to classify the age of a given person's face.", "model_name": "nateraw/vit-age-classifier"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Image Classification", "api_name": "openai/clip-vit-large-patch14", "api_call": "CLIPModel.from_pretrained('openai/clip-vit-large-patch14')", "performance": {"dataset": ["Food101", "CIFAR10", "CIFAR100", "Birdsnap", "SUN397", "Stanford Cars", "FGVC Aircraft", "VOC2007", "DTD", "Oxford-IIIT Pet dataset", "Caltech101", "Flowers102", "MNIST", "SVHN", "IIIT5K", "Hateful Memes", "SST-2", "UCF101", "Kinetics700", "Country211", "CLEVR Counting", "KITTI Distance", "STL-10", "RareAct", "Flickr30", "MSCOCO", "ImageNet", "ImageNet-A", "ImageNet-R", "ImageNet Sketch", "ObjectNet (ImageNet Overlap)", "Youtube-BB", "ImageNet-Vid"], "accuracy": "varies depending on the dataset"}, "description": "The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner.", "model_name": "openai/clip-vit-large-patch14"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "snunlp/KR-FinBert-SC", "performance": {"dataset": "", "accuracy": null}, "description": "much progress has been made in the nlp natural language processing field, with numerous studies showing that domain adaptation using small-scale corpus and fine-tuning with labeled data is effective for overall performance improvement. we proposed kr-finbert for the financial domain by further pre-training it on a financial corpus and fine-tuning it for sentiment analysis. as many studies have shown, the performance improvement through adaptation and conducting the downstream task was also clear in this experiment. the training data for this model is expanded from those of , texts from korean wikipedia, general news articles, legal texts crawled from the national law information center and . for the transfer learning, corporate related economic news articles from 72 media sources such as the financial times, the korean economy daily, etc and analyst reports from 16 securities companies such as kiwoom securities, samsung securities, etc are added. included in the dataset is 440,067 news titles with their content and 11,237 analyst reports. the total data size is about 13.22gb. for mlm training, we split the data line by line and the total no. of lines is 6,379,315.", "api_call": "", "model_name": "snunlp/KR-FinBert-SC"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "cointegrated/rubert-tiny", "performance": {"dataset": "", "accuracy": null}, "description": "this is a very small distilled version of the model for russian and english 45 mb, 12m parameters. there is also an updated version of this model , , with a larger vocabulary and better quality on practically all russian nlu tasks. this model is useful if you want to fine-tune it for a relatively simple russian task e.g. ner or sentiment classification, and you care more about speed and size than about accuracy. it is approximately x10 smaller and faster than a base-sized bert. its `cls` embeddings can be used as a sentence representation aligned between russian and english. it was trained on the , and , using mlm loss distilled from , translation ranking loss, and `cls` embeddings distilled from , , laser and use.", "api_call": "", "model_name": "cointegrated/rubert-tiny"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-tn-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: tn\n* target languages: en\n*  OPUS readme: [tn-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/tn-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-21.zip](https://object.pouta.csc.fi/OPUS-MT-models/tn-en/opus-2020-01-21.zip)\n* test set translations: [opus-2020-01-21.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/tn-en/opus-2020-01-21.test.txt)\n* test set scores: [opus-2020-01-21.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/tn-en/opus-2020-01-21.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-tn-en"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-en-nl", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: en\n* target languages: nl\n*  OPUS readme: [en-nl](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/en-nl/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/en-nl/opus-2019-12-04.zip)\n* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-nl/opus-2019-12-04.test.txt)\n* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-nl/opus-2019-12-04.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-en-nl"}
{"domain": "Multimodal Feature Extraction", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "facebook/dragon-plus-context-encoder", "api_call": "AutoModel.from_pretrained('facebook/dragon-plus-context-encoder')", "performance": {"dataset": "MS MARCO", "accuracy": 39.0}, "description": "DRAGON+ is a BERT-base sized dense retriever initialized from RetroMAE and further trained on the data augmented from MS MARCO corpus, following the approach described in How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval. The associated GitHub repository is available here https://github.com/facebookresearch/dpr-scale/tree/main/dragon. We use asymmetric dual encoder, with two distinctly parameterized encoders.", "model_name": "facebook/dragon-plus-context-encoder"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "dallinmackay/Van-Gogh-diffusion", "performance": {"dataset": "", "accuracy": null}, "description": "v2 - fixed and working this is a fine-tuned stable diffusion model based on v1.5 trained on screenshots from the film loving vincent . use the token lvngvncnt at the beginning of your prompts to use the style e.g., \"lvngvncnt, beautiful woman at sunset\". this model works best with the euler sampler not euler a. download the ckpt file from \"files and versions\" tab into the stable diffusion models folder of your web-ui of choice.", "api_call": "", "model_name": "dallinmackay/Van-Gogh-diffusion"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/xclip-base-patch32", "api_call": "XClipModel.from_pretrained('microsoft/xclip-base-patch32')", "performance": {"dataset": "Kinetics 400", "accuracy": {"top-1": 80.4, "top-5": 95.0}}, "description": "X-CLIP is a minimal extension of CLIP for general video-language understanding. The model is trained in a contrastive way on (video, text) pairs. This allows the model to be used for tasks like zero-shot, few-shot or fully supervised video classification and video-text retrieval.", "model_name": "microsoft/xclip-base-patch32"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "NlpHUST/gpt2-vietnamese", "performance": {"dataset": "", "accuracy": null}, "description": "pretrained gpt model on vietnamese language using a causal language modeling clm objective. it was introduced in and first released at .", "api_call": "", "model_name": "NlpHUST/gpt2-vietnamese"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "dreamlike-art/dreamlike-anime-1.0", "api_call": "StableDiffusionPipeline.from_pretrained('dreamlike-art/dreamlike-anime-1.0', torch_dtype=torch.float16)(prompt, negative_prompt=negative_prompt)", "performance": {"dataset": "N/A", "accuracy": "N/A"}, "description": "Dreamlike Anime 1.0 is a high quality anime model, made by dreamlike.art. It can be used to generate anime-style images based on text prompts. The model is trained on 768x768px images and works best with prompts that include 'photo anime, masterpiece, high quality, absurdres'. It can be used with the Stable Diffusion Pipeline from the diffusers library.", "model_name": "dreamlike-art/dreamlike-anime-1.0"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-en-da", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: en\n* target languages: da\n*  OPUS readme: [en-da](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/en-da/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/en-da/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-da/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-da/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-en-da"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "saibo/legal-roberta-base", "performance": {"dataset": "", "accuracy": null}, "description": "we introduce legal-roberta, which is a domain-specific language representation model fine-tuned on large-scale legal corpora4.6 gb. 'this \\ agreement is between general motors and john murray .' model top1 top2 top3 top4 top5", "api_call": "", "model_name": "saibo/legal-roberta-base"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Sentence-Transformers", "api_name": "dangvantuan/sentence-camembert-large", "performance": {"dataset": "", "accuracy": null}, "description": "is the embedding model for french developed by . the purpose of this embedding model is to represent the content and semantics of a french sentence in a mathematical vector which allows it to understand the meaning of the text-beyond individual words in queries and documents, offering a powerful semantic search. the model is fine-tuned using pre-trained and on dataset is the embedding model for french developed by . the purpose of this embedding model is to represent the content and semantics of a french sentence in a mathematical vector which allows it to understand the meaning of the text-beyond individual words in queries and documents, offering a powerful semantic search.", "api_call": "", "model_name": "dangvantuan/sentence-camembert-large"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "Voicelab/vlt5-base-keywords", "performance": {"dataset": "", "accuracy": null}, "description": "our vlt5 model is a keyword generation model based on encoder-decoder architecture using transformer blocks presented by google . the vlt5 was trained on scientific articles corpus to predict a given set of keyphrases based on the concatenation of the article\u2019s abstract and title. it generates precise, yet not always complete keyphrases that describe the content of the article based only on the abstract. keywords generated with vlt5-base-keywords: encoder-decoder architecture, keyword generation results on demo model different generation method, one model per language:", "api_call": "", "model_name": "Voicelab/vlt5-base-keywords"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "functionality": "Summarization", "api_name": "lidiya/bart-large-xsum-samsum", "api_call": "pipeline('summarization', model='lidiya/bart-large-xsum-samsum')", "performance": {"dataset": "SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization", "accuracy": {"rouge1": 53.306, "rouge2": 28.355, "rougeL": 44.095}}, "description": "This model was obtained by fine-tuning facebook/bart-large-xsum on Samsum dataset.", "model_name": "lidiya/bart-large-xsum-samsum"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Babelscape/mrebel-large", "performance": {"dataset": "", "accuracy": null}, "description": "this is a multilingual version of . it can be used as a standalone multulingual relation extraction system, or as a pretrained system to be tuned on multilingual relation extraction datasets. mrebel is introduced in the acl 2023 paper . we present a new multilingual relation extraction dataset and train a multilingual version of rebel which reframed relation extraction as a seq2seq task. the paper can be found . if you use the code or model, please reference this work in your paper: @inproceedingshuguet-cabot-et-al-2023-redfm-dataset,", "api_call": "", "model_name": "Babelscape/mrebel-large"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-bn-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source group: Bengali \n* target group: English \n*  OPUS readme: [ben-eng](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/ben-eng/README.md)\n\n*  model: transformer-align\n* source language(s): ben\n* target language(s): eng\n* model: transformer-align\n* pre-processing: normalization + SentencePiece (spm32k,spm32k)\n* download original weights: [opus-2020-06-17.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/ben-eng/opus-2020-06-17.zip)\n* test set translations: [opus-2020-06-17.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/ben-eng/opus-2020-06-17.test.txt)\n* test set scores: [opus-2020-06-17.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/ben-eng/opus-2020-06-17.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-bn-en"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "darkstorm2150/Protogen_x5.8_Official_Release", "performance": {"dataset": "", "accuracy": null}, "description": "## General info\nProtogen x5.8\n\nProtogen was warm-started with [Stable Diffusion v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) and \nis rebuilt using dreamlikePhotoRealV2.ckpt as a core, adding small amounts during merge checkpoints.", "api_call": "", "model_name": "darkstorm2150/Protogen_x5.8_Official_Release"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "fabiochiu/t5-base-tag-generation", "performance": {"dataset": "", "accuracy": null}, "description": "this model is fine-tuned on the dataset for predicting article tags using the article textual content as input. while usually formulated as a multi-label classification problem, this model deals with tag generation as a text2text generation task inspiration from . the dataset is composed of medium articles and their tags. however, each medium article can have at most five tags, therefore the author needs to choose what he/she believes are the best tags mainly for seo-related purposes. this means that an article with the \"python\" tag may have not the \"programming languages\" tag, even though the first implies the latter. to clean the dataset accounting for this problem, a hand-made taxonomy of about 1000 tags was built. using the taxonomy, the tags of each articles have been augmented e.g. an article with the \"python\" tag will have the \"programming languages\" tag as well, as the taxonomy says that \"python\" is part of \"programming languages\". the taxonomy is not public, if you are interested in it please send an email at chiusanofabio94@gmail.com. this model is fine-tuned on the dataset for predicting article tags using the article textual content as input. while usually formulated as a multi-label classification problem, this model deals with tag generation as a text2text generation task inspiration from .", "api_call": "", "model_name": "fabiochiu/t5-base-tag-generation"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "ItsJayQz/Marvel_WhatIf_Diffusion", "performance": {"dataset": "", "accuracy": null}, "description": "this model was trained on images from the animated marvel disney+ show what if. which includes characters, background, and some objects. please check out important informations on the usage of the model down bellow.", "api_call": "", "model_name": "ItsJayQz/Marvel_WhatIf_Diffusion"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "apanc/russian-sensitive-topics", "performance": {"dataset": "", "accuracy": null}, "description": "this model is trained on the dataset of sensitive topics of the russian language. the concept of sensitive topics is described presented at the workshop for balto-slavic nlp at the eacl-2021 conference. please note that this article describes the first version of the dataset, while the model is trained on the extended version of the dataset open-sourced on our or on . the properties of the dataset is the same as the one described in the article, the only difference is the size. the model predicts combinations of 18 sensitive topics described in the . you can find step-by-step instructions for using the model the dataset partially manually labeled samples and partially semi-automatically labeled samples. learn more in our article. we tested the performance of the classifier only on the part of manually labeled data that is why some topics are not well represented in the test set.", "api_call": "", "model_name": "apanc/russian-sensitive-topics"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "xiaolxl/GuoFeng3", "performance": {"dataset": "", "accuracy": null}, "description": "# \u672c\u4eba\u90d1\u91cd\u58f0\u660e\uff1a\u672c\u6a21\u578b\u7981\u6b62\u7528\u4e8e\u8bad\u7ec3\u57fa\u4e8e\u660e\u661f\u3001\u516c\u4f17\u4eba\u7269\u8096\u50cf\u7684\u98ce\u683c\u6a21\u578b\u8bad\u7ec3\uff0c\u56e0\u4e3a\u8fd9\u4f1a\u5e26\u6765\u4e89\u8bae\uff0c\u5bf9AI\u793e\u533a\u7684\u53d1\u5c55\u9020\u6210\u4e0d\u826f\u7684\u8d1f\u9762\u5f71\u54cd\u3002\n\n# \u672c\u6a21\u578b\u6ce8\u660e\uff1a\u8bad\u7ec3\u7d20\u6750\u4e2d\u4e0d\u5305\u542b\u4efb\u4f55\u771f\u4eba\u7d20\u6750\u3002\n\n| \u7248\u672c | \u6548\u679c\u56fe |\n| --- | --- |\n| **GuoFeng3.4** | ![e5.jpg](https://ai-studio-static-online.cdn.bcebos.com/5e78944f992747f79723af0fdd9cb5a306ecddde0dd941ac8e220c45dd8fcff7) |\n| **GuoFeng3.3** | ![min_00193-3556647833.png.jpg](https://ai-studio-static-online.cdn.bcebos.com/fd09b7f02da24d3391bea0c639a14a80c12aec9467484d67a7ab5a32cef84bb1) |\n| **GuoFeng3.2_light** | ![178650.png](https://ai-studio-static-online.cdn.bcebos.com/9d5e36ad89f947a39b631f70409366c3bd531aa3a1214be7b0cf115daa62fb94) |\n| **GuoFeng3.2** | ![00044-4083026190-1girl, beautiful, realistic.png.png](https://ai-studio-static-online.cdn.bcebos.com/ff5c7757f97849ecb5320bfbe7b692d1cb12da547c9348058a842ea951369ff8) |\n| **GuoFeng3** | ![e1.png](https://ai-studio-static-online.cdn.bcebos.com/be966cf5c86d431cb33d33396560f546fdd4c15789d54203a8bd15c35abd7dc2) |\n\n# \u4ecb\u7ecd - GuoFeng3\n\n\u6b22\u8fce\u4f7f\u7528GuoFeng3\u6a21\u578b - (TIP:\u8fd9\u4e2a\u7248\u672c\u7684\u540d\u5b57\u8fdb\u884c\u4e86\u5fae\u8c03),\u8fd9\u662f\u4e00\u4e2a\u4e2d\u56fd\u534e\u4e3d\u53e4\u98ce\u98ce\u683c\u6a21\u578b\uff0c\u4e5f\u53ef\u4ee5\u8bf4\u662f\u4e00\u4e2a\u53e4\u98ce\u6e38\u620f\u89d2\u8272\u6a21\u578b\uff0c\u5177\u67092.5D\u7684\u8d28\u611f\u3002\u7b2c\u4e09\u4ee3\u5927\u5e45\u5ea6\u51cf\u5c11\u4e0a\u624b\u96be\u5ea6\uff0c\u589e\u52a0\u4e86\u573a\u666f\u5143\u7d20\u4e0e\u7537\u6027\u53e4\u98ce\u4eba\u7269\uff0c\u9664\u6b64\u4e4b\u5916\u4e3a\u4e86\u6a21\u578b\u80fd\u66f4\u597d\u7684\u9002\u5e94\u5176\u5b83TAG\uff0c\u8fd8\u589e\u52a0\u4e86\u5176\u5b83\u98ce\u683c\u7684\u5143\u7d20\u3002\u8fd9\u4e00\u4ee3\u5bf9\u8138\u548c\u624b\u7684\u5d29\u574f\u6709\u4e00\u5b9a\u7684\u4fee\u590d\uff0c\u540c\u65f6\u7d20\u6750\u5927\u5c0f\u4e5f\u63d0\u9ad8\u5230\u4e86\u6700\u957f\u8fb91024\u3002\n\n\u6839\u636e\u4e2a\u4eba\u7684\u5b9e\u9a8c\u4e0e\u6536\u5230\u7684\u53cd\u9988\uff0c\u56fd\u98ce\u6a21\u578b\u7cfb\u5217\u7684\u7b2c\u4e8c\u4ee3\uff0c\u5728\u4eba\u7269\uff0c\u4e0e\u5927\u5934\u7167\u7684\u6548\u679c\u8868\u73b0\u6bd4\u4e09\u4ee3\u66f4\u597d\uff0c\u5982\u679c\u4f60\u6709\u8fd9\u65b9\u9762\u9700\u6c42\u4e0d\u59a8\u8bd5\u8bd5\u7b2c\u4e8c\u4ee3\u3002\n\n2.0\u7248\u672c\uff1a[https://huggingface.co/xiaolxl/Gf_style2](https://huggingface.co/xiaolxl/Gf_style2)\n\nGuoFeng3:\u539f\u59cb\u6a21\u578b\n\nGuoFeng3.1:\u5bf9GuoFeng3\u4eba\u50cf\u8fdb\u884c\u4e86\u5fae\u8c03\u4fee\u590d\n\nGuoFeng3.2:\u5982\u679c\u4f60\u4e0d\u77e5\u9053\u9009\u62e9GuoFeng3\u8fd8\u662fGuoFeng2\uff0c\u53ef\u4ee5\u76f4\u63a5\u4f7f\u7528\u6b64\u7248\u672c\n\nGuoFeng3.2_light:\u901a\u8fc7GuoFeng3.2\u878d\u5408\u4e86\u57fa\u4e8e Noise Offset \u8bad\u7ec3\u7684Lora\u4f7f\u5f97\u6a21\u578b\u80fd\u591f\u753b\u51fa\u66f4\u6f02\u4eae\u7684\u5149\u5f71\u6548\u679c(Lora:epi_noiseoffset/Theovercomer8's Contrast Fix)\n\nGuoFeng3.2_Lora:\u56fd\u98ce3.2 Lora\u7248\u672c\n\nGuoFeng3.2_Lora_big_light:\u56fd\u98ce3.2_light Lora\u7248\u672c \u7ef4\u5ea6\u589e\u5927\u7248\u672c\n\nGuoFeng3.2_f16:\u56fd\u98ce3.2 \u534a\u7cbe\u7248\u672c\n\nGuoFeng3.2_light_f16:\u56fd\u98ce3.2_light \u534a\u7cbe\u7248\u672c\n\nGuoFeng3.3\uff1a\u6b64\u7248\u672c\u662f\u57fa\u4e8e3.2\u7684\u4e00\u6b21\u8f83\u5927\u7684\u66f4\u65b0\u4e0e\u6539\u8fdb\uff0c\u53ef\u4ee5\u9002\u914dfull body\uff0c\u5373\u4f7f\u4f60\u7684tag\u4e0d\u592a\u597d\uff0c\u6a21\u578b\u4e5f\u4f1a\u5bf9\u753b\u9762\u8fdb\u884c\u81ea\u52a8\u4fee\u6539\uff0c\u4e0d\u8fc7\u56e0\u6b64\u6a21\u578b\u51fa\u7684\u8138\u4f1a\u6bd4\u8f83\u96f7\u540c\u3002\u6b64\u6a21\u578b\u4f3c\u4e4e\u4e0d\u9700\u8981\u8d85\u5206\uff0c\u6211\u7684\u51fa\u56fe\u5927\u5c0f\u662f768*1024\uff0c\u6e05\u6670\u5ea6\u8fd8\u4e0d\u9519\u3002\u5efa\u8bae\u7ad6\u56fe\uff0c\u6a2a\u56fe\u53ef\u80fd\u4e0d\u6e05\u6670\u3002Euler a\u5373\u53ef\u3002(DPM++ SDE Karras, DDIM\u4e5f\u4e0d\u9519)\n\nGuoFeng3.4:\u6b64\u7248\u672c\u91cd\u65b0\u8fdb\u884c\u4e86\u65b0\u7684\u8bad\u7ec3\uff0c\u9002\u914d\u5168\u8eab\u56fe\uff0c\u540c\u65f6\u5185\u5bb9\u4e0a\u4e0e\u524d\u51e0\u4e2a\u7248\u672c\u6709\u8f83\u5927\u4e0d\u540c\u3002\u5e76\u8c03\u6574\u4e86\u6574\u4f53\u753b\u98ce\uff0c\u964d\u4f4e\u4e86\u8fc7\u62df\u5408\u7a0b\u5ea6\uff0c\u4f7f\u5176\u80fd\u4f7f\u7528\u66f4\u591a\u7684lora\u5bf9\u753b\u9762\u4e0e\u5185\u5bb9\u8fdb\u884c\u8c03\u6574\u3002\n\n--\n\nWelcome to the GuoFeng3 model - (TIP: the name of this version has been fine-tuned). This is a Chinese gorgeous antique style model, which can also be said to be an antique game character model with a 2.5D texture. The third generation greatly reduces the difficulty of getting started, and adds scene elements and male antique characters. In addition, in order to better adapt the model to other TAGs, other style elements are also added. This generation has repaired the broken face and hands to a certain extent, and the size of the material has also increased to the longest side of 1024.\n\nAccording to personal experiments and feedback received, the second generation of the Guofeng model series performs better than the third generation in terms of characters and big head photos. If you have this need, you can try the second generation.\n\nVersion 2.0\uff1a[https://huggingface.co/xiaolxl/Gf_style2](https://huggingface.co/xiaolxl/Gf_style2)\n\nGuoFeng3: original model\n\nGuoFeng3.1: The portrait of GuoFeng3 has been fine-tuned and repaired\n\nGuoFeng3.2: If you don't know whether to choose GuoFeng3 or GuoFeng2, you can use this version directly\n\nGuoFeng3.2_Light: Through GuoFeng3.2, Lora based on Noise Offset training is integrated to enable the model to draw more beautiful light and shadow effects (Lora: epi_noiseoffset/Theovercolor8's Contrast Fix)\n\nGuoFeng3.2_Lora: Guofeng3.2 Lora version\n\nGuoFeng3.2_Lora_big_Light: Guofeng3.2_Light Lora Version Dimension Increase Version\n\nGuoFeng3.2_F16: Guofeng3.2 semi-refined version\n\nGuoFeng3.2_light_f16: Guofeng3.2_Light semi-refined version\n\nGuoFeng3.3: This version is a major update and improvement based on 3.2, which can adapt to full bodies. Even if your tag is not good, the model will automatically modify the screen, but the faces produced by the model will be quite similar. This model doesn't seem to require supersession. My plot size is 768 * 1024, and the clarity is quite good. Suggest vertical view, horizontal view may not be clear. Euler a is sufficient. (DPM++SDE Karras, DDIM is also good)\n\nGuoFeng3.4: This version has undergone new training to adapt to the full body image, and the content is significantly different from previous versions.At the same time, the overall painting style has been adjusted, reducing the degree of overfitting, allowing it to use more Lora to adjust the screen and content.\n\n# \u5b89\u88c5\u6559\u7a0b - install\n\n1. \u5c06GuoFeng3.ckpt\u6a21\u578b\u653e\u5165SD\u76ee\u5f55 - Put GuoFeng3.ckpt model into SD directory\n\n2. \u6b64\u6a21\u578b\u81ea\u5e26VAE\uff0c\u5982\u679c\u4f60\u7684\u7a0b\u5e8f\u4e0d\u652f\u6301\uff0c\u8bf7\u8bb0\u5f97\u9009\u62e9\u4efb\u610f\u4e00\u4e2aVAE\u6587\u4ef6\uff0c\u5426\u5219\u56fe\u5f62\u5c06\u4e3a\u7070\u8272 - This model comes with VAE. If your program does not support it, please remember to select any VAE file, otherwise the graphics will be gray\n\n# \u5982\u4f55\u4f7f\u7528 - How to use\n\n**TIP\uff1a\u7ecf\u8fc7\u4e00\u5929\u7684\u6d4b\u8bd5\uff0c\u53d1\u73b0\u5f88\u591a\u4eba\u7269\u53ef\u80fd\u51fa\u73b0\u7ea2\u773c\u95ee\u9898\uff0c\u53ef\u4ee5\u5c1d\u8bd5\u5728\u8d1f\u9762\u8bcd\u6dfb\u52a0red eyes\u3002\u5982\u679c\u8272\u5f69\u8273\u4e3d\u53ef\u4ee5\u5c1d\u8bd5\u964d\u4f4eCFG - After a day of testing, we found that many characters may have red-eye problems. We can try to add red eyes to negative words\u3002Try to reduce CFG if the color is bright**\n\n\u7b80\u5355\uff1a\u7b2c\u4e09\u4ee3\u5927\u5e45\u5ea6\u51cf\u5c11\u4e0a\u624b\u96be\u5ea6 - Simple: the third generation greatly reduces the difficulty of getting started\n\n======\n\n\u5982\u679c\u4f60\u7684\u51fa\u56fe\u5168\u8eab\u56fe\u65f6\u51fa\u73b0\u8138\u90e8\u5d29\u574f\u5efa\u8bae\u5220\u9664full body\u5173\u952e\u8bcd\u6216\u8005\u4f7f\u7528\u8138\u90e8\u81ea\u52a8\u4fee\u590d\u63d2\u4ef6\uff1a\n\n\u56fd\u5916\u6e90\u5730\u5740\uff1ahttps://github.com/ototadana/sd-face-editor.git\n\n\u56fd\u5185\u52a0\u901f\u5730\u5740\uff1ahttps://jihulab.com/xiaolxl_pub/sd-face-editor.git\n\n-\n\nIf you experience facial collapse during the full body image, it is recommended to delete the full body keyword or use the facial automatic repair plugin:\n\nForeign source address: https://github.com/ototadana/sd-face-editor.git\n\nDomestic acceleration address: https://jihulab.com/xiaolxl_pub/sd-face-editor.git\n\n=====\n\n- **\u5173\u952e\u8bcd - key word:**\n```\nbest quality, masterpiece, highres, 1girl,china dress,Beautiful face\n```\n\n- **\u8d1f\u9762\u8bcd - Negative words:**\n```\nNSFW, lowres,bad anatomy,bad hands, text, error, missing fingers,extra digit, fewer digits, cropped, worstquality, low quality, normal quality,jpegartifacts,signature, watermark, username,blurry,bad feet\n```\n\n---\n\n\u9ad8\u7ea7\uff1a\u5982\u679c\u60a8\u8fd8\u60f3\u4f7f\u56fe\u7247\u5c3d\u53ef\u80fd\u66f4\u597d\uff0c\u8bf7\u5c1d\u8bd5\u4ee5\u4e0b\u914d\u7f6e - senior\uff1aIf you also want to make the picture as better as possible, please try the following configuration\n\n- Sampling steps:**50**\n\n- Sampler:**DPM++ SDE Karras or DDIM**\n\n- The size of the picture should be at least **1024** - \u56fe\u7247\u5927\u5c0f\u81f3\u5c111024\n\n- CFG:**4-6**\n\n- **\u66f4\u597d\u7684\u8d1f\u9762\u8bcd Better negative words - \u611f\u8c22\u7fa4\u53cb\u63d0\u4f9b\u7684\u8d1f\u9762\u8bcd:**\n```\n(((simple background))),monochrome ,lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, lowres, bad anatomy, bad hands, text, error, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, ugly,pregnant,vore,duplicate,morbid,mut ilated,tran nsexual, hermaphrodite,long neck,mutated hands,poorly drawn hands,poorly drawn face,mutation,deformed,blurry,bad anatomy,bad proportions,malformed limbs,extra limbs,cloned face,disfigured,gross proportions, (((missing arms))),(((missing legs))), (((extra arms))),(((extra legs))),pubic hair, plump,bad legs,error legs,username,blurry,bad feet\n```\n\n- **\u5982\u679c\u60f3\u5143\u7d20\u66f4\u4e30\u5bcc\uff0c\u53ef\u4ee5\u6dfb\u52a0\u4e0b\u65b9\u5173\u952e\u8bcd - If you want to enrich the elements, you can add the following keywords**\n```\nBeautiful face,\nhair ornament, solo,looking at viewer,smile,closed mouth,lips\nchina dress,dress,hair ornament, necklace, jewelry, long hair, earrings, chinese clothes,\narchitecture,east asian architecture,building,outdoors,rooftop,city,cityscape\n```\n\n# \u4f8b\u56fe - Examples\n\n(\u53ef\u5728\u6587\u4ef6\u5217\u8868\u4e2d\u627e\u5230\u539f\u56fe\uff0c\u5e76\u653e\u5165WebUi\u67e5\u770b\u5173\u952e\u8bcd\u7b49\u4fe1\u606f) - (You can find the original image in the file list, and put WebUi to view keywords and other information)\n\n<img src=https://huggingface.co/xiaolxl/GuoFeng3/resolve/main/examples/e1.png>\n\n<img src=https://huggingface.co/xiaolxl/GuoFeng3/resolve/main/examples/e2.png>\n\n<img src=https://huggingface.co/xiaolxl/GuoFeng3/resolve/main/examples/e3.png>\n\n<img src=https://huggingface.co/xiaolxl/GuoFeng3/resolve/main/examples/e4.png>", "api_call": "", "model_name": "xiaolxl/GuoFeng3"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "stanford-crfm/BioMedLM", "performance": {"dataset": "", "accuracy": null}, "description": "note: this model was previously known as pubmedgpt 2.7b, but we have changed it due to a request from the nih which holds the trademark for \"pubmed\". paper: biomedlm 2.7b is new language model trained exclusively on biomedical abstracts and papers from . this gpt-style model can achieve strong results on a variety of biomedical nlp tasks, including a new state of the art performance of 50.3% accuracy on the medqa biomedical question answering task. biomedlm 2.7b is new language model trained exclusively on biomedical abstracts and papers from . this gpt-style model can achieve strong results on a variety of biomedical nlp tasks, including a new state of the art performance of 50.3% accuracy on the medqa biomedical question answering task. as an autoregressive language model, biomedlm 2.7b is also capable of natural language generation. however, we have only begun to explore the generation capabilities and limitations of this model, and we emphasize that this model\u2019s generation capabilities are for research purposes only and not suitable for production. in releasing this model, we hope to advance both the development of biomedical nlp applications and best practices for responsibly training and utilizing domain-specific language models; issues of reliability, truthfulness, and explainability are top of mind for us. this model was a joint collaboration of and . - developed by: stanford crfm, mosaicml - shared by: stanford crfm - model type: language model - languages nlp: en - license:", "api_call": "", "model_name": "stanford-crfm/BioMedLM"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment", "performance": {"dataset": "", "accuracy": null}, "description": "- Main Page:[Fengshenbang](https://fengshenbang-lm.com/)\n- Github: [Fengshenbang-LM](https://github.com/IDEA-CCNL/Fengshenbang-LM)", "api_call": "", "model_name": "IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-ca-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: ca\n* target languages: en\n*  OPUS readme: [ca-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/ca-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/ca-en/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ca-en/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ca-en/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-ca-en"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Image Segmentation", "api_name": "openmmlab/upernet-convnext-small", "api_call": "UperNetModel.from_pretrained('openmmlab/upernet-convnext-small')", "performance": {"dataset": "N/A", "accuracy": "N/A"}, "description": "UperNet framework for semantic segmentation, leveraging a ConvNeXt backbone. UperNet was introduced in the paper Unified Perceptual Parsing for Scene Understanding by Xiao et al. Combining UperNet with a ConvNeXt backbone was introduced in the paper A ConvNet for the 2020s.", "model_name": "openmmlab/upernet-convnext-small"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-en-uk", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: en\n* target languages: uk\n*  OPUS readme: [en-uk](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/en-uk/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/en-uk/opus-2020-01-08.zip)\n* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-uk/opus-2020-01-08.test.txt)\n* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-uk/opus-2020-01-08.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-en-uk"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "CLTL/MedRoBERTa.nl", "performance": {"dataset": "", "accuracy": null}, "description": "check out our website! this model is a roberta-based model pre-trained from scratch on dutch hospital notes sourced from electronic health records. the model is not fine-tuned. all code used for the creation of medroberta.nl can be found at the model can be fine-tuned on any type of task. since it is a domain-specific model trained on medical data, it is meant to be used on medical nlp tasks for dutch. this model is a roberta-based model pre-trained from scratch on dutch hospital notes sourced from electronic health records. the model is not fine-tuned. all code used for the creation of medroberta.nl can be found at", "api_call": "", "model_name": "CLTL/MedRoBERTa.nl"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-hi-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: hi\n* target languages: en\n*  OPUS readme: [hi-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/hi-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/hi-en/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/hi-en/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/hi-en/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-hi-en"}
{"domain": "Audio Text-to-Audio", "framework": "Hugging Face Diffusers", "api_name": "riffusion/riffusion-model-v1", "performance": {"dataset": "", "accuracy": null}, "description": "Riffusion is an app for real-time music generation with stable diffusion.\n\nRead about it at https://www.riffusion.com/about and try it at https://www.riffusion.com/.\n\n* Code: https://github.com/riffusion/riffusion\n* Web app: https://github.com/hmartiro/riffusion-app\n* Model checkpoint: https://huggingface.co/riffusion/riffusion-model-v1\n* Discord: https://discord.gg/yu6SRwvX4v\n\nThis repository contains the model files, including:\n\n * a diffusers formated library\n * a compiled checkpoint file\n * a traced unet for improved inference speed\n * a seed image library for use with riffusion-app", "api_call": "", "model_name": "riffusion/riffusion-model-v1"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "dbmdz/german-gpt2", "performance": {"dataset": "", "accuracy": null}, "description": "in this repository we release yet another gpt-2 model, that was trained on various texts for german. the model is meant to be an entry point for fine-tuning on other texts, and it is definitely not as good or \"dangerous\" as the english gpt-3 model. we do not plan extensive pr or staged releases for this model \ud83d\ude09 note : the model was initially released under an anonymous alias `anonymous-german-nlp/german-gpt2` so we now \"de-anonymize\" it.", "api_call": "", "model_name": "dbmdz/german-gpt2"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "datificate/gpt2-small-spanish", "performance": {"dataset": "", "accuracy": null}, "description": "la descripci\u00f3n en espa\u00f1ol se encuentra despu\u00e9s de la descripci\u00f3n en ingl\u00e9s. gpt2-small-spanish is a state-of-the-art language model for spanish based on the gpt-2 small model. it was trained on spanish wikipedia using transfer learning and fine-tuning techniques . the training took around 70 hours with four gpu nvidia gtx 1080-ti with 11gb of ddr5 and with around 3gb of processed training data.", "api_call": "", "model_name": "datificate/gpt2-small-spanish"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-tc-big-fi-en", "performance": {"dataset": "", "accuracy": null}, "description": "neural machine translation model for translating from finnish fi to english en. this model is part of the , an effort to make neural machine translation models widely available and accessible for many languages in the world. all models are originally trained using the amazing framework of , an efficient nmt implementation written in pure c++. the models have been converted to pytorch using the transformers library by huggingface. training data is taken from and training pipelines use the procedures of . publications: and please, cite if you use this model.", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-tc-big-fi-en"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "Hugging Face Transformers", "api_name": "SCUT-DLVCLab/lilt-infoxlm-base", "performance": {"dataset": "", "accuracy": null}, "description": "language-independent layout transformer - infoxlm model by stitching a pre-trained infoxlm and a pre-trained language-independent layout transformer lilt together. it was introduced in the paper by wang et al. and first released in . disclaimer: the team releasing lilt did not write a model card for this model so this model card has been written by the hugging face team. the language-independent layout transformer lilt allows to combine any pre-trained roberta encoder from the hub hence, in any language with a lightweight layout transformer to have a layoutlm-like model for any language. the language-independent layout transformer lilt allows to combine any pre-trained roberta encoder from the hub hence, in any language with a lightweight layout transformer to have a layoutlm-like model for any language.", "api_call": "", "model_name": "SCUT-DLVCLab/lilt-infoxlm-base"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-ko-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source group: Korean \n* target group: English \n*  OPUS readme: [kor-eng](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/kor-eng/README.md)\n\n*  model: transformer-align\n* source language(s): kor kor_Hang kor_Latn\n* target language(s): eng\n* model: transformer-align\n* pre-processing: normalization + SentencePiece (spm32k,spm32k)\n* download original weights: [opus-2020-06-17.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/kor-eng/opus-2020-06-17.zip)\n* test set translations: [opus-2020-06-17.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/kor-eng/opus-2020-06-17.test.txt)\n* test set scores: [opus-2020-06-17.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/kor-eng/opus-2020-06-17.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-ko-en"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "Yale-LILY/brio-cnndm-uncased", "performance": {"dataset": "", "accuracy": null}, "description": "abstractive summarization models are commonly trained using maximum likelihood estimation, which assumes a deterministic one-point target distribution in which an ideal model will assign all the probability mass to the reference summary. this assumption may lead to performance degradation during inference, where the model needs to compare several system-generated candidate summaries that have deviated from the reference summary. to address this problem, we propose a novel training paradigm which assumes a non-deterministic distribution so that different candidate summaries are assigned probability mass according to their quality. - developed by: yale lily lab - shared by optional: yale lily lab abstractive summarization models are commonly trained using maximum likelihood estimation, which assumes a deterministic one-point target distribution in which an ideal model will assign all the probability mass to the reference summary. this assumption may lead to performance degradation during inference, where the model needs to compare several system-generated candidate summaries that have deviated from the reference summary. to address this problem, we propose a novel training paradigm which assumes a non-deterministic distribution so that different candidate summaries are assigned probability mass according to their quality. - developed by: yale lily lab - shared by optional: yale lily lab - model type: text2text generation - languages nlp: more information needed - license: more information needed - parent model: bart - resources for more information:  - -", "api_call": "", "model_name": "Yale-LILY/brio-cnndm-uncased"}
{"domain": "Computer Vision Mask Generation", "framework": "Hugging Face Transformers", "api_name": "facebook/sam-vit-huge", "performance": {"dataset": "", "accuracy": null}, "description": "<p>\n\t<img src=\"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/sam-architecture.png\" alt=\"Model architecture\">\n\t<em> Detailed architecture of Segment Anything Model (SAM).</em>\n</p>\n\n\n#  Table of Contents\n\n0. [TL;DR](#TL;DR)\n1. [Model Details](#model-details)\n2. [Usage](#usage)\n3. [Citation](#citation)\n\n# TL;DR\n\n\n[Link to original repository](https://github.com/facebookresearch/segment-anything)\n\n| <img src=\"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/sam-beancans.png\" alt=\"Snow\" width=\"600\" height=\"600\"> | <img src=\"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/sam-dog-masks.png\" alt=\"Forest\" width=\"600\" height=\"600\"> | <img src=\"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/sam-car-seg.png\" alt=\"Mountains\" width=\"600\" height=\"600\"> |\n|---------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|\n\n\nThe **Segment Anything Model (SAM)** produces high quality object masks from input prompts such as points or boxes, and it can be used to generate masks for all objects in an image. It has been trained on a [dataset](https://segment-anything.com/dataset/index.html) of 11 million images and 1.1 billion masks, and has strong zero-shot performance on a variety of segmentation tasks.\nThe abstract of the paper states:\n\n>  We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at [https://segment-anything.com](https://segment-anything.com) to foster research into foundation models for computer vision.\n\n**Disclaimer**: Content from **this** model card has been written by the Hugging Face team, and parts of it were copy pasted from the original [SAM model card](https://github.com/facebookresearch/segment-anything).\n\n# Model Details\n\nThe SAM model is made up of 3 modules:\n  - The `VisionEncoder`: a VIT based image encoder. It computes the image embeddings using attention on patches of the image. Relative Positional Embedding is used.\n  - The `PromptEncoder`: generates embeddings for points and bounding boxes\n  - The `MaskDecoder`: a two-ways transformer which performs cross attention between the image embedding and the point embeddings (->) and between the point embeddings and the image embeddings. The outputs are fed\n  - The `Neck`: predicts the output masks based on the contextualized masks produced by the `MaskDecoder`.\n# Usage", "api_call": "", "model_name": "facebook/sam-vit-huge"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "api_name": "joeddav/bart-large-mnli-yahoo-answers", "performance": {"dataset": "", "accuracy": null}, "description": "this model takes and fine-tunes it on yahoo answers topic classification. it can be used to predict whether a topic label can be assigned to a given sequence, whether or not the label has been seen before. you can play with an interactive demo of this zero-shot technique with this model, as well as the non-finetuned , . this model was fine-tuned on topic classification and will perform best at zero-shot topic classification. use `hypothesis template \"this text is about .\"` as this is the template used during fine-tuning. this model takes and fine-tunes it on yahoo answers topic classification. it can be used to predict whether a topic label can be assigned to a given sequence, whether or not the label has been seen before. you can play with an interactive demo of this zero-shot technique with this model, as well as the non-finetuned , .", "api_call": "", "model_name": "joeddav/bart-large-mnli-yahoo-answers"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "Salesforce/codegen-350M-mono", "performance": {"dataset": "", "accuracy": null}, "description": "codegen is a family of autoregressive language models for program synthesis from the paper: by erik nijkamp, bo pang, hiroaki hayashi, lifu tu, huan wang, yingbo zhou, silvio savarese, caiming xiong. the models are originally released in , under 3 pre-training data variants `nl`, `multi`, `mono` and 4 model size variants `350m`, `2b`, `6b`, `16b`. the checkpoint included in this repository is denoted as codegen-mono 350m in the paper, where \"mono\" means the model is initialized with codegen-multi 350m and further pre-trained on a python programming language dataset, and \"350m\" refers to the number of trainable parameters. this checkpoint codegen-mono 350m was firstly initialized with codegen-multi 350m , and then pre-trained on bigpython dataset. the data consists of 71.7b tokens of python programming language. see section 2.1 of the for more details. codegen is a family of autoregressive language models for program synthesis from the paper: by erik nijkamp, bo pang, hiroaki hayashi, lifu tu, huan wang, yingbo zhou, silvio savarese, caiming xiong. the models are originally released in , under 3 pre-training data variants `nl`, `multi`, `mono` and 4 model size variants `350m`, `2b`, `6b`, `16b`. the checkpoint included in this repository is denoted as codegen-mono 350m in the paper, where \"mono\" means the model is initialized with codegen-multi 350m and further pre-trained on a python programming language dataset, and \"350m\" refers to the number of trainable parameters.", "api_call": "", "model_name": "Salesforce/codegen-350M-mono"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-tc-big-en-ko", "performance": {"dataset": "", "accuracy": null}, "description": "## Table of Contents\n- [Model Details](#model-details)\n- [Uses](#uses)\n- [Risks, Limitations and Biases](#risks-limitations-and-biases)\n- [How to Get Started With the Model](#how-to-get-started-with-the-model)\n- [Training](#training)\n- [Evaluation](#evaluation)\n- [Citation Information](#citation-information)\n- [Acknowledgements](#acknowledgements)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-tc-big-en-ko"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "valhalla/t5-base-e2e-qg", "api_call": "pipeline('e2e-qg', model='valhalla/t5-base-e2e-qg')", "performance": {"dataset": "squad", "accuracy": "N/A"}, "description": "This is a T5-base model trained for end-to-end question generation task. Simply input the text and the model will generate multiple questions. You can play with the model using the inference API, just put the text and see the results!", "model_name": "valhalla/t5-base-e2e-qg"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "opus-mt-fr-es", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('Helsinki-NLP/opus-mt-fr-es')", "performance": {"dataset": "opus", "accuracy": {"BLEU": {"newssyscomb2009.fr.es": 34.3, "news-test2008.fr.es": 32.5, "newstest2009.fr.es": 31.6, "newstest2010.fr.es": 36.5, "newstest2011.fr.es": 38.3, "newstest2012.fr.es": 38.1, "newstest2013.fr.es": 34.0, "Tatoeba.fr.es": 53.2}, "chr-F": {"newssyscomb2009.fr.es": 0.601, "news-test2008.fr.es": 0.583, "newstest2009.fr.es": 0.586, "newstest2010.fr.es": 0.616, "newstest2011.fr.es": 0.622, "newstest2012.fr.es": 0.619, "newstest2013.fr.es": 0.587, "Tatoeba.fr.es": 0.709}}}, "description": "A French to Spanish translation model trained on the OPUS dataset using the Hugging Face Transformers library. The model is based on the transformer-align architecture and uses normalization and SentencePiece for pre-processing.", "model_name": "Helsinki-NLP/opus-mt-fr-es"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "Hugging Face Transformers", "functionality": "Document-level embeddings of research papers", "api_name": "malteos/scincl", "api_call": "AutoModel.from_pretrained('malteos/scincl')", "performance": {"dataset": "SciDocs", "accuracy": {"mag-f1": 81.2, "mesh-f1": 89.0, "co-view-map": 85.3, "co-view-ndcg": 92.2, "co-read-map": 87.7, "co-read-ndcg": 94.0, "cite-map": 93.6, "cite-ndcg": 97.4, "cocite-map": 91.7, "cocite-ndcg": 96.5, "recomm-ndcg": 54.3, "recomm-P@1": 19.6}}, "description": "SciNCL is a pre-trained BERT language model to generate document-level embeddings of research papers. It uses the citation graph neighborhood to generate samples for contrastive learning. Prior to the contrastive training, the model is initialized with weights from scibert-scivocab-uncased. The underlying citation embeddings are trained on the S2ORC citation graph.", "model_name": "malteos/scincl"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image Generation", "api_name": "stabilityai/stable-diffusion-2-1", "api_call": "StableDiffusionPipeline.from_pretrained('stabilityai/stable-diffusion-2-1', torch_dtype=torch.float16)", "performance": {"dataset": "COCO2017", "accuracy": "Not optimized for FID scores"}, "description": "Stable Diffusion v2-1 is a diffusion-based text-to-image generation model developed by Robin Rombach and Patrick Esser. It is capable of generating and modifying images based on text prompts in English. The model is trained on a subset of the LAION-5B dataset and is primarily intended for research purposes.", "model_name": "stabilityai/stable-diffusion-2-1"}
{"domain": "Computer Vision Text-to-3D", "framework": "Hugging Face Diffusers", "api_name": "openai/shap-e", "performance": {"dataset": "", "accuracy": null}, "description": "shap-e introduces a diffusion process that can generate a 3d image from a text prompt. it was introduced in by heewoo jun and alex nichol from openai. original repository of shap-e can be found here: the authors of shap-e didn't author this model card. they provide a separate model card .  the abstract of the shap-e paper:  we present shap-e, a conditional generative model for 3d assets. unlike recent work on 3d generative models which produce a single output representation, shap-e directly generates the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. we train shap-e in two stages: first, we train an encoder that deterministically maps 3d assets into the parameters of an implicit function; second, we train a conditional diffusion model on outputs of the encoder. when trained on a large dataset of paired 3d and text data, our resulting models are capable of generating complex and diverse 3d assets in a matter of seconds. when compared to point-e, an explicit generative model over point clouds, shap-e converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space. we release model weights, inference code, and samples at .", "api_call": "", "model_name": "openai/shap-e"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "ItsJayQz/GTA5_Artwork_Diffusion", "performance": {"dataset": "", "accuracy": null}, "description": "this model was trained on the loading screens, gta storymode, and gta online dlcs artworks. which includes characters, background, chop, and some objects. the model can do people and portrait pretty easily, as well as cars, and houses.", "api_call": "", "model_name": "ItsJayQz/GTA5_Artwork_Diffusion"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "potsawee/t5-large-generation-squad-QuestionAnswer", "performance": {"dataset": "", "accuracy": null}, "description": "- input: `context` e.g. news article - output: `question answer` the answers in the training data squad are highly extractive; therefore, this model will generate extractive answers. if you would like to have abstractive questions/answers, you can use our model trained on the race dataset:", "api_call": "", "model_name": "potsawee/t5-large-generation-squad-QuestionAnswer"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "api_name": "smp111/terrain_recognition", "performance": {"dataset": "", "accuracy": null}, "description": "autogenerated by huggingpics\ud83e\udd17\ud83d\uddbc\ufe0f create your own image classifier for anything by running . report any issues with the demo at the .", "api_call": "", "model_name": "smp111/terrain_recognition"}
{"domain": "Audio Text-to-Speech", "framework": "ESPnet", "functionality": "Text-to-Speech", "api_name": "mio/amadeus", "api_call": "./run.sh --skip_data_prep false --skip_train true --download_model mio/amadeus", "performance": {"dataset": "amadeus", "accuracy": "Not provided"}, "description": "This model was trained by mio using amadeus recipe in espnet.", "model_name": "mio/amadeus"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "Helsinki-NLP/opus-mt-en-zh", "api_call": "pipeline('translation_en_to_zh', model='Helsinki-NLP/opus-mt-en-zh')", "performance": {"dataset": "Tatoeba-test.eng.zho", "accuracy": {"BLEU": 31.4, "chr-F": 0.268}}, "description": "A translation model for English to Chinese using the Hugging Face Transformers library. It is based on the Marian NMT model and trained on the OPUS dataset. The model requires a sentence initial language token in the form of '>>id<<' (id = valid target language ID).", "model_name": "Helsinki-NLP/opus-mt-en-zh"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image Generation", "api_name": "stabilityai/stable-diffusion-2", "api_call": "StableDiffusionPipeline.from_pretrained('stabilityai/stable-diffusion-2', scheduler=EulerDiscreteScheduler.from_pretrained('stabilityai/stable-diffusion-2', subfolder=scheduler), torch_dtype=torch.float16)", "performance": {"dataset": "COCO2017 validation set", "accuracy": "Not optimized for FID scores"}, "description": "Stable Diffusion v2 is a diffusion-based text-to-image generation model that can generate and modify images based on text prompts. It uses a fixed, pretrained text encoder (OpenCLIP-ViT/H) and is primarily intended for research purposes, such as safe deployment of models with potential to generate harmful content, understanding limitations and biases of generative models, and generation of artworks for design and artistic processes.", "model_name": "stabilityai/stable-diffusion-2"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "kandinsky-community/kandinsky-2-2-controlnet-depth", "performance": {"dataset": "", "accuracy": null}, "description": "kandinsky inherits best practices from dall-e 2 and latent diffusion while introducing some new ideas. it uses the clip model as a text and image encoder, and diffusion image prior mapping between latent spaces of clip modalities. this approach increases the visual performance of the model and unveils new horizons in blending images and text-guided image manipulation. the kandinsky model is created by , , , , and", "api_call": "", "model_name": "kandinsky-community/kandinsky-2-2-controlnet-depth"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "api_name": "nsi319/legal-pegasus", "performance": {"dataset": "", "accuracy": null}, "description": "legal-pegasus is a finetuned version of for the legal domain , trained to perform abstractive summarization task. the maximum length of input sequence is 1024 tokens. this model was trained on dataset consisting more than 2700 litigation releases and complaints. model rouge1 rouge1-precision rouge2 rouge2-precision rougel rougel-precision", "api_call": "", "model_name": "nsi319/legal-pegasus"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "donut-base-finetuned-cord-v2", "api_call": "AutoModel.from_pretrained('naver-clova-ix/donut-base-finetuned-cord-v2')", "performance": {"dataset": "CORD", "accuracy": "Not provided"}, "description": "Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder. This model is fine-tuned on CORD, a document parsing dataset.", "model_name": "naver-clova-ix/donut-base-finetuned-cord-v2"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "sbcBI/sentiment_analysis_model", "performance": {"dataset": "", "accuracy": null}, "description": "pretrained model on english language using a masked language modeling mlm objective. it was introduced in and first released in . this model is uncased: it does not make a difference bert is a transformers model pretrained on a large corpus of english data in a self-supervised fashion. this means it was pretrained on the raw texts only, with no humans labelling them in any way which is why it can use lots of publicly available data with an automatic process to generate inputs and labels from those texts. more precisely, it was pretrained with two objectives: - masked language modeling mlm: taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. this is different from traditional recurrent neural networks rnns that usually see the words one after the other, or from autoregressive models like gpt which internally mask the future tokens. it allows the model to learn a bidirectional representation of the sentence. - next sentence prediction nsp: the models concatenates two masked sentences as inputs during pretraining. sometimes they correspond to sentences that were next to each other in the original text, sometimes not. the model then has to predict if the two sentences were following each other or not. this way, the model learns an inner representation of the english language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the bert model as inputs.", "api_call": "", "model_name": "sbcBI/sentiment_analysis_model"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Sentence-Transformers", "api_name": "PM-AI/bi-encoder_msmarco_bert-base_german", "performance": {"dataset": "", "accuracy": null}, "description": "this model can be used for semantic search and documents retrieval to find relevant passages based on a query. it was trained on a machine translated msmarco dataset for german with hard negatives and margin mse loss . combining these elements results in a sota transformer for asymmetric search.", "api_call": "", "model_name": "PM-AI/bi-encoder_msmarco_bert-base_german"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "Meli/GPT2-Prompt", "performance": {"dataset": "", "accuracy": null}, "description": "generate a short story from an input prompt. put the vocab ` endprompt` after your input. example of an input: generate a short story from an input prompt. put the vocab ` endprompt` after your input. example of an input:", "api_call": "", "model_name": "Meli/GPT2-Prompt"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "api_name": "rizvandwiki/gender-classification-2", "performance": {"dataset": "", "accuracy": null}, "description": "autogenerated by huggingpics\ud83e\udd17\ud83d\uddbc\ufe0f create your own image classifier for anything by running . report any issues with the demo at the .", "api_call": "", "model_name": "rizvandwiki/gender-classification-2"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "api_name": "tonyassi/camera-lens-focal-length", "performance": {"dataset": "", "accuracy": null}, "description": "this model predicts the focal length that the camera lens used to capture an image. it takes in an image and returns one of the following labels: - ultra-wide - wide this model is a fine-tuned version of .", "api_call": "", "model_name": "tonyassi/camera-lens-focal-length"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "potsawee/t5-large-generation-race-Distractor", "performance": {"dataset": "", "accuracy": null}, "description": "- input: `question answer context` - output: list of 3 distractors t5-large model is fine-tuned to the race dataset where the input is the concatenation of question, answer, context and the output is a list of 3 distractors. this is the second component in the question generation pipeline i.e. `g2` in our ,", "api_call": "", "model_name": "potsawee/t5-large-generation-race-Distractor"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/trocr-base-handwritten", "api_call": "VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-base-handwritten')", "performance": {"dataset": "IAM", "accuracy": "Not specified"}, "description": "TrOCR model fine-tuned on the IAM dataset. It was introduced in the paper TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Li et al. and first released in this repository. The TrOCR model is an encoder-decoder model, consisting of an image Transformer as encoder, and a text Transformer as decoder. The image encoder was initialized from the weights of BEiT, while the text decoder was initialized from the weights of RoBERTa.", "model_name": "microsoft/trocr-base-handwritten"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "clipseg-rd64-refined", "api_call": "pipeline('image-segmentation', model='CIDAS/clipseg-rd64-refined')", "performance": {"dataset": "", "accuracy": ""}, "description": "CLIPSeg model with reduce dimension 64, refined (using a more complex convolution). It was introduced in the paper Image Segmentation Using Text and Image Prompts by L\u00fcddecke et al. and first released in this repository. This model is intended for zero-shot and one-shot image segmentation.", "model_name": "CIDAS/clipseg-rd64-refined"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "google/t5-small-ssm-nq", "performance": {"dataset": "", "accuracy": null}, "description": "for closed book question answering . the model was pre-trained using t5's denoising objective on , subsequently additionally pre-trained using 's salient span masking objective on , and finally fine-tuned on . note : the model was fine-tuned on 100% of the train splits of for 10k steps.", "api_call": "", "model_name": "google/t5-small-ssm-nq"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "blip2-opt-2.7b", "api_call": "Blip2ForConditionalGeneration.from_pretrained('Salesforce/blip2-opt-2.7b')", "performance": {"dataset": "LAION", "accuracy": "Not specified"}, "description": "BLIP-2 model, leveraging OPT-2.7b (a large language model with 2.7 billion parameters). It was introduced in the paper BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models by Li et al. and first released in this repository. The goal for the model is to predict the next text token, given the query embeddings and the previous text. This allows the model to be used for tasks like image captioning, visual question answering (VQA), and chat-like conversations by feeding the image and the previous conversation as prompt to the model.", "model_name": "Salesforce/blip2-opt-2.7b"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "blanchefort/rubert-base-cased-sentiment-rusentiment", "performance": {"dataset": "", "accuracy": null}, "description": "This is a [DeepPavlov/rubert-base-cased-conversational](https://huggingface.co/DeepPavlov/rubert-base-cased-conversational) model trained on [RuSentiment](http://text-machine.cs.uml.edu/projects/rusentiment/).", "api_call": "", "model_name": "blanchefort/rubert-base-cased-sentiment-rusentiment"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "opus-mt-ca-es", "api_call": "MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-ca-es') , MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-ca-es')", "performance": {"dataset": "Tatoeba.ca.es", "accuracy": {"BLEU": 74.9, "chr-F": 0.863}}, "description": "A Hugging Face model for translation between Catalan (ca) and Spanish (es) languages, based on the OPUS dataset and using the transformer-align architecture. The model has been pre-processed with normalization and SentencePiece.", "model_name": "Helsinki-NLP/opus-mt-ca-es"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "aubmindlab/aragpt2-base", "performance": {"dataset": "", "accuracy": null}, "description": "you can find more information in our paper the code in this repository was used to train all gpt2 variants. the code support training and fine-tuning gpt2 on gpus and tpus via the tpuestimator api. gpt2-base and medium uses the code from the `gpt2` folder and can trains models from the repository.", "api_call": "", "model_name": "aubmindlab/aragpt2-base"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "kandinsky-community/kandinsky-2-2-decoder", "performance": {"dataset": "", "accuracy": null}, "description": "kandinsky inherits best practices from dall-e 2 and latent diffusion while introducing some new ideas. it uses the clip model as a text and image encoder, and diffusion image prior mapping between latent spaces of clip modalities. this approach increases the visual performance of the model and unveils new horizons in blending images and text-guided image manipulation. the kandinsky model is created by , , , , and", "api_call": "", "model_name": "kandinsky-community/kandinsky-2-2-decoder"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "pucpr/biobertpt-all", "performance": {"dataset": "", "accuracy": null}, "description": "the paper contains clinical and biomedical bert-based models for portuguese language, initialized with bert-multilingual-cased & trained on clinical notes and biomedical literature. this model card describes the biobertptall model, a full version with clinical narratives and biomedical literature in portuguese language. load the model via the transformers library:", "api_call": "", "model_name": "pucpr/biobertpt-all"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "AstraliteHeart/pony-diffusion", "performance": {"dataset": "", "accuracy": null}, "description": "pony-diffusion is a latent text-to-image diffusion model that has been conditioned on high-quality pony sfw-ish images through fine-tuning. with special thanks to for providing finetuning expertise and for providing necessary compute. the model originally used for fine-tuning is an early finetuned checkpoint of on top of , which is a latent image diffusion model trained on . this particular checkpoint has been fine-tuned with a learning rate of 5.0e-6 for 4 epochs on approximately 80k pony text-image pairs using tags from derpibooru which all have score greater than `500` and belong to categories `safe` or `suggestive`.", "api_call": "", "model_name": "AstraliteHeart/pony-diffusion"}
{"domain": "Audio Text-to-Speech", "framework": "Fairseq", "functionality": "Text-to-Speech", "api_name": "facebook/tts_transformer-es-css10", "api_call": "load_model_ensemble_and_task_from_hf_hub('facebook/tts_transformer-es-css10')", "performance": {"dataset": "CSS10", "accuracy": null}, "description": "Transformer text-to-speech model from fairseq S^2. Spanish single-speaker male voice trained on CSS10.", "model_name": "facebook/tts_transformer-es-css10"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-gmq-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source group: North Germanic languages \n* target group: English \n*  OPUS readme: [gmq-eng](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/gmq-eng/README.md)\n\n*  model: transformer\n* source language(s): dan fao isl nno nob nob_Hebr non_Latn swe\n* target language(s): eng\n* model: transformer\n* pre-processing: normalization + SentencePiece (spm32k,spm32k)\n* download original weights: [opus2m-2020-07-26.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/gmq-eng/opus2m-2020-07-26.zip)\n* test set translations: [opus2m-2020-07-26.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/gmq-eng/opus2m-2020-07-26.test.txt)\n* test set scores: [opus2m-2020-07-26.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/gmq-eng/opus2m-2020-07-26.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-gmq-en"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-cs-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: cs\n* target languages: en\n*  OPUS readme: [cs-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/cs-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/cs-en/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-en/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/cs-en/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-cs-en"}
{"domain": "Audio Text-to-Speech", "framework": "ESPnet", "functionality": "Text-to-Speech", "api_name": "kan-bayashi_ljspeech_vits", "api_call": "pipeline('text-to-speech', model='espnet/kan-bayashi_ljspeech_vits')", "performance": {"dataset": "ljspeech", "accuracy": "Not mentioned"}, "description": "A Text-to-Speech model trained on the ljspeech dataset using the ESPnet toolkit. This model can be used to convert text input into synthesized speech.", "model_name": "espnet/kan-bayashi_ljspeech_vits"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "Salesforce/codet5-small", "performance": {"dataset": "", "accuracy": null}, "description": "pre-trained codet5 model. it was introduced in the paper by yue wang, weishi wang, shafiq joty, steven c.h. hoi and first released in . disclaimer: the team releasing codet5 did not write a model card for this model so this model card has been written by the hugging face team more specifically, . from the abstract: from the abstract: \"we present codet5, a unified pre-trained encoder-decoder transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers. our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning. besides, we propose a novel identifier-aware pre-training task that enables the model to distinguish which code tokens are identifiers and to recover them when they are masked. furthermore, we propose to exploit the user-written code comments with a bimodal dual generation task for better nl-pl alignment. comprehensive experiments show that codet5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including pl-nl, nl-pl, and pl-pl. further analysis reveals that our model can better capture semantic information from code.\"", "api_call": "", "model_name": "Salesforce/codet5-small"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "grammarly/coedit-large", "performance": {"dataset": "", "accuracy": null}, "description": "this model was obtained by fine-tuning the corresponding `google/flan-t5-large` model on the coedit dataset. details of the dataset can be found in our paper and repository. paper: coedit: text editing by task-specific instruction tuning authors: vipul raheja, dhruv kumar, ryan koo, dongyeop kang - languages nlp : english - finetuned from model: google/flan-t5-large", "api_call": "", "model_name": "grammarly/coedit-large"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "facebook/wmt19-de-en", "performance": {"dataset": "", "accuracy": null}, "description": "this is a ported version of for de-en. for more details, please see, . the abbreviation fsmt stands for fairseqmachinetranslation this is a ported version of for de-en. for more details, please see, . the abbreviation fsmt stands for fairseqmachinetranslation all four models are available:", "api_call": "", "model_name": "facebook/wmt19-de-en"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "lcw99/t5-base-korean-text-summary", "performance": {"dataset": "", "accuracy": null}, "description": "this model is a fine-tuning of model using aihub \"summary and report generation data\". this model provides a short summary of long sentences in korean. paust/pko-t5-base model aihub \" \" fine tunning . . more information needed", "api_call": "", "model_name": "lcw99/t5-base-korean-text-summary"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Embeddings", "api_name": "sentence-transformers/paraphrase-MiniLM-L6-v2", "api_call": "SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L6-v2')", "performance": {"dataset": "https://seb.sbert.net", "accuracy": "Not provided"}, "description": "This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/paraphrase-MiniLM-L6-v2"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "Davlan/afro-xlmr-base", "performance": {"dataset": "", "accuracy": null}, "description": "afroxlmr-base was created by mlm adaptation of xlm-r-base model on 17 african languages afrikaans, amharic, hausa, igbo, malagasy, chichewa, oromo, naija, kinyarwanda, kirundi, shona, somali, sesotho, swahili, isixhosa, yoruba, and isizulu covering the major african language families and 3 high-resource languages arabic, french, and english. language xlm-r-minilm xlm-r-base xlm-r-large afro-xlmr-base afro-xlmr-small afro-xlmr-mini - - - - - - -", "api_call": "", "model_name": "Davlan/afro-xlmr-base"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "Hugging Face Transformers", "api_name": "funnel-transformer/small", "performance": {"dataset": "", "accuracy": null}, "description": "pretrained model on english language using a similar objective objective as . it was introduced in and first released in . this model is uncased: it does not make a difference funnel transformer is a transformers model pretrained on a large corpus of english data in a self-supervised fashion. this means it was pretrained on the raw texts only, with no humans labelling them in any way which is why it can use lots of publicly available data with an automatic process to generate inputs and labels from those texts. more precisely, a small language model corrupts the input texts and serves as a generator of inputs for this model, and the pretraining objective is to predict which token is an original and which one has been replaced, a bit like a gan training. this way, the model learns an inner representation of the english language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the bert model as inputs.", "api_call": "", "model_name": "funnel-transformer/small"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "apanc/russian-inappropriate-messages", "performance": {"dataset": "", "accuracy": null}, "description": "the 'inappropriateness' substance we tried to collect in the dataset and detect with the model is not a substitution of toxicity , it is rather a derivative of toxicity. so the model based on our dataset could serve as an additional layer of inappropriateness filtering after toxicity and obscenity filtration . you can detect the exact sensitive topic by using . the proposed pipeline is shown in the scheme below. you can also train one classifier for both toxicity and inappropriateness detection. the data to be mixed with toxic labelled samples could be found on our or on this model is trained on the dataset of inappropriate messages of the russian language. generally, an inappropriate utterance is an utterance that has not obscene words or any kind of toxic intent, but can still harm the reputation of the speaker. find some sample for more intuition in the table below. learn more about the concept of inappropriateness presented at the workshop for balto-slavic nlp at the eacl-2021 conference. please note that this article describes the first version of the dataset, while the model is trained on the extended version of the dataset open-sourced on our or on . the properties of the dataset are the same as the one described in the article, the only difference is the size.", "api_call": "", "model_name": "apanc/russian-inappropriate-messages"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Spoken Language Identification", "api_name": "TalTechNLP/voxlingua107-epaca-tdnn", "api_call": "EncoderClassifier.from_hparams(source='TalTechNLP/voxlingua107-epaca-tdnn')", "performance": {"dataset": "VoxLingua107", "accuracy": "93%"}, "description": "This is a spoken language recognition model trained on the VoxLingua107 dataset using SpeechBrain. The model uses the ECAPA-TDNN architecture that has previously been used for speaker recognition. The model can classify a speech utterance according to the language spoken. It covers 107 different languages.", "model_name": "TalTechNLP/voxlingua107-epaca-tdnn"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Fill-Mask", "api_name": "cl-tohoku/bert-base-japanese-char", "api_call": "AutoModelForMaskedLM.from_pretrained('cl-tohoku/bert-base-japanese-char')", "performance": {"dataset": "wikipedia", "accuracy": "N/A"}, "description": "This is a BERT model pretrained on texts in the Japanese language. This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by character-level tokenization.", "model_name": "cl-tohoku/bert-base-japanese-char"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "fabiochiu/t5-small-medium-title-generation", "performance": {"dataset": "", "accuracy": null}, "description": "this model is fine-tuned on the dataset for predicting article titles using the article textual content as input. there are two versions of the model: - : trained from . this model is fine-tuned on the dataset for predicting article titles using the article textual content as input. there are two versions of the model: - : trained from . - : trained from . visit the to try the model with different text generation parameters.", "api_call": "", "model_name": "fabiochiu/t5-small-medium-title-generation"}
{"domain": "Multimodal Text-to-Video", "framework": "Hugging Face", "functionality": "Text-to-Video Generation", "api_name": "mo-di-bear-guitar", "api_call": "TuneAVideoPipeline.from_pretrained('nitrosocke/mo-di-diffusion', unet=UNet3DConditionModel.from_pretrained('Tune-A-Video-library/mo-di-bear-guitar', subfolder='unet', torch_dtype=torch.float16), torch_dtype=torch.float16)", "performance": {"dataset": "Not mentioned", "accuracy": "Not mentioned"}, "description": "Tune-A-Video is a text-to-video generation model based on the Hugging Face framework. The model generates videos based on textual prompts in a modern Disney style.", "model_name": "nitrosocke/mo-di-diffusion"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-lv-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: lv\n* target languages: en\n*  OPUS readme: [lv-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/lv-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/lv-en/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/lv-en/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/lv-en/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-lv-en"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "gerulata/slovakbert", "performance": {"dataset": "", "accuracy": null}, "description": "slovakbert pretrained model on slovak language using a masked language modeling mlm objective. this model is case-sensitive: it makes a difference between slovensko and slovensko. you can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. important : the model was not trained on the \u201c and \u201d direct quote character - so before tokenizing the text, it is advised to replace all \u201c and \u201d direct quote marks with a single \"double quote marks.", "api_call": "", "model_name": "gerulata/slovakbert"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Sentence-Transformers", "api_name": "pkshatech/simcse-ja-bert-base-clcmlp", "performance": {"dataset": "", "accuracy": null}, "description": "model name: `pkshatech/simcse-ja-bert-base-clcmlp` this is a japanese model. you can easily extract sentence embedding representations from japanese sentences. this model is based on and trained on dataset, which is a japanese natural language inference dataset. you can use this model easily with . model name: `pkshatech/simcse-ja-bert-base-clcmlp` this is a japanese model. you can easily extract sentence embedding representations from japanese sentences. this model is based on and trained on dataset, which is a japanese natural language inference dataset.", "api_call": "", "model_name": "pkshatech/simcse-ja-bert-base-clcmlp"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "uer/roberta-base-finetuned-chinanews-chinese", "performance": {"dataset": "", "accuracy": null}, "description": "this is the set of 5 chinese roberta-base classification models fine-tuned by , which is introduced in . besides, the models could also be fine-tuned by introduced in , which inherits uer-py to support models with parameters above one billion, and extends it to a multimodal pre-training framework. you can download the 5 chinese roberta-base classification models either from the , or via huggingface from the links below: dataset link  this is the set of 5 chinese roberta-base classification models fine-tuned by , which is introduced in . besides, the models could also be fine-tuned by introduced in , which inherits uer-py to support models with parameters above one billion, and extends it to a multimodal pre-training framework. you can download the 5 chinese roberta-base classification models either from the , or via huggingface from the links below:  dataset link   :: ::   jd full roberta-base-finetuned-jd-full-chinese jd full   jd binary roberta-base-finetuned-jd-binary-chinese jd binary   dianping roberta-base-finetuned-dianping-chinese dianping   ifeng roberta-base-finetuned-ifeng-chinese ifeng   chinanews roberta-base-finetuned-chinanews-chinese chinanews", "api_call": "", "model_name": "uer/roberta-base-finetuned-chinanews-chinese"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "julien-c/hotdog-not-hotdog", "api_call": "pipeline('image-classification', model='julien-c/hotdog-not-hotdog')", "performance": {"dataset": "", "accuracy": 0.8250000000000001}, "description": "A model that classifies images as hotdog or not hotdog.", "model_name": "julien-c/hotdog-not-hotdog"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "aychang/roberta-base-imdb", "performance": {"dataset": "", "accuracy": null}, "description": "a simple base roberta model trained on the \"imdb\" dataset. this is minimal language model trained on a benchmark dataset. imdb a simple base roberta model trained on the \"imdb\" dataset.", "api_call": "", "model_name": "aychang/roberta-base-imdb"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-fi-de", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: fi\n* target languages: de\n*  OPUS readme: [fi-de](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/fi-de/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-04.zip](https://object.pouta.csc.fi/OPUS-MT-models/fi-de/opus-2019-12-04.zip)\n* test set translations: [opus-2019-12-04.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/fi-de/opus-2019-12-04.test.txt)\n* test set scores: [opus-2019-12-04.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/fi-de/opus-2019-12-04.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-fi-de"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Document Question Answering", "api_name": "impira/layoutlm-document-qa", "api_call": "pipeline('question-answering', model=LayoutLMForQuestionAnswering.from_pretrained('impira/layoutlm-document-qa', return_dict=True))", "performance": {"dataset": ["SQuAD2.0", "DocVQA"], "accuracy": "Not provided"}, "description": "A fine-tuned version of the multi-modal LayoutLM model for the task of question answering on documents.", "model_name": "impira/layoutlm-document-qa"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "saltacc/anime-ai-detect", "api_call": "pipeline('image-classification', model='saltacc/anime-ai-detect')", "performance": {"dataset": "aibooru and imageboard sites", "accuracy": "96%"}, "description": "A BEiT classifier to see if anime art was made by an AI or a human.", "model_name": "saltacc/anime-ai-detect"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face Diffusers", "api_name": "lllyasviel/control_v11e_sd15_shuffle", "performance": {"dataset": "", "accuracy": null}, "description": "controlnet v1.1 was released in by . this checkpoint is a conversion of into `diffusers` format. it can be used in combination with stable diffusion , such as . controlnet was proposed in by lvmin zhang, maneesh agrawala. the abstract reads as follows:  we present a neural network structure, controlnet, to control pretrained large diffusion models to support additional input conditions. the controlnet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small 0.16.0.dev0` installed! 3. run code:", "api_call": "", "model_name": "lllyasviel/control_v11e_sd15_shuffle"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "uer/gpt2-chinese-cluecorpussmall", "performance": {"dataset": "", "accuracy": null}, "description": "the set of gpt2 models, except for gpt2-xlarge model, are pre-trained by , which is introduced in . the gpt2-xlarge model is pre-trained by introduced in , which inherits uer-py to support models with parameters above one billion, and extends it to a multimodal pre-training framework. besides, the other models could also be pre-trained by tencentpretrain. the model is used to generate chinese texts. you can download the set of chinese gpt2 models either from the , or via huggingface from the links below: link  the set of gpt2 models, except for gpt2-xlarge model, are pre-trained by , which is introduced in . the gpt2-xlarge model is pre-trained by introduced in , which inherits uer-py to support models with parameters above one billion, and extends it to a multimodal pre-training framework. besides, the other models could also be pre-trained by tencentpretrain. the model is used to generate chinese texts. you can download the set of chinese gpt2 models either from the , or via huggingface from the links below:  link    ::   gpt2-distil l 6/h 768 distil   gpt2 l 12/h 768 base   gpt2-medium l 24/h 1024 medium   gpt2-large l 36/h 1280 large   gpt2-xlarge l 48/h 1600 xlarge  note that the 6-layer model is called gpt2-distil model because it follows the configuration of , and the pre-training does not involve the supervision of larger models.", "api_call": "", "model_name": "uer/gpt2-chinese-cluecorpussmall"}
{"domain": "Audio Audio-to-Audio", "framework": "Fairseq", "functionality": "Speech-to-speech translation", "api_name": "xm_transformer_unity_hk-en", "api_call": "load_model_ensemble_and_task_from_hf_hub('facebook/xm_transformer_unity_hk-en')", "performance": {"dataset": ["TED", "drama", "TAT"], "accuracy": "Not specified"}, "description": "A speech-to-speech translation model with two-pass decoder (UnitY) trained on Hokkien-English data from TED, drama, and TAT domains. It uses Facebook's Unit HiFiGAN for speech synthesis.", "model_name": "facebook/xm_transformer_unity_hk-en"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "flax-community/t5-recipe-generation", "performance": {"dataset": "", "accuracy": null}, "description": "this is part of the , organized by and tpu usage sponsored by google. want to give it a try? then what's the wait, head over to hugging face spaces .", "api_call": "", "model_name": "flax-community/t5-recipe-generation"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-et-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: et\n* target languages: en\n*  OPUS readme: [et-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/et-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/et-en/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/et-en/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/et-en/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-et-en"}
{"domain": "Natural Language Processing Text Ranking", "framework": "Hugging Face Sentence-Transformers", "api_name": "cross-encoder/quora-distilroberta-base", "performance": {"dataset": "", "accuracy": null}, "description": "this model was trained using class. this model was trained on the dataset. the model will predict a score between 0 and 1 how likely the two given questions are duplicates. note: the model is not suitable to estimate the similarity of questions, e.g. the two questions \"how to learn java\" and \"how to learn python\" will result in a rather low score, as these are not duplicates.", "api_call": "", "model_name": "cross-encoder/quora-distilroberta-base"}
{"domain": "Multimodal Feature Extraction", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "cambridgeltl/SapBERT-from-PubMedBERT-fulltext", "api_call": "AutoModel.from_pretrained('cambridgeltl/SapBERT-from-PubMedBERT-fulltext')", "performance": {"dataset": "UMLS", "accuracy": "N/A"}, "description": "SapBERT is a pretraining scheme that self-aligns the representation space of biomedical entities. It is trained with UMLS 2020AA (English only) and uses microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext as the base model. The input should be a string of biomedical entity names, and the [CLS] embedding of the last layer is regarded as the output.", "model_name": "cambridgeltl/SapBERT-from-PubMedBERT-fulltext"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "baichuan-inc/Baichuan-13B-Chat", "performance": {"dataset": "", "accuracy": null}, "description": "baichuan-13b-chatbaichuan-13b,\u3002 130 , benchmark \u3002 \u3002baichuan-13b : 1. \u3001 :baichuan-13b 130 , 1.4 tokens, llama-13b 40%, 13b \u3002, alibi , 4096\u3002", "api_call": "", "model_name": "baichuan-inc/Baichuan-13B-Chat"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "Onodofthenorth/SD_PixelArt_SpriteSheet_Generator", "performance": {"dataset": "", "accuracy": null}, "description": "this stable diffusion checkpoint allows you to generate pixel art sprite sheets from four different angles. these first images are my results after merging this model with another model trained on my wife. merging another model with this one is the easiest way to get a consistent character with each view. still requires a bit of playing around with settings in img2img to get them how you want. for left and right, i suggest picking your best result and mirroring. after you are satisfied take your photo into photoshop or krita, remove the background, and scale to the desired size. after this you can scale back up to display your results; this also clears up some of the color murkiness in the initial outputs. this model can be used just like any other stable diffusion model. for more information,", "api_call": "", "model_name": "Onodofthenorth/SD_PixelArt_SpriteSheet_Generator"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-no-de", "performance": {"dataset": "", "accuracy": null}, "description": "* source group: Norwegian \n* target group: German \n*  OPUS readme: [nor-deu](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/nor-deu/README.md)\n\n*  model: transformer-align\n* source language(s): nno nob\n* target language(s): deu\n* model: transformer-align\n* pre-processing: normalization + SentencePiece (spm4k,spm4k)\n* download original weights: [opus-2020-06-17.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/nor-deu/opus-2020-06-17.zip)\n* test set translations: [opus-2020-06-17.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/nor-deu/opus-2020-06-17.test.txt)\n* test set scores: [opus-2020-06-17.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/nor-deu/opus-2020-06-17.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-no-de"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-fi-sv", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: fi\n* target languages: sv\n*  OPUS readme: [fi-sv](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/fi-sv/README.md)\n\n*  dataset: opus+bt\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus+bt-2020-04-11.zip](https://object.pouta.csc.fi/OPUS-MT-models/fi-sv/opus+bt-2020-04-11.zip)\n* test set translations: [opus+bt-2020-04-11.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/fi-sv/opus+bt-2020-04-11.test.txt)\n* test set scores: [opus+bt-2020-04-11.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/fi-sv/opus+bt-2020-04-11.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-fi-sv"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Sentence Transformers", "api_name": "sentence-transformers/all-mpnet-base-v2", "api_call": "SentenceTransformer('sentence-transformers/all-mpnet-base-v2')", "performance": {"dataset": [{"name": "MS Marco", "accuracy": "Not provided"}]}, "description": "This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.", "model_name": "sentence-transformers/all-mpnet-base-v2"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "api_name": "dima806/facial_emotions_image_detection", "performance": {"dataset": "", "accuracy": null}, "description": null, "api_call": "", "model_name": "dima806/facial_emotions_image_detection"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "nitrosocke/archer-diffusion", "performance": {"dataset": "", "accuracy": null}, "description": "this is the fine-tuned stable diffusion model trained on screenshots from the tv-show archer. use the tokens archer style in your prompts for the effect. if you enjoy my work, please consider supporting me", "api_call": "", "model_name": "nitrosocke/archer-diffusion"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "functionality": "Question Answering", "api_name": "Rakib/roberta-base-on-cuad", "api_call": "AutoModelForQuestionAnswering.from_pretrained('Rakib/roberta-base-on-cuad')", "performance": {"dataset": "cuad", "accuracy": "46.6%"}, "description": "This model is trained for the task of Question Answering on Legal Documents using the CUAD dataset. It is based on the RoBERTa architecture and can be used to extract answers from legal contracts and documents.", "model_name": "Rakib/roberta-base-on-cuad"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "staka/fugumt-ja-en", "performance": {"dataset": "", "accuracy": null}, "description": "This is a translation model using Marian-NMT.\nFor more details, please see [my repository](https://github.com/s-taka/fugumt).\n\n* source language: ja\n* target language: en \n\n### How to use\n\nThis model uses transformers and sentencepiece.\n```python\n!pip install transformers sentencepiece\n```\n\nYou can use this model directly with a pipeline:\n\n```python\nfrom transformers import pipeline\nfugu_translator = pipeline('translation', model='staka/fugumt-ja-en')\nfugu_translator('\u732b\u306f\u304b\u308f\u3044\u3044\u3067\u3059\u3002')\n```\n\n### Eval results\n\nThe results of the evaluation using [tatoeba](https://tatoeba.org/ja)(randomly selected 500 sentences) are as follows:\n\n|source |target |BLEU(*1)| \n|-------|-------|--------|\n|ja     |en     |39.1    |\n\n(*1) sacrebleu", "api_call": "", "model_name": "staka/fugumt-ja-en"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "Realistic_Vision_V1.4", "api_call": "pipeline('text-to-image', model=SG161222/Realistic_Vision_V1.4)", "performance": {"dataset": "N/A", "accuracy": "N/A"}, "description": "Realistic_Vision_V1.4 is a text-to-image model that generates high-quality and detailed images based on textual prompts. It can be used for various applications such as generating realistic portraits, landscapes, and other types of images.", "model_name": "SG161222/Realistic_Vision_V1.4"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "anferico/bert-for-patents", "performance": {"dataset": "", "accuracy": null}, "description": "bert for patents is a model trained by google on 100m+ patents not just us patents. it is based on bertlarge. if you want to learn more about the model, check out the , and containing the original tensorflow checkpoint.", "api_call": "", "model_name": "anferico/bert-for-patents"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "rinna/japanese-gpt2-medium", "performance": {"dataset": "", "accuracy": null}, "description": "this repository provides a medium-sized japanese gpt-2 model. the model was trained using code from github repository by from transformers import autotokenizer, automodelforcausallm", "api_call": "", "model_name": "rinna/japanese-gpt2-medium"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "setu4993/LaBSE", "api_call": "BertModel.from_pretrained('setu4993/LaBSE')", "performance": {"dataset": "CommonCrawl and Wikipedia", "accuracy": "Not Specified"}, "description": "Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages. The pre-training process combines masked language modeling with translation language modeling. The model is useful for getting multilingual sentence embeddings and for bi-text retrieval.", "model_name": "setu4993/LaBSE"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "w11wo/indonesian-roberta-base-sentiment-classifier", "performance": {"dataset": "", "accuracy": null}, "description": "indonesian roberta base sentiment classifier is a sentiment-text-classification model based on the model. the model was originally the pre-trained model, which is then fine-tuned on 's `smsa` dataset consisting of indonesian comments and reviews. after training, the model achieved an evaluation accuracy of 94.36% and f1-macro of 92.42%. on the benchmark test set, the model achieved an accuracy of 93.2% and f1-macro of 91.02%. hugging face's `trainer` class from the library was used to train the model. pytorch was used as the backend framework during training, but the model remains compatible with other frameworks nonetheless.", "api_call": "", "model_name": "w11wo/indonesian-roberta-base-sentiment-classifier"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-hu-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: hu\n* target languages: en\n*  OPUS readme: [hu-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/hu-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/hu-en/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/hu-en/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/hu-en/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-hu-en"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "Hugging Face Transformers", "api_name": "clip-italian/clip-italian", "performance": {"dataset": "", "accuracy": null}, "description": "paper: with a few tricks, we have been able to fine-tune a competitive italian clip model with only 1.4 million training samples. our italian clip model is built upon the model provided by and the openai . do you want to test our model right away? we got you covered! you just need to head to our .", "api_call": "", "model_name": "clip-italian/clip-italian"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "fnlp/bart-base-chinese", "performance": {"dataset": "", "accuracy": null}, "description": "12/30/2022 an updated version of cpt & chinese bart are released. in the new version, we changed the following parts: - vocabulary we replace the old bert vocabulary with a larger one of size 51271 built from the training data, in which we 1 add missing 6800+ chinese characters most of them are traditional chinese characters; 2 remove redundant tokens e.g. chinese character tokens with prefix; 3 add some english tokens to reduce oov. this is an implementation of chinese bart-base. yunfan shao, zhichao geng, yitao liu, junqi dai, fei yang, li zhe, hujun bao, xipeng qiu  github link:", "api_call": "", "model_name": "fnlp/bart-base-chinese"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-th-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source group: Thai \n* target group: English \n*  OPUS readme: [tha-eng](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/tha-eng/README.md)\n\n*  model: transformer-align\n* source language(s): tha\n* target language(s): eng\n* model: transformer-align\n* pre-processing: normalization + SentencePiece (spm32k,spm32k)\n* download original weights: [opus-2020-06-17.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/tha-eng/opus-2020-06-17.zip)\n* test set translations: [opus-2020-06-17.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/tha-eng/opus-2020-06-17.test.txt)\n* test set scores: [opus-2020-06-17.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/tha-eng/opus-2020-06-17.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-th-en"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "api_name": "moussaKam/barthez-orangesum-title", "performance": {"dataset": "", "accuracy": null}, "description": "finetuning: examples/seq2seq/ (as of Nov 06, 2020)\n\nMetrics: ROUGE-2 > 23\n\npaper: https://arxiv.org/abs/2010.12321 \\\ngithub: https://github.com/moussaKam/BARThez\n\n```\n@article{eddine2020barthez,\n  title={BARThez: a Skilled Pretrained French Sequence-to-Sequence Model},\n  author={Eddine, Moussa Kamal and Tixier, Antoine J-P and Vazirgiannis, Michalis},\n  journal={arXiv preprint arXiv:2010.12321},\n  year={2020}\n}\n```", "api_call": "", "model_name": "moussaKam/barthez-orangesum-title"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-de-fi", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: de\n* target languages: fi\n*  OPUS readme: [de-fi](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/de-fi/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-fi/opus-2020-01-08.zip)\n* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-fi/opus-2020-01-08.test.txt)\n* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-fi/opus-2020-01-08.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-de-fi"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "McGill-NLP/roberta-large-faithcritic", "performance": {"dataset": "", "accuracy": null}, "description": "model description: roberta-large-faithcritic is the fine-tuned on faithcritic, a derivative of the dataset. the objective is to predict whether an utterance is faithful or not, given the source knowledge. the hyperparameters are provided in . to know more about how to train a critic model, visit .", "api_call": "", "model_name": "McGill-NLP/roberta-large-faithcritic"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "api_name": "mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es", "performance": {"dataset": "", "accuracy": null}, "description": "this model is a fine-tuned on and distilled version of for q&a . distillation makes the model smaller, faster, cheaper and lighter than this model was fine-tuned on the same dataset but using distillation during the process as mentioned above and one more train epoch.", "api_call": "", "model_name": "mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es"}
{"domain": "Natural Language Processing Summarization", "framework": "Transformers", "functionality": "text2text-generation", "api_name": "financial-summarization-pegasus", "api_call": "PegasusForConditionalGeneration.from_pretrained('human-centered-summarization/financial-summarization-pegasus')", "performance": {"dataset": "xsum", "accuracy": {"ROUGE-1": 35.206, "ROUGE-2": 16.569, "ROUGE-L": 30.128, "ROUGE-LSUM": 30.171}}, "description": "This model was fine-tuned on a novel financial news dataset, which consists of 2K articles from Bloomberg, on topics such as stock, markets, currencies, rate and cryptocurrencies. It is based on the PEGASUS model and in particular PEGASUS fine-tuned on the Extreme Summarization (XSum) dataset: google/pegasus-xsum model. PEGASUS was originally proposed by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu in PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization.", "model_name": "human-centered-summarization/financial-summarization-pegasus"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "lucas-leme/FinBERT-PT-BR", "performance": {"dataset": "", "accuracy": null}, "description": "finbert-pt-br is a pre-trained nlp model to analyze sentiment of brazilian portuguese financial texts. the model was trained in two main stages: language modeling and sentiment modeling. in the first stage, a language model was trained with more than 1.4 million texts of financial news in portuguese. from this first training, it was possible to build a sentiment classifier with few labeled texts 500 that presented a satisfactory convergence.", "api_call": "", "model_name": "lucas-leme/FinBERT-PT-BR"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "facebook/wmt19-ru-en", "performance": {"dataset": "", "accuracy": null}, "description": "this is a ported version of for ru-en. for more details, please see, . the abbreviation fsmt stands for fairseqmachinetranslation this is a ported version of for ru-en. for more details, please see, . the abbreviation fsmt stands for fairseqmachinetranslation all four models are available:", "api_call": "", "model_name": "facebook/wmt19-ru-en"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "PygmalionAI/pygmalion-6b", "performance": {"dataset": "", "accuracy": null}, "description": "pymalion 6b is a proof-of-concept dialogue model based on eleutherai's . warning: this model is not suitable for use by minors. it will output x-rated content under certain circumstances. the fine-tuning dataset consisted of 56mb of dialogue data gathered from multiple sources, which includes both real and partially machine-generated conversations. pymalion 6b is a proof-of-concept dialogue model based on eleutherai's .  warning: this model is not suitable for use by minors. it will output x-rated content under certain circumstances.", "api_call": "", "model_name": "PygmalionAI/pygmalion-6b"}
{"domain": "Computer Vision Video Classification", "framework": "Hugging Face Transformers", "functionality": "Feature Extraction", "api_name": "microsoft/xclip-base-patch16-zero-shot", "api_call": "XClipModel.from_pretrained('microsoft/xclip-base-patch16-zero-shot')", "performance": {"dataset": [{"name": "HMDB-51", "accuracy": 44.6}, {"name": "UCF-101", "accuracy": 72.0}, {"name": "Kinetics-600", "accuracy": 65.2}]}, "description": "X-CLIP is a minimal extension of CLIP for general video-language understanding. The model is trained in a contrastive way on (video, text) pairs. This allows the model to be used for tasks like zero-shot, few-shot or fully supervised video classification and video-text retrieval.", "model_name": "microsoft/xclip-base-patch16-zero-shot"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "api_name": "NchuNLP/Chinese-Question-Answering", "performance": {"dataset": "", "accuracy": null}, "description": "this is the model, fine-tuned using the drcd dataset. it's been trained on question-answer pairs for the task of question answering. han cheng yu: boy19990222@gmail.com yao-chung fan: yfan@nchu.edu.tw", "api_call": "", "model_name": "NchuNLP/Chinese-Question-Answering"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "dccuchile/bert-base-spanish-wwm-uncased", "performance": {"dataset": "", "accuracy": null}, "description": "beto is a trained on a . beto is of size similar to a bert-base and was trained with the whole word masking technique. below you find tensorflow and pytorch checkpoints for the uncased and cased versions, as well as some results for spanish benchmarks comparing beto with as well as other not bert-based models. - :: :: ::", "api_call": "", "model_name": "dccuchile/bert-base-spanish-wwm-uncased"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "flaubert/flaubert_base_cased", "performance": {"dataset": "", "accuracy": null}, "description": "flaubert is a french bert trained on a very large and heterogeneous french corpus. models of different sizes are trained using the new cnrs french national centre for scientific research supercomputer. along with flaubert comes : an evaluation setup for french nlp systems similar to the popular glue benchmark. the goal is to enable further reproducible experiments in the future and to share models and progress on the french language.for more details please refer to the . model name number of layers attention heads embedding dimension total parameters", "api_call": "", "model_name": "flaubert/flaubert_base_cased"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face Diffusers", "api_name": "ioclab/control_v1p_sd15_brightness", "performance": {"dataset": "", "accuracy": null}, "description": "this model brings brightness control to stable diffusion, allowing users to colorize grayscale images or recolor generated images. - developed by: - shared by optional: more information needed", "api_call": "", "model_name": "ioclab/control_v1p_sd15_brightness"}
{"domain": "Multimodal Text-to-Video", "framework": "Hugging Face", "functionality": "Text-to-video-synthesis", "api_name": "damo-vilab/text-to-video-ms-1.7b", "api_call": "DiffusionPipeline.from_pretrained('damo-vilab/text-to-video-ms-1.7b', torch_dtype=torch.float16, variant=fp16)", "performance": {"dataset": "Webvid", "accuracy": "Not specified"}, "description": "A multi-stage text-to-video generation diffusion model that inputs a description text and returns a video that matches the text description. The model consists of three sub-networks: text feature extraction model, text feature-to-video latent space diffusion model, and video latent space to video visual space model. It supports English input only and has a wide range of applications.", "model_name": "damo-vilab/text-to-video-ms-1.7b"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "shibing624/chinese-alpaca-plus-7b-hf", "performance": {"dataset": "", "accuracy": null}, "description": "**\u53d1\u5e03\u4e2d\u6587LLaMA, Alpaca Plus\u7248\uff087B\uff09\u6a21\u578b**\n\n\n\u63a8\u51fa\u4e2d\u6587LLaMA, Alpaca Plus\u7248\uff087B\uff09\uff0c\u76f8\u6bd4\u57fa\u7840\u7248\u672c\u7684\u6539\u8fdb\u70b9\u5982\u4e0b\uff1a\n\n- \u8fdb\u4e00\u6b65\u6269\u5145\u4e86\u8bad\u7ec3\u6570\u636e\uff0c\u5176\u4e2dLLaMA\u6269\u5145\u81f3120G\u6587\u672c\uff08\u901a\u7528\u9886\u57df\uff09\uff0cAlpaca\u6269\u5145\u81f34M\u6307\u4ee4\u6570\u636e\uff08\u91cd\u70b9\u589e\u52a0\u4e86STEM\u76f8\u5173\u6570\u636e\uff09\n- Alpaca\u8bad\u7ec3\u65f6\u91c7\u7528\u4e86\u66f4\u5927\u7684rank\uff0c\u76f8\u6bd4\u539f\u7248\u5177\u6709\u66f4\u4f4e\u7684\u9a8c\u8bc1\u96c6\u635f\u5931\n- \u8bc4\u6d4b\u7ed3\u679c\u663e\u793a\uff0cAlpaca-Plus-7B\u76f8\u6bd4\u57fa\u7840\u7248Alpaca-7B\u6548\u679c\u66f4\u4f18\uff0c\u90e8\u5206\u4efb\u52a1\u63a5\u8fd1\u6216\u8d85\u8fc713B\u7248\u672c\n- \u8fd9\u4e00\u8f6e\u6bd4\u62fc\uff1a7B\u83b7\u5f9765.3\u5206\uff0c13B\u83b7\u5f9770.9\u5206\uff0cPlus-7B\u6548\u679c75.3\u5206\uff0c\u5177\u4f53\u8bc4\u6d4b\u7ed3\u679c\u8bf7\u53c2\u8003[\u6548\u679c\u8bc4\u6d4b](https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/examples/README.md)\n\n\u672c\u6a21\u578b\u662f`\u539f\u751fLLaMA-7B`\u5408\u5e76`\u4e2d\u6587LLaMA LoRA`\u548c`\u4e2d\u6587Alpaca LoRA`\u540e\u7684\u6a21\u578b\u6743\u91cd`chinese-alpaca-plus-7b-hf`\uff0c\u5e76\u8f6c\u5316\u4e3aHuggingFace\u7248\u672c\u6743\u91cd\uff08.bin\u6587\u4ef6\uff09\uff0c\u53ef\u4ee5\u76f4\u63a5\u4f7f\u7528\u6216\u8005\u7ee7\u7eed\u8bad\u7ec3\u3002\n\n13b-hf\u6743\u91cd\u94fe\u63a5\uff1ahttps://huggingface.co/shibing624/chinese-alpaca-plus-13b-hf\n\ntest case:\n\n|input_text|predict|\n|:-- |:--- |\n|\u4e3a\u4ec0\u4e48\u5929\u7a7a\u662f\u84dd\u8272\u7684\uff1f|\u5929\u7a7a\u662f\u84dd\u8272\u7684\uff0c\u662f\u56e0\u4e3a\u5927\u6c14\u5c42\u4e2d\u7684\u6c14\u4f53\u5206\u5b50\u4f1a\u6563\u5c04\u592a\u9633\u5149\u4e2d\u7684\u84dd\u8272\u5149\uff0c\u4f7f\u5f97\u6211\u4eec\u770b\u5230\u7684\u5929\u7a7a\u662f\u84dd\u8272\u7684\u3002|", "api_call": "", "model_name": "shibing624/chinese-alpaca-plus-7b-hf"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-en-id", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: en\n* target languages: id\n*  OPUS readme: [en-id](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/en-id/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/en-id/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-id/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-id/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-en-id"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "pierreguillou/gpt2-small-portuguese", "performance": {"dataset": "", "accuracy": null}, "description": "gportuguese-2 portuguese gpt-2 small is a state-of-the-art language model for portuguese based on the gpt-2 small model. it was trained on portuguese wikipedia using transfer learning and fine-tuning techniques in just over a day, on one gpu nvidia v100 32gb and with a little more than 1gb of training data. it is a proof-of-concept that it is possible to get a state-of-the-art language model in any language with low ressources. gportuguese-2 portuguese gpt-2 small is a state-of-the-art language model for portuguese based on the gpt-2 small model. it was trained on portuguese wikipedia using transfer learning and fine-tuning techniques in just over a day, on one gpu nvidia v100 32gb and with a little more than 1gb of training data. it is a proof-of-concept that it is possible to get a state-of-the-art language model in any language with low ressources. it was fine-tuned from the using the hugging face libraries transformers and tokenizers wrapped into the deep learning framework. all the fine-tuning fastai v2 techniques were used. it is now available on hugging face. for further information or requests, please go to \"\".", "api_call": "", "model_name": "pierreguillou/gpt2-small-portuguese"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "tblard/tf-allocine", "performance": {"dataset": "", "accuracy": null}, "description": "a french sentiment analysis model, based on , and finetuned on a large-scale dataset scraped from user reviews. validation accuracy validation f1-score test accuracy test f1-score : : : :", "api_call": "", "model_name": "tblard/tf-allocine"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "google/muril-base-cased", "performance": {"dataset": "", "accuracy": null}, "description": "muril: multilingual representations for indian languages muril is a bert model pre-trained on 17 indian languages and their transliterated counterparts. we have released the pre-trained model with the mlm layer intact, enabling masked word predictions in this repository. we have also released the encoder on with an additional pre-processing module, that processes raw text into the expected input format for the encoder. you can find more details on muril in this .", "api_call": "", "model_name": "google/muril-base-cased"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-en-sla", "performance": {"dataset": "", "accuracy": null}, "description": "* source group: English \n* target group: Slavic languages \n*  OPUS readme: [eng-sla](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/eng-sla/README.md)\n\n*  model: transformer\n* source language(s): eng\n* target language(s): bel bel_Latn bos_Latn bul bul_Latn ces csb_Latn dsb hrv hsb mkd orv_Cyrl pol rue rus slv srp_Cyrl srp_Latn ukr\n* model: transformer\n* pre-processing: normalization + SentencePiece (spm32k,spm32k)\n* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)\n* download original weights: [opus2m-2020-08-01.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/eng-sla/opus2m-2020-08-01.zip)\n* test set translations: [opus2m-2020-08-01.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/eng-sla/opus2m-2020-08-01.test.txt)\n* test set scores: [opus2m-2020-08-01.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/eng-sla/opus2m-2020-08-01.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-en-sla"}
{"domain": "Audio Audio-to-Audio", "framework": "Hugging Face Transformers", "functionality": "speech-enhancement", "api_name": "speechbrain/metricgan-plus-voicebank", "api_call": "SpectralMaskEnhancement.from_hparams(source='speechbrain/metricgan-plus-voicebank', savedir='pretrained_models/metricgan-plus-voicebank')", "performance": {"dataset": "Voicebank", "accuracy": {"Test PESQ": "3.15", "Test STOI": "93.0"}}, "description": "MetricGAN-trained model for Enhancement", "model_name": "speechbrain/metricgan-plus-voicebank"}
{"domain": "Audio Audio Classification", "framework": "PyTorch Transformers", "functionality": "Emotion Recognition", "api_name": "superb/wav2vec2-base-superb-er", "api_call": "pipeline('audio-classification', model='superb/wav2vec2-base-superb-er')", "performance": {"dataset": "IEMOCAP", "accuracy": 0.6258}, "description": "This is a ported version of S3PRL's Wav2Vec2 for the SUPERB Emotion Recognition task. The base model is wav2vec2-base, which is pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. For more information refer to SUPERB: Speech processing Universal PERformance Benchmark.", "model_name": "superb/wav2vec2-base-superb-er"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "MilaNLProc/feel-it-italian-sentiment", "performance": {"dataset": "", "accuracy": null}, "description": "you can find the package that uses this model for emotion and sentiment classification it is meant to be a very simple interface over huggingface models. users should refer to the sentiment analysis is a common task to understand people's reactions online. still, we often need more nuanced information: is the post negative because the user is angry or because they are sad?", "api_call": "", "model_name": "MilaNLProc/feel-it-italian-sentiment"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "superb/wav2vec2-base-superb-ks", "api_call": "pipeline('audio-classification', model='superb/wav2vec2-base-superb-ks')", "performance": {"dataset": "Speech Commands dataset v1.0", "accuracy": {"s3prl": 0.9623, "transformers": 0.9643}}, "description": "Wav2Vec2-Base for Keyword Spotting (KS) task in the SUPERB benchmark. The base model is pretrained on 16kHz sampled speech audio. The KS task detects preregistered keywords by classifying utterances into a predefined set of words. The model is trained on the Speech Commands dataset v1.0.", "model_name": "superb/wav2vec2-base-superb-ks"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "yikuan8/Clinical-Longformer", "performance": {"dataset": "", "accuracy": null}, "description": "clinical-longformer is a clinical knowledge enriched version of longformer that was further pre-trained using mimic-iii clinical notes. it allows up to 4,096 tokens as the model input. clinical-longformer consistently out-performs clinicalbert across 10 baseline dataset for at least 2 percent. those downstream experiments broadly cover named entity recognition ner, question answering qa, natural language inference nli and text classification tasks. for more details, please refer to . we also provide a sister model at we initialized clinical-longformer from the pre-trained weights of the base version of longformer. the pre-training process was distributed in parallel to 6 32gb tesla v100 gpus. fp16 precision was enabled to accelerate training. we pre-trained clinical-longformer for 200,000 steps with batch size of 6\u00d73. the learning rates were 3e-5 for both models. the entire pre-training process took more than 2 weeks. load the model directly from transformers:", "api_call": "", "model_name": "yikuan8/Clinical-Longformer"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face Transformers", "api_name": "Idan0405/ClipMD", "performance": {"dataset": "", "accuracy": null}, "description": "clipmd is a medical image-text matching model based on openai's clip model with a sliding window text encoder. the model uses a vit-b/32 transformer architecture as an image encoder and uses a masked sliding window elf-attention transformer as a text encoder. these encoders are trained to maximize the similarity of image, text pairs via a contrastive loss. the model was fine-tuned on the . the model uses a vit-b/32 transformer architecture as an image encoder and uses a masked sliding window elf-attention transformer as a text encoder. these encoders are trained to maximize the similarity of image, text pairs via a contrastive loss. the model was fine-tuned on the .", "api_call": "", "model_name": "Idan0405/ClipMD"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-pl-fr", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: pl\n* target languages: fr\n*  OPUS readme: [pl-fr](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/pl-fr/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-16.zip](https://object.pouta.csc.fi/OPUS-MT-models/pl-fr/opus-2020-01-16.zip)\n* test set translations: [opus-2020-01-16.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/pl-fr/opus-2020-01-16.test.txt)\n* test set scores: [opus-2020-01-16.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/pl-fr/opus-2020-01-16.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-pl-fr"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face Diffusers", "api_name": "instruction-tuning-sd/cartoonizer", "performance": {"dataset": "", "accuracy": null}, "description": "this pipeline is an 'instruction-tuned' version of . it was fine-tuned from the existing . motivation behind this pipeline partly comes from and partly", "api_call": "", "model_name": "instruction-tuning-sd/cartoonizer"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "functionality": "Image Classification", "api_name": "timm/vit_large_patch14_clip_224.openai_ft_in12k_in1k", "api_call": "pipeline('image-classification', model='timm/vit_large_patch14_clip_224.openai_ft_in12k_in1k', framework='pt')", "performance": {"dataset": "", "accuracy": ""}, "description": "A ViT-based image classification model trained on ImageNet-1K and fine-tuned on ImageNet-12K by OpenAI.", "model_name": "timm/vit_large_patch14_clip_224.openai_ft_in12k_in1k"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "cointegrated/rubert-tiny2-cedr-emotion-detection", "performance": {"dataset": "", "accuracy": null}, "description": "this is the model fine-tuned for classification of emotions in russian sentences. the task is multilabel classification, because one sentence can contain multiple emotions. the model on the described in the paper by sboev et al. the model has been trained with adam optimizer for 40 epochs with learning rate `1e-5` and batch size 64 .", "api_call": "", "model_name": "cointegrated/rubert-tiny2-cedr-emotion-detection"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/table-transformer-detection", "api_call": "TableTransformerDetrModel.from_pretrained('microsoft/table-transformer-detection')", "performance": {"dataset": "PubTables1M", "accuracy": "Not provided"}, "description": "Table Transformer (DETR) model trained on PubTables1M for detecting tables in documents. Introduced in the paper PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents by Smock et al.", "model_name": "microsoft/table-transformer-detection"}
{"domain": "Multimodal Document Question Answer", "framework": "Hugging Face Transformers", "functionality": "Question Answering", "api_name": "impira/layoutlm-invoices", "api_call": "pipeline('question-answering', model='impira/layoutlm-invoices')", "performance": {"dataset": "proprietary dataset of invoices, SQuAD2.0, and DocVQA", "accuracy": "not provided"}, "description": "This is a fine-tuned version of the multi-modal LayoutLM model for the task of question answering on invoices and other documents. It has been fine-tuned on a proprietary dataset of invoices as well as both SQuAD2.0 and DocVQA for general comprehension. Unlike other QA models, which can only extract consecutive tokens (because they predict the start and end of a sequence), this model can predict longer-range, non-consecutive sequences with an additional classifier head.", "model_name": "impira/layoutlm-invoices"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "CAMeL-Lab/bert-base-arabic-camelbert-ca", "performance": {"dataset": "", "accuracy": null}, "description": "camelbert is a collection of bert models pre-trained on arabic texts with different sizes and variants. we release pre-trained language models for modern standard arabic msa, dialectal arabic da, and classical arabic ca, in addition to a model pre-trained on a mix of the three. we also provide additional models that are pre-trained on a scaled-down set of the msa variant half, quarter, eighth, and sixteenth.  camelbert is a collection of bert models pre-trained on arabic texts with different sizes and variants. we release pre-trained language models for modern standard arabic msa, dialectal arabic da, and classical arabic ca, in addition to a model pre-trained on a mix of the three. we also provide additional models that are pre-trained on a scaled-down set of the msa variant half, quarter, eighth, and sixteenth. the details are described in the paper \".\"  this model card describes camelbert-ca `bert-base-arabic-camelbert-ca`, a model pre-trained on the ca classical arabic dataset.  model variant size word   - - :-: -: -:   `bert-base-arabic-camelbert-mix` ca,da,msa 167gb 17.3b   \u2714 `bert-base-arabic-camelbert-ca` ca 6gb 847m   `bert-base-arabic-camelbert-da` da 54gb 5.8b   `bert-base-arabic-camelbert-msa` msa 107gb 12.6b   `bert-base-arabic-camelbert-msa-half` msa 53gb 6.3b   `bert-base-arabic-camelbert-msa-quarter` msa 27gb 3.1b   `bert-base-arabic-camelbert-msa-eighth` msa 14gb 1.6b   `bert-base-arabic-camelbert-msa-sixteenth` msa 6gb 746m", "api_call": "", "model_name": "CAMeL-Lab/bert-base-arabic-camelbert-ca"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "nlpaueb/bert-base-greek-uncased-v1", "performance": {"dataset": "", "accuracy": null}, "description": "A Greek version of BERT pre-trained language model.\n\n<img src=\"https://github.com/nlpaueb/GreekBERT/raw/master/greek-bert-logo.png\" width=\"600\"/>", "api_call": "", "model_name": "nlpaueb/bert-base-greek-uncased-v1"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "HooshvareLab/bert-fa-base-uncased", "performance": {"dataset": "", "accuracy": null}, "description": "a transformer-based model for persian language understanding we reconstructed the vocabulary and fine-tuned the parsbert v1.1 on the new persian corpora in order to provide some functionalities for using parsbert in other scopes! please follow the repo for the latest information about previous and current models. parsbert is a monolingual language model based on google\u2019s bert architecture. this model is pre-trained on large persian corpora with various writing styles from numerous subjects e.g., scientific, novels, news with more than `3.9m` documents, `73m` sentences, and `1.3b` words. paper presenting parsbert:", "api_call": "", "model_name": "HooshvareLab/bert-fa-base-uncased"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-gmw-gmw", "performance": {"dataset": "", "accuracy": null}, "description": "* source group: West Germanic languages \n* target group: West Germanic languages \n*  OPUS readme: [gmw-gmw](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/gmw-gmw/README.md)\n\n*  model: transformer\n* source language(s): afr ang_Latn deu eng enm_Latn frr fry gos gsw ksh ltz nds nld pdc sco stq swg yid\n* target language(s): afr ang_Latn deu eng enm_Latn frr fry gos gsw ksh ltz nds nld pdc sco stq swg yid\n* model: transformer\n* pre-processing: normalization + SentencePiece (spm32k,spm32k)\n* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)\n* download original weights: [opus-2020-07-27.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/gmw-gmw/opus-2020-07-27.zip)\n* test set translations: [opus-2020-07-27.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/gmw-gmw/opus-2020-07-27.test.txt)\n* test set scores: [opus-2020-07-27.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/gmw-gmw/opus-2020-07-27.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-gmw-gmw"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "daveni/twitter-xlm-roberta-emotion-es", "performance": {"dataset": "", "accuracy": null}, "description": "note : this model & model card are based on the this is a xlm-roberta-base model trained on 198m tweets and finetuned for emotion analysis on spanish language. this model was presented to emoevales competition, part of , where the proposed task was the classification of spanish tweets between seven different classes: anger , disgust , fear , joy , sadness , surprise , and other . we achieved the first position in the competition with a macro-averaged f1 score of 71.70%. - .", "api_call": "", "model_name": "daveni/twitter-xlm-roberta-emotion-es"}
{"domain": "Multimodal Feature Extraction", "framework": "Hugging Face Transformers", "functionality": "Audio Spectrogram", "api_name": "audio-spectrogram-transformer", "api_call": "ASTModel.from_pretrained('MIT/ast-finetuned-audioset-10-10-0.4593')", "performance": {"dataset": "", "accuracy": ""}, "description": "One custom ast model for testing of HF repos", "model_name": "MIT/ast-finetuned-audioset-10-10-0.4593"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "blip2-flan-t5-xl", "api_call": "Blip2ForConditionalGeneration.from_pretrained('Salesforce/blip2-flan-t5-xl')", "performance": {"dataset": "LAION", "accuracy": "Not provided"}, "description": "BLIP-2 model, leveraging Flan T5-xl (a large language model). It was introduced in the paper BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models by Li et al. and first released in this repository. The goal for the model is to predict the next text token, giving the query embeddings and the previous text. This allows the model to be used for tasks like image captioning, visual question answering (VQA), and chat-like conversations by feeding the image and the previous conversation as prompt to the model.", "model_name": "Salesforce/blip2-flan-t5-xl"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "EimisAnimeDiffusion_1.0v", "api_call": "DiffusionPipeline.from_pretrained('eimiss/EimisAnimeDiffusion_1.0v')", "performance": {"dataset": "Not specified", "accuracy": "Not specified"}, "description": "EimisAnimeDiffusion_1.0v is a text-to-image model trained with high-quality and detailed anime images. It works well on anime and landscape generations and supports a Gradio Web UI.", "model_name": "eimiss/EimisAnimeDiffusion_1.0v"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "NumbersStation/nsql-350M", "performance": {"dataset": "", "accuracy": null}, "description": "nsql is a family of autoregressive open-source large foundation models fms designed specifically for sql generation tasks. the checkpoint included in this repository is based on from salesforce and further pre-trained on a dataset of general sql queries and then fine-tuned on a dataset composed of text-to-sql pairs. the general sql queries are the sql subset from , containing 1m training samples. the labeled text-to-sql pairs come from more than 20 public sources across the web from standard datasets. we hold out spider and geoquery datasets for use in evaluation. nsql is a family of autoregressive open-source large foundation models fms designed specifically for sql generation tasks. the checkpoint included in this repository is based on from salesforce and further pre-trained on a dataset of general sql queries and then fine-tuned on a dataset composed of text-to-sql pairs.", "api_call": "", "model_name": "NumbersStation/nsql-350M"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "oliverguhr/german-sentiment-bert", "performance": {"dataset": "", "accuracy": null}, "description": "this model was trained for sentiment classification of german language texts. to achieve the best results all model inputs needs to be preprocessed with the same procedure, that was applied during the training. to simplify the usage of the model, we provide a python package that bundles the code need for the preprocessing and inferencing. the model uses the googles bert architecture and was trained on 1.834 million german-language samples. the training data contains texts from various domains like twitter, facebook and movie, app and hotel reviews.", "api_call": "", "model_name": "oliverguhr/german-sentiment-bert"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "bolbolzaban/gpt2-persian", "performance": {"dataset": "", "accuracy": null}, "description": "bolbolzaban/gpt2-persian is gpt2 language model that is trained with hyper parameters similar to standard gpt2-medium with following differences: 1. the context size is reduced from 1024 to 256 sub words in order to make the training affordable 2. instead of bpe, google sentence piece tokenizor is used for tokenization.", "api_call": "", "model_name": "bolbolzaban/gpt2-persian"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "Hugging Face Transformers", "api_name": "laion/clap-htsat-unfused", "performance": {"dataset": "", "accuracy": null}, "description": "Model card for CLAP: Contrastive Language-Audio Pretraining\n\n![clap_image](https://s3.amazonaws.com/moonup/production/uploads/1678811100805-62441d1d9fdefb55a0b7d12c.png)\n\n\n#  Table of Contents\n\n0. [TL;DR](#TL;DR)\n1. [Model Details](#model-details)\n2. [Usage](#usage)\n3. [Uses](#uses)\n4. [Citation](#citation)\n\n# TL;DR\n\nThe abstract of the paper states that: \n\n> Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To accomplish this target, we first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources. Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance. Third, we perform comprehensive experiments to evaluate our model across three tasks: text-to-audio retrieval, zero-shot audio classification, and supervised audio classification. The results demonstrate that our model achieves superior performance in text-to-audio retrieval task. In audio classification tasks, the model achieves state-of-the-art performance in the zero-shot setting and is able to obtain performance comparable to models' results in the non-zero-shot setting. LAION-Audio-630K and the proposed model are both available to the public.\n\n\n# Usage\n\nYou can use this model for zero shot audio classification or extracting audio and/or textual features.\n\n# Uses", "api_call": "", "model_name": "laion/clap-htsat-unfused"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "soleimanian/financial-roberta-large-sentiment", "performance": {"dataset": "", "accuracy": null}, "description": null, "api_call": "", "model_name": "soleimanian/financial-roberta-large-sentiment"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "distilbert-base-cased", "performance": {"dataset": "", "accuracy": null}, "description": "this model is a distilled version of the . it was introduced in . the code for the distillation process can be found distilbert is a transformers model, smaller and faster than bert, which was pretrained on the same corpus in a self-supervised fashion, using the bert base model as a teacher. this means it was pretrained on the raw texts only, with no humans labelling them in any way which is why it can use lots of publicly available data with an automatic process to generate inputs and labels from those texts using the bert base model. more precisely, it was pretrained with three objectives: - distillation loss: the model was trained to return the same probabilities as the bert base model. - masked language modeling mlm: this is part of the original training loss of the bert base model. when taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. this is different from traditional recurrent neural networks rnns that usually see the words one after the other, or from autoregressive models like gpt which internally mask the future tokens. it allows the model to learn a bidirectional representation of the sentence. - cosine embedding loss: the model was also trained to generate hidden states as close as possible as the bert base model. this way, the model learns the same inner representation of the english language than its teacher model, while being faster for inference or downstream tasks.", "api_call": "", "model_name": "distilbert-base-cased"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-cy-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: cy\n* target languages: en\n*  OPUS readme: [cy-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/cy-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/cy-en/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/cy-en/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/cy-en/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-cy-en"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Transformers", "functionality": "Text Classification", "api_name": "typeform/distilbert-base-uncased-mnli", "api_call": "AutoModelForSequenceClassification.from_pretrained('typeform/distilbert-base-uncased-mnli')", "performance": {"dataset": "multi_nli", "accuracy": 0.8206875509}, "description": "This is the uncased DistilBERT model fine-tuned on Multi-Genre Natural Language Inference (MNLI) dataset for the zero-shot classification task.", "model_name": "typeform/distilbert-base-uncased-mnli"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "TurkuNLP/gpt3-finnish-small", "performance": {"dataset": "", "accuracy": null}, "description": "generative pretrained transformer with 186m parameteres for finnish. turkunlp finnish gpt-3-models are a model family of pretrained monolingual gpt-style language models that are based on bloom-architecture. note that the models are pure language models, meaning that they are not for dialogue", "api_call": "", "model_name": "TurkuNLP/gpt3-finnish-small"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "opus-mt-de-en", "api_call": "translation_pipeline('translation_de_to_en', model='Helsinki-NLP/opus-mt-de-en')", "performance": {"dataset": "opus", "accuracy": {"newssyscomb2009.de.en": 29.4, "news-test2008.de.en": 27.8, "newstest2009.de.en": 26.8, "newstest2010.de.en": 30.2, "newstest2011.de.en": 27.4, "newstest2012.de.en": 29.1, "newstest2013.de.en": 32.1, "newstest2014-deen.de.en": 34.0, "newstest2015-ende.de.en": 34.2, "newstest2016-ende.de.en": 40.4, "newstest2017-ende.de.en": 35.7, "newstest2018-ende.de.en": 43.7, "newstest2019-deen.de.en": 40.1, "Tatoeba.de.en": 55.4}}, "description": "A German to English translation model trained on the OPUS dataset using the Hugging Face Transformers library.", "model_name": "Helsinki-NLP/opus-mt-de-en"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "JosephusCheung/ACertainThing", "performance": {"dataset": "", "accuracy": null}, "description": "try full functions with google colab free t4 anything3.0 is an overfitted model that takes liberties when it shouldn't be generating human images and certain details. however, the community has given it a high rating, and i believe that is because many lazy people who don't know how to write a prompt can use this overfitted model to generate high-quality images even if their prompts are poorly written. here is a acertain version of anything3.0, made with dreambooth idea of integrated, initialized with .", "api_call": "", "model_name": "JosephusCheung/ACertainThing"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "nitrosocke/Nitro-Diffusion", "performance": {"dataset": "", "accuracy": null}, "description": "welcome to nitro diffusion - the first multi-style model trained from scratch! this is a fine-tuned stable diffusion model trained on three artstyles simultaniously while keeping each style separate from the others. this allows for high control of mixing, weighting and single style use. use the tokens archer style, arcane style or modern disney style in your prompts for the effect. you can also use more than one for a mixed style like in the examples down below: if you enjoy my work and want to test new models before release, please consider supporting me", "api_call": "", "model_name": "nitrosocke/Nitro-Diffusion"}
{"domain": "Audio Audio-to-Audio", "framework": "Fairseq", "functionality": "speech-to-speech-translation", "api_name": "xm_transformer_unity_en-hk", "api_call": "load_model_ensemble_and_task_from_hf_hub('facebook/xm_transformer_unity_en-hk')", "performance": {"dataset": "MuST-C", "accuracy": null}, "description": "Speech-to-speech translation model with two-pass decoder (UnitY) from fairseq: English-Hokkien. Trained with supervised data in TED domain, and weakly supervised data in TED and Audiobook domain.", "model_name": "facebook/xm_transformer_unity_en-hk"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "Stancld/longt5-tglobal-large-16384-pubmed-3k_steps", "performance": {"dataset": "", "accuracy": null}, "description": "introduced as an extension of a successful . this is an unofficial longt5-large-16384-pubmed-3k steps checkpoint. i.e., this is a large configuration of the longt5 model with a `transient-global` attention fine-tuned on for 3,000 training steps. it may be worth continuing in the fine-tuning as we did not train the model until the convergence. the fine-tuned model achieves the following results on the evaluation set using `beam search 3` and without any specific calibration of generation parameters are presented below, altogether with the results from the original paper the original scores are higher, very likely due to a higher number of training steps. introduced as an extension of a successful . this is an unofficial longt5-large-16384-pubmed-3k steps checkpoint. i.e., this is a large configuration of the longt5 model with a `transient-global` attention fine-tuned on for 3,000 training steps. it may be worth continuing in the fine-tuning as we did not train the model until the convergence.", "api_call": "", "model_name": "Stancld/longt5-tglobal-large-16384-pubmed-3k_steps"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-en-sv", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: en\n* target languages: sv\n*  OPUS readme: [en-sv](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/en-sv/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-02-26.zip](https://object.pouta.csc.fi/OPUS-MT-models/en-sv/opus-2020-02-26.zip)\n* test set translations: [opus-2020-02-26.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-sv/opus-2020-02-26.test.txt)\n* test set scores: [opus-2020-02-26.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-sv/opus-2020-02-26.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-en-sv"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "mrm8488/t5-base-finetuned-wikiSQL", "performance": {"dataset": "", "accuracy": null}, "description": "fine-tuned on for english to sql translation . the t5 model was presented in by colin raffel, noam shazeer, adam roberts, katherine lee, sharan narang, michael matena, yanqi zhou, wei li, peter j. liu in here the abstract: transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing nlp. the effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. in this paper, we explore the landscape of transfer learning techniques for nlp by introducing a unified framework that converts every language problem into a text-to-text format. our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. by combining the insights from our exploration with scale and our new \u201ccolossal clean crawled corpus\u201d, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. to facilitate future work on transfer learning for nlp, we release our dataset, pre-trained models, and code.", "api_call": "", "model_name": "mrm8488/t5-base-finetuned-wikiSQL"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-tc-big-en-es", "performance": {"dataset": "", "accuracy": null}, "description": "neural machine translation model for translating from english en to spanish es. this model is part of the , an effort to make neural machine translation models widely available and accessible for many languages in the world. all models are originally trained using the amazing framework of , an efficient nmt implementation written in pure c++. the models have been converted to pytorch using the transformers library by huggingface. training data is taken from and training pipelines use the procedures of . publications: and please, cite if you use this model.", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-tc-big-en-es"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "Dr-BERT/DrBERT-7GB", "performance": {"dataset": "", "accuracy": null}, "description": "in recent years, pre-trained language models plms achieve the best performance on a wide range of natural language processing nlp tasks. while the first models were trained on general domain data, specialized ones have emerged to more effectively treat specific domains. in this paper, we propose an original study of plms in the medical domain on french language. we compare, for the first time, the performance of plms trained on both public data from the web and private data from healthcare establishments. we also evaluate different learning strategies on a set of biomedical tasks. finally, we release the first specialized plms for the biomedical field in french, called drbert, as well as the largest corpus of medical data under free license on which these models are trained.", "api_call": "", "model_name": "Dr-BERT/DrBERT-7GB"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "api_name": "facebook/mask2former-swin-large-mapillary-vistas-panoptic", "performance": {"dataset": "", "accuracy": null}, "description": "mask2former model trained on mapillary vistas panoptic segmentation large-sized version, swin backbone. it was introduced in the paper and first released in . disclaimer: the team releasing mask2former did not write a model card for this model so this model card has been written by the hugging face team. mask2former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. hence, all 3 tasks are treated as if they were instance segmentation. mask2former outperforms the previous sota, mask2former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. hence, all 3 tasks are treated as if they were instance segmentation. mask2former outperforms the previous sota, both in terms of performance an efficiency by i replacing the pixel decoder with a more advanced multi-scale deformable attention transformer, ii adopting a transformer decoder with masked attention to boost performance without without introducing additional computation and iii improving training efficiency by calculating the loss on subsampled points instead of whole masks.", "api_call": "", "model_name": "facebook/mask2former-swin-large-mapillary-vistas-panoptic"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-nl-fr", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: nl\n* target languages: fr\n*  OPUS readme: [nl-fr](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/nl-fr/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-24.zip](https://object.pouta.csc.fi/OPUS-MT-models/nl-fr/opus-2020-01-24.zip)\n* test set translations: [opus-2020-01-24.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/nl-fr/opus-2020-01-24.test.txt)\n* test set scores: [opus-2020-01-24.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/nl-fr/opus-2020-01-24.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-nl-fr"}
{"domain": "Computer Vision Text-to-3D", "framework": "Hugging Face Diffusers", "api_name": "Intel/ldm3d", "performance": {"dataset": "", "accuracy": null}, "description": "the ldm3d model was proposed in the paper , authored by gabriela ben melech stan, diana wofk, scottie fox, alex redden, will saxton, jean yu, estelle aflalo, shao-yen tseng, fabio nonato, matthias muller, and vasudev lal. ldm3d was accepted to the in 2023. for better results, do not hesitate to use our new checkpoint, based on a slighlty different architecture.", "api_call": "", "model_name": "Intel/ldm3d"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "savasy/bert-base-turkish-sentiment-cased", "performance": {"dataset": "", "accuracy": null}, "description": "this model is used for sentiment analysis, which is based on berturk for turkish language please cite if you use it in your study the dataset is taken from the studies 2 paper-2 and 3 paper-3, and merged.", "api_call": "", "model_name": "savasy/bert-base-turkish-sentiment-cased"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-bat-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source group: Baltic languages \n* target group: English \n*  OPUS readme: [bat-eng](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/bat-eng/README.md)\n\n*  model: transformer\n* source language(s): lav lit ltg prg_Latn sgs\n* target language(s): eng\n* model: transformer\n* pre-processing: normalization + SentencePiece (spm32k,spm32k)\n* download original weights: [opus2m-2020-07-31.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/bat-eng/opus2m-2020-07-31.zip)\n* test set translations: [opus2m-2020-07-31.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/bat-eng/opus2m-2020-07-31.test.txt)\n* test set scores: [opus2m-2020-07-31.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/bat-eng/opus2m-2020-07-31.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-bat-en"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Masked Language Modeling", "api_name": "albert-base-v2", "api_call": "pipeline('fill-mask', model='albert-base-v2')", "performance": {"dataset": {"SQuAD1.1": "90.2/83.2", "SQuAD2.0": "82.1/79.3", "MNLI": "84.6", "SST-2": "92.9", "RACE": "66.8"}, "accuracy": "82.3"}, "description": "ALBERT Base v2 is a transformers model pretrained on a large corpus of English data in a self-supervised fashion using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model, as all ALBERT models, is uncased: it does not make a difference between english and English.", "model_name": "albert-base-v2"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "IDEA-CCNL/Wenzhong-GPT2-110M", "performance": {"dataset": "", "accuracy": null}, "description": "- Main Page:[Fengshenbang](https://fengshenbang-lm.com/)\n- Github: [Fengshenbang-LM](https://github.com/IDEA-CCNL/Fengshenbang-LM)\n## \u7b80\u4ecb Brief Introduction\n\n\u5584\u4e8e\u5904\u7406NLG\u4efb\u52a1\uff0c\u4e2d\u6587\u7248\u7684GPT2-Small\u3002\n\nFocused on handling NLG tasks, Chinese GPT2-Small.", "api_call": "", "model_name": "IDEA-CCNL/Wenzhong-GPT2-110M"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Masked Language Modeling", "api_name": "bert-base-cased", "api_call": "pipeline('fill-mask', model='bert-base-cased')", "performance": {"dataset": "GLUE", "accuracy": 79.6}, "description": "BERT base model (cased) is a pre-trained transformer model on English language using a masked language modeling (MLM) objective. It was introduced in a paper and first released in a repository. This model is case-sensitive, which means it can differentiate between 'english' and 'English'. The model can be used for masked language modeling or next sentence prediction, but it's mainly intended to be fine-tuned on a downstream task.", "model_name": "bert-base-cased"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "bigscience/T0_3B", "performance": {"dataset": "", "accuracy": null}, "description": "how do i pronounce the name of the model? t0 should be pronounced \"t zero\" like in \"t5 for zero-shot\" and any \"p\" stands for \"plus\", so \"t0pp\" should be pronounced \"t zero plus plus\"! official repository : t0 shows zero-shot task generalization on english natural language prompts, outperforming gpt-3 on many tasks, while being 16x smaller. it is a series of encoder-decoder models trained on a large set of different tasks specified in natural language prompts. we convert numerous english supervised datasets into prompts, each with multiple templates using varying formulations. these prompted datasets allow for benchmarking the ability of a model to perform completely unseen tasks specified in natural language. to obtain t0 , we fine-tune a pretrained language model on this multitask mixture covering many different nlp tasks. t0 shows zero-shot task generalization on english natural language prompts, outperforming gpt-3 on many tasks, while being 16x smaller. it is a series of encoder-decoder models trained on a large set of different tasks specified in natural language prompts. we convert numerous english supervised datasets into prompts, each with multiple templates using varying formulations. these prompted datasets allow for benchmarking the ability of a model to perform completely unseen tasks specified in natural language. to obtain t0 , we fine-tune a pretrained language model on this multitask mixture covering many different nlp tasks.", "api_call": "", "model_name": "bigscience/T0_3B"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "wavymulder/wavyfusion", "performance": {"dataset": "", "accuracy": null}, "description": "wavyfusion - this is a dreambooth trained on a very diverse dataset ranging from photographs to paintings. the goal was to make a varied, general purpose model for illustrated styles. in your prompt, use the activation token: `wa-vy style`", "api_call": "", "model_name": "wavymulder/wavyfusion"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-sk-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: sk\n* target languages: en\n*  OPUS readme: [sk-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/sk-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-16.zip](https://object.pouta.csc.fi/OPUS-MT-models/sk-en/opus-2020-01-16.zip)\n* test set translations: [opus-2020-01-16.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/sk-en/opus-2020-01-16.test.txt)\n* test set scores: [opus-2020-01-16.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/sk-en/opus-2020-01-16.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-sk-en"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "wavymulder/timeless-diffusion", "performance": {"dataset": "", "accuracy": null}, "description": "timeless diffusion - this is a dreambooth model trained on a diverse set of colourized photographs from the 1880s-1980s. use the activation token timeless style in your prompt i recommend at the start", "api_call": "", "model_name": "wavymulder/timeless-diffusion"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "kit-nlp/bert-base-japanese-sentiment-irony", "performance": {"dataset": "", "accuracy": null}, "description": "this is a bert base model for sentiment analysis in japanese additionally finetuned for automatic irony detection. the model was based on , and later finetuned on a dataset containing ironic and sarcastic tweets. the finetuned model with all attached files is licensed under , or creative commons attribution-sharealike 4.0 international license.", "api_call": "", "model_name": "kit-nlp/bert-base-japanese-sentiment-irony"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Summarization", "api_name": "mrm8488/t5-base-finetuned-summarize-news", "api_call": "AutoModelWithLMHead.from_pretrained('mrm8488/t5-base-finetuned-summarize-news')", "performance": {"dataset": "News Summary", "accuracy": "Not provided"}, "description": "Google's T5 base fine-tuned on News Summary dataset for summarization downstream task. The dataset consists of 4515 examples and contains Author_name, Headlines, Url of Article, Short text, Complete Article. Time period ranges from February to August 2017.", "model_name": "mrm8488/t5-base-finetuned-summarize-news"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-de-ZH", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: de\n* target languages: cmn,cn,yue,ze_zh,zh_cn,zh_CN,zh_HK,zh_tw,zh_TW,zh_yue,zhs,zht,zh\n*  OPUS readme: [de-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/de-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)\n* download original weights: [opus-2020-01-20.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh/opus-2020-01-20.zip)\n* test set translations: [opus-2020-01-20.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh/opus-2020-01-20.test.txt)\n* test set scores: [opus-2020-01-20.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh/opus-2020-01-20.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-de-ZH"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "joachimsallstrom/Double-Exposure-Diffusion", "performance": {"dataset": "", "accuracy": null}, "description": "! double exposure diffusion this is version 2 of the double exposure diffusion model, trained specifically on images of people and a few animals.", "api_call": "", "model_name": "joachimsallstrom/Double-Exposure-Diffusion"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "Hugging Face Transformers", "api_name": "facebook/encodec_24khz", "performance": {"dataset": "", "accuracy": null}, "description": "this model card provides details and information about encodec, a state-of-the-art real-time audio codec developed by meta ai. encodec is a high-fidelity audio codec leveraging neural networks. it introduces a streaming encoder-decoder architecture with quantized latent space, trained in an end-to-end fashion. the model simplifies and speeds up training using a single multiscale spectrogram adversary that efficiently reduces artifacts and produces high-quality samples. encodec is a high-fidelity audio codec leveraging neural networks. it introduces a streaming encoder-decoder architecture with quantized latent space, trained in an end-to-end fashion. the model simplifies and speeds up training using a single multiscale spectrogram adversary that efficiently reduces artifacts and produces high-quality samples. it also includes a novel loss balancer mechanism that stabilizes training by decoupling the choice of hyperparameters from the typical scale of the loss. additionally, lightweight transformer models are used to further compress the obtained representation while maintaining real-time performance. - developed by: meta ai - model type: audio codec", "api_call": "", "model_name": "facebook/encodec_24khz"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-tc-big-en-tr", "performance": {"dataset": "", "accuracy": null}, "description": "neural machine translation model for translating from english en to turkish tr. this model is part of the , an effort to make neural machine translation models widely available and accessible for many languages in the world. all models are originally trained using the amazing framework of , an efficient nmt implementation written in pure c++. the models have been converted to pytorch using the transformers library by huggingface. training data is taken from and training pipelines use the procedures of . publications: and please, cite if you use this model.", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-tc-big-en-tr"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "facebook/wmt19-en-de", "performance": {"dataset": "", "accuracy": null}, "description": "this is a ported version of for en-de. for more details, please see, . the abbreviation fsmt stands for fairseqmachinetranslation this is a ported version of for en-de. for more details, please see, . the abbreviation fsmt stands for fairseqmachinetranslation all four models are available:", "api_call": "", "model_name": "facebook/wmt19-en-de"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-en-vi", "performance": {"dataset": "", "accuracy": null}, "description": "* source group: English \n* target group: Vietnamese \n*  OPUS readme: [eng-vie](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/eng-vie/README.md)\n\n*  model: transformer-align\n* source language(s): eng\n* target language(s): vie vie_Hani\n* model: transformer-align\n* pre-processing: normalization + SentencePiece (spm32k,spm32k)\n* a sentence initial language token is required in the form of `>>id<<` (id = valid target language ID)\n* download original weights: [opus-2020-06-17.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/eng-vie/opus-2020-06-17.zip)\n* test set translations: [opus-2020-06-17.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/eng-vie/opus-2020-06-17.test.txt)\n* test set scores: [opus-2020-06-17.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/eng-vie/opus-2020-06-17.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-en-vi"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "NoCrypt/SomethingV2_2", "performance": {"dataset": "", "accuracy": null}, "description": "somethingv2.2 welcome to somethingv2.2 - an improved anime latent diffusion model from somethingv2 a lot of things are being discovered lately, such as a way to merge model using mbw automatically, offset noise to get much darker result, and even vae tuning. this model is intended to use all of those features as the improvements, here's some improvements that have been made:", "api_call": "", "model_name": "NoCrypt/SomethingV2_2"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "functionality": "zero-shot-object-detection", "api_name": "google/owlvit-base-patch32", "api_call": "OwlViTForObjectDetection.from_pretrained('google/owlvit-base-patch32')", "performance": {"dataset": "COCO and OpenImages", "accuracy": "Not specified"}, "description": "OWL-ViT is a zero-shot text-conditioned object detection model that uses CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. The model can be used to query an image with one or multiple text queries.", "model_name": "google/owlvit-base-patch32"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "wavymulder/Analog-Diffusion", "api_call": "pipeline('text-to-image', model='wavymulder/Analog-Diffusion')", "performance": {"dataset": "analog photographs", "accuracy": "Not specified"}, "description": "Analog Diffusion is a dreambooth model trained on a diverse set of analog photographs. It can generate images based on text prompts with an analog style. Use the activation token 'analog style' in your prompt to get the desired output. The model is available on the Hugging Face Inference API and can be used with the transformers library.", "model_name": "wavymulder/Analog-Diffusion"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "Salesforce/codet5p-220m", "performance": {"dataset": "", "accuracy": null}, "description": "is a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes i.e. encoder-only , decoder-only , and encoder-decoder to support a wide range of code understanding and generation tasks. it is introduced in the paper: by \\ , \\ , , , , indicates equal contribution. is a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes i.e. encoder-only , decoder-only , and encoder-decoder to support a wide range of code understanding and generation tasks. it is introduced in the paper: by \\ , \\ , , , , indicates equal contribution. compared to the original codet5 family base: `220m`, large: `770m`, codet5+ is pretrained with a diverse set of pretraining tasks including span denoising , causal language modeling , contrastive learning , and text-code matching to learn rich representations from both unimodal code data and bimodal code-text data. additionally, it employs a simple yet effective compute-efficient pretraining method to initialize the model components with frozen off-the-shelf llms such as to efficiently scale up the model i.e. `2b`, `6b`, `16b`, and adopts a \"shallow encoder and deep decoder\" architecture. furthermore, it is instruction-tuned to align with natural language instructions see our instructcodet5+ 16b following .", "api_call": "", "model_name": "Salesforce/codet5p-220m"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "unicamp-dl/translation-pt-en-t5", "performance": {"dataset": "", "accuracy": null}, "description": "this repository brings an implementation of t5 for translation in pt-en tasks using a modest hardware setup. we propose some changes in tokenizator and post-processing that improves the result and used a portuguese pretrained model for the translation. you can collect more informations in . also, check ! just follow \"use in transformers\" instructions. it is necessary to add a few words before to define the task to t5. you can also create a pipeline for it. an example with the phrase \" eu gosto de comer arroz\" is: this repository brings an implementation of t5 for translation in pt-en tasks using a modest hardware setup. we propose some changes in tokenizator and post-processing that improves the result and used a portuguese pretrained model for the translation. you can collect more informations in . also, check !", "api_call": "", "model_name": "unicamp-dl/translation-pt-en-t5"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "EleutherAI/polyglot-ko-1.3b", "performance": {"dataset": "", "accuracy": null}, "description": "polyglot-ko is a series of large-scale korean autoregressive language models made by the eleutherai polyglot team. hyperparameter value    polyglot-ko is a series of large-scale korean autoregressive language models made by the eleutherai polyglot team.  hyperparameter value       \\\\n parameters\\\\ 1,331,810,304   \\\\n layers\\\\ 24   \\\\d model\\\\ 2,048   \\\\d ff\\\\ 8,192   \\\\n heads\\\\ 16   \\\\d head\\\\ 128   \\\\n ctx\\\\ 2,048   \\\\n vocab\\\\ 30,003 / 30,080   positional encoding   rope dimensions  the model consists of 24 transformer layers with a model dimension of 2048, and a feedforward dimension of 8192. the model dimension is split into 16 heads, each with a dimension of 128. rotary position embedding rope is applied to 64 dimensions of each head. the model is trained with a tokenization vocabulary of 30003.", "api_call": "", "model_name": "EleutherAI/polyglot-ko-1.3b"}
{"domain": "Audio Audio-to-Audio", "framework": "Hugging Face Transformers", "functionality": "Asteroid", "api_name": "mpariente/DPRNNTasNet-ks2_WHAM_sepclean", "api_call": "pipeline('audio-source-separation', model='mpariente/DPRNNTasNet-ks2_WHAM_sepclean')", "performance": {"dataset": "WHAM!", "si_sdr": 19.3167434907, "si_sdr_imp": 19.3178952739, "sdr": 19.6808534719, "sdr_imp": 19.5298092933, "sir": 30.3622139987, "sir_imp": 30.2111698201, "sar": 20.1555325134, "sar_imp": -129.0209176235, "stoi": 0.9777266431, "stoi_imp": 0.2396809152}, "description": "This model was trained by Manuel Pariente using the wham/DPRNN recipe in Asteroid. It was trained on the sep_clean task of the WHAM! dataset.", "model_name": "mpariente/DPRNNTasNet-ks2_WHAM_sepclean"}
{"domain": "Natural Language Processing Question Answering", "framework": "Transformers", "functionality": "Question Answering", "api_name": "distilbert-base-cased-distilled-squad", "api_call": "DistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased-distilled-squad')", "performance": {"dataset": "SQuAD v1.1", "accuracy": {"Exact Match": 79.6, "F1": 86.996}}, "description": "DistilBERT base cased distilled SQuAD is a fine-tuned checkpoint of DistilBERT-base-cased, trained using knowledge distillation on SQuAD v1.1 dataset. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark. This model can be used for question answering.", "model_name": "distilbert-base-cased-distilled-squad"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "functionality": "Text2Text Generation", "api_name": "mrm8488/bert2bert_shared-spanish-finetuned-summarization", "api_call": "AutoModelForSeq2SeqLM.from_pretrained('mrm8488/bert2bert_shared-spanish-finetuned-summarization')", "performance": {"dataset": "mlsum", "accuracy": {"Rouge1": 26.24, "Rouge2": 8.9, "RougeL": 21.01, "RougeLsum": 21.02}}, "description": "Spanish BERT2BERT (BETO) fine-tuned on MLSUM ES for summarization", "model_name": "mrm8488/bert2bert_shared-spanish-finetuned-summarization"}
{"domain": "Natural Language Processing Translation", "framework": "Transformers", "functionality": "Translation", "api_name": "opus-mt-de-es", "api_call": "pipeline('translation_de_to_es', model='Helsinki-NLP/opus-mt-de-es')", "performance": {"dataset": "Tatoeba.de.es", "accuracy": {"BLEU": 48.5, "chr-F": 0.676}}, "description": "A German to Spanish translation model based on the OPUS dataset and trained using the transformer-align architecture. The model is pre-processed with normalization and SentencePiece tokenization.", "model_name": "Helsinki-NLP/opus-mt-de-es"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "Hugging Face Transformers", "api_name": "rinna/japanese-clip-vit-b-16", "performance": {"dataset": "", "accuracy": null}, "description": "![rinna-icon](./rinna.png)\n\nThis is a Japanese [CLIP (Contrastive Language-Image Pre-Training)](https://arxiv.org/abs/2103.00020) model trained by [rinna Co., Ltd.](https://corp.rinna.co.jp/).\n\nPlease see [japanese-clip](https://github.com/rinnakk/japanese-clip) for the other available models.\n\n\n# How to use the model\n\n\n1. Install package\n\n```shell\n$ pip install git+https://github.com/rinnakk/japanese-clip.git\n```\n\n2. Run\n\n```python\nimport io\nimport requests\nfrom PIL import Image\nimport torch\nimport japanese_clip as ja_clip\n\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n\n\nmodel, preprocess = ja_clip.load(\"rinna/japanese-clip-vit-b-16\", cache_dir=\"/tmp/japanese_clip\", device=device)\ntokenizer = ja_clip.load_tokenizer()\n\nimg = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))\nimage = preprocess(img).unsqueeze(0).to(device)\nencodings = ja_clip.tokenize(\n    texts=[\"\u72ac\", \"\u732b\", \"\u8c61\"],\n    max_seq_len=77,\n    device=device,\n    tokenizer=tokenizer, # this is optional. if you don't pass, load tokenizer each time\n)\n\nwith torch.no_grad():\n    image_features = model.get_image_features(image)\n    text_features = model.get_text_features(**encodings)\n    \n    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)\n\nprint(\"Label probs:\", text_probs)  # prints: [[1.0, 0.0, 0.0]]\n```\n\n# Model architecture\nThe model was trained  a ViT-B/16 Transformer architecture as an image encoder and uses a 12-layer BERT as a text encoder. The image encoder was initialized from the [AugReg `vit-base-patch16-224` model](https://github.com/google-research/vision_transformer).\n\n# Training\nThe model was trained on [CC12M](https://github.com/google-research-datasets/conceptual-12m) translated the captions to Japanese.\n\n# Release date\nMay 12, 2022\n\n# How to cite\n```bibtex\n@misc{rinna-japanese-clip-vit-b-16,\n    title = {rinna/japanese-clip-vit-b-16},\n    author = {Shing, Makoto and Zhao, Tianyu and Sawada, Kei},\n    url = {https://huggingface.co/rinna/japanese-clip-vit-b-16}\n}\n\n@inproceedings{sawada2024release,\n    title = {Release of Pre-Trained Models for the {J}apanese Language},\n    author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},\n    booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},\n    month = {5},\n    year = {2024},\n    pages = {13898--13905},\n    url = {https://aclanthology.org/2024.lrec-main.1213},\n    note = {\\url{https://arxiv.org/abs/2404.01657}}\n}\n```\n\n# License\n\n[The Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0)", "api_call": "", "model_name": "rinna/japanese-clip-vit-b-16"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Image Classification", "api_name": "patrickjohncyh/fashion-clip", "api_call": "CLIPModel.from_pretrained('patrickjohncyh/fashion-clip')", "performance": {"dataset": [{"name": "FMNIST", "accuracy": 0.8300000000000001}, {"name": "KAGL", "accuracy": 0.73}, {"name": "DEEP", "accuracy": 0.62}]}, "description": "FashionCLIP is a CLIP-based model developed to produce general product representations for fashion concepts. Leveraging the pre-trained checkpoint (ViT-B/32) released by OpenAI, it is trained on a large, high-quality novel fashion dataset to study whether domain specific fine-tuning of CLIP-like models is sufficient to produce product representations that are zero-shot transferable to entirely new datasets and tasks.", "model_name": "patrickjohncyh/fashion-clip"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-id-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: id\n* target languages: en\n*  OPUS readme: [id-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/id-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/id-en/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/id-en/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/id-en/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-id-en"}
{"domain": "Multimodal Feature Extraction", "framework": "Hugging Face Transformers", "functionality": "Feature Engineering", "api_name": "microsoft/unixcoder-base", "api_call": "AutoModel.from_pretrained('microsoft/unixcoder-base')", "performance": {"dataset": "Not specified", "accuracy": "Not specified"}, "description": "UniXcoder is a unified cross-modal pre-trained model that leverages multimodal data (i.e. code comment and AST) to pretrain code representation. Developed by Microsoft Team and shared by Hugging Face. It is based on the RoBERTa model and trained on English language data. The model can be used for feature engineering tasks.", "model_name": "microsoft/unixcoder-base"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-ine-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source group: Indo-European languages \n* target group: English \n*  OPUS readme: [ine-eng](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/ine-eng/README.md)\n\n*  model: transformer\n* source language(s): afr aln ang_Latn arg asm ast awa bel bel_Latn ben bho bos_Latn bre bul bul_Latn cat ces cor cos csb_Latn cym dan deu dsb egl ell enm_Latn ext fao fra frm_Latn frr fry gcf_Latn gla gle glg glv gom gos got_Goth grc_Grek gsw guj hat hif_Latn hin hrv hsb hye ind isl ita jdt_Cyrl ksh kur_Arab kur_Latn lad lad_Latn lat_Latn lav lij lit lld_Latn lmo ltg ltz mai mar max_Latn mfe min mkd mwl nds nld nno nob nob_Hebr non_Latn npi oci ori orv_Cyrl oss pan_Guru pap pdc pes pes_Latn pes_Thaa pms pnb pol por prg_Latn pus roh rom ron rue rus san_Deva scn sco sgs sin slv snd_Arab spa sqi srp_Cyrl srp_Latn stq swe swg tgk_Cyrl tly_Latn tmw_Latn ukr urd vec wln yid zlm_Latn zsm_Latn zza\n* target language(s): eng\n* model: transformer\n* pre-processing: normalization + SentencePiece (spm32k,spm32k)\n* download original weights: [opus2m-2020-08-01.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/ine-eng/opus2m-2020-08-01.zip)\n* test set translations: [opus2m-2020-08-01.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/ine-eng/opus2m-2020-08-01.test.txt)\n* test set scores: [opus2m-2020-08-01.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/ine-eng/opus2m-2020-08-01.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-ine-en"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-vi-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source group: Vietnamese \n* target group: English \n*  OPUS readme: [vie-eng](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/vie-eng/README.md)\n\n*  model: transformer-align\n* source language(s): vie vie_Hani\n* target language(s): eng\n* model: transformer-align\n* pre-processing: normalization + SentencePiece (spm32k,spm32k)\n* download original weights: [opus-2020-06-17.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/vie-eng/opus-2020-06-17.zip)\n* test set translations: [opus-2020-06-17.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/vie-eng/opus-2020-06-17.test.txt)\n* test set scores: [opus-2020-06-17.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/vie-eng/opus-2020-06-17.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-vi-en"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Sentence-Transformers", "api_name": "sentence-transformers/paraphrase-xlm-r-multilingual-v1", "performance": {"dataset": "", "accuracy": null}, "description": "this is a model: it maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search. using this model becomes easy when you have installed: then you can use the model like this:", "api_call": "", "model_name": "sentence-transformers/paraphrase-xlm-r-multilingual-v1"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-de-fr", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: de\n* target languages: fr\n*  OPUS readme: [de-fr](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/de-fr/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-fr/opus-2020-01-08.zip)\n* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-fr/opus-2020-01-08.test.txt)\n* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-fr/opus-2020-01-08.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-de-fr"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-is-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: is\n* target languages: en\n*  OPUS readme: [is-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/is-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/is-en/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/is-en/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/is-en/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-is-en"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "HooshvareLab/bert-fa-base-uncased-sentiment-deepsentipers-binary", "performance": {"dataset": "", "accuracy": null}, "description": "a transformer-based model for persian language understanding we reconstructed the vocabulary and fine-tuned the parsbert v1.1 on the new persian corpora in order to provide some functionalities for using parsbert in other scopes! please follow the repo for the latest information about previous and current models.", "api_call": "", "model_name": "HooshvareLab/bert-fa-base-uncased-sentiment-deepsentipers-binary"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "onlplab/alephbert-base", "performance": {"dataset": "", "accuracy": null}, "description": "state-of-the-art language model for hebrew. based on google's bert architecture . 1. oscar hebrew section 10 gb text, 20 million sentences.", "api_call": "", "model_name": "onlplab/alephbert-base"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Speaker Verification", "api_name": "speechbrain/spkrec-xvect-voxceleb", "api_call": "EncoderClassifier.from_hparams(source='speechbrain/spkrec-xvect-voxceleb', savedir='pretrained_models/spkrec-xvect-voxceleb')", "performance": {"dataset": "Voxceleb1-test set (Cleaned)", "accuracy": "EER(%) 3.2"}, "description": "This repository provides all the necessary tools to extract speaker embeddings with a pretrained TDNN model using SpeechBrain. The system is trained on Voxceleb 1+ Voxceleb2 training data.", "model_name": "speechbrain/spkrec-xvect-voxceleb"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Transformers", "functionality": "Masked Language Modeling and Next Sentence Prediction", "api_name": "bert-large-uncased", "api_call": "pipeline('fill-mask', model='bert-large-uncased')", "performance": {"dataset": {"SQUAD 1.1 F1/EM": "91.0/84.3", "Multi NLI Accuracy": "86.05"}}, "description": "BERT large model (uncased) is a transformer model pretrained on a large corpus of English data using a masked language modeling (MLM) objective. It has 24 layers, 1024 hidden dimensions, 16 attention heads, and 336M parameters. The model is intended to be fine-tuned on a downstream task, such as sequence classification, token classification, or question answering.", "model_name": "bert-large-uncased"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Image Captioning", "api_name": "microsoft/git-base", "api_call": "pipeline('image-to-text', model='microsoft/git-base')", "performance": {"dataset": ["COCO", "Conceptual Captions (CC3M)", "SBU", "Visual Genome (VG)", "Conceptual Captions (CC12M)", "ALT200M"], "accuracy": "Refer to the paper for evaluation results"}, "description": "GIT (short for GenerativeImage2Text) model, base-sized version. It was introduced in the paper GIT: A Generative Image-to-text Transformer for Vision and Language by Wang et al. and first released in this repository. The model is trained using 'teacher forcing' on a lot of (image, text) pairs. The goal for the model is simply to predict the next text token, giving the image tokens and previous text tokens. This allows the model to be used for tasks like image and video captioning, visual question answering (VQA) on images and videos, and even image classification (by simply conditioning the model on the image and asking it to generate a class for it in text).", "model_name": "microsoft/git-base"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "seyonec/ChemBERTa-zinc-base-v1", "performance": {"dataset": "", "accuracy": null}, "description": "deep learning for chemistry and materials science remains a novel field with lots of potiential. however, the popularity of transfer learning based methods in areas such as nlp and computer vision have not yet been effectively developed in computational chemistry + machine learning. using huggingface's suite of models and the bytelevel tokenizer, we are able to train on a large corpus of 100k smiles strings from a commonly known benchmark dataset, zinc. training roberta over 5 epochs, the model achieves a decent loss of 0.398, but may likely continue to decline if trained for a larger number of epochs. the model can predict tokens within a smiles sequence/molecule, allowing for variants of a molecule within discoverable chemical space to be predicted. by applying the representations of functional groups and atoms learned by the model, we can try to tackle problems of toxicity, solubility, drug-likeness, and synthesis accessibility on smaller datasets using the learned representations as features for graph convolution and attention models on the graph structure of molecules, as well as fine-tuning of bert. finally, we propose the use of attention visualization as a helpful tool for chemistry practitioners and students to quickly identify important substructures in various chemical properties.", "api_call": "", "model_name": "seyonec/ChemBERTa-zinc-base-v1"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "malmarjeh/t5-arabic-text-summarization", "performance": {"dataset": "", "accuracy": null}, "description": "Paper: [Arabic abstractive text summarization using RNN-based and transformer-based architectures](https://www.sciencedirect.com/science/article/abs/pii/S0306457322003284).\n\nDataset: [link](https://data.mendeley.com/datasets/7kr75c9h24/1).\n\nThe model can be used as follows:\n```python\nfrom transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline\nfrom arabert.preprocess import ArabertPreprocessor\n\nmodel_name=\"malmarjeh/t5-arabic-text-summarization\"\npreprocessor = ArabertPreprocessor(model_name=\"\")\n\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModelForSeq2SeqLM.from_pretrained(model_name)\npipeline = pipeline(\"text2text-generation\",model=model,tokenizer=tokenizer)\n\ntext = \"\u0634\u0647\u062f\u062a \u0645\u062f\u064a\u0646\u0629 \u0637\u0631\u0627\u0628\u0644\u0633\u060c \u0645\u0633\u0627\u0621 \u0623\u0645\u0633 \u0627\u0644\u0623\u0631\u0628\u0639\u0627\u0621\u060c \u0627\u062d\u062a\u062c\u0627\u062c\u0627\u062a \u0634\u0639\u0628\u064a\u0629 \u0648\u0623\u0639\u0645\u0627\u0644 \u0634\u063a\u0628 \u0644\u0644\u064a\u0648\u0645 \u0627\u0644\u062b\u0627\u0644\u062b \u0639\u0644\u0649 \u0627\u0644\u062a\u0648\u0627\u0644\u064a\u060c \u0648\u0630\u0644\u0643 \u0628\u0633\u0628\u0628 \u062a\u0631\u062f\u064a \u0627\u0644\u0648\u0636\u0639 \u0627\u0644\u0645\u0639\u064a\u0634\u064a \u0648\u0627\u0644\u0627\u0642\u062a\u0635\u0627\u062f\u064a. \u0648\u0627\u0646\u062f\u0644\u0639\u062a \u0645\u0648\u0627\u062c\u0647\u0627\u062a \u0639\u0646\u064a\u0641\u0629 \u0648\u0639\u0645\u0644\u064a\u0627\u062a \u0643\u0631 \u0648\u0641\u0631 \u0645\u0627 \u0628\u064a\u0646 \u0627\u0644\u062c\u064a\u0634 \u0627\u0644\u0644\u0628\u0646\u0627\u0646\u064a \u0648\u0627\u0644\u0645\u062d\u062a\u062c\u064a\u0646 \u0627\u0633\u062a\u0645\u0631\u062a \u0644\u0633\u0627\u0639\u0627\u062a\u060c \u0625\u062b\u0631 \u0645\u062d\u0627\u0648\u0644\u0629 \u0641\u062a\u062d \u0627\u0644\u0637\u0631\u0642\u0627\u062a \u0627\u0644\u0645\u0642\u0637\u0648\u0639\u0629\u060c \u0645\u0627 \u0623\u062f\u0649 \u0625\u0644\u0649 \u0625\u0635\u0627\u0628\u0629 \u0627\u0644\u0639\u0634\u0631\u0627\u062a \u0645\u0646 \u0627\u0644\u0637\u0631\u0641\u064a\u0646.\"\ntext = preprocessor.preprocess(text)\n\nresult = pipeline(text,\n            pad_token_id=tokenizer.eos_token_id,\n            num_beams=3,\n            repetition_penalty=3.0,\n            max_length=200,\n            length_penalty=1.0,\n            no_repeat_ngram_size = 3)[0]['generated_text']\nresult\n>>> '\u0645\u0648\u0627\u062c\u0647\u0627\u062a \u0639\u0646\u064a\u0641\u0629 \u0628\u064a\u0646 \u0627\u0644\u062c\u064a\u0634 \u0627\u0644\u0644\u0628\u0646\u0627\u0646\u064a \u0648\u0645\u062d\u062a\u062c\u064a\u0646 \u0641\u064a \u0637\u0631\u0627\u0628\u0644\u0633'\n```", "api_call": "", "model_name": "malmarjeh/t5-arabic-text-summarization"}
{"domain": "Multimodal Visual Question Answering", "framework": "Hugging Face Transformers", "api_name": "google/pix2struct-infographics-vqa-large", "performance": {"dataset": "", "accuracy": null}, "description": "![model_image](https://s3.amazonaws.com/moonup/production/uploads/1678713353867-62441d1d9fdefb55a0b7d12c.png)\n\n#  Table of Contents\n\n0. [TL;DR](#TL;DR)\n1. [Using the model](#using-the-model)\n2. [Contribution](#contribution)\n3. [Citation](#citation)\n\n# TL;DR\n\nPix2Struct is an image encoder - text decoder model that is trained on image-text pairs for various tasks, including image captionning and visual question answering. The full list of available models can be found on the Table 1 of the paper:\n\n![Table 1 - paper](https://s3.amazonaws.com/moonup/production/uploads/1678712985040-62441d1d9fdefb55a0b7d12c.png)\n\n\nThe abstract of the model states that: \n> Visually-situated language is ubiquitous\u2014sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and\nforms. Perhaps due to this diversity, previous work has typically relied on domainspecific recipes with limited sharing of the underlying data, model architectures,\nand objectives. We present Pix2Struct, a pretrained image-to-text model for\npurely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse\nmasked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large\nsource of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy,\nwe introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions\nare rendered directly on top of the input image. For the first time, we show that a\nsingle pretrained model can achieve state-of-the-art results in six out of nine tasks\nacross four domains: documents, illustrations, user interfaces, and natural images.\n\n# Using the model", "api_call": "", "model_name": "google/pix2struct-infographics-vqa-large"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "indigo-ai/BERTino", "performance": {"dataset": "", "accuracy": null}, "description": "this repository hosts bertino, an italian distilbert model pre-trained by on a large general-domain italian corpus. bertino is task-agnostic and can be fine-tuned for every downstream task.", "api_call": "", "model_name": "indigo-ai/BERTino"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-it-fr", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: it\n* target languages: fr\n*  OPUS readme: [it-fr](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/it-fr/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-16.zip](https://object.pouta.csc.fi/OPUS-MT-models/it-fr/opus-2020-01-16.zip)\n* test set translations: [opus-2020-01-16.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/it-fr/opus-2020-01-16.test.txt)\n* test set scores: [opus-2020-01-16.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/it-fr/opus-2020-01-16.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-it-fr"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-da-de", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: da\n* target languages: de\n*  OPUS readme: [da-de](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/da-de/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-26.zip](https://object.pouta.csc.fi/OPUS-MT-models/da-de/opus-2020-01-26.zip)\n* test set translations: [opus-2020-01-26.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/da-de/opus-2020-01-26.test.txt)\n* test set scores: [opus-2020-01-26.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/da-de/opus-2020-01-26.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-da-de"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "praeclarum/cuneiform", "performance": {"dataset": "", "accuracy": null}, "description": "this is a translation network that understands sumerian and akkadian languages written in cuneiform. it was trained on cuneiform transcribed in the cdli atf format. for example: the network was trained to translate from the ancient languages:", "api_call": "", "model_name": "praeclarum/cuneiform"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "valhalla/t5-base-qa-qg-hl", "performance": {"dataset": "", "accuracy": null}, "description": "this is multi-task model trained for question answering and answer aware question generation tasks. for question generation the answer spans are highlighted within the text with special highlight tokens `` and prefixed with 'generate question: '. for qa the input is processed like this `question: question text context: context text ` you can play with the model using the inference api. here's how you can use it", "api_call": "", "model_name": "valhalla/t5-base-qa-qg-hl"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "functionality": "Generative Commonsense Reasoning", "api_name": "mrm8488/t5-base-finetuned-common_gen", "api_call": "AutoModelWithLMHead.from_pretrained('mrm8488/t5-base-finetuned-common_gen')", "performance": {"dataset": "common_gen", "accuracy": {"ROUGE-2": 17.1, "ROUGE-L": 39.47}}, "description": "Google's T5 fine-tuned on CommonGen for Generative Commonsense Reasoning. CommonGen is a constrained text generation task, associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts; the task is to generate a coherent sentence describing an everyday scenario using these concepts.", "model_name": "mrm8488/t5-base-finetuned-common_gen"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-es-de", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: es\n* target languages: de\n*  OPUS readme: [es-de](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/es-de/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-16.zip](https://object.pouta.csc.fi/OPUS-MT-models/es-de/opus-2020-01-16.zip)\n* test set translations: [opus-2020-01-16.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/es-de/opus-2020-01-16.test.txt)\n* test set scores: [opus-2020-01-16.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/es-de/opus-2020-01-16.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-es-de"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "api_name": "yangy50/garbage-classification", "performance": {"dataset": "", "accuracy": null}, "description": "garbage classification refers to the separation of several types of different categories in accordance with the environmental impact of the use of the value of the composition of garbage components and the requirements of existing treatment methods. the significance of garbage classification: 1. garbage classification reduces the mutual pollution between different garbage, which is beneficial to the recycling of materials.", "api_call": "", "model_name": "yangy50/garbage-classification"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "Sygil/Sygil-Diffusion", "performance": {"dataset": "", "accuracy": null}, "description": "this model is a fine-tune of stable diffusion, trained on the , with the big advantage of allowing the use of multiple namespaces labeled tags to control various parts of the final generation. while current models usually are prone to \u201ccontext errors\u201d and need substantial negative prompting to set them on the right track, the use of namespaces in this model eg. \u201cspecies:seal\u201d or \u201cstudio:dc\u201d stop the model from misinterpreting a seal as the singer seal, or dc comics as washington dc.", "api_call": "", "model_name": "Sygil/Sygil-Diffusion"}
{"domain": "Natural Language Processing Question Answering", "framework": "Hugging Face Transformers", "api_name": "ZeyadAhmed/AraElectra-Arabic-SQuADv2-QA", "performance": {"dataset": "", "accuracy": null}, "description": "this is the model, fine-tuned using the dataset. it's been trained on question-answer pairs, including unanswerable questions, for the task of question answering. with help of to predicted unanswerable question. overview language model: araelectra", "api_call": "", "model_name": "ZeyadAhmed/AraElectra-Arabic-SQuADv2-QA"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "voidful/context-only-question-generator", "performance": {"dataset": "", "accuracy": null}, "description": "this model is a sequence-to-sequence question generator which takes context as an input, and generates a question as an output. it is based on a pretrained `bart-base` model. inputs should be organised into the following format: this model is a sequence-to-sequence question generator which takes context as an input, and generates a question as an output. it is based on a pretrained `bart-base` model.", "api_call": "", "model_name": "voidful/context-only-question-generator"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-de-da", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: de\n* target languages: da\n*  OPUS readme: [de-da](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/de-da/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-29.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-da/opus-2020-01-29.zip)\n* test set translations: [opus-2020-01-29.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-da/opus-2020-01-29.test.txt)\n* test set scores: [opus-2020-01-29.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-da/opus-2020-01-29.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-de-da"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image Diffusion Models", "api_name": "lllyasviel/control_v11p_sd15s2_lineart_anime", "api_call": "ControlNetModel.from_pretrained('lllyasviel/control_v11p_sd15s2_lineart_anime')", "performance": {"dataset": "Not specified", "accuracy": "Not specified"}, "description": "ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on lineart_anime images.", "model_name": "lllyasviel/control_v11p_sd15s2_lineart_anime"}
{"domain": "Multimodal Text-to-Video", "framework": "Hugging Face", "functionality": "Text-to-Video Generation", "api_name": "redshift-man-skiing", "api_call": "TuneAVideoPipeline.from_pretrained('nitrosocke/redshift-diffusion', unet=UNet3DConditionModel.from_pretrained('Tune-A-Video-library/redshift-man-skiing', subfolder='unet', torch_dtype=torch.float16), torch_dtype=torch.float16)", "performance": {"dataset": "N/A", "accuracy": "N/A"}, "description": "Tune-A-Video - Redshift is a text-to-video generation model based on the nitrosocke/redshift-diffusion model. It generates videos based on textual prompts, such as 'a man is skiing' or '(redshift style) spider man is skiing'.", "model_name": "nitrosocke/redshift-diffusion"}
{"domain": "Natural Language Processing Translation", "framework": "Transformers", "functionality": "Translation", "api_name": "Helsinki-NLP/opus-mt-es-en", "api_call": "pipeline('translation_es_to_en', model='Helsinki-NLP/opus-mt-es-en')", "performance": {"dataset": [{"name": "newssyscomb2009-spaeng.spa.eng", "accuracy": {"BLEU": 30.6, "chr-F": 0.5700000000000001}}, {"name": "news-test2008-spaeng.spa.eng", "accuracy": {"BLEU": 27.9, "chr-F": 0.553}}, {"name": "newstest2009-spaeng.spa.eng", "accuracy": {"BLEU": 30.4, "chr-F": 0.5720000000000001}}, {"name": "newstest2010-spaeng.spa.eng", "accuracy": {"BLEU": 36.1, "chr-F": 0.614}}, {"name": "newstest2011-spaeng.spa.eng", "accuracy": {"BLEU": 34.2, "chr-F": 0.599}}, {"name": "newstest2012-spaeng.spa.eng", "accuracy": {"BLEU": 37.9, "chr-F": 0.624}}, {"name": "newstest2013-spaeng.spa.eng", "accuracy": {"BLEU": 35.3, "chr-F": 0.609}}, {"name": "Tatoeba-test.spa.eng", "accuracy": {"BLEU": 59.6, "chr-F": 0.739}}]}, "description": "Helsinki-NLP/opus-mt-es-en is a machine translation model trained to translate from Spanish to English using the Hugging Face Transformers library. The model is based on the Marian framework and was trained on the OPUS dataset.", "model_name": "Helsinki-NLP/opus-mt-es-en"}
{"domain": "Audio Audio-to-Audio", "framework": "Hugging Face Transformers", "functionality": "Asteroid", "api_name": "DCCRNet_Libri1Mix_enhsingle_16k", "api_call": "AutoModelForAudioToAudio.from_pretrained('JorisCos/DCCRNet_Libri1Mix_enhsingle_16k')", "performance": {"dataset": "Libri1Mix", "accuracy": {"si_sdr": 13.3297673983, "si_sdr_imp": 9.8799860925, "sdr": 13.87279933, "sdr_imp": 10.3701365308, "sir": "Infinity", "sir_imp": "NaN", "sar": 13.87279933, "sar_imp": 10.3701365308, "stoi": 0.9140907016, "stoi_imp": 0.11817087800000001}}, "description": "This model was trained by Joris Cosentino using the librimix recipe in Asteroid. It was trained on the enh_single task of the Libri1Mix dataset.", "model_name": "JorisCos/DCCRNet_Libri1Mix_enhsingle_16k"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-it-de", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: it\n* target languages: de\n*  OPUS readme: [it-de](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/it-de/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-20.zip](https://object.pouta.csc.fi/OPUS-MT-models/it-de/opus-2020-01-20.zip)\n* test set translations: [opus-2020-01-20.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/it-de/opus-2020-01-20.test.txt)\n* test set scores: [opus-2020-01-20.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/it-de/opus-2020-01-20.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-it-de"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-pl-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: pl\n* target languages: en\n*  OPUS readme: [pl-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/pl-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/pl-en/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/pl-en/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/pl-en/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-pl-en"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "sander-wood/text-to-music", "performance": {"dataset": "", "accuracy": null}, "description": "this language-music model takes fine-tunes on 282,870 english text-music pairs, where all scores are represented in abc notation. it was introduced in the paper by wu et al. and released in . it is capable of generating complete and semantically consistent sheet music directly from descriptions in natural language based on text. to the best of our knowledge, this is the first model that achieves text-conditional symbolic music generation which is trained on real text-music pairs, and the music is generated entirely by the model and without any hand-crafted rules. this language-music model is available for online use and experience on . with this online platform, you can easily input your desired text descriptions and receive a generated sheet music output from the model. this language-music model takes fine-tunes on 282,870 english text-music pairs, where all scores are represented in abc notation. it was introduced in the paper by wu et al. and released in . it is capable of generating complete and semantically consistent sheet music directly from descriptions in natural language based on text. to the best of our knowledge, this is the first model that achieves text-conditional symbolic music generation which is trained on real text-music pairs, and the music is generated entirely by the model and without any hand-crafted rules. this language-music model is available for online use and experience on . with this online platform, you can easily input your desired text descriptions and receive a generated sheet music output from the model. due to copyright reasons, we are unable to publicly release the training dataset of this model. instead, we have made available the wikimt dataset, which includes 1010 pairs of text-music data and can be used to evaluate the performance of language-music models.", "api_call": "", "model_name": "sander-wood/text-to-music"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "api_name": "VietAI/vit5-large-vietnews-summarization", "performance": {"dataset": "", "accuracy": null}, "description": "State-of-the-art pretrained Transformer-based encoder-decoder model for Vietnamese.\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vit5-pretrained-text-to-text-transformer-for/abstractive-text-summarization-on-vietnews)](https://paperswithcode.com/sota/abstractive-text-summarization-on-vietnews?p=vit5-pretrained-text-to-text-transformer-for)", "api_call": "", "model_name": "VietAI/vit5-large-vietnews-summarization"}
{"domain": "Audio Voice Activity Detection", "framework": "Hugging Face Pyannote-Audio", "api_name": "philschmid/pyannote-segmentation", "performance": {"dataset": "", "accuracy": null}, "description": "![Example](example.png)\n\nModel from *[End-to-end speaker segmentation for overlap-aware resegmentation](http://arxiv.org/abs/2104.04045)*,  \nby Herv\u00e9 Bredin and Antoine Laurent.\n\n[Online demo](https://huggingface.co/spaces/pyannote/pretrained-pipelines) is available as a Hugging Face Space.", "api_call": "", "model_name": "philschmid/pyannote-segmentation"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "humarin/chatgpt_paraphraser_on_T5_base", "performance": {"dataset": "", "accuracy": null}, "description": "this model was trained on our . this dataset is based on the , texts from the and the . this model is based on the t5-base model. we used \"transfer learning\" to get our model to generate paraphrases as well as chatgpt. now we can say that this is one of the best paraphrases of the hugging face.", "api_call": "", "model_name": "humarin/chatgpt_paraphraser_on_T5_base"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "wavymulder/portraitplus", "performance": {"dataset": "", "accuracy": null}, "description": "portrait+ - this is a dreambooth model trained on a diverse set of close to medium range portraits of people. use `portrait+ style` in your prompt i recommend at the start", "api_call": "", "model_name": "wavymulder/portraitplus"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "api_name": "kassem112/rare", "performance": {"dataset": "", "accuracy": null}, "description": "autogenerated by huggingpics\ud83e\udd17\ud83d\uddbc\ufe0f create your own image classifier for anything by running . report any issues with the demo at the .", "api_call": "", "model_name": "kassem112/rare"}
{"domain": "Computer Vision Object Detection", "framework": "Hugging Face Transformers", "api_name": "valentinafeve/yolos-fashionpedia", "performance": {"dataset": "", "accuracy": null}, "description": "this is a fine-tunned object detection model for fashion. for more details of the implementation you can check the source code the dataset used for its training is available", "api_call": "", "model_name": "valentinafeve/yolos-fashionpedia"}
{"domain": "Audio Text-to-Speech", "framework": "Hugging Face Transformers", "functionality": "Text-to-Speech", "api_name": "microsoft/speecht5_tts", "api_call": "SpeechT5ForTextToSpeech.from_pretrained('microsoft/speecht5_tts')", "performance": {"dataset": "LibriTTS", "accuracy": "Not specified"}, "description": "SpeechT5 model fine-tuned for speech synthesis (text-to-speech) on LibriTTS. It is a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. It can be used for a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.", "model_name": "microsoft/speecht5_tts"}
{"domain": "Natural Language Processing Text Classification", "framework": "Transformers", "functionality": "Sentiment Analysis", "api_name": "siebert/sentiment-roberta-large-english", "api_call": "pipeline('sentiment-analysis', model='siebert/sentiment-roberta-large-english')", "performance": {"dataset": [{"name": "McAuley and Leskovec (2013) (Reviews)", "accuracy": 98.0}, {"name": "McAuley and Leskovec (2013) (Review Titles)", "accuracy": 87.0}, {"name": "Yelp Academic Dataset", "accuracy": 96.5}, {"name": "Maas et al. (2011)", "accuracy": 96.0}, {"name": "Kaggle", "accuracy": 96.0}, {"name": "Pang and Lee (2005)", "accuracy": 91.0}, {"name": "Nakov et al. (2013)", "accuracy": 88.5}, {"name": "Shamma (2009)", "accuracy": 87.0}, {"name": "Blitzer et al. (2007) (Books)", "accuracy": 92.5}, {"name": "Blitzer et al. (2007) (DVDs)", "accuracy": 92.5}, {"name": "Blitzer et al. (2007) (Electronics)", "accuracy": 95.0}, {"name": "Blitzer et al. (2007) (Kitchen devices)", "accuracy": 98.5}, {"name": "Pang et al. (2002)", "accuracy": 95.5}, {"name": "Speriosu et al. (2011)", "accuracy": 85.5}, {"name": "Hartmann et al. (2019)", "accuracy": 98.0}], "average_accuracy": 93.2}, "description": "This model ('SiEBERT', prefix for 'Sentiment in English') is a fine-tuned checkpoint of RoBERTa-large (Liu et al. 2019). It enables reliable binary sentiment analysis for various types of English-language text. For each instance, it predicts either positive (1) or negative (0) sentiment. The model was fine-tuned and evaluated on 15 data sets from diverse text sources to enhance generalization across different types of texts (reviews, tweets, etc.). Consequently, it outperforms models trained on only one type of text (e.g., movie reviews from the popular SST-2 benchmark) when used on new data as shown below.", "model_name": "siebert/sentiment-roberta-large-english"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "api_name": "Recognai/zeroshot_selectra_medium", "performance": {"dataset": "", "accuracy": null}, "description": "zero-shot selectra is a fine-tuned on the spanish portion of the . you can use it with hugging face's to make . in comparison to our previous zero-shot classifier , zero-shot selectra is much more lightweight . as shown in the metrics section, the small version 5 times fewer parameters performs slightly worse, while the medium version 3 times fewer parameters outperforms the beto based zero-shot classifier. the `hypothesis template` parameter is important and should be in spanish. in the widget on the right, this parameter is set to its default value: \"this example is .\", so different results are expected.", "api_call": "", "model_name": "Recognai/zeroshot_selectra_medium"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "facebook/wmt19-en-ru", "performance": {"dataset": "", "accuracy": null}, "description": "this is a ported version of for en-ru. for more details, please see, . the abbreviation fsmt stands for fairseqmachinetranslation this is a ported version of for en-ru. for more details, please see, . the abbreviation fsmt stands for fairseqmachinetranslation all four models are available:", "api_call": "", "model_name": "facebook/wmt19-en-ru"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "logo-wizard/logo-diffusion-checkpoint", "performance": {"dataset": "", "accuracy": null}, "description": "this is checkpoint based on and . the weights were fine-tuned on the dataset. you can find some example images in the following. we recommend using this model with the following prompt template: positive: f\"a logo of company industry, some objects, colors, modern, minimalism, vector art, 2d, best quality, centered\"", "api_call": "", "model_name": "logo-wizard/logo-diffusion-checkpoint"}
{"domain": "Audio Audio-to-Audio", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "microsoft/speecht5_vc", "api_call": "SpeechT5ForSpeechToSpeech.from_pretrained('microsoft/speecht5_vc')", "performance": {"dataset": "CMU ARCTIC", "accuracy": "Not specified"}, "description": "SpeechT5 model fine-tuned for voice conversion (speech-to-speech) on CMU ARCTIC. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. It is designed to improve the modeling capability for both speech and text. This model can be used for speech conversion tasks.", "model_name": "microsoft/speecht5_vc"}
{"domain": "Natural Language Processing Text Classification", "framework": "Transformers", "functionality": "Sentiment Analysis", "api_name": "cardiffnlp/twitter-roberta-base-sentiment", "api_call": "AutoModelForSequenceClassification.from_pretrained('cardiffnlp/twitter-roberta-base-sentiment')", "performance": {"dataset": "tweet_eval", "accuracy": "Not provided"}, "description": "Twitter-roBERTa-base for Sentiment Analysis. This is a roBERTa-base model trained on ~58M tweets and finetuned for sentiment analysis with the TweetEval benchmark. This model is suitable for English.", "model_name": "cardiffnlp/twitter-roberta-base-sentiment"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "RUCAIBox/mvp", "performance": {"dataset": "", "accuracy": null}, "description": "the mvp model was proposed in by tianyi tang, junyi li, wayne xin zhao and ji-rong wen. the detailed information and instructions can be found . mvp is supervised pre-trained using a mixture of labeled datasets. it follows a standard transformer encoder-decoder architecture. mvp is supervised pre-trained using a mixture of labeled datasets. it follows a standard transformer encoder-decoder architecture. mvp is specially designed for natural language generation and can be adapted to a wide range of generation tasks, including but not limited to summarization, data-to-text generation, open-ended dialogue system, story generation, question answering, question generation, task-oriented dialogue system, commonsense generation, paraphrase generation, text style transfer, and text simplification. our model can also be adapted to natural language understanding tasks such as sequence classification and extractive question answering.", "api_call": "", "model_name": "RUCAIBox/mvp"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "DTAI-KULeuven/robbert-v2-dutch-sentiment", "performance": {"dataset": "", "accuracy": null}, "description": "this is a finetuned model based on . we used , which consists of book reviews from . hence our example sentences about books. we did some limited experiments to test if this also works for other domains, but this was not exactly amazing. we released a distilled model and a `base`-sized model. both models perform quite well, so there is only a slight performance tradeoff: model identifier layers params. accuracy", "api_call": "", "model_name": "DTAI-KULeuven/robbert-v2-dutch-sentiment"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "nitrosocke/spider-verse-diffusion", "performance": {"dataset": "", "accuracy": null}, "description": "spider-verse diffusion this is the fine-tuned stable diffusion model trained on movie stills from sony's into the spider-verse. use the tokens spiderverse style in your prompts for the effect.", "api_call": "", "model_name": "nitrosocke/spider-verse-diffusion"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-es-ru", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: es\n* target languages: ru\n*  OPUS readme: [es-ru](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/es-ru/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-20.zip](https://object.pouta.csc.fi/OPUS-MT-models/es-ru/opus-2020-01-20.zip)\n* test set translations: [opus-2020-01-20.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/es-ru/opus-2020-01-20.test.txt)\n* test set scores: [opus-2020-01-20.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/es-ru/opus-2020-01-20.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-es-ru"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "OpenAssistant/reward-model-deberta-v3-large-v2", "performance": {"dataset": "", "accuracy": null}, "description": "reward model rm trained to predict which generated answer is better judged by a human, given a question. rm are useful in these domain: - qa model evaluation", "api_call": "", "model_name": "OpenAssistant/reward-model-deberta-v3-large-v2"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image", "api_name": "vintedois-diffusion-v0-1", "api_call": "pipeline('text-to-image', model='22h/vintedois-diffusion-v0-1')", "performance": {"dataset": "large amount of high quality images", "accuracy": "not specified"}, "description": "Vintedois (22h) Diffusion model trained by Predogl and piEsposito with open weights, configs and prompts. This model generates beautiful images without a lot of prompt engineering. It can also generate high fidelity faces with a little amount of steps.", "model_name": "22h/vintedois-diffusion-v0-1"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "johnslegers/epic-diffusion", "performance": {"dataset": "", "accuracy": null}, "description": "!example11 ep\u00eec diffusion is a general purpose model based on stable diffusion 1.x intended to replace the official sd releases as your default model. it is focused on providing high quality output in a wide range of different styles, with support", "api_call": "", "model_name": "johnslegers/epic-diffusion"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "Babelscape/rebel-large", "performance": {"dataset": "", "accuracy": null}, "description": "this is the model card for the findings of emnlp 2021 paper . we present a new linearization approach and a reframing of relation extraction as a seq2seq task. the paper can be found . if you use the code, please reference this work in your paper: @inproceedingshuguet-cabot-navigli-2021-rebel-relation, title \"rebel: relation extraction by end-to-end language generation\",", "api_call": "", "model_name": "Babelscape/rebel-large"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-ja-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: ja\n* target languages: en\n*  OPUS readme: [ja-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/ja-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/ja-en/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/ja-en/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/ja-en/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-ja-en"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-tc-big-zls-en", "performance": {"dataset": "", "accuracy": null}, "description": "neural machine translation model for translating from south slavic languages zls to english en. this model is part of the , an effort to make neural machine translation models widely available and accessible for many languages in the world. all models are originally trained using the amazing framework of , an efficient nmt implementation written in pure c++. the models have been converted to pytorch using the transformers library by huggingface. training data is taken from and training pipelines use the procedures of . publications: and please, cite if you use this model.", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-tc-big-zls-en"}
{"domain": "Computer Vision Text-to-Video", "framework": "Hugging Face Diffusers", "api_name": "cerspense/zeroscope_v2_576w", "performance": {"dataset": "", "accuracy": null}, "description": "a watermark-free modelscope-based video model optimized for producing high-quality 16:9 compositions and a smooth video output. this model was trained from the using 9,923 clips and 29,769 tagged frames at 24 frames, 576x320 resolution. zeroscope v2 567w is specifically designed for upscaling with using vid2vid in the extension by . leveraging this model as a preliminary step allows for superior overall compositions at higher resolutions in zeroscope v2 xl, permitting faster exploration in 576x320 before transitioning to a high-resolution render. see some that have been upscaled to 1024x576 using zeroscope v2 xl. courtesy of zeroscope v2 576w uses 7.9gb of vram when rendering 30 frames at 576x320", "api_call": "", "model_name": "cerspense/zeroscope_v2_576w"}
{"domain": "Natural Language Processing Zero-Shot Classification", "framework": "Hugging Face Transformers", "api_name": "MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7", "performance": {"dataset": "", "accuracy": null}, "description": "this multilingual model can perform natural language inference nli on 100 languages and is therefore also suitable for multilingual zero-shot classification. the underlying mdeberta-v3-base model was pre-trained by microsoft on the with 100 languages. the model was then fine-tuned on the and on the . both datasets contain more than 2.7 million hypothesis-premise pairs in 27 languages spoken by more than 4 billion people. as of december 2021, mdeberta-v3-base is the best performing multilingual base-sized transformer model introduced by microsoft in . this model was trained on the and the validation dataset. this multilingual model can perform natural language inference nli on 100 languages and is therefore also suitable for multilingual zero-shot classification. the underlying mdeberta-v3-base model was pre-trained by microsoft on the with 100 languages. the model was then fine-tuned on the and on the . both datasets contain more than 2.7 million hypothesis-premise pairs in 27 languages spoken by more than 4 billion people. as of december 2021, mdeberta-v3-base is the best performing multilingual base-sized transformer model introduced by microsoft in .", "api_call": "", "model_name": "MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "facebook/mask2former-swin-large-cityscapes-semantic", "api_call": "Mask2FormerForUniversalSegmentation.from_pretrained('facebook/mask2former-swin-large-cityscapes-semantic')", "performance": {"dataset": "Cityscapes", "accuracy": "Not specified"}, "description": "Mask2Former model trained on Cityscapes semantic segmentation (large-sized version, Swin backbone). It addresses instance, semantic and panoptic segmentation by predicting a set of masks and corresponding labels. The model outperforms the previous SOTA, MaskFormer, in terms of performance and efficiency.", "model_name": "facebook/mask2former-swin-large-cityscapes-semantic"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "airesearch/wangchanberta-base-att-spm-uncased", "performance": {"dataset": "", "accuracy": null}, "description": "pretrained roberta base model on assorted thai texts 78.5 gb. the script and documentation can be found at . the architecture of the pretrained model is based on roberta liu et al., 2019 the architecture of the pretrained model is based on roberta liu et al., 2019", "api_call": "", "model_name": "airesearch/wangchanberta-base-att-spm-uncased"}
{"domain": "Computer Vision Image-to-Image", "framework": "Diffusers", "functionality": "Text-to-Image Diffusion Models", "api_name": "lllyasviel/control_v11p_sd15_openpose", "api_call": "ControlNetModel.from_pretrained('lllyasviel/control_v11p_sd15_openpose')", "performance": {"dataset": "Not specified", "accuracy": "Not specified"}, "description": "ControlNet is a neural network structure to control diffusion models by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on openpose images.", "model_name": "lllyasviel/control_v11p_sd15_openpose"}
{"domain": "Computer Vision Depth Estimation", "framework": "Hugging Face Transformers", "functionality": "Depth Estimation", "api_name": "glpn-nyu", "api_call": "GLPNForDepthEstimation.from_pretrained('vinvino02/glpn-nyu')", "performance": {"dataset": "NYUv2", "accuracy": "Not provided"}, "description": "Global-Local Path Networks (GLPN) model trained on NYUv2 for monocular depth estimation. It was introduced in the paper Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth by Kim et al. and first released in this repository.", "model_name": "vinvino02/glpn-nyu"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "ahmedrachid/FinancialBERT-Sentiment-Analysis", "performance": {"dataset": "", "accuracy": null}, "description": "is a bert model pre-trained on a large corpora of financial texts. the purpose is to enhance financial nlp research and practice in financial domain, hoping that financial practitioners and researchers can benefit from this model without the necessity of the significant computational resources required to train the model. the model was fine-tuned for sentiment analysis task on financial phrasebank dataset. experiments show that this model outperforms the general bert and other financial domain-specific models. more details on `financialbert`'s pre-training process can be found at:", "api_call": "", "model_name": "ahmedrachid/FinancialBERT-Sentiment-Analysis"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "Fictiverse/Stable_Diffusion_PaperCut_Model", "performance": {"dataset": "", "accuracy": null}, "description": "this is the fine-tuned stable diffusion model trained on paper cut images. use papercut in your prompts. based on stablediffusion 1.5 model", "api_call": "", "model_name": "Fictiverse/Stable_Diffusion_PaperCut_Model"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-tc-big-en-fr", "performance": {"dataset": "", "accuracy": null}, "description": "neural machine translation model for translating from english en to french fr. this model is part of the , an effort to make neural machine translation models widely available and accessible for many languages in the world. all models are originally trained using the amazing framework of , an efficient nmt implementation written in pure c++. the models have been converted to pytorch using the transformers library by huggingface. training data is taken from and training pipelines use the procedures of . publications: and please, cite if you use this model.", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-tc-big-en-fr"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "Hugging Face Transformers", "api_name": "facebook/dpr-question_encoder-multiset-base", "performance": {"dataset": "", "accuracy": null}, "description": "## Table of Contents\n- [Model Details](#model-details)\n- [How To Get Started With the Model](#how-to-get-started-with-the-model)\n- [Uses](#uses)\n- [Risks, Limitations and Biases](#risks-limitations-and-biases)\n- [Training](#training)\n- [Evaluation](#evaluation-results)\n- [Environmental Impact](#environmental-impact)\n- [Technical Specifications](#technical-specifications)\n- [Citation Information](#citation-information)\n- [Model Card Authors](#model-card-authors)", "api_call": "", "model_name": "facebook/dpr-question_encoder-multiset-base"}
{"domain": "Natural Language Processing Sentence Similarity", "framework": "Hugging Face Sentence-Transformers", "api_name": "intfloat/e5-small-v2", "performance": {"dataset": "", "accuracy": null}, "description": ". liang wang, nan yang, xiaolong huang, binxing jiao, linjun yang, daxin jiang, rangan majumder, furu wei, arxiv 2022 this model has 12 layers and the embedding size is 384.", "api_call": "", "model_name": "intfloat/e5-small-v2"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "Mizuiro-sakura/luke-japanese-large-sentiment-analysis-wrime", "performance": {"dataset": "", "accuracy": null}, "description": "# This model is based on Luke-japanese-large-lite\nThis model is fine-tuned model which besed on studio-ousia/Luke-japanese-large-lite.\nThis could be able to analyze which emotions (joy or sadness or anticipation or surprise or anger or fear or disdust or trust ) are included.\nThis model was fine-tuned by using wrime dataset.\n\n# what is Luke?\u3000Luke\u3068\u306f\uff1f[1] \nLUKE (Language Understanding with Knowledge-based Embeddings) is a new pre-trained contextualized representation of words and entities based on transformer. LUKE treats words and entities in a given text as independent tokens, and outputs contextualized representations of them. LUKE adopts an entity-aware self-attention mechanism that is an extension of the self-attention mechanism of the transformer, and considers the types of tokens (words or entities) when computing attention scores.\n\nLUKE achieves state-of-the-art results on five popular NLP benchmarks including SQuAD v1.1 (extractive question answering), CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), TACRED (relation classification), and Open Entity (entity typing).\nluke-japanese\u306f\u3001\u5358\u8a9e\u3068\u30a8\u30f3\u30c6\u30a3\u30c6\u30a3\u306e\u77e5\u8b58\u62e1\u5f35\u578b\u8a13\u7df4\u6e08\u307f Transformer \u30e2\u30c7\u30ebLUKE\u306e\u65e5\u672c\u8a9e\u7248\u3067\u3059\u3002LUKE \u306f\u5358\u8a9e\u3068\u30a8\u30f3\u30c6\u30a3\u30c6\u30a3\u3092\u72ec\u7acb\u3057\u305f\u30c8\u30fc\u30af\u30f3\u3068\u3057\u3066\u6271\u3044\u3001\u3053\u308c\u3089\u306e\u6587\u8108\u3092\u8003\u616e\u3057\u305f\u8868\u73fe\u3092\u51fa\u529b\u3057\u307e\u3059\u3002\n\n# how to use \u4f7f\u3044\u65b9\n\u30b9\u30c6\u30c3\u30d71\uff1apython\u3068pytorch, sentencepiece\u306e\u30a4\u30f3\u30b9\u30c8\u30fc\u30eb\u3068transformers\u306e\u30a2\u30c3\u30d7\u30c7\u30fc\u30c8\uff08\u30d0\u30fc\u30b8\u30e7\u30f3\u304c\u53e4\u3059\u304e\u308b\u3068LukeTokenizer\u304c\u5165\u3063\u3066\u3044\u306a\u3044\u305f\u3081\uff09\nupdate transformers and install sentencepiece, python and pytorch\n\n\u30b9\u30c6\u30c3\u30d72\uff1a\u4e0b\u8a18\u306e\u30b3\u30fc\u30c9\u3092\u5b9f\u884c\u3059\u308b\nPlease execute this code\n\n\n```python\nfrom transformers import AutoTokenizer, AutoModelForSequenceClassification, LukeConfig\nimport torch\ntokenizer = AutoTokenizer.from_pretrained(\"Mizuiro-sakura/luke-japanese-large-sentiment-analysis-wrime\")\nconfig = LukeConfig.from_pretrained('Mizuiro-sakura/luke-japanese-large-sentiment-analysis-wrime', output_hidden_states=True)    \nmodel = AutoModelForSequenceClassification.from_pretrained('Mizuiro-sakura/luke-japanese-large-sentiment-analysis-wrime', config=config)\n\ntext='\u3059\u3054\u304f\u697d\u3057\u304b\u3063\u305f\u3002\u307e\u305f\u884c\u304d\u305f\u3044\u3002'\n\nmax_seq_length=512\ntoken=tokenizer(text,\n        truncation=True,\n        max_length=max_seq_length,\n        padding=\"max_length\")\noutput=model(torch.tensor(token['input_ids']).unsqueeze(0), torch.tensor(token['attention_mask']).unsqueeze(0))\nmax_index=torch.argmax(torch.tensor(output.logits))\n\nif max_index==0:\n    print('joy\u3001\u3046\u308c\u3057\u3044')\nelif max_index==1:\n    print('sadness\u3001\u60b2\u3057\u3044')\nelif max_index==2:\n    print('anticipation\u3001\u671f\u5f85')\nelif max_index==3:\n    print('surprise\u3001\u9a5a\u304d')\nelif max_index==4:\n    print('anger\u3001\u6012\u308a')\nelif max_index==5:\n    print('fear\u3001\u6050\u308c')\nelif max_index==6:\n    print('disgust\u3001\u5acc\u60aa')\nelif max_index==7:\n    print('trust\u3001\u4fe1\u983c')\n```\n\n# Acknowledgments\u3000\u8b1d\u8f9e\nLuke\u306e\u958b\u767a\u8005\u3067\u3042\u308b\u5c71\u7530\u5148\u751f\u3068Studio ousia\u3055\u3093\u306b\u306f\u611f\u8b1d\u3044\u305f\u3057\u307e\u3059\u3002\nI would like to thank Mr.Yamada @ikuyamada and Studio ousia @StudioOusia.\n\n# Citation\n[1]@inproceedings{yamada2020luke,\n  title={LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention},\n  author={Ikuya Yamada and Akari Asai and Hiroyuki Shindo and Hideaki Takeda and Yuji Matsumoto},\n  booktitle={EMNLP},\n  year={2020}\n}", "api_call": "", "model_name": "Mizuiro-sakura/luke-japanese-large-sentiment-analysis-wrime"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-eo-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: eo\n* target languages: en\n*  OPUS readme: [eo-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/eo-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/eo-en/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/eo-en/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/eo-en/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-eo-en"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "kha-white/manga-ocr-base", "api_call": "pipeline('ocr', model='kha-white/manga-ocr-base')", "performance": {"dataset": "manga109s", "accuracy": ""}, "description": "Optical character recognition for Japanese text, with the main focus being Japanese manga. It uses Vision Encoder Decoder framework. Manga OCR can be used as a general purpose printed Japanese OCR, but its main goal was to provide a high quality text recognition, robust against various scenarios specific to manga: both vertical and horizontal text, text with furigana, text overlaid on images, wide variety of fonts and font styles, and low quality images.", "model_name": "kha-white/manga-ocr-base"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "api_name": "cafeai/cafe_aesthetic", "performance": {"dataset": "", "accuracy": null}, "description": "since people are downloading this and i don't know why, i'll add some information. this model is an image classifier fine-tuned on `microsoft/beit-base-patch16-384`. its purpose is to be used in the dataset conditioning step for the , a fine-tune effort for stable diffusion. as wd1.4 is planned to have a significantly large dataset 15m images, it is infeasible to analyze every image manually to determine whether or not it should be included in the final training dataset. this image classifier is trained on approximately 3.5k real-life and anime/manga images. its purpose is to remove aesthetically worthless images from our dataset by classifying them as \"`not aesthetic`\". the image classifier was trained to err on the side of caution and will generally tend to include images unless they are in a \"manga-like\" format, have messy lines and/or are sketches, or include an unacceptable amount of text namely text that covers the primary subject of the image. the idea is that certain images will hurt a sd fine-tune. note: this classifier is not perfect, just like every other classifier out there. however, with a sufficiently large dataset, any imperfections or misclassifications should average themselves out due to the law of large numbers.", "api_call": "", "model_name": "cafeai/cafe_aesthetic"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "api_name": "thanathorn/mt5-cpe-kmutt-thai-sentence-sum", "performance": {"dataset": "", "accuracy": null}, "description": "this repository contains the finetuned mt5-base model for thai sentence summarization. the architecture of the model is based on mt5 model and fine-tuned on text-summarization pairs in thai. also, this project is a senior project of computer engineering student at king mongkut\u2019s university of technology thonburi. see the example on google colab rouge-1: 61.7805", "api_call": "", "model_name": "thanathorn/mt5-cpe-kmutt-thai-sentence-sum"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-gl-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source group: Galician \n* target group: English \n*  OPUS readme: [glg-eng](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/glg-eng/README.md)\n\n*  model: transformer-align\n* source language(s): glg\n* target language(s): eng\n* model: transformer-align\n* pre-processing: normalization + SentencePiece (spm12k,spm12k)\n* download original weights: [opus-2020-06-16.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/glg-eng/opus-2020-06-16.zip)\n* test set translations: [opus-2020-06-16.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/glg-eng/opus-2020-06-16.test.txt)\n* test set scores: [opus-2020-06-16.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/glg-eng/opus-2020-06-16.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-gl-en"}
{"domain": "Multimodal Image-to-Text", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "google/deplot", "api_call": "Pix2StructForConditionalGeneration.from_pretrained('google/deplot')", "performance": {"dataset": "ChartQA", "accuracy": "24.0% improvement over finetuned SOTA"}, "description": "DePlot is a model that translates the image of a plot or chart to a linearized table. It decomposes the challenge of visual language reasoning into two steps: (1) plot-to-text translation, and (2) reasoning over the translated text. The output of DePlot can then be directly used to prompt a pretrained large language model (LLM), exploiting the few-shot reasoning capabilities of LLMs.", "model_name": "google/deplot"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "s-nlp/bart-base-detox", "performance": {"dataset": "", "accuracy": null}, "description": "model overview this is the model presented in the paper . the model itself is model trained on parallel detoxification dataset paradetox achiving sota results for detoxification task. more details, code and data can be found .", "api_call": "", "model_name": "s-nlp/bart-base-detox"}
{"domain": "Multimodal Text-to-Image", "framework": "Hugging Face", "functionality": "Text-to-Image Generation", "api_name": "runwayml/stable-diffusion-v1-5", "api_call": "StableDiffusionPipeline.from_pretrained(runwayml/stable-diffusion-v1-5, torch_dtype=torch.float16)(prompt).images[0]", "performance": {"dataset": "COCO2017", "accuracy": "Not optimized for FID scores"}, "description": "Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.", "model_name": "runwayml/stable-diffusion-v1-5"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-tc-big-en-fi", "performance": {"dataset": "", "accuracy": null}, "description": "neural machine translation model for translating from english en to finnish fi. this model is part of the , an effort to make neural machine translation models widely available and accessible for many languages in the world. all models are originally trained using the amazing framework of , an efficient nmt implementation written in pure c++. the models have been converted to pytorch using the transformers library by huggingface. training data is taken from and training pipelines use the procedures of . publications: and please, cite if you use this model.", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-tc-big-en-fi"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "MaRiOrOsSi/t5-base-finetuned-question-answering", "performance": {"dataset": "", "accuracy": null}, "description": "this model is the result produced by christian di maio and giacomo nunziati for the language processing technologies exam. reference for fine-tuned on for generative question answering by just prepending the question to the context . the code used for t5 training is available at this .", "api_call": "", "model_name": "MaRiOrOsSi/t5-base-finetuned-question-answering"}
{"domain": "Natural Language Processing Text2Text Generation", "framework": "Hugging Face Transformers", "api_name": "mrm8488/t5-base-finetuned-span-sentiment-extraction", "performance": {"dataset": "", "accuracy": null}, "description": "all credits to base fine-tuned on for span sentiment extraction downstream task. the t5 model was presented in by colin raffel, noam shazeer, adam roberts, katherine lee, sharan narang, michael matena, yanqi zhou, wei li, peter j. liu in here the abstract:", "api_call": "", "model_name": "mrm8488/t5-base-finetuned-span-sentiment-extraction"}
{"domain": "Natural Language Processing Text Classification", "framework": "Transformers", "functionality": "Text Classification", "api_name": "joeddav/distilbert-base-uncased-go-emotions-student", "api_call": "pipeline('text-classification', model='joeddav/distilbert-base-uncased-go-emotions-student')", "performance": {"dataset": "go_emotions"}, "description": "This model is distilled from the zero-shot classification pipeline on the unlabeled GoEmotions dataset. It is primarily intended as a demo of how an expensive NLI-based zero-shot model can be distilled to a more efficient student, allowing a classifier to be trained with only unlabeled data.", "model_name": "joeddav/distilbert-base-uncased-go-emotions-student"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-no-sv", "performance": {"dataset": "", "accuracy": null}, "description": "* source group: Norwegian \n* target group: Swedish \n*  OPUS readme: [nor-swe](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/nor-swe/README.md)\n\n*  model: transformer-align\n* source language(s): nno nob\n* target language(s): swe\n* model: transformer-align\n* pre-processing: normalization + SentencePiece (spm4k,spm4k)\n* download original weights: [opus-2020-06-17.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/nor-swe/opus-2020-06-17.zip)\n* test set translations: [opus-2020-06-17.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/nor-swe/opus-2020-06-17.test.txt)\n* test set scores: [opus-2020-06-17.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/nor-swe/opus-2020-06-17.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-no-sv"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "pdelobelle/robbert-v2-dutch-base", "performance": {"dataset": "", "accuracy": null}, "description": "is the state-of-the-art dutch bert model. it is a large pre-trained general dutch language model that can be fine-tuned on a given dataset to perform any text classification, regression or token-tagging task. as such, it has been successfully used by many and for achieving state-of-the-art performance for a wide range of dutch natural language processing tasks, including: - - sentiment analysis ,", "api_call": "", "model_name": "pdelobelle/robbert-v2-dutch-base"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "facebook/mbart-large-cc25", "performance": {"dataset": "", "accuracy": null}, "description": "Pretrained (not finetuned) multilingual mbart model.\nOriginal Languages\n```\nexport langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN\n```\n\nOriginal Code: https://github.com/pytorch/fairseq/tree/master/examples/mbart\nDocs:  https://huggingface.co/transformers/master/model_doc/mbart.html\nFinetuning Code: examples/seq2seq/finetune.py (as of Aug 20, 2020)\n\nCan also be finetuned for summarization.", "api_call": "", "model_name": "facebook/mbart-large-cc25"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "bert-base-german-cased", "performance": {"dataset": "", "accuracy": null}, "description": "**Update April 3rd, 2020**: we updated the vocabulary file on deepset's s3 to conform with the default tokenization of punctuation tokens. \nFor details see the related [FARM issue](https://github.com/deepset-ai/FARM/issues/60). If you want to use the old vocab we have also uploaded a [\"deepset/bert-base-german-cased-oldvocab\"](https://huggingface.co/deepset/bert-base-german-cased-oldvocab) model.\n \n## Details\n- We trained using Google's Tensorflow code on a single cloud TPU v2 with standard settings.\n- We trained 810k steps with a batch size of 1024 for sequence length 128 and 30k steps with sequence length 512. Training took about 9 days.\n- As training data we used the latest German Wikipedia dump (6GB of raw txt files), the OpenLegalData dump (2.4 GB) and news articles (3.6 GB).\n- We cleaned the data dumps with tailored scripts and segmented sentences with spacy v2.1. To create tensorflow records we used the recommended sentencepiece library for creating the word piece vocabulary and tensorflow scripts to convert the text to data usable by BERT.\n\n\nSee https://deepset.ai/german-bert for more details", "api_call": "", "model_name": "bert-base-german-cased"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face Diffusers", "api_name": "openai/shap-e-img2img", "performance": {"dataset": "", "accuracy": null}, "description": "shap-e introduces a diffusion process that can generate a 3d image from a text prompt. it was introduced in by heewoo jun and alex nichol from openai. original repository of shap-e can be found here: the authors of shap-e didn't author this model card. they provide a separate model card .  the abstract of the shap-e paper:  we present shap-e, a conditional generative model for 3d assets. unlike recent work on 3d generative models which produce a single output representation, shap-e directly generates the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. we train shap-e in two stages: first, we train an encoder that deterministically maps 3d assets into the parameters of an implicit function; second, we train a conditional diffusion model on outputs of the encoder. when trained on a large dataset of paired 3d and text data, our resulting models are capable of generating complex and diverse 3d assets in a matter of seconds. when compared to point-e, an explicit generative model over point clouds, shap-e converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space. we release model weights, inference code, and samples at .", "api_call": "", "model_name": "openai/shap-e-img2img"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-es-fr", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: es\n* target languages: fr\n*  OPUS readme: [es-fr](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/es-fr/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-08.zip](https://object.pouta.csc.fi/OPUS-MT-models/es-fr/opus-2020-01-08.zip)\n* test set translations: [opus-2020-01-08.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/es-fr/opus-2020-01-08.test.txt)\n* test set scores: [opus-2020-01-08.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/es-fr/opus-2020-01-08.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-es-fr"}
{"domain": "Computer Vision Image Segmentation", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "facebook/mask2former-swin-small-coco-instance", "api_call": "Mask2FormerForUniversalSegmentation.from_pretrained('facebook/mask2former-swin-small-coco-instance')", "performance": {"dataset": "COCO", "accuracy": "Not provided"}, "description": "Mask2Former model trained on COCO instance segmentation (small-sized version, Swin backbone). It was introduced in the paper Masked-attention Mask Transformer for Universal Image Segmentation and first released in this repository. Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA, MaskFormer both in terms of performance an efficiency.", "model_name": "facebook/mask2former-swin-small-coco-instance"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Emotion Recognition", "api_name": "superb/hubert-large-superb-er", "api_call": "pipeline('audio-classification', model='superb/hubert-large-superb-er')", "performance": {"dataset": "IEMOCAP", "accuracy": 0.6762}, "description": "This is a ported version of S3PRL's Hubert for the SUPERB Emotion Recognition task. The base model is hubert-large-ll60k, which is pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. For more information refer to SUPERB: Speech processing Universal PERformance Benchmark.", "model_name": "superb/hubert-large-superb-er"}
{"domain": "Computer Vision Image-to-Image", "framework": "Keras", "functionality": "Image Deblurring", "api_name": "google/maxim-s3-deblurring-gopro", "api_call": "from_pretrained_keras('google/maxim-s3-deblurring-gopro')", "performance": {"dataset": "GoPro", "accuracy": {"PSNR": 32.86, "SSIM": 0.961}}, "description": "MAXIM model pre-trained for image deblurring. It was introduced in the paper MAXIM: Multi-Axis MLP for Image Processing by Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, Yinxiao Li and first released in this repository.", "model_name": "google/maxim-s3-deblurring-gopro"}
{"domain": "Computer Vision Text-to-Video", "framework": "Hugging Face Diffusers", "api_name": "PAIR/text2video-zero-controlnet-canny-avatar", "performance": {"dataset": "", "accuracy": null}, "description": "is a zero-shot text to video generator. it can perform `zero-shot text-to-video generation`, `video instruct pix2pix` instruction-guided video editing, `text and pose conditional video generation`, `text and canny-edge conditional video generation`, and `text, canny-edge and dreambooth conditional video generation`. for more information about this work,", "api_call": "", "model_name": "PAIR/text2video-zero-controlnet-canny-avatar"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "redrussianarmy/gpt2-turkish-cased", "performance": {"dataset": "", "accuracy": null}, "description": "in this repository i release gpt-2 model, that was trained on various texts for turkish. the model is meant to be an entry point for fine-tuning on other texts. i used a turkish corpora that is taken from oscar-corpus.", "api_call": "", "model_name": "redrussianarmy/gpt2-turkish-cased"}
{"domain": "Computer Vision Zero-Shot Image Classification", "framework": "Hugging Face Transformers", "functionality": "Zero-Shot Image Classification", "api_name": "OFA-Sys/chinese-clip-vit-base-patch16", "api_call": "ChineseCLIPModel.from_pretrained('OFA-Sys/chinese-clip-vit-base-patch16')", "performance": {"dataset": {"MUGE Text-to-Image Retrieval": {"accuracy": {"Zero-shot R@1": 63.0, "Zero-shot R@5": 84.1, "Zero-shot R@10": 89.2, "Finetune R@1": 68.9, "Finetune R@5": 88.7, "Finetune R@10": 93.1}}, "Flickr30K-CN Retrieval": {"accuracy": {"Zero-shot Text-to-Image R@1": 71.2, "Zero-shot Text-to-Image R@5": 91.4, "Zero-shot Text-to-Image R@10": 95.5, "Finetune Text-to-Image R@1": 83.8, "Finetune Text-to-Image R@5": 96.9, "Finetune Text-to-Image R@10": 98.6, "Zero-shot Image-to-Text R@1": 81.6, "Zero-shot Image-to-Text R@5": 97.5, "Zero-shot Image-to-Text R@10": 98.8, "Finetune Image-to-Text R@1": 95.3, "Finetune Image-to-Text R@5": 99.7, "Finetune Image-to-Text R@10": 100.0}}, "COCO-CN Retrieval": {"accuracy": {"Zero-shot Text-to-Image R@1": 69.2, "Zero-shot Text-to-Image R@5": 89.9, "Zero-shot Text-to-Image R@10": 96.1, "Finetune Text-to-Image R@1": 81.5, "Finetune Text-to-Image R@5": 96.9, "Finetune Text-to-Image R@10": 99.1, "Zero-shot Image-to-Text R@1": 63.0, "Zero-shot Image-to-Text R@5": 86.6, "Zero-shot Image-to-Text R@10": 92.9, "Finetune Image-to-Text R@1": 83.5, "Finetune Image-to-Text R@5": 97.3, "Finetune Image-to-Text R@10": 99.2}}, "Zero-shot Image Classification": {"accuracy": {"CIFAR10": 96.0, "CIFAR100": 79.7, "DTD": 51.2, "EuroSAT": 52.0, "FER": 55.1, "FGVC": 26.2, "KITTI": 49.9, "MNIST": 79.4, "PC": 63.5, "VOC": 84.9}}}}, "description": "Chinese CLIP is a simple implementation of CLIP on a large-scale dataset of around 200 million Chinese image-text pairs. It uses ViT-B/16 as the image encoder and RoBERTa-wwm-base as the text encoder.", "model_name": "OFA-Sys/chinese-clip-vit-base-patch16"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "ogkalu/Comic-Diffusion", "performance": {"dataset": "", "accuracy": null}, "description": "v2 is here. trained on 6 styles at once, it allows anyone to create unique but consistent styles by mixing any number of the tokens. even changing the order of the same list influences results so there's a lot to experiment with here. this was created so anyone could create their comic projects with ease and flexibility. it is the culmination of all my experimentation with dreambooth thus far. the tokens for v2 are - - charliebo artstyle", "api_call": "", "model_name": "ogkalu/Comic-Diffusion"}
{"domain": "Natural Language Processing Text Classification", "framework": "Hugging Face Transformers", "api_name": "sismetanin/rubert-toxic-pikabu-2ch", "performance": {"dataset": "", "accuracy": null}, "description": "rubert-toxic is a model fine-tuned on . you can find a detailed description of the data used and the fine-tuning process in . you can also find this information at . system p r f1", "api_call": "", "model_name": "sismetanin/rubert-toxic-pikabu-2ch"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "Helsinki-NLP/opus-mt-en-it", "api_call": "pipeline('translation_en_to_it', model='Helsinki-NLP/opus-mt-en-it')", "performance": {"dataset": "opus", "accuracy": {"newssyscomb2009.en.it": {"BLEU": 30.9, "chr-F": 0.606}, "newstest2009.en.it": {"BLEU": 31.9, "chr-F": 0.604}, "Tatoeba.en.it": {"BLEU": 48.2, "chr-F": 0.6950000000000001}}}, "description": "A Transformer-based English to Italian translation model trained on the OPUS dataset. This model can be used for translation tasks using the Hugging Face Transformers library.", "model_name": "Helsinki-NLP/opus-mt-en-it"}
{"domain": "Audio Automatic Speech Recognition", "framework": "Hugging Face Transformers", "api_name": "sweetcocoa/pop2piano", "performance": {"dataset": "", "accuracy": null}, "description": "pop2piano, a transformer network that generates piano covers given waveforms of pop music. pop2piano was proposed in the paper by jongho choi and kyogu lee.", "api_call": "", "model_name": "sweetcocoa/pop2piano"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "api_name": "enzostvs/hair-color", "performance": {"dataset": "", "accuracy": null}, "description": "autogenerated by huggingpics\ud83e\udd17\ud83d\uddbc\ufe0f create your own image classifier for anything by running . report any issues with the demo at the .", "api_call": "", "model_name": "enzostvs/hair-color"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-tr-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: tr\n* target languages: en\n*  OPUS readme: [tr-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/tr-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-16.zip](https://object.pouta.csc.fi/OPUS-MT-models/tr-en/opus-2020-01-16.zip)\n* test set translations: [opus-2020-01-16.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/tr-en/opus-2020-01-16.test.txt)\n* test set scores: [opus-2020-01-16.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/tr-en/opus-2020-01-16.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-tr-en"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-st-en", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: st\n* target languages: en\n*  OPUS readme: [st-en](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/st-en/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-16.zip](https://object.pouta.csc.fi/OPUS-MT-models/st-en/opus-2020-01-16.zip)\n* test set translations: [opus-2020-01-16.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/st-en/opus-2020-01-16.test.txt)\n* test set scores: [opus-2020-01-16.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/st-en/opus-2020-01-16.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-st-en"}
{"domain": "Computer Vision Text-to-Image", "framework": "Hugging Face Diffusers", "api_name": "wavymulder/modelshoot", "performance": {"dataset": "", "accuracy": null}, "description": "modelshoot style use `modelshoot style` in your prompt i recommend at the start i also suggest your prompts include subject and location, for example \"`amy adams at the construction site`\" , as this helps the model to resolve backgrounds and small details.", "api_call": "", "model_name": "wavymulder/modelshoot"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-en-hi", "performance": {"dataset": "", "accuracy": null}, "description": "* source group: English \n* target group: Hindi \n*  OPUS readme: [eng-hin](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/eng-hin/README.md)\n\n*  model: transformer-align\n* source language(s): eng\n* target language(s): hin\n* model: transformer-align\n* pre-processing: normalization + SentencePiece (spm32k,spm32k)\n* download original weights: [opus-2020-06-17.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/eng-hin/opus-2020-06-17.zip)\n* test set translations: [opus-2020-06-17.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/eng-hin/opus-2020-06-17.test.txt)\n* test set scores: [opus-2020-06-17.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/eng-hin/opus-2020-06-17.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-en-hi"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "api_name": "0x70DA/down-syndrome-classifier", "performance": {"dataset": "", "accuracy": null}, "description": "autogenerated by huggingpics\ud83e\udd17\ud83d\uddbc\ufe0f create your own image classifier for anything by running . report any issues with the demo at the .", "api_call": "", "model_name": "0x70DA/down-syndrome-classifier"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "KB/bert-base-swedish-cased", "performance": {"dataset": "", "accuracy": null}, "description": "the national library of sweden / kblab releases three pretrained language models based on bert and albert. the models are trained on aproximately 15-20gb of text 200m sentences, 3000m tokens from various sources books, news, government publications, swedish wikipedia and internet forums aiming to provide a representative bert model for swedish text. a more complete description will be published later on. the following three models are currently available: - bert-base-swedish-cased v1 - a bert trained with the same hyperparameters as first published by google.", "api_call": "", "model_name": "KB/bert-base-swedish-cased"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-fr-de", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: fr\n* target languages: de\n*  OPUS readme: [fr-de](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/fr-de/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-09.zip](https://object.pouta.csc.fi/OPUS-MT-models/fr-de/opus-2020-01-09.zip)\n* test set translations: [opus-2020-01-09.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/fr-de/opus-2020-01-09.test.txt)\n* test set scores: [opus-2020-01-09.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/fr-de/opus-2020-01-09.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-fr-de"}
{"domain": "Natural Language Processing Fill-Mask", "framework": "Hugging Face Transformers", "api_name": "recobo/agriculture-bert-uncased", "performance": {"dataset": "", "accuracy": null}, "description": "a bert-based language model further pre-trained from the checkpoint of . the dataset gathered is a balance between scientific and general works in agriculture domain and encompassing knowledge from different areas of agriculture research and practical knowledge. the corpus contains 1.2 million paragraphs from national agricultural library nal from the us gov. and 5.3 million paragraphs from books and common literature from the agriculture domain .", "api_call": "", "model_name": "recobo/agriculture-bert-uncased"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-de-nl", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: de\n* target languages: nl\n*  OPUS readme: [de-nl](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/de-nl/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2020-01-20.zip](https://object.pouta.csc.fi/OPUS-MT-models/de-nl/opus-2020-01-20.zip)\n* test set translations: [opus-2020-01-20.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-nl/opus-2020-01-20.test.txt)\n* test set scores: [opus-2020-01-20.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/de-nl/opus-2020-01-20.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-de-nl"}
{"domain": "Natural Language Processing Translation", "framework": "Transformers", "functionality": "Translation", "api_name": "Helsinki-NLP/opus-mt-it-en", "api_call": "pipeline('translation_it_to_en', model='Helsinki-NLP/opus-mt-it-en')", "performance": {"dataset": "opus", "accuracy": {"BLEU": {"newssyscomb2009.it.en": 35.3, "newstest2009.it.en": 34.0, "Tatoeba.it.en": 70.9}, "chr-F": {"newssyscomb2009.it.en": 0.6000000000000001, "newstest2009.it.en": 0.594, "Tatoeba.it.en": 0.808}}}, "description": "A transformer model for Italian to English translation trained on the OPUS dataset. It can be used for translating Italian text to English.", "model_name": "Helsinki-NLP/opus-mt-it-en"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "api_name": "Helsinki-NLP/opus-mt-en-cs", "performance": {"dataset": "", "accuracy": null}, "description": "* source languages: en\n* target languages: cs\n*  OPUS readme: [en-cs](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/en-cs/README.md)\n\n*  dataset: opus\n* model: transformer-align\n* pre-processing: normalization + SentencePiece\n* download original weights: [opus-2019-12-18.zip](https://object.pouta.csc.fi/OPUS-MT-models/en-cs/opus-2019-12-18.zip)\n* test set translations: [opus-2019-12-18.test.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-cs/opus-2019-12-18.test.txt)\n* test set scores: [opus-2019-12-18.eval.txt](https://object.pouta.csc.fi/OPUS-MT-models/en-cs/opus-2019-12-18.eval.txt)", "api_call": "", "model_name": "Helsinki-NLP/opus-mt-en-cs"}
{"domain": "Audio Audio Classification", "framework": "Hugging Face Transformers", "functionality": "Transformers", "api_name": "wav2vec2-base-superb-sv", "api_call": "AutoModelForAudioXVector.from_pretrained('anton-l/wav2vec2-base-superb-sv')", "performance": {"dataset": "superb", "accuracy": "More information needed"}, "description": "This is a ported version of S3PRL's Wav2Vec2 for the SUPERB Speaker Verification task. The base model is wav2vec2-large-lv60, which is pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. For more information refer to SUPERB: Speech processing Universal PERformance Benchmark.", "model_name": "anton-l/wav2vec2-base-superb-sv"}
{"domain": "Natural Language Processing Translation", "framework": "Transformers", "functionality": "Translation", "api_name": "opus-mt-fr-en", "api_call": "pipeline('translation_fr_to_en', model='Helsinki-NLP/opus-mt-fr-en')", "performance": {"dataset": "opus", "accuracy": {"BLEU": {"newsdiscussdev2015-enfr.fr.en": 33.1, "newsdiscusstest2015-enfr.fr.en": 38.7, "newssyscomb2009.fr.en": 30.3, "news-test2008.fr.en": 26.2, "newstest2009.fr.en": 30.2, "newstest2010.fr.en": 32.2, "newstest2011.fr.en": 33.0, "newstest2012.fr.en": 32.8, "newstest2013.fr.en": 33.9, "newstest2014-fren.fr.en": 37.8, "Tatoeba.fr.en": 57.5}}}, "description": "Helsinki-NLP/opus-mt-fr-en is a machine translation model trained to translate from French to English. It is based on the Marian NMT framework and trained on the OPUS dataset.", "model_name": "Helsinki-NLP/opus-mt-fr-en"}
{"domain": "Computer Vision Image-to-Image", "framework": "Hugging Face Diffusers", "api_name": "CrucibleAI/ControlNetMediaPipeFace", "performance": {"dataset": "", "accuracy": null}, "description": "## Table of Contents:\n- Overview: Samples, Contents, and Construction\n- Usage: Downloading, Training, and Inference\n- License\n- Credits and Thanks\n\n# Overview:\n\nThis dataset is designed to train a ControlNet with human facial expressions.  It includes keypoints for pupils to allow gaze direction.  Training has been tested on Stable Diffusion v2.1 base (512) and Stable Diffusion v1.5.", "api_call": "", "model_name": "CrucibleAI/ControlNetMediaPipeFace"}
{"domain": "Natural Language Processing Translation", "framework": "Hugging Face Transformers", "functionality": "Translation", "api_name": "Helsinki-NLP/opus-mt-nl-en", "api_call": "pipeline('translation_nl_to_en', model='Helsinki-NLP/opus-mt-nl-en')", "performance": {"dataset": "Tatoeba.nl.en", "accuracy": {"BLEU": 60.9, "chr-F": 0.749}}, "description": "A Dutch to English translation model based on the OPUS dataset, using a transformer-align architecture with normalization and SentencePiece pre-processing.", "model_name": "Helsinki-NLP/opus-mt-nl-en"}
{"domain": "Natural Language Processing Feature Extraction", "framework": "Hugging Face Transformers", "api_name": "monsoon-nlp/hindi-bert", "performance": {"dataset": "", "accuracy": null}, "description": "this is a first attempt at a hindi language model trained with google research's . as of 2022 i recommend google's muril model trained on english, hindi, and other major indian languages, both in their script and latinized script : and for causal language models, i would suggest though this is a large model", "api_call": "", "model_name": "monsoon-nlp/hindi-bert"}
{"domain": "Natural Language Processing Summarization", "framework": "Hugging Face Transformers", "api_name": "google/bigbird-pegasus-large-pubmed", "performance": {"dataset": "", "accuracy": null}, "description": "bigbird, is a sparse-attention based transformer which extends transformer based models, such as bert to much longer sequences. moreover, bigbird comes along with a theoretical understanding of the capabilities of a complete transformer that the sparse model can handle. bigbird was introduced in this and first released in this . disclaimer: the team releasing bigbird did not write a model card for this model so this model card has been written by the hugging face team. bigbird relies on block sparse attention instead of normal attention i.e. bert's attention and can handle sequences up to a length of 4096 at a much lower compute cost compared to bert. it has achieved sota on various tasks involving very long sequences such as long documents summarization, question-answering with long contexts.", "api_call": "", "model_name": "google/bigbird-pegasus-large-pubmed"}
{"domain": "Natural Language Processing Text Generation", "framework": "Hugging Face Transformers", "api_name": "uer/gpt2-chinese-poem", "performance": {"dataset": "", "accuracy": null}, "description": "the model is pre-trained by , which is introduced in . besides, the model could also be pre-trained by introduced in , which inherits uer-py to support models with parameters above one billion, and extends it to a multimodal pre-training framework. the model is used to generate chinese ancient poems. you can download the model from the , or , or via huggingface from the link . since the parameter skip special tokens is used in the pipelines.py, special tokens such as sep, unk will be deleted, the output results of hosted inference api right may not be properly displayed. the model is pre-trained by , which is introduced in . besides, the model could also be pre-trained by introduced in , which inherits uer-py to support models with parameters above one billion, and extends it to a multimodal pre-training framework. the model is used to generate chinese ancient poems. you can download the model from the , or , or via huggingface from the link . since the parameter skip special tokens is used in the pipelines.py, special tokens such as sep, unk will be deleted, the output results of hosted inference api right may not be properly displayed.", "api_call": "", "model_name": "uer/gpt2-chinese-poem"}
{"domain": "Computer Vision Image Classification", "framework": "Hugging Face Transformers", "api_name": "JuanMa360/kitchen-style-classification", "performance": {"dataset": "", "accuracy": null}, "description": "House & Apartaments Classification model\ud83e\udd17\ud83d\uddbc\ufe0f", "api_call": "", "model_name": "JuanMa360/kitchen-style-classification"}
