{
    "name": "spambase",
    "n_num_features": 57,
    "n_cat_features": 0,
    "train_size": 2944,
    "val_size": 736,
    "test_size": 921,
    "source": "https://www.openml.org/search?type=data&status=active&id=44&sort=runs",
    "task_intro": "**Author**: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt    \n**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/spambase)   \n**Please cite**: [UCI](https://archive.ics.uci.edu/ml/citation_policy.html)\n\nSPAM E-mail Database  \nThe \"spam\" concept is diverse: advertisements for products/websites, make money fast schemes, chain letters, pornography... Our collection of spam e-mails came from our postmaster and individuals who had filed spam.  Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam.  These are useful when constructing a personalized spam filter.  One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.\n \nFor background on spam:  \nCranor, Lorrie F., LaMacchia, Brian A.  Spam! Communications of the ACM, 41(8):74-83, 1998.  \n\n### Attribute Information:  \nThe last column denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occurring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters.  \n\nFor the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes:  \n\n48 continuous real [0,100] attributes of type  \nword_freq_WORD = percentage of words in the e-mail that match WORD,  i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail.  A \"word\" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.\n \n6 continuous real [0,100] attributes of type char_freq_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail\n \n1 continuous real [1,...] attribute of type capital_run_length_average\n = average length of uninterrupted sequences of capital letters\n \n1 continuous integer [1,...] attribute of type capital_run_length_longest\n = length of longest uninterrupted sequence of capital letters\n \n1 continuous integer [1,...] attribute of type capital_run_length_total\n = sum of length of uninterrupted sequences of capital letters\n = total number of capital letters in the e-mail\n \n1 nominal {0,1} class attribute of type spam\n = denotes whether the e-mail was considered spam (1) or not (0), \n i.e. unsolicited commercial e-mail.",
    "task_type": "binclass",
    "openml_id": 44,
    "imbalance_ratio": 1.5377826806398236,
    "n_classes": 2,
    "num_feature_intro": {
        "word_freq_make": "word_freq_make",
        "word_freq_address": "word_freq_address",
        "word_freq_all": "word_freq_all",
        "word_freq_3d": "word_freq_3d",
        "word_freq_our": "word_freq_our",
        "word_freq_over": "word_freq_over",
        "word_freq_remove": "word_freq_remove",
        "word_freq_internet": "word_freq_internet",
        "word_freq_order": "word_freq_order",
        "word_freq_mail": "word_freq_mail",
        "word_freq_receive": "word_freq_receive",
        "word_freq_will": "word_freq_will",
        "word_freq_people": "word_freq_people",
        "word_freq_report": "word_freq_report",
        "word_freq_addresses": "word_freq_addresses",
        "word_freq_free": "word_freq_free",
        "word_freq_business": "word_freq_business",
        "word_freq_email": "word_freq_email",
        "word_freq_you": "word_freq_you",
        "word_freq_credit": "word_freq_credit",
        "word_freq_your": "word_freq_your",
        "word_freq_font": "word_freq_font",
        "word_freq_000": "word_freq_000",
        "word_freq_money": "word_freq_money",
        "word_freq_hp": "word_freq_hp",
        "word_freq_hpl": "word_freq_hpl",
        "word_freq_george": "word_freq_george",
        "word_freq_650": "word_freq_650",
        "word_freq_lab": "word_freq_lab",
        "word_freq_labs": "word_freq_labs",
        "word_freq_telnet": "word_freq_telnet",
        "word_freq_857": "word_freq_857",
        "word_freq_data": "word_freq_data",
        "word_freq_415": "word_freq_415",
        "word_freq_85": "word_freq_85",
        "word_freq_technology": "word_freq_technology",
        "word_freq_1999": "word_freq_1999",
        "word_freq_parts": "word_freq_parts",
        "word_freq_pm": "word_freq_pm",
        "word_freq_direct": "word_freq_direct",
        "word_freq_cs": "word_freq_cs",
        "word_freq_meeting": "word_freq_meeting",
        "word_freq_original": "word_freq_original",
        "word_freq_project": "word_freq_project",
        "word_freq_re": "word_freq_re",
        "word_freq_edu": "word_freq_edu",
        "word_freq_table": "word_freq_table",
        "word_freq_conference": "word_freq_conference",
        "char_freq_%3B": "char_freq_%3B",
        "char_freq_%28": "char_freq_%28",
        "char_freq_%5B": "char_freq_%5B",
        "char_freq_%21": "char_freq_%21",
        "char_freq_%24": "char_freq_%24",
        "char_freq_%23": "char_freq_%23",
        "capital_run_length_average": "capital_run_length_average",
        "capital_run_length_longest": "capital_run_length_longest",
        "capital_run_length_total": "capital_run_length_total"
    },
    "cat_feature_intro": {}
}