{"cells":[{"cell_type":"markdown","source":["# Extracting tweets using the twitter API\n","##### Step 1: Get Blodgett_500k.csv file from the dataset folder. \n","##### Step 2: Get access to twitter developer account and get your api secret key.\n","##### Step 3: Place both these file in the same folder or change the path in the code to wherever both files are located.\n","##### Step 4: Run all the cells and you will get the Blodget_50k.csv file with 50 k tweets that can be used to test our model.\n","\n","Note: It is quite often easier to use [open-source data](https://github.com/slanglab/twitteraae) already available, as it takes time to retrieve the tweets using the API. Once we retrieve tweets from this source, we get approximately 1M tweets. The tweets with high likelihood (>0.9) of AAE dialect is predicted by the Blodgett Classifier and it comes close to 500k tweets. This 500k data is used as the \"Blodgett_500k.csv\" file in our code below."],"metadata":{"id":"saJtPFsmH8Wx"}},{"cell_type":"markdown","source":["## Importing the libraries"],"metadata":{"id":"jGGm4n9WJRji"}},{"cell_type":"code","source":["import tweepy\n","import pandas as pd"],"metadata":{"id":"GP1AsG73lRe1"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## Reading blodgett 500 k files and the twitter secret key"],"metadata":{"id":"OBunhMK1JW4d"}},{"cell_type":"code","source":["df = pd.read_csv(\"/content/Blodgett_500k.csv\")\n","\n","#make sure to have twitter elevated access to extracted the tweets.\n","with open(\"/content/twitter-api-secret.txt\") as api_file:\n","  api_secret = api_file.read().splitlines()"],"metadata":{"id":"9ETirh39mJ3Q"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## Authentication"],"metadata":{"id":"QuQnl-hFJhBr"}},{"cell_type":"code","source":["auth = tweepy.AppAuthHandler(api_secret[0], api_secret[1])\n","\n","api = tweepy.API(auth, wait_on_rate_limit=True)"],"metadata":{"id":"3i7PTLl9mZcD"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## Dictionary where all the tweet id's and tweet text will be stored"],"metadata":{"id":"4OLTb3FUJlZC"}},{"cell_type":"code","source":["data_dict = {\"Tweet_id\":[],\"Tweet_text\":[]}"],"metadata":{"id":"waFbQiiWmtR-"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## API call to fetch the tweets and store the same in the dictionary"],"metadata":{"id":"XRlS0DvGJ5BO"}},{"cell_type":"code","source":["for tweet_id in df[\"Tweet_id\"]:\n","  if len(data_dict[\"Tweet_id\"]) == 50000:\n","    break\n","  try:\n","    tweet = api.get_status(tweet_id)._json[\"text\"]\n","    data_dict[\"Tweet_id\"].append(tweet_id)\n","    data_dict[\"Tweet_text\"].append(tweet)\n","  except: # The exception handling is done to avoid the code from crashing when a certain tweet was not found as it was deleted or removed.\n","    pass\n","\n"],"metadata":{"id":"AVYulo1xmrVl"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## Creating data frame from the dictionary and then generating a csv file which will be used further."],"metadata":{"id":"gJCEREpLKBEj"}},{"cell_type":"code","source":["tweet_df = pd.DataFrame(data_dict)\n","tweet_df.to_csv(\"Blodget_50k.csv\")"],"metadata":{"id":"vSfEeEfbnsn4"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":[],"metadata":{"id":"wiVo6AWZn4NL"},"execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"display_name":"Python 3 (ipykernel)","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.10.6"},"colab":{"provenance":[]}},"nbformat":4,"nbformat_minor":0}