Leveraging Twitter API for NLP: A Simple Tweet Collection Script

TwitterAPI ,NLPScripts ,DataCollection ,PythonProgramming

Introduction

In the realm of Natural Language Processing (NLP), data is paramount. Social media platforms, especially Twitter, offer a rich source of real-time data that can be leveraged for various NLP tasks. This blog post explores a Python script designed to collect tweets from Twitter, providing a foundational tool for NLP enthusiasts and researchers.

Motivations

With the Twitter API offering extensive access to real-time tweet data, the need to gather and analyze this data for NLP purposes has grown. Whether for sentiment analysis, trend detection, or linguistic research, having a reliable method to collect tweets on specific topics is crucial. This script simplifies the process of collecting tweets by leveraging Twitter’s streaming API, allowing users to focus on analyzing the data rather than dealing with complex data collection issues.

Expectatives

The script is designed to:

Stream real-time tweets based on a specific keyword.
Collect tweets until a predefined count is reached.
Save the collected tweets in JSON format for easy analysis.
Be easily adaptable for various NLP tasks, making it a versatile tool for researchers and developers.

Code

Here’s the Python script used for collecting tweets:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author: Torrez, Milton N.

import os
import uuid
import json
import configparser
import tweepy
from tweepy import Stream
from time import sleep
import sys

# Load configuration
config = configparser.ConfigParser()
config.read("config.ini")

# Twitter API credentials
ACCESS_TOKEN = os.getenv("TWITTER_ACCESS_TOKEN", config.get("twitter", "access_token"))
ACCESS_TOKEN_SECRET = os.getenv("TWITTER_ACCESS_TOKEN_SECRET", config.get("twitter", "access_token_secret"))
CONSUMER_KEY = os.getenv("TWITTER_CONSUMER_KEY", config.get("twitter", "api_key"))
CONSUMER_SECRET = os.getenv("TWITTER_CONSUMER_SECRET", config.get("twitter", "api_secret_key"))

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

api = tweepy.API(auth)

class MyStreamListener(tweepy.StreamListener):
    def __init__(self, api):
        super().__init__(api)
        self.tweets_list = []

    def on_status(self, status):
        print("=" * 20)
        print("TAMAÑO DE LA LISTA -> ", len(self.tweets_list))
        print("=" * 20)
        if len(self.tweets_list) <= int(arguments[2]):
            self.add_new_status(status)
        else:
            self.save_tweets()
            self.clear_list()

    def clear_list(self):
        print("CLEARING LIST!")
        self.tweets_list.clear()

    def save_tweets(self):
        path = f"./tweets/{arguments[1]}"

        if not os.path.exists(path):
            os.makedirs(path)

        with open(
            f"./tweets/{arguments[1]}/{arguments[1]}_{arguments[2]}_{uuid.uuid4().hex[:3]}.json",
            "w",
            encoding="utf-8",
        ) as f:
            json.dump(self.tweets_list, f, ensure_ascii=False, indent=4)

    def add_new_status(self, status):
        print("ADDING NEW TWEET!")
        self.tweets_list.append(status._json)

    def on_error(self, status_code):
        if status_code == 420:
            print("Rate limit exceeded, stopping stream.")
            return False
        print(f"Encountered an error: {status_code}")

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python main.py <keyword> <tweet_count>")
        sys.exit(1)

    arguments = sys.argv
    stream_listener = MyStreamListener(api)
    stream = tweepy.Stream(auth=api.auth, listener=stream_listener)

    try:
        stream.filter(
            track=[arguments[1]],
            languages=["es"]
        )
    except KeyboardInterrupt:
        print("Stream stopped by user.")
    except Exception as e:
        print(f"An error occurred: {e}")

Code Explanation

Configuration Loading: The script starts by loading API credentials from a config.ini file or environment variables. This is essential for authenticating with the Twitter API.
Authentication: Using tweepy.OAuthHandler, the script sets up the authentication required for accessing Twitter’s API.
StreamListener Class: MyStreamListener inherits from tweepy.StreamListener and manages the collection and storage of tweets. Key methods include:
on_status: Called when a new tweet arrives. It checks if the tweet count has reached the limit and either saves the data or continues collecting.
clear_list: Clears the tweet list after saving.
save_tweets: Saves the collected tweets to a JSON file.
add_new_status: Adds a new tweet to the list.
on_error: Handles errors, including rate limits.
Main Execution: The script takes command-line arguments for the keyword and tweet count, initializes the stream, and starts filtering tweets based on the provided keyword.

Conclusions

The script offers a straightforward and effective method for collecting tweets for NLP tasks. By focusing on a specific keyword, users can gather relevant tweets in real-time and store them for later analysis. This approach simplifies the data collection process, allowing researchers and developers to concentrate on analyzing and interpreting the data.

Future Work

Enhanced Error Handling: Implement more robust error handling and logging to better manage various API issues and connectivity problems.
Data Storage Improvements: Explore alternative data storage solutions, such as databases, to handle large volumes of tweets more efficiently.
Feature Expansion: Extend the script to support additional Twitter API features, such as user timelines or specific tweet types, to enhance data collection capabilities.
User Interface: Develop a user-friendly interface for setting parameters and managing the data collection process, making the tool more accessible to non-technical users.
API Updates: Stay updated with changes in the Twitter API and adapt the script as needed to maintain functionality and compliance.