Ideaflux is a Reddit comment mining tool. It uses comments structure for heuristically selecting a single thread of comments that appears to be the most interesting in terms of comment engagement.
IdeaFlux is able to
- mine comments of any Reddit subreddit into a database
- rank comment sections of posts in terms of user engagement
- output the interesting submissions through an API
- generate data visualization of the interesting submissions through a webpage
This can be used as-is. All you need to do is access [this website]. Here is a snapshot of it:
- First of all, you need a Reddit account
- Then you are going to need to create an app within reddit to get the OAuth2 keys to access the API. For more info on that matter, see the section "The Reddit API" on https://www.storybench.org/how-to-scrape-reddit-with-python/. Generating the keys is easier than you might think.
- After that, you need to generate a yaml file (login.yaml) with your credentials for the application then place it in project's root directory. You can easily do that using tools such as https://codebeautify.org/yaml-editor-online. The file must follow the template below:
login:
client_id: [API client ID]
client_secret: [API client secret]
user_agent: [API name]
username: [your reddit username]
password: [your reddit password]
- then place login.yaml in path/to/backend/configs/
pip install -r requirements.txt
sudo service postgresql start
redis-server
uvicorn main:app --reload
The core of Ideaflux is the following. IdeaFlux summarizes comments of top posts of subreddits of your choice by recursively selecting which comment generated most comments:
- The root node is the topic itself and it greedly selects the most prosperous comment, the one that generated most comments, then outputs it in a text file;
- Then this parent comment will select the most prosperous child node;
- Then this process is repeated recursively until it reaches the leaf node.
For backend, IdeaFlux stores all the mined data in postreSQL, a simple task queue system and caching system is implemented using Redis, and the APIs are served using FastAPI. For frontend, we use d3.js for data visualization of the most interesting posts in Reddit.
There are two database systems: postgreSQL, and Redis.
PostgreSQL has three main tables (submissions, reddit_comments, submissions_user_engagement), that can be created using the following queries:
CREATE TABLE submissions (
submission_id CHAR PRIMARY KEY,
comments_num FLOAT,
subreddit_id VARCHAR,
subreddit_display_name VARCHAR,
title VARCHAR,
created TIMESTAMP,
url VARCHAR,
body VARCHAR
);
CREATE TABLE reddit_comments (
comment_id CHAR PRIMARY KEY,
parent_id VARCHAR,
submission_id VARCHAR,
body VARCHAR
);
CREATE TABLE submissions_user_engagement (
submission_id CHAR PRIMARY KEY,
user_engagement FLOAT NOT NULL,
spine_body VARCHAR
);
Redis has two objects (hash table, task queue).
FastAPI has the following APIs.
- Check what tables exist in DB
SELECT * FROM information_schema.tables WHERE table_schema = 'public'
- Get list of subreddits names that have been mined
SELECT DISTINCT subreddit_display_name FROM submissions
- Get list of comments for a given submission id
SELECT * FROM reddit_comments
WHERE submission_id = '{submission_id}';
- Get list of submissions (posts) for an arbitrary subreddit having the user engagement score sorted in descendent order
SELECT *
FROM
submissions
INNER JOIN
submissions_user_engagement
ON
submissions_user_engagement.submission_id = submissions.submission_id
WHERE
submissions.subreddit_display_name = '{subreddit_name}'
ORDER BY
user_engagement DESC;
- Get list of submissions (posts) that are top 5 for each subreddit within one day, and sorted by decreasing user engagement score
SELECT *
FROM (
SELECT
*, ROW_NUMBER() OVER (PARTITION BY submissions.subreddit_id ORDER BY user_engagement DESC) AS ua_rank
FROM
submissions
INNER JOIN
submissions_user_engagement
ON
submissions_user_engagement.submission_id = submissions.submission_id
WHERE created > now() - interval '1 days'
ORDER BY
user_engagement DESC
) ranks
WHERE ua_rank <= 5
- program SQL table init for new env
- improve logging messages
- write commonly used SQL queries
- implement rate-limiter to APIs
- improve javascript responsiveness (PC/tablet/phone)
- implement memory cache for /subreddits/top{top_nth}-th using Redis
- implement redirect page (in case of too many requests)
- make tooltip more evident (frontend)
- add unit tests to FastAPI
- deploy API to AWS