Sentiment Analysis and Stock Price Prediction using HuggingFace and SparkML

This is a group project part of which goes on towards satisfying requirements for course project in the Distributed Data Systems course, MSDS 697, as part of the University of San Francisco, MSDS Program.

Team Members

Gurusankar Gopalakrishnan
Devendra Govil
Maneel Reddy
Akshay Pamnani
Youshi Zhang

ML Objectives:

Scrape reddit posts, financial 10K reports and stock ticker price data using PRAW, PMAW, EDGAR - API, and AlphaVantage.
Automate the data pipeline using PySpark, Airflow, MongoDB and GCS.
Performing Sentiment Analysis using Pre-trained Models (Hugging Face- FinBERT) on the reddit posts and financial 10K reports.
Use sentiment scores and ticker features like EBITDA, 52wk high/lows etc to predict Stock Prices.

Install

We recommend creating a new conda environment using the environment.yml file present in the repo.

Citation

This repo relies on the edgar api crawler repo available at this location: https://github.com/nlpaueb/edgar-crawler .

The citation bibtex is as below:

@inproceedings{loukas-etal-2021-edgar,
    title = "{EDGAR}-{CORPUS}: Billions of Tokens Make The World Go Round",
    author = "Loukas, Lefteris  and
      Fergadiotis, Manos  and
      Androutsopoulos, Ion  and
      Malakasiotis, Prodromos",
    booktitle = "Proceedings of the Third Workshop on Economics and Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url="/https://aclanthology.org/2021.econlp-1.2",
    pages = "13--18",
}

The whole paper is present here: https://arxiv.org/abs/2109.14394

License

Please see the GNU General Public License v3.0

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.ipynb_checkpoints		.ipynb_checkpoints
SparkML and HuggingFace		SparkML and HuggingFace
datasets		datasets
logs		logs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Russell 3000 - Google Spreadsheet - Sheet 1.csv		Russell 3000 - Google Spreadsheet - Sheet 1.csv
__init__.py		__init__.py
aggregates_to_mongo.py		aggregates_to_mongo.py
config.json		config.json
edgar_crawler.py		edgar_crawler.py
environment.yml		environment.yml
extract_items.py		extract_items.py
financial_fillings_scrape.py		financial_fillings_scrape.py
financialcik.json		financialcik.json
logger.py		logger.py
mongodb.py		mongodb.py
msds697_task2.py		msds697_task2.py
reddit_calls.py		reddit_calls.py
requirements.txt		requirements.txt
user_definition.py		user_definition.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Analysis and Stock Price Prediction using HuggingFace and SparkML

Team Members

ML Objectives:

Install

Citation

License

About

Releases

Packages

Contributors 2

Languages

License

gurug-dev/distributed_data_systems_project

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis and Stock Price Prediction using HuggingFace and SparkML

Team Members

ML Objectives:

Install

Citation

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages