TL;DR — We’re releasing a time-series dataset for S&P 500 companies that joins market data–such as stock price and trade volume–with news events and sentiment distilled from the world’s financial media using NLP.
You can view and download SNES v1.0 from here.
News headlines and markets interact in a myriad of ways: from anticipatory analyst commentary impacting the sentiment around a stock among traders, to after-the-fact coverages of notable events such as new product releases or factory expansion, it has long been known that news data is a rich source of information that could contribute significantly to better modelling and understanding of financial markets and market dynamics. Due to the time-based nature of market events and news data, time series analysis has lent itself well to the analysis of such data.
Market data such as prices and trade volumes is a fairly accessible commodity these days. This is thanks to a mature ecosystem of data providers and data processors that has developed over the past 10–15 years in the financial services space.
News data on the other hand, is harder to obtain, and more importantly, to harness. Many vendors who provide market data also provide news feeds and news headlines. However, in order to distill high quality time series data from news articles that reflect events and sentiment, practitioners need to develop and apply high quality NLP models, which could take months or years given the level of domain expertise, technical skills and and training data required.
Today we’re releasing Stock-NewsEventsSentiment (SNES) 1.0 — a large dataset consisting of daily market and news time series data for S&P 500 companies over a period of 21 months (October 2020 to July 2022). In addition to news sentiment, SNES covers the following events extracted from the news in relation to each company:
We’re sharing the code and the methodology for compiling this dataset at the bottom of this article.
You can view and download the dataset from Kaggle.
SNES consists of two files:
2. data.csv/data.parquet: The main dataset containing stock price, trade volume, news events and news sentiment for S&P 500 companies during the period Oct 2020-Jul 2022.
Below we’ve included a few visualisations to help you get a better understanding of the SNES dataset:
We scrape the list of S&P 500 companies from Wikipedia in order to retrieve a current list of S&P 500 companies and attributes such as GICS industry and sub-industries. We leverage Wikidata’s SPARQL API to retrieve Wikidata IDs for each company in S&P 500. We use the yfinance Python package to retrieve stock prices and trade volumes. Finally we use Aylien’s News API to retrieve the news time series data for each event category and sentiment values we’re interested in.
You can see the code used to retrieve SNES here.