July 27, 2022

Stock-NewsEventsSentiment (SNES) 1.0: A time-series dataset for joint news and market data analysis of stocks

TL;DR — We’re releasing a time-series dataset for S&P 500 companies that joins market data–such as stock price and trade volume–with news events and sentiment distilled from the world’s financial media using NLP.

You can view and download SNES v1.0 from here.

Introduction

News headlines and markets interact in a myriad of ways: from anticipatory analyst commentary impacting the sentiment around a stock among traders, to after-the-fact coverages of notable events such as new product releases or factory expansion, it has long been known that news data is a rich source of information that could contribute significantly to better modelling and understanding of financial markets and market dynamics. Due to the time-based nature of market events and news data, time series analysis has lent itself well to the analysis of such data.

Market data such as prices and trade volumes is a fairly accessible commodity these days. This is thanks to a mature ecosystem of data providers and data processors that has developed over the past 10–15 years in the financial services space.

News data on the other hand, is harder to obtain, and more importantly, to harness. Many vendors who provide market data also provide news feeds and news headlines. However, in order to distill high quality time series data from news articles that reflect events and sentiment, practitioners need to develop and apply high quality NLP models, which could take months or years given the level of domain expertise, technical skills and and training data required.

Today we’re releasing Stock-NewsEventsSentiment (SNES) 1.0 — a large dataset consisting of daily market and news time series data for S&P 500 companies over a period of 21 months (October 2020 to July 2022). In addition to news sentiment, SNES covers the following events extracted from the news in relation to each company:

We’re sharing the code and the methodology for compiling this dataset at the bottom of this article.

Downloading Stock-NewsEventsSentiment (SNES)

You can view and download the dataset from Kaggle.

Exploring the dataset

SNES consists of two files:

  1. sp500wiki.csv/sp500wiki.parquet: List of S&P 500 companies as of July 2022 and various metadata in tabular format.

2. data.csv/data.parquet: The main dataset containing stock price, trade volume, news events and news sentiment for S&P 500 companies during the period Oct 2020-Jul 2022.

Below we’ve included a few visualisations to help you get a better understanding of the SNES dataset:

  1. A snapshot of 20 randomly selected companies and the following attributes: Stock Price, Trade Volume, News Volume, Negative News, Adverse Events, New Products and Corporate Earnings. Note the regularity in Corporate Earnings events (right most column) which matches intuition.

  1. Event types and volume by GICS industry sector

  1. Event types and volume by GICS sub-industry

  1. Event types and volume for 20 randomly selected technology companies

  1. Events over time for Microsoft (note the regularity of Corporate Earnings which matches intuition)

  1. Sentiment over time for Microsoft

  1. Stock price and trade volume over time for Microsoft

  1. Market and news data over time for Microsoft

Data collection methodology

We scrape the list of S&P 500 companies from Wikipedia in order to retrieve a current list of S&P 500 companies and attributes such as GICS industry and sub-industries. We leverage Wikidata’s SPARQL API to retrieve Wikidata IDs for each company in S&P 500. We use the yfinance Python package to retrieve stock prices and trade volumes. Finally we use Aylien’s News API to retrieve the news time series data for each event category and sentiment values we’re interested in.

You can see the code used to retrieve SNES here.




Previous post Exploring relationships between news and market data using time series analysis TL;DR — In this article we introduce a few tools and techniques for studying relationships between the stock market and the news. We explore time Next post Identifying causal links in NLP-enriched news data (with R code and dataset) TL;DR — In this article, we briefly introduce you to time series analysis, which we then use to identify causal relationships between different