XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
PRELIMINARIES: Please turn on "word wrap" when reading this document.


*********** Research Project

Please cite the following paper when using this data:

Dynamic Interpretation of Emerging Risks in the Financial Sector, Kathleen Hanley and Gerard Hoberg 
Review of Financial Studies, forthcoming 2019.

Paper url: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2792943


*********** Data Description

The included file is a tab delimited file showing the complete word lists associated with each semantic theme included in the static model.  The file has three columns.  You can read it directly into any programming or statistical environment or you can review it in Excel.

1. Phrase: A semantic theme included in the model.  These are either single words or bigrams.  Each is repeated roughly many times as we provide the highest ranked individual terms that best characterize the bigram or individual word according to the word2vec semantic theme analysis performed by metaHeuristica.  Note that we requested 250 words from metaHeuristica but fewer show here as we pruned the list based on requiring that the words are single words (not bigrams) and that they contain only alphabetic characters.  The included terms have the closest cosine similarity to the given theme.  For example, standards has the highest cosine similarity to accounting relative to all other words (except accounting itself).

2. Word: This column lists the list of words for each semantic theme that are most proximate.

3. Cosine: This is the cosine similarity of the semantic vector of the theme (ie, accounting) and the word that is proximate (ie, standards).  These two words have a cosine similarity of 0.488.


*********** Important Notes

*Note 1: These are raw word lists and are not screened for stop words.  This allows users the flexibility to manage this issue as they wish to maximize flexibility.

*Note 2: We include the semantic word lists because this is the most core database for both the static and the dynamic models.  From these lists, it is mostly straight forward to run the rest of the computations.  One would first need to compute the cosine similarity of each set of 250 words for a given theme to the risk factor section of each bank in each year.  Then you would have the bank-specific exposures to the given risk theme.  Once you have the bank exposures, you can compute the joint exposure for any pair of banks by simply multiplying them together.  Finally, you can then regress pairwise covariance for the pair on the joint risk exposures plus the controls, and then you have the risk exposure scores (marginal RSQs) for each risk theme, which is the basis for the main results in the paper.  This description is very spartan and oversimplified but is included to give a high level perspective.  Please read the paper in fulll to get all of the details.

*Note 3: We hope this data is useful to you.  We also hope to add more data and info to this repository over time.  We are happy to provide this resource, but unfortunately given our intense work schedules, the authors of this research project are unable to provide any support for using the data or the code.  Thanks for your understanding and best wishes for your own research.