XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
SUMMARY

This is the complete database of firm-industry-year mappings indicating which of the 300 industries the given firm likely belongs to in the given year.
Many firms map to multiple industries, indicating high levels of scope.  Note that this is a more advanced data structure than many researchers will need,
and you might consider using the simpler firm-year panel database offered on this same website that provides the scope value for each firm in each year.

This database is more advanced as it has higher dimensionality as it shows the specific industries each firm maps to.  The variable Text_loading indicates
the degree of textual similarity the firm's business description is to the keywords in the given industry as noted in the paper referenced below.  The higher this value, the more
relevant the industry is to the given firm.  Each industry also has a label, and the keywords most proximate to each industry are included in the labels file, which indicates the
specific types of products or services firms serving the given industry likely provide. 

This data is computed using doc2vec and K-means clustering along with a procedure to map each firm to each of the centroid clusters.  The result is that 
we can map each firm to its most proximate industries and also count how many industries each firm likely belongs to.  These mappings are contained in this 
file and industry labels in the second file. 


****** NOTE:  Please read the technical descriptions below and the data and methods of the primary reference paper below before using the data.  



**************************************************************************************************************
**************************************************************************************************************
********************************************** Citations *****************************************************
********************************************** Citations *****************************************************
********************************************** Citations *****************************************************
**************************************************************************************************************
**************************************************************************************************************

This data is the result of a research project by Gerard Hoberg and Gordon Phillips.
The intent of the project is to scope across firms and over time, assess how firms are changing their
scope profiles in recent years, and understanding the links between scope and corporate finance policies.
The paper also illustrates how increases in scope likely offset decreases in HHIs over the past 25 years, and 
indicates that competition might not be declining rapidly as some studies suggst.
This article should be cited when using this data for the purpose of academic research.

Primary reference:
Scope, Scale and Concentration
Gerard Hoberg, and Gordon Phillips, Journal of Finance forthcoming (accepted 2023).

***********
Auxiliary reference 1: This articles is a precursor to the above scope paper.  In particular, it develops the original horizontal TNIC industries.

Text-Based Network Industries and Endogenous Product Differentiation
Gerard Hoberg and Gordon Phillips, Journal of Political Economy (October 2016), 124 (5) 1423-1465.

***********
Auxiliary reference 2: This articles is also a precursor.  It uses horizontal relatedness data (TNIC) to study mergers and acquisitions.
  This article is relevant especially because the the scope paper above also studies M&A.  The two papers are complementary as the theories
  and results for the two papers are distinct.  See the papers for details.

Product Market Synergies and Competition in Mergers and Acquisitions: A Text-Based Analysis
Gerard Hoberg and Gordon Phillips, Review of Financial Studies (October 2010) 23 (10), 3773-3811.



**********************************************************************************************************************
**********************************************************************************************************************
********************************************** Technical Details *****************************************************
********************************************** Technical Details *****************************************************
********************************************** Technical Details *****************************************************
**********************************************************************************************************************
**********************************************************************************************************************

****** Please read the data and methods section of the above primary reference for complete details regarding how the variables are constructed.  ********


Technical Note 1) The database is a firm-industry-year panel and thus contains the fields: year, gvkey, Industry, and Text_loading.  The Text_loading variable indicates the degree of 
textual similarity between the given firm (gvkey) and the given industry (Industry) in the given year. Note that the industry is simply a code from one to 300.  This code 
is generated by our system and the number itself has no specific meaning.  Rather it is a class variable and indicates a specific industry. To see which industry a given 
industry code refers to, please see the industry labels file included in this same data repository.  That has the list of most spatially proximate keywords for each industry. 

Technical Note 2) Each file contains a gvkey, year, industry in addition to the scope variable.  It is important to note that we already did 
the merge to COMPUSTAT, so you do not have to repeat this.  The data contained here is not lagged.  Consider a COMPUSTAT firm with a fiscal year ending 
on Sept 30th, 1997, for example (i.e., the CSTAT variable datadate is 19970930).  The corresponding observations for this firm in the VTNIC database would have the
year set to 1997. These observations would be based on the product description of the 10-K report that was associated with this 9/30/1997 fiscal year end.  More generally, 
the year field in the TNIC database is always set to be the first four digits of the datadate variable (the year part) so the database uses the calendar year convention for
convenience.  Because this data is merged by fiscal year end, the pairwise links in this file should conveniently be viewed as being time-synchronous based on the year
identified as the first four digits of the datadate Compustat variable.

Technical Note 3) The scope mappings are based on a 2% granularity following Hoberg and Phillips 2016.  This results in a rather fine granularity indicating roughly 7.4 
industries being mapped to the average firm.  This is a richer granularity than the standard Compustat database, but it also reflects the fact that firms are operating in 
many markets.  Yet if the researcher desires a more coarse granularity, they can achieve this on their own simply by applying minimum value for the Text_loading variable and
deleting any observations with this variable being below a certain value. The higher the threshold value, the smaller the database will become and hence the granularity will
be more coarse.  Researchers have the flexibility to design this as is helpful for their work.