This file describes the content of the financial constraints database, that accompanies the following paper (please reference when using this data):

"Redefining Financial Constraints: a Text-Based Analysis", by Gerard Hoberg and Vojislav Maksimovic, 2015, Review of Financial Studies 28 (5), 1312-1352.

****** Please read the data section of the paper for complete details regarding how the variables are computed.  ********
****** This file is best read in a text editor with word wrap turned on.  *******

We note some basic details about the data here (see paper for full detail).  First, the data is indexed by Compustat gvkey and by year, so they can be merged into existing databases easily.  The year refers to the calendar year in which the given firm's fiscal year ends.  The data is not lagged. If you wish to have lagged data, the researcher must do this on their own.  Also note that this data has been extended since the paper was written and now covers the years 1997 to 2015.

In order to be included in this database, a firm must have a machine readable Capitalization and Liquidity Subsection (henceforth we call this "CAPLIQ") of the MD&A section of the 10-K.  This section of the 10-K was parsed using the metaHeuristica software program, and we thank Christopher Ball of metaHeuristica for providing us with access.  

In the paper, we found that firms that do not have such a machine readable CAPLIQ are UNlikely to be unconstrained.  Hence if the goal of the researcher is to identify a set of potentially constrained firms, then looking only among the firms that have this section is appropriate.  The database is tailored to be useful in this way.  Firms that are not in this database are likely less interesting for constraints research.  We also note that we drop Financials from this database (SIC codes 6000 to 6999), as Financial firms typically have different information in their CAPLIQ Subsections and we do not feel the methods in the paper would thus be appopriate for coding financials.

*** There are four constraint variables included in the database (please read the data section of the above paper for more details) ***
1) delaycon: firms with higher values are more similar to a set of firms known to be at risk of delaying their investments due to issues with liquidity
2) equitydelaycon: firms with higher values are more similar to a set of firms that (A) are at risk of delaying their investments due to liquidity issues and (B) that indicate plans to issue equity (presumably to address their liquidity challenges).
3) debtdelaycon: analogous to above, but the firm indicates plans to issue debt (presumably to solve their liquidity problems).
4) privdelaycon: analogous to above, but the firm indicates plans to issue private placements (presumably to solve their liquidity problems).

** Important Technical note: The above constraint variables are based on cosine similarities, but are also purged of boiler plate content in a second stage.  As a result, they have mean values that are close to zero, and the interpretation is only that a higher value indicates a higher level of likely constraints.  We also note that given this scaling, it is NOT appropriate to set any constraint variable to zero in a sample a researcher is using where a given observation of the constraint variable is missing.  That would presume that the given firm has an average level of constraints, which is likely not valid per above discussion.  Instead, if a researcher wishes to include the constraint variables in a regression where some observations are missing (and the researcher wants to not lose observations and thus control for the missing values), we recommend (A) including a dummy in the regression for observations where the given variable is missing and (B) then it is ok to set the constraint variable to zero for these missing observations.   This procedure would fit the intercept for the missing value observations based on the data, and would not presume any particular value for those firms, which we feel is appropriate.

*** Note on data updating: Note that this data was extended through 2015 although we are unable to update further as the project was run on an earlier version of the metaHeuristica platform that used a different coding algorithm.