The Glimpse (Predicting)

play_arrowPredicting Successful Organizations

How to use Machine Learning Model to predict

From data collection, variables definition, data integration, data transformation to evaluation metrics.

Data sources and variables definition.

The samples of this research come from two different sources. The first set, the inttrend sample, has been collected from this company's portal. It includes the samples from the previous study and incorporates data collected from the website in December 2020. Those samples include funding rounds from different categories, in addition to Insurtech organizations.

The second set, the Insurtech sample, was structured and collected by the Insurance Sector staff in 2019, that being the same sample as last year’s report (Insurtech Global Outlook 2020).

A large fraction of the inttrend sample will be used to train the model, which will be tested with the remaining small fraction of inttrend samples.

Finally, the model will also be tested with the Insurtech sample, and similar samples that, up to this point had not been labelled as successful, but predicted to be so by the model, will be examined in detail as they might evolve into successful startups in the near future.

inttrend insurtech global outlook 2021

In this study, each funding round has been considered as a sample. Each sample has different features regarding the round, the invested organization, its founder, and the investor. This section explains in detail how the final data sample was obtained, filtered, built, and processed.

  • The round: the first dataset contains information about the funding rounds.
  • The invested organization: the second dataset contains information on the organizations being studied, including one entry per each organization.
  • Founders: the third dataset of this study contains information on the founders of the organization being invested in in the funding round.
  • The investors: the fourth and final dataset contains information regarding the investors, which can be either individuals or organizations.

Data Integration

To create a complete and unified dataset, like the one depicted in Figure 1, taking into account the information of the different datasets mentioned before, the following process has been followed:

figure1 insurtech global outlook 2021

Figure 1. Table exemplifying the different parts and sources of the final sample. In green, data that has been extracted from the website inttrend.com, in blue data provided by the Insurtech staff, and in orange and yellow the final composition of each sample and its respective target.
  1. The Organizations dataset and the Acquisitions (in the case of inttrend data) or Target (in the case of Insurtech data) have been merged to obtain a dataset with one sample per company including the target value. If the startup is located within the Target dataset, it has been considered to be successful (1), and if not, unsuccessful (0).
  2. The resulting dataset has then been merged with the Funding Rounds dataset. In the Founding Round dataset, there can be more than one entry for each organization, so for each sample, the information of the organization and the target have been added to the same row.
  3. Once one sample per funding round has been included in the dataset, the lead investor of the funding round has been used to merge that dataset with the Investors dataset. If the lead investor field in the main dataset was null, the first regular investor has been used to merge the dataset with the Investors.
  4. The dataset has been merged with the Schools and Founders dataset (which are two separate datasets that have already been combined), to obtain information on the founders of each startup. Approximately, only 30% of the samples included information on the founders, so in the resulting 70%, those files were left blank, and two different setups were later implemented: one, with all the dataset omitting the founder’s information, and the other one, with 30% of the dataset, but including the founder’s information.
  5. The previous steps have been performed independently for both the inttrend and the Insurtech samples. In this final step, both datasets were concatenated, resulting in the final sample. The distribution of the final sample following that process (source, successful and if it contains founder’s information) is depicted in Figure 2 and 3.

Data integration

data integration insurtech global outlook 2021

Figure 3. Distribution of the final sample for three different flags: source, successful and if founder’s information included. From top to bottom, the inttrend sample and the Insurtech sample is depicted.

Data Transformation

Data transformation is one of the main tasks to be applied when using a machine-learning model. Once data has been selected and cleaned (the samples, in which the columns are empty and there is no possibility of replacing this value, are then deleted and samples with inconsistent values —outliers— are also deleted), and in order to extract more value from the original features, two different strategies have been applied to the data: feature transformation (make changes to the original data) and feature engineering (create new variables from current features).

A) Round Type

We have four categories for the funding round according to the Fitalent team. We considered the rounds that were in the Exit Stage to be successful. See the categories below:

B ) Dates

The two dates that have been considered for this study as final features (organization founding date and the founder’s completion degree date), have been reduced to just an integer containing the year information.

C) Countries

Three of the final features include countries information: the country where the organization was founded, the origin country of the founder and the country of origin of the investor or, investor company.

D) Investor types

The investor types field includes a combination of 21 different investor types for each sample. Upon investigation, we have seen that there are 12 categories that frequently come up (such as venture_capital, accelerator, micro_vc, private_equity firm), while others (such as syndicate, university_program, pension_funds) have a residual frequency on the dataset.

E) Categories

The information regarding the category of an organization has been extracted from the website inttrend.com, which includes a list of different categories or subcategories for each organization. By analyzing the category groups that have resulted into each of the 4 clusters, we could group them into: BFSI, Health, Data and Software.

Prediction Method

This section explains the method used to predict successful startups, the different setups that have been administered to obtain the best method, the metrics used for its evaluation, the experiment results and the insights on the Insurtech sample. Prediction Models In last year’s study, three different machine-learning models were tested to predict the success of startups: Logistic Regression, Random Forest and Support Vector Machines, with Random Forest obtaining the best results. Taking this into account, in this study we have used the XGBoost [4] model, which is also a tree-based algorithm like Random Forest.

XGBoost (eXtreme Gradient Boosting) is an open-source software library, which provides an implementation of a gradient boosting decision tree algorithm for different programming languages. Specifically designed to reach a certain speed and performance, it has recently been dominating applied machine learning employment for structured or tabular data. Boosting is an ensemble technique where new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made.

In this year’s study, two setups have been trained with the built dataset that differ on the data that has been used for each of them. The first setup (Setup A) includes all the dataset but no information regarding the founders, meanwhile the second setup (Setup B) includes only a portion of the dataset but adds information.

The evaluation metrics applied for the classifiers were the True Positive Rate (TPR), False Positive Rate (FPR), False Negative Rate (FNR) and the Precision. These evaluation rates are commonly used for binary classification problems, but they are also used to fairly evaluate problems with class unbalance. For binary classification problems, each kind of predicted sample is distributed into four different categories: True Negative, False Positive, False Negative and True Positive. The following table explains the difference between them and details the description for this particular study.

By analyzing Setup A and Setup B and using the previously trained models, the results obtained for the classification of the Insurtech dataset are as below. There are a total of 1708 samples from different companies and their financing rounds. The matrix explained in previous sections and the analysis of FP, TN, TP and FN samples for the Insurtech sample is the following:

insurtech sample

That previous table can be interpreted as:

  • 1.485 funding rounds of (so far) unsuccessful companies have been classified as unsuccessful.
  • 76 funding rounds of successful companies have been correctly classified as successful.
  • 111 funding rounds of successful companies have been incorrectly classified as unsuccessful.
  • 36 funding rounds of (so far) unsuccessful companies have been classified as successful.

The last point is particularly interesting because those 36 funding rounds, corresponding to 29 different companies, are startups that are currently unsuccessful, but the model, which has evaluated thousands of funding rounds at different stages, has classified them as successful, indicating that in the future they may well be successful.

Figure 9 is extremely curious and descriptive. All the 1708 funding round samples are sorted by their success probability outputted by the model (rounds with higher probability at the top and at the bottom, rounds with lower success probability). The red line indicates the probability threshold; that is, all samples above that line have been classified as successful and all those samples below, unsuccessful.

The colour of the points determines the label of the sample: in blue, funding rounds of companies that at this moment are not successful, while in orange, rounds of already successful companies. It is compelling to note how the majority of blue samples have a low success probability and the samples with a high probability are mostly orange.

The 36 blue points positioned above the red line are those that correspond to the 29 different companies that will form the shortlist of startups that are likely to become successful in the future.

prediction method

Insurtech Sample

It is also interesting to discover which stage the classified funding rounds are at. Figure 10 compares the number of sample rounds of the successful startups (orange) with the currently unsuccessful startups that have been classified as successful (blue) over the four different rounds. The distribution of the samples is quite similar (at different scales) for each of the two different groups, not taking into account the Exit stage that, by definition, determines the success of a startup. The majority of the samples are in the Early Stage (which includes grant, seed, series_a and undisclosed stages).

insurtech sample 2

Figure 10. Distribution of the successful rounds (orange) and the unsuccessful samples classified as successful (blue) grouped by round type.