Text mining

TEXT MINING

The European commission is funding thousands of research projects each year in collaboration with universities and private companies. They needed a classification system that looks at each contract and detect the type of the funding.

There are various types of fundings. Depending on the type, the duration and the amount of money varies. Each funding is characterized by a contract with the European commission. We needed to classify 90.000 documents written in 10 languages in 12 categories. Each year, 8000 new contracts were added to the pool of contracts. It already took 3 years to a team of 4 specialized lawyers to classify 13000 contracts, so the task had to be automated.

8 weeks

It took 2 days to build the 120 models needed. The rest of the time was used to automate the processes and check the conflicting contracts with the lawyers. The whole project was complete and automated in 8 weeks. We didn’t use any dictionary to reduce the number of columns in the dataset. We simply injected the raw dataset inside TIMi.

96% accuracy

The final accuracy of the prediction system was 96%. When there was a difference between the classification performed by the predictive models and by the experts, we double-checked the contract. Sometime a contract had to be reviewed by 8 experts before having a final answer. We finally noticed that around 30% of the contracts where incorrectly classified by the humans.