Tech Corner May 2, 2024
To fully solve the problem of unstructured data processing for financial services firms, platforms have to move beyond template-based and OCR-only solutions. AI and machine learning allow us to build automated workflows that can understand the content within a document, even if the format varies. Machine learning is based on statistical models that are trained to recognize patterns and identify relationships from data, allowing them to make predictions or decisions without explicit programming. These models serve as the backbone of machine learning-based applications like Alkymi, enabling systems to automate tasks, make decisions and extract meaningful information from complex datasets.
In financial services, this challenge is compounded by a lack of standardization. The documents frequently processed by our customers, like Capital Notices or Brokerage Account Statements, may have thousands and thousands of varying formats. These formats may vary in layout, structure, and style,with different arrangements of text, tables, or other elements. Each fund or broker may have their own format—their name may appear in the document header in one instance, but in the introductory paragraph in another. The same sender might even present their data differently every time they send a document.
Alkymi builds machine learning models dedicated to solving our customers’ unstructured document data challenges. How do we build these models, and how do they understand document formats they’ve never seen before? We break down our process into 5 steps below.
Before we build and train a model, we need to understand our customers’ document and workflow challenges. What do their documents look like? What data do they need to extract? How should the data be represented and organized? How does the structured data move downstream? Is there a relationship between different data elements? Our customers span from investment managers and asset management firms to insurance companies and beyond. Each customer's workflow and data requirements are unique and specific to their business.
Our data scientists leverage an extensive array of machine learning models and tools that can be tailored to each unique use case. Based on their detailed analysis of the documents and workflow, our team decides the best approach for each scenario, in collaboration with every customer. Options include layout transformer-based ML models, proprietary OpenAI large language models, in-house algorithms to locate elements on a page using regular expression, anchoring, and LLM-based tools, or a combination of these tools and models best suited for the workflow..
Using a representative sample of documents from our customer, our team meticulously labels the location of each data element for every document using our own platform. This is an iterative process that includes labeling, analyzing and taking customer feedback to ensure clarity around their requirements, minimal ambiguity, and most importantly, consistency across labeled documents. With a consistent set of high quality and clearly labeled documents in a training set, model training can begin.
Much of what has been done until this point could be considered preparation. We employ a rigorous approach to training each machine learning model focused on building for generalization. In this context, generalization refers to the ability of the model to perform effectively on unseen data, including formats it did not encounter during training. We define a successful model in part by its ability to effectively generalize, because unstructured documents, like Capital Notices or Brokerage Statements, come in a myriad of formats, rendering it impossible to train on 100% of the formats that will appear during production.
There are several approaches that our Data Science team leverages to ensure high quality models, including:
The quality of the training dataset or datasets is critical to training the model. During the selection of documents, our team determines any documents that are outliers. These may be documents that will unexpectedly skew the model or simply are not representative of the general body of documents. Our validation system will flag these documents for review by the user during production.
Model training goes through multiple iterations of training and QA, integrating subject matter expert and customer feedback at every step.
Once Alkymi’s model is trained and validated, it is ready for deployment into production. From this point onward, we conduct ongoing maintenance by proactively monitoring model performance, capturing model confidence scores, and providing re-trainings to improve specific fields or formats throughout. Additionally, any manual reviews performed by customers are analyzed and incorporated into tailored re-trainings. As our customers provide more data to Alkymi through the manual review process, the quality of the model continues to increase.
At Alkymi, training machine learning models isn’t simply about algorithms and data—it’s about understanding our customers’ needs, leveraging the right tools and techniques, and delivering tangible results.
Alkymi launches comprehensive fund tracking for private markets, improving transparency and performance reporting.
We're thrilled to announce that Alkymi has been named to the 2024 CB Insights’ List of the 100 Most Innovative Fintech Startups
Introducing a platform to automate your data workflows will increase efficiency and save time—but that’s not the only impact it can have on your firm.