FinTOC 2023 Call For Participation - Corpora

21 Jun 2023


      Please find the FINTOC 2023 Shared Task Call for Participation below.
Apologies for cross-posting.
With best wishes,
FinTOC 2023 Shared Task organizing committee
---
Call for participation:
FNP-2023 Shared Task: FinTOC - Financial Document Structure Extraction
Practical Information:
To be held as part of the 5th Financial Narrative Processing Workshop (FNP
2023) https://wp.lancs.ac.uk/cfie/fnp2023/during the 2023 IEEE
International Conference on Big Data (IEEE BigData 2023)
http://bigdataieee.org/BigData2023/, Sorrento, Italy, from 15th December
to 18th December, 2023. It is a one-day event of which the exact date is to
be announced.
===================
Shared Task URL: http://wp.lancs.ac.uk/cfie/fintoc2023/
http://wp.lancs.ac.uk/cfie/fintoc2022/
Workshop URL: https://wp.lancs.ac.uk/cfie/fnp2023/
Participation Form:
https://docs.google.com/forms/d/e/1FAIpQLSdqUKy3YGho0Cw2GF__VHilHZZbR75UDG3J...
___________________________________________________________
Shared Task Description:
A vast and continuously growing volume of financial documents are being
created and published in machine-readable formats, predominantly in aPDF
format. Unfortunately, these documents often lack comprehensive structural
information, presenting a challenge for efficient analysis and
interpretation. Nevertheless, these documents play a crucial role in
enabling firms to report their activities, financial situation, and
investment plans to shareholders, investors, and the financial markets.
They serve as corporate annual reports, offering detailed financial and
operational information.
In certain countries like the United States and France, regulators such as
the SEC (Securities and Exchange Commission) and the AMF (Financial Markets
Authority) have implemented requirements for firms to adhere to specific
reporting templates. These regulations aim to promote standardization and
consistency across firms' disclosures. However, in various European
countries, management typically possesses more flexibility in determining
what, where, and how to report financial information, resulting in a lack
of standardization among financial documents published within the same
market.
Although there has been some research conducted on the recognition of books
and document table of contents (TOC), most of the existing work has focused
on small-scale, application-dependent and domain-specific datasets. This
limited scope poses challenges when dealing with a vast collection of
heterogeneous documents and books, where TOCs from different domains
exhibit significant variations in visual layout and style. Consequently,
recognizing and extracting TOCs becomes an intricate problem. Indeed, in
comparison to regular books that are typically provided in a full-text
format with limited structural information such as pages and paragraphs,
financial documents possess a more complex structure. They consist of
various elements, including parts, sections, sub-sections, and even
sub-sub-sections, incorporating both textual and non-textual content. Thus,
TOC pages are not always present to help readers navigate the document, and
when they are, they often only provide access to the main sections.
In this shared task, our objective is to undertake the analysis of various
types of financial documents, encompassing KIID (Key Investor Information
Document), Prospectus (official PDF documents where investment funds
meticulously describe their characteristics and investment modalities),
Réglement and Financial Annual Reports/Financial Statements (that provide a
detailed overview of a company's financial performance and operations over
the course of a fiscal year). These documents play a vital role in
providing crucial information to investors, stakeholders, and regulatory
bodies. While the content they must contain is often prescribed and
regulated, their format lacks standardization, leading to a significant
degree of variability. The presentation styles range from plain text format
to more visually rich and data-driven graphical and tabular
representations. Notably, the majority of those documents are published
without a table of contents . A TOC is typically essential for readers as
it enables easy navigation within the document by providing a clear outline
of headers and corresponding page numbers. Additionally, TOCs serve as a
valuable resource for legal teams, facilitating the verification of the
inclusion of all the required contents. Consequently, the automated
analysis of these documents to extract their structure is becoming
increasingly useful for numerous firms worldwide.
Our primary focus for this edition is to expand the extraction of table of
contents to a wider variety of financial documents, and the task will
involve developing highly efficient algorithms and methodologies to address
the challenges associated with such a dataset. Our aim is to achieve a
level of generalization ensuring that the developed system can be applied
to different types of financial documents. This broader scope allows us to
explore the applicability of our methodologies across a range of financial
document categories, such as KIID, Prospectus, Réglement and Financial
Annual Reports/Financial Statements. This way, we want to demonstrate the
versatility and effectiveness of the ML algorithms used in TOC extraction,
enabling a streamlined and consistent approach across various financial
document types.
In addition, for this edition, we are excited to introduce a dataset that
goes beyond textual annotations. Our proposed dataset will include visual
(spatial) annotations that capture the coordinates of the titles and
hierarchical structure of the documents. This comprehensive approach
enables a more holistic analysis and understanding of financial documents.
By incorporating visual annotations, we can capture the visual cues and
design elements that contribute to the overall structure and organization
of the documents. This allows us to delve deeper into the visual
representation of the table of contents and extract valuable insights from
the visual hierarchy present in these financial documents. The combination
of textual and visual annotations provides a richer and more nuanced
dataset, making it possible to increase the accuracy and effectiveness of
the machine learning algorithms and methodologies employed in TOC
extraction.
Thanks to the contribution of the Autonomous University of Madrid (UAM,
Spain), the fifth edition of the FinTOC Shared Task welcomes a specific
track for Spanish documents, continuing from the previous edition.
In this edition, systems will be scored based on their performance in both
Title detection and TOC generation using more precise evaluation metrics
based on visual annotations.
Participants are required to register for the Shared Task. Once registered,
all participating teams will receive a common training dataset consisting
of PDF documents along with the associated TOC annotations.
To participate please use the registration form below to add details about
your team:
https://docs.google.com/forms/d/e/1FAIpQLSdqUKy3YGho0Cw2GF__VHilHZZbR75UDG3J...
(now open as of 06/01/2023)
_____________________________________________
-
1st Call for papers & shared task participants: June 12, 2023
   -
2nd Call for papers & shared task participants: July 17, 2023
   -
Final Call for papers & shared task participants: August 17, 2023
   -
Training set release: August 21, 2023
   -
Blind test set release: September 21, 2023
   -
Systems submission: October 03, 2023
   -
Release of results: October 09, 2023
   -
Paper submission deadline: October 18, 2023 (anywhere in the world)
   -
Notification of paper acceptance to authors: November 01, 2023
   -
Camera-ready of accepted papers: November 15, 2023
   -
Workshop date (1 day event) : December 15-18, 2023 (exact date to be
   announced)
_____________________________________________
Contact:
For any questions on the shared task please contact us on:
fin.toc.task@gmail.com
_____________________________________________
Shared Task Organizers:
- Abderrahim Ait Azzi, 3DS Outscale (ex Fortia), France
- Sandra Bellato, 3DS Outscale (ex Fortia), France
- Blanca Carbajo Coronado, Universidad Autónoma de Madrid
- Dr Ismail El Maarouf, Imprevicible
- Dr Juyeon Kang, 3DS Outscale (ex Fortia), France
- Prof. Ana Gisbert, Universidad Autónoma de Madrid
- Prof. Antonio Moreno Sandoval, Universidad Autónoma de Madrid