Automatic building up documents taxonomy through metadata analysis

Maurizio Talamo; Giorgio Gambosi; Alessandra Aversa; Simone Bonazzoli

doi:10.2352/issn.2168-3204.2009.6.1.art00033

In cooperation with CNIPA (the Italian Authority for the use of ICT's in the Public Administration), we studied and developed a new solution for the effective access to legal data, especially law texts, norms and rules. Such information represented in XML based and structured documents - is available also at the section or paragraph level. We are experiencing this kind of system within the civil data status, because a project of vertical research, structured on a semantic level, allows the collection of information and the building of a body of uniform rules. The system is based on a statistical similarities relationship and it gives to user the capability to consider also information which, even if not immediately returned as a result of the query resolution, could however be interesting related to the user information needs, because it discovers new information and relationships with in the set of documents. The system provides the usual functionalities of ad hoc retrieval of laws, sections and paragraphs of interest, implemented by means of XML-retrieval techniques, but it also, given the text of a certain law, applies document similarity algorithms to derive section or paragraph, the set of paragraphs where the sections and laws are included which, probably, treat the same subject. Furthermore, by performing a suitable text parsing, the system extracts from each document all explicit references to different laws (and even the references to sections and paragraphs). In this way the system is able, in response to a given query, to return not only all laws (and the corresponding sections and paragraphs) which may be relevant to the specified subject, but also, for each returned law, a set of laws (sections, paragraphs) which are either explicitly (by means of explicit reference in the text) or implicitly (by statistical similarity) related to it. Then these items are ranked by applying a suitable, user tunable, function of both explicit (in a link analysis style) and implicit referent. Applying iteratively the same approach to each considered law, section or paragraph, the user is able to browse within the given document corpus, moving according to the presence of significant (explicit or implicit) relationships among text items. This search technology employs a new class of database designed for exploring information, not just managing transactions, but it lets users prioritize and personalize their choices, rather than directing them down a classification path. Now users can find what they are looking for, and discover new information and relationships.