Application Of Network Analysis For Finding Relatedness Among Legal Documents By Using Case Citation Data

* PG Scholar, Department of Computer Science Engineering, Christ University, Bangalore, Karnataka, India.

** Associate Professor, Department of Computer Science Engineering, Christ University, Bangalore, Karnataka, India.

Abstract

Information Retrieval (IR) is an activity of searching and extracting information from web resources based on the information need of user. There are various domains like legal domain where information being searched is stored in large databases and is available as documents written in natural languages. Due to the huge amount of information being available as text documents, there is a paradigm shift towards knowledge based information retrieval. Knowledge management requirements of legal domain are very challenging due to the complex structure of legal documents like acts, judgments, petitions, etc. Citations across these documents thus can be considered as very important component in legal processes. Citation analysis in legal domain is used to examine the patterns to find the relationship between the legal documents. Citations can be represented as network of legal documents where every document represents a legal concept. In this study, similarities between legal documents are analyzed and visualized using Network Analysis. Unlike other techniques where similarity is defined between two objects directly, network analysis allows to analyze relatedness with the help of betweenness and paths. Citations in the judgements of Indian courts are used to build the network structure which is then evaluated using network metrics.

Keywords :

Information Retrieval,
Legal Documents,
Citation Analysis,
Similarity Search
Citation Network.

Introduction

Legal domain through its various processes generates huge data in the form of legal documents and texts. Documents with legal information can be characterized into several headings like court lawsuits, affidavits, common law, statements and convictions, etc. Importance of these documents is deliberated as collection of judgements which belongs to different courts provided by the judges. In legal domain the information load have adverse effects in finding the related legal documents. To find similarities in these documents the collaborator wants to browse information over legal databases based on the prior experience and knowledge which is therefore time-consuming and complicated. By knowledge of the lawyer he can browse the database where similarities can be found based on his requirements on the prior judgement and starts looking for more judgements which are similar to that old judgement for analysis. Ever since the data is huge and complicated, computerised mechanism is generated to process the similar judgements built on their citations is obtainable.

Legal informatics is a specialized field which aims at application of Information technology for effective management of legal information. Lawyers are amid the most conformist professionals, and traditional late adopters of technology, the gap between Information Technology and Lawyers is decreasing, in particular due to the Internet and the richness of legal sources that can be found online. Legal domain comprises documents which include both structured and unstructured data. Structured data represents citations, case id, location, date, etc. While unstructured data comprises of plain texts present in the document. The judgments in common law system provides link to the judgments taken in similar previous cases. These links are referred to as the citations. It gives the details of judgements and ideas of the article that was considered for the judgement for a case. It also gives details of cases which fall under same category. Citation Analysis is used in legal domain to build a network based on case-citation data. Analysis can also be used to examine the patterns or frequency and to find the relationship between the cited data.

This paper exhibits similarities of legal documents based on the citations used in the court judgements. For this study data related to cyber-related crimes is covered under Information Technology Act 2000. These crimes are committed over an electronic medium which includes crime like hacking, cyber stalking, social network abuses, email frauds, spams, etc.

Network Analysis is a method of representing complex data and structuring it in a form of network in terms of nodes and edges. It deliberates node structure where similarities are found using network metrics. Different measures are taken to interpret the data like Clustering Co-efficient, Density, Connectivity, Degree distributions, in-degree, out-degree, betweenness and closeness [18]. In this proposed work a case, i.e. a judgement is considered as a node and citation referred in the judgement is considered as an edge. Thus if there is any path from source to destination (say case 1 is citing citation 1 and case 2 is also citing citation 1) then network is generated for case 1 and case 2 based on the similarities.

1. Related Work

The process of taking legal action defines litigation [12]. Analytics in legal domain offers litigators associate improvement over conflicting counsel by providing information driven insights into how judges, attorneys, and parties have behaved in similar cases within the past and the way they are possible to behave within the future. Analytics in legal domain [2]relies on advanced knowledge management technologies such as machine learning and NLP to structure and analyse the data from huge dockets and legal documents. It relies on the information of the cases which were tried in the past is compared with the current under-trail cases. Such anecdotal data mostly relies on small sample size data. Using legal analytics, attorney can achieve series of searches and adjust parameters to provide the client with a range of possible results for different legal approaches including remedies and damages awarded. In the domain of law, tools that are related to knowledge based system are described. Conceptual model is the first tools which categorize the legal task with most suitable model for computerized performance for currency. Network sequenced transition [13] was the second tool which reduces the involvement of the engineer so that they can maintain and develop their knowledge easily. For solving legal problem, an approach based on logic is explained in this paper for two main problems: [1]a technique is created for representation of legal texts and the problems involved in legal terms based on the language was evaluated. The structure of legal judgements is complex and contains citations to different judgements. For the effective search of similar legal documents, research is been made on [6] web searching and information retrieval. This paper exhibits links or citations for finding the similar judgements and proposed paragraph-link for the proficiency in the similarities of the legal documents.

Citation analysis is often used to understand knowledge transfer across domains; it is very difficult for the researchers to find the impact in the Google patents because of its large number of articles. In response, this method was introduced to extract the count of the patent citation and Google patent for large articles. Introduction of computerized duplicate results were filtered [5]. Importance in analyzing patents and evaluating programs and research group is presented; link is made through the analysis on the patents and research. The paper describes about two sections where first section is about [14] SNRPs, for reviewing of patent analysis research. Second section describes about the technology development for the knowledge of relationship between the author and the inventor.

Processing information [17] in legal domain has several issues; main issues are discussed in this paper which comprises issues such as complexity in handling knowledge of legal domain by the methods and techniques and also identifying the ways for storage and retrieving the required data. In legal domain similar precedent extraction is important. Analysis on legal documents includes information of texts present in the form of semi-structured and unstructured. Storage of text documents with legal information uses natural languages, for extracting the data which is stored in the databases is retrieved through text mining. This paper explains [16] grouping of legal text documents based on its contents without considering the external query input using text mining techniques. Process of legal domain is widely dependent on the knowledge interpretation of human expertise, hence it is considered to be very complex. Building similarities between two cases in various legal documents is common, but significant for the legal expert. Application of network analysis for legal relevant documents is compared for finding similarities based on [15]cosine and citation similarity. Citation based similarity measure is more robust in this study.

Semantic Substrates were described in this section: Network visualization use semantic substrates where nonoverlapping regions in which nodes are based on node attributes. And by using users who are interactively adjust sliders to control link visibility. Author used a method of NVSS [10]with legal precedent data. Paper begins with user task with building basic network which contains simple nodes and links, then build up network with nodes, links, and then labels, directed links, link attributes, and node attributes. Studies proved that successful nodes are ones showed small networks with 10-50 nodes and 20-100 links. Implementation of the network was visualized by semantic substrates; it was constructed using Java Universal Network/Graph (JUNG). Citation Analysis is the analysis where web based documents, including scientific papers, legal briefs, emails, and Wikipedia are used to improve and resolve conflicts. Analytics can trace pattern of co-citations analysis and also use of key phrases and arguments. After finding document linkage, analysis was done based on graph-theoretic algorithms, where prestige and betweenness centrality metrics are described. The main aim of the paper was to find the strongly related nodes which are clustered together. If clusters were more, then degrees were assigned to reduce the number of nodes. Author has described about the semantic substrates [19], where the user can see the patterns related to the attributes. Attribute 1 can be related to time where older documents can be analysed with the newer one, which is x-axis and y axis describes about the US legal documents or high ranked journals which results in clusters, patterns, trends, outliers, and gaps. After node layout user can review links which are strongly or weakly connected. Two main principles were; Appropriate region alignment, placement of regions to reduce the number of long edges. Later, they described about predator-prey relationship and US Senate voting patterns. Using semantic substrates common node can be considered as single Meta node. To build the network NodeXL tool was used. In this section, analysis of text document, as the size of the document is large. System called Jigsaw [11] was introduced to represent the connection between the entities across different documents. For data with large collection of evidence, collection was made based on the disparate facts, reading between lines to separate activities into larger plot. The objective of the research was to develop visual representation of the information within the textual documents which highlights and identifies the connection between entities in the documents. Interactive visual index onto the documents a visual system which connects a link entities that guides the document to read. Accuracy was found with the help of theme, plots, and stories embedded in the documents. Citation holds very important information in legal domain. Links, relationships, nodes, and judgements were applied on social network to find similarities in important cases and cases which are legally relevant based on the statistical measures which would help the scholars and legal practitioners. Dataset gives details of the article 264 to 300 which is a part of Indian Constitution which dealt with the subject of property, contracts, finance and suits. Results were shown based on high dispersion between the relationship by degree and structural method [8]. Tf-idf cosine similarity measured with 2 pairs includes formulation of term which represented the in degree and out degree for both the cases. In this section semantic information retrieval of the legal domain was described, basic work is to search and browse and also rely on the graph. It is difficult to combine semantic contents and textual link between the documents [3]. Two approaches were been made to discover and allow approximate answer for the given search queries, to provide larger scalability and semantics with lesser queries. In this paper author describes about possibilities of [3] automated mapping of legislative texts. Query and scripts based approach was introduced on the database to extract the information; visualization was made using Hungarian legislative texts providing inner link structure. This section describes about the [4]Legislation Network where data was collected for more than 60-years old legislation corpus. Topological structure is built for the analysis of complex data. For evolved data in legislation network, temporal analysis was performed. It resulted in evolution of legislation properties and enhanced clarification on the basis of the structure.

2. Methodology

This work is based on the network analysis for finding the similarities among the legal judgements. Figure 1 explains the methodology followed during the analysis. Various steps are described below.

Figure 1. Methodology

2.1 Data Collection and Data Preprocessing

Data required for the analysis is collected from the website “indiankanoon.org”. For the purpose of this analysis authors have focused only on cases of Indian Courts. Data is collected in two column format in which first column contains cases and second column contains citations. These data are semi structured and require preprocessing. Preprocessing of the dataset is done using the R tool. Generic preprocessing steps, namely Sections and Act removal, Ids are given for both cases and citation.

2.2 Building Knowledge Model

Two column data is represented as nodes and edges for the information obtained in preprocessing. A node in this analysis represents a judgement or a case and an edge between nodes indicates that the two cases are similar. Since edges are very important in citation network analysis, directed graph is analyzed with measures as explained below.

Density denotes as a "connections" between participants. Density is outlined because the range of connections a participant has, divided by the whole possible connections a participant might have. For instance, for a square measure, where twenty individuals taking part, all and sundry might doubtlessly connect with nineteen people. A density of 100% (19/19) is that the greatest density within the system.
Centrality focuses on the behaviour of individual participants among a network. It measures the extent to that a personal interacts with alternative people within the network. The additional a personal connects to others during a network, the bigger their spatial relation within a network.
Betweenness centrality is an indicator of a node's centrality in a network. It is equal to the number of shortest paths from all vertices to all others that pass through that node. If refers to the extent to that the relation which relates 2 nodes in a network that connects with an edge is transitive. Good transitivity implies that, if $x$ is connected (through associate degree edge) to $y$, and $y$ is connected to $z$, then $x$ is connected to $z$ still. It is rare in real networks, since it implies that every part may be in group, that is, every try of approachable nodes within the graph would be connected by an edge.
Connected Components: In an undirected graph connected components is a sub graph through which two vertices are connected to one another by a path which is not connected to any additional vertices in a super graph.
In-degree and out-degree variables are related to centrality. In-degree centrality concentrates on a specific individual as the point of focus; centrality of all other individuals is based on their relation to the focal point of the "in-degree" individual. Out-degree is a measure of centrality that still focuses on a single individual, but the analytic is concerned with the out-going interactions of the individual; the measure of out-degree centrality is how many times the focus point individual interacts with others.
Eigen Vector: Eigenvector centrality is a method in which computing centrality or finding the importance of each node in a graph. Each nodes centrality is the sum of value of the centrality of the node which is connected to in a graph. Nodes are drawn with the radius related to their centrality.
Clustering co-efficient: It is a degree measure in which nodes in a graph incline to cluster. In social networks nodes incline to create tightly join groups which is characterized by the relatively high density of ties. It tends to have greater than an average probability of a tie arbitrarily recognized between two nodes. It generally describes that how many of the nearest neighbors are connected to each other.

2.3 Evaluation of the Model

Network model thus obtained by nodes and edges and then analyzed with the help of network metrics like density, centrality, betweenness, connected components, transitivity, and clustering co-efficient. Instead of considering only one node link to another node, existence of a path say from source to destination from one node to another with more links is also considered for deciding similarity. In Figure 2, connectedness of nodes representing cases with different colours depicts that green represents more connectivity whereas red represents less connectivity based on the case similarities. In Figure 3, connectedness of nodes representing cases with different colours infers that purple node represents more connectivity whereas yellow represent less connectivity with other cases in the dataset.

Figure 2. Network Structure for 100 Dataset

Figure 3. Similarity Network Structure

In the network obtained, a direct link or a path from one node to other signifies the similarity. By visual representation of a network considering four cases it is obvious that cases have similarities based on their citations.

Analyzed data results in the minimum similarities in the range of 0-1. Table 1 represents as follows, Length gives the average shortest path between all node pairs with set of 10 vectors. Density gives the result of the kernel and bandwidth of the observation. Diameter describes about the longest length of the 2 points. Figure 4 shows degree distribution.

Table 1. Network Metrics

Figure 4. Degree Distribution [9]

Based on the results analyzed with the degree limit 35. Degree distribution gives an impact result of 0.89.

Conclusion and Future Work

Finding similar cases is one of the most researched problems in legal domain. Through this paper, the authors have presented the application of network analysis for citations in Indian court judgement which can be used not only to find the similairity in the cases, but also for understanding the interrelationship among various legal ideas through the citation links. Network can be further analyzed by applying more network measures and also by link algorithms. Information can be analyzed using the judgement year, courts and judges for most relevant laws. Edges can be weighed by designing knowledge based weights and such structure can be very useful in understanding legal knowledge.

References

[1]. Branting, L. K. (2017). Data-centric and logic-based models for automated legal problem solving. Artificial Intelligence and Law, 25(1), 5-27.

[2]. Byrd, O. (2017). Legal Analytics vs. Legal Research: What's the Difference? In Law Technology Today. Retrieved from http://www.lawtechnologytoday.org/2017/06/legalanalytics- vs-legal-research/

[3]. Hamp, G. & Markovich, R. (2015). Automated Reference Extraction in Hungarian Legislative Texts and Visualization of their Inner Link Structures. Openlaws Open Data Workshop. NAIL.

[4]. Koniaris, M., Anagnostopoulos, I., & Vassiliou, Y. (2015). Network Analysis in the Legal Domain: A complex model for European Union legal sources. Journal of Complex Networks, 1-29.

[5]. Kousha, K., & Thelwall, M. (2017). Patent citation analysis with Google. Journal of the Association for Information Science and Technology, 68(1), 48-61.

[6]. Kumar, S., Reddy, P. K., Reddy, V. B., & Suri, M. (2013, March). Finding similar legal judgements under common law system. In International Workshop on Databases in Networked Information Systems (pp. 103-116). Springer, Berlin, Heidelberg.

[7]. Mimouni, N., Nazarenko, A., & Salotti, S. (2015, December). Answering Complex Queries on Legal Networks: a Direct and a Structured IR Approaches. In Network Analysis in Law Workshop (held in connection with JURIX 2015).

[8]. Minocha, A., Singh, N., & Srivastava, A. (2015, May). Finding Relevant Indian Judgments using Dispersion of th Citation Network. In Proceedings of the 24 International Conference on World Wide Web (pp. 1085-1088). ACM.

[9]. Ognyanova, K. (2016). Network Analysis and Visualization with R and igraph. In www.kateto.net. Retrieved from http://kateto.net/networks-r-igraph

[10]. Shneiderman, B. & Aris, A. (2006). Network visualization by semantic substrates. IEEE Transactions on Visualization and Computer Graphics, 12(5), 733-740.

[11]. Stasko, J., Görg, C., & Liu, Z. (2008). Jigsaw: supporting investigative analysis through interactive visualization. Information Visualization, 7(2), 118-132.

[12]. Steiner, D. (2016). Data Analytics and Your Law Firm. In Law Technology Today. Retrieved from http://www.law technologytoday.org/2016/04/big-data-law-firm-dataanalytics- influencing-cases/

[13]. Stranieri, A. & Zeleznikow, J. (2000). Tools for intelligent decision support system development in the legal domain. In Tools with Artificial Intelligence, 2000. t h ICTAI 2000. Proceedings. 12 IEEE International Conference on (pp. 186-189). IEEE.

[14]. van Raan, A. F. (2017). Patent citations analysis and its value in research evaluation: A review and a new approach to map technology-relevant research. Journal of Data and Information Science, 2(1), 13-50.

[15]. Wagh, R. S. & Anand, D. (2017). Application of citation network analysis for improved similarity index estimation of legal case documents: A study. IEEE International Conference on Current Trends in Advanced Computing (ICCTAC) (pp. 1-5).

[16]. Wagh, R. S. (2013). Knowledge Discovery from Legal Documents Dataset using Text Mining Techniques. International Journal of Computer Applications, 66(23), 32-34

[17]. Wagh, R. S. (2014). Exploratory Analysis of Legal Documents using Unsupervised Text Mining Techniques. IJERT International Journal of Engineering Research & Technology, 3(2), 2264-2267.

[18]. Wolfram Alpha LLC. (2009). Graph Measures & Metrics-Wolfram Language Documentation. Wolfram Research.

[19]. Wong, P. C., Chen, C., Gorg, C., Shneiderman, B., Stasko, J., & Thomas, J. (2011). Graph Analytics-Lessons Learned and Challenges Ahead. IEEE Computer Graphics and Applications, 31(5), 18-29.