Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CL XML and head_matter fields with data from CAP #4614

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

jtmst
Copy link
Collaborator

@jtmst jtmst commented Oct 24, 2024

Description

This PR introduces a new management command update_cap_cases along with corresponding unit tests. The command is designed to update CourtListener (CL) cases with the latest data from the Caselaw Access Project (CAP).

Key Changes

  • Added update_cap_cases.py management command
  • Implemented test_update_cap_cases.py for unit testing
  • The command processes crosswalk files, fetches CAP HTML and CL XML, and updates CL data accordingly

Testing

Unit tests have been added to for core functionality in the new command

Note

It is necessary to have generated crosswalk files with the generate_capcrosswalk.py command before this script will work

@jtmst jtmst marked this pull request as ready for review October 28, 2024 14:06
@jtmst jtmst requested a review from mlissner October 28, 2024 16:36
@mlissner
Copy link
Member

@flooie, to you for triage, analysis, or both! :)

Copy link
Member

@quevon24 quevon24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a problem, i was testing the command with this cluster: https://www.courtlistener.com/opinion/1539264/go/ (https://static.case.law/a2d/191/html/0138-01.html) and i saw that the resulting xml removed the link from the first footnote and also there is a wrong link in the first footnote in the updated xml:

<footnote data-label="1" id="footnote_1_1">
<footnote citation-index="1" href="#fn2_ref" label="139">1</footnote>
<p data-blocks="[[&quot;BL_175.11&quot;,175,[157,2661,695,68]]]" id="b175-8">. Schibi v. Schibi, <citation data-cite="136 Conn. 190" data-index="0" href="/citations/?q=136%20Conn.%20190">136 Conn. 190</citation>, <citation data-cite="69 A.2d 831" data-index="1" href="/citations/?q=69%20A.2d%20831">69 A.2d 831</citation>, <citation data-cite="14 A.L.R. 2d 620" data-index="2" href="/citations/?q=14%20A.L.R.%202d%20620">14 A.L.R.2d 620</citation>.</p>
</footnote>

image

When we ran the harvard merger command to update opinions and metatada, it also fixed the footnotes(regenerated the tag and link), and that may be breaking the update_cap_html_with_cl_xml function.

Here is how we fixed the footnotes to be linked correctly: https://github.com/freelawproject/courtlistener/blob/main/cl/corpus_importer/management/commands/harvard_merge.py#L518

this is the xml of the cluster i mentioned above:

<?xml version="1.0" encoding="utf-8"?><opinion type="majority"><author id="b174-23"> HOOD, Chief Judge. </author><p id="b174-24"> This appeal is by a husband from an order dismissing his complaint seeking a divorce on the ground of five years voluntary separation. </p><p id="b174-25"> The facts, as found by the trial court, are these. A child was born out of wedlock to the parties in April of 1955. In November of that year the parties were legally married, but separated eight days later and have not lived together since that time. Prior to and at the time of the marriage the parties agreed that the purpose of the marriage was to give the child a legal name and that if they were not satisfied with the marriage a divorce could be obtained. </p><p id="b174-26"> The trial court denied the divorce on the ground that the agreement of the parties prior to and at the time of marriage was collusive and contrary to law. </p><p id="b175-4"><span citation-index="1" class="star-pagination" label="139"> *139 </span> The parties, as the court found, were legally married. Although a marriage is entered into solely for the purpose of legitimizing a child born out of wedlock, such a marriage is a valid one. <a class="footnote" href="#fn1" id="fn1_ref"> 1 </a> The court also found that the parties had lived separate and apart for more than five years. The court did not expressly state that the separation was voluntary, but that is implicit in its finding, and there is no intimation in the record that the separation was other than voluntary. Under our law proof of a valid marriage and five years voluntary separation entitles either party to a divorce. <a class="footnote" href="#fn2" id="fn2_ref"> 2 </a> The sole question is whether the agreement of the parties at the time of marriage bars granting of the divorce. </p><p id="b175-5"> The agreement did not constitute collusion in a legal sense. In general it may be said that collusion, in the law of divorce, implies a corrupt agreement by which evidence is fabricated or suppressed in an attempt to deceive the court and obtain a divorce where legal grounds do not exist. Such was not the case here, but the trial court apparently was of the opinion that an agreement before marriage that if the marriage was unsatisfactory the parties could and would separate and thereafter obtain a divorce, was collusive in nature and contrary to law. </p><p id="b175-6"> When our divorce law was amended in 1935 to include five years voluntary separation as a ground for divorce, it made possible that parties to a marriage could put an end to the marriage by their own voluntary action and after the required period either party could have the marriage legally dissolved. In such a dissolution proceeding there is no question of the innocence or guilt of either party and the reason for the separation is not material. The only issue is the existence of the voluntary separation for the required time. </p><p id="b175-7"> The result is that an agreement by the parties prior to entering marriage that they may voluntarily separate, end the marriage and be divorced, is nothing more than a recognition of the rights given them by law. Such an agreement cannot be said to-be contrary to law. </p><p id="b175-11"> Reversed with instructions to award appellant a divorce. </p><div class="footnotes"><div class="footnote" id="fn1" label="1"><a class="footnote" href="#fn1_ref"> 1 </a><p id="b175-8"> . Schibi v. Schibi, 136 Conn. 190, 69 A.2d 831, 14 A.L.R.2d 620. </p></div><div class="footnote" id="fn2" label="2"><a class="footnote" href="#fn2_ref"> 2 </a><p id="b175-24"> . Code 1961, 16-403. </p></div></div></opinion>

cl/search/management/commands/update_cap_cases.py Outdated Show resolved Hide resolved
cl/search/management/commands/update_cap_cases.py Outdated Show resolved Hide resolved
cl/search/management/commands/update_cap_cases.py Outdated Show resolved Hide resolved
@jtmst
Copy link
Collaborator Author

jtmst commented Oct 30, 2024

I found a problem, i was testing the command with this cluster: https://www.courtlistener.com/opinion/1539264/go/ (https://static.case.law/a2d/191/html/0138-01.html) and i saw that the resulting xml removed the link from the first footnote and also there is a wrong link in the first footnote in the updated xml:

<footnote data-label="1" id="footnote_1_1">
<footnote citation-index="1" href="#fn2_ref" label="139">1</footnote>
<p data-blocks="[[&quot;BL_175.11&quot;,175,[157,2661,695,68]]]" id="b175-8">. Schibi v. Schibi, <citation data-cite="136 Conn. 190" data-index="0" href="/citations/?q=136%20Conn.%20190">136 Conn. 190</citation>, <citation data-cite="69 A.2d 831" data-index="1" href="/citations/?q=69%20A.2d%20831">69 A.2d 831</citation>, <citation data-cite="14 A.L.R. 2d 620" data-index="2" href="/citations/?q=14%20A.L.R.%202d%20620">14 A.L.R.2d 620</citation>.</p>
</footnote>

image

When we ran the harvard merger command to update opinions and metatada, it also fixed the footnotes(regenerated the tag and link), and that may be breaking the update_cap_html_with_cl_xml function.

Here is how we fixed the footnotes to be linked correctly: https://github.com/freelawproject/courtlistener/blob/main/cl/corpus_importer/management/commands/harvard_merge.py#L518

this is the xml of the cluster i mentioned above:

<?xml version="1.0" encoding="utf-8"?><opinion type="majority"><author id="b174-23"> HOOD, Chief Judge. </author><p id="b174-24"> This appeal is by a husband from an order dismissing his complaint seeking a divorce on the ground of five years voluntary separation. </p><p id="b174-25"> The facts, as found by the trial court, are these. A child was born out of wedlock to the parties in April of 1955. In November of that year the parties were legally married, but separated eight days later and have not lived together since that time. Prior to and at the time of the marriage the parties agreed that the purpose of the marriage was to give the child a legal name and that if they were not satisfied with the marriage a divorce could be obtained. </p><p id="b174-26"> The trial court denied the divorce on the ground that the agreement of the parties prior to and at the time of marriage was collusive and contrary to law. </p><p id="b175-4"><span citation-index="1" class="star-pagination" label="139"> *139 </span> The parties, as the court found, were legally married. Although a marriage is entered into solely for the purpose of legitimizing a child born out of wedlock, such a marriage is a valid one. <a class="footnote" href="#fn1" id="fn1_ref"> 1 </a> The court also found that the parties had lived separate and apart for more than five years. The court did not expressly state that the separation was voluntary, but that is implicit in its finding, and there is no intimation in the record that the separation was other than voluntary. Under our law proof of a valid marriage and five years voluntary separation entitles either party to a divorce. <a class="footnote" href="#fn2" id="fn2_ref"> 2 </a> The sole question is whether the agreement of the parties at the time of marriage bars granting of the divorce. </p><p id="b175-5"> The agreement did not constitute collusion in a legal sense. In general it may be said that collusion, in the law of divorce, implies a corrupt agreement by which evidence is fabricated or suppressed in an attempt to deceive the court and obtain a divorce where legal grounds do not exist. Such was not the case here, but the trial court apparently was of the opinion that an agreement before marriage that if the marriage was unsatisfactory the parties could and would separate and thereafter obtain a divorce, was collusive in nature and contrary to law. </p><p id="b175-6"> When our divorce law was amended in 1935 to include five years voluntary separation as a ground for divorce, it made possible that parties to a marriage could put an end to the marriage by their own voluntary action and after the required period either party could have the marriage legally dissolved. In such a dissolution proceeding there is no question of the innocence or guilt of either party and the reason for the separation is not material. The only issue is the existence of the voluntary separation for the required time. </p><p id="b175-7"> The result is that an agreement by the parties prior to entering marriage that they may voluntarily separate, end the marriage and be divorced, is nothing more than a recognition of the rights given them by law. Such an agreement cannot be said to-be contrary to law. </p><p id="b175-11"> Reversed with instructions to award appellant a divorce. </p><div class="footnotes"><div class="footnote" id="fn1" label="1"><a class="footnote" href="#fn1_ref"> 1 </a><p id="b175-8"> . Schibi v. Schibi, 136 Conn. 190, 69 A.2d 831, 14 A.L.R.2d 620. </p></div><div class="footnote" id="fn2" label="2"><a class="footnote" href="#fn2_ref"> 2 </a><p id="b175-24"> . Code 1961, 16-403. </p></div></div></opinion>

Taking a look at this. Might be able to just plug in the fix_footnotes logic as a fix. @flooie you had mentioned about making some site wide modifications to footnotes sometimes soon, does that come into play here it all?

@jtmst
Copy link
Collaborator Author

jtmst commented Nov 4, 2024

@quevon24 I'm looking at this again and I dont think we should expect this to display the same when we re-import it, as the xml is still just cap xml but this page is displaying a lawbox import. We did look at examples of how these footnotes are being brought in from CAP and it seemed like it wasnt an issue pending @flooie 's work around footnote styling.

@quevon24
Copy link
Member

quevon24 commented Nov 7, 2024

@quevon24 I'm looking at this again and I dont think we should expect this to display the same when we re-import it, as the xml is still just cap xml but this page is displaying a lawbox import. We did look at examples of how these footnotes are being brought in from CAP and it seemed like it wasnt an issue pending @flooie 's work around footnote styling.

I've already checked the opinions where xml_harvard is the main source and I didn't find any problems when updating the xml since I didn't find any footnotes with the format I described above. So they should be displayed correctly when updating the xml if that's the case.

So I don't think there is a problem since it is not the main source shown in courtlistener. What do you think @flooie?

@quevon24 quevon24 self-requested a review November 7, 2024 19:32
@quevon24
Copy link
Member

I did these changes:

  • I inspected the db again and i found clusters where we are displaying xml_harvard value, so it is neccesary to revert the changes in the xml (footnotes and pagination) before doing a comparison between cap and cl data: https://www.courtlistener.com/opinion/5361439/go/ https://www.courtlistener.com/opinion/5361432/go/
  • Use VerboseCommand instead of BaseCommand to reduce code
  • Improve log messages to make it more readable
  • Update tests
  • Remove tqdm
  • Add delay argument because we are updating opinions and xml_harvard field is being indexed in ES

@quevon24 quevon24 requested a review from flooie November 22, 2024 16:21
@quevon24 quevon24 assigned flooie and unassigned quevon24 Nov 22, 2024
Copy link
Contributor

@flooie flooie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtmst

Sorry this took some time to get back to you on. I have a number of concerns about this PR and I think it might be helpful to hop onto a call after Thanksgiving. I'm going to highlight one particular problem that I came across but I think there may be a handful of others.

As far as I can tell there is no attempt to merge the changes we identified with the CAP data. For example, let me take a random example. This is an example where something on CAP went attorney happy and identified basically entire document as attorney tags.
Screenshot 2024-11-27 at 2 46 55 PM

At the end of the CAP file you get

 <opinion type="majority">
    </opinion>

See links at bottom.

because we have to do this at scale I can sheepishly state that we didnt get it all correctly. we converted all the errant attorney tags to p tags and wrapped all the content after the court into the opinion and did not create a headmatter. We had no method to correctly parse out when the opinion actually started here.

When I tested your code it took the entire opinion prior to the empty opinion and stashed it as headmatter and generated an empty opinion. It reverted all the fixes we generated and made 26 attorney tags.

https://ia902209.us.archive.org/10/items//law.free.cap.p3d.443/597.12576835.json
https://www.courtlistener.com/opinion/8255415/white-v-premo/#p3
/opinion/8255415/white-v-premo/

I think this PR is going to have to wait until the front end 🤞 gets approved this week so we can do a thorough inspection of how these eventual changes would affect the front end and the css.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🏗 In progress
Status: 👀 In review
Status: Blocked
Development

Successfully merging this pull request may close these issues.

4 participants