Skip to content

Commit

Permalink
common programming operations
Browse files Browse the repository at this point in the history
- convert html to txt (java)
  • Loading branch information
xy-241 committed Dec 18, 2024
1 parent 06fdff2 commit 78f0045
Showing 1 changed file with 39 additions and 2 deletions.
41 changes: 39 additions & 2 deletions content/Programming/Common Programming Operations.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,9 @@ tags:
- java
- cpp
- python
- programming
Creation Date: 2024-12-14, 20:31
Last Date: 2024-12-16T22:14:18+08:00
Last Date: 2024-12-18T23:05:46+08:00
References:
draft:
description:
Expand Down Expand Up @@ -120,4 +121,40 @@ minHeap.push(5);
minHeap.pop();
// Return the smallest element from the heap
minHeap.top().val;
```
```


## Convert HTML to TXT
---
>[!important]
> We don’t want to extract the content as a single string because this would result in losing all the formatting information provided by the HTML tags. Reformatting the content based on the document structure afterward is tedious, error-prone, and not scalable.
>
> Instead, the idea is to retain the formatting information we need before removing all the HTML tags. Then, we can use this retained formatting information to generate a text file with the desired formatting.

```java title="Java"
// Generate a placeholder string using UUID to avoid conflicts with the HTML content
String uniquePlaceholder = UUID.randomUUID().toString();

// This step retains line break information, which we will later replace with actual line breaks (\n)
String htmlContent = rawHtml.replace("<br />", "<span>" + uniquePlaceholder + "</span>");

// Parse the modified HTML content using Jsoup to extract plain text
// Replace the placeholder with actual line breaks (\n) to simulate the original formatting
String txtContent = Jsoup.parse(htmlContent).text().replace(uniquePlaceholder, "\n");

// Create a FileWriter to write the plain text content to a file
FileWriter writer = new FileWriter("output.txt");

// Write the plain text content into the file
writer.write(txtContent);

writer.close();

```

>[!code]
> This above code example assumes that `<br />` is the only tag used to denote line breaks in the given HTML string. If other tags or methods are used for formatting, additional handling may be required.
>
> Also note that, The `FileWriter writer` formats and writes the content into a text file, ensuring that line breaks are correctly represented using `\n`.

0 comments on commit 78f0045

Please sign in to comment.