Html2Xhtml
From CORSIS
Html2Xhtml is a .NET 4.0 library for converting HTML to XHTML licensed under GPLv2 or above.
I tested Html2Xhtml in the local reconstruction of a large online database of the European Union. Tidy/Tidy.NET would not even produce valid output most of the time, Chilkat's HTML-to-XML was a bit slow and produced strange results (misplaced, missing, unexplainable elements). In attempt to find a free, fast and reliable conversion tool I created this library. It converts 2 - 4x faster than all other libraries I tested.
Html2Xhtml, combined with the power of LINQ to XML, is an excellent tool for all large-scale data extraction and web crawling scenarios.
Download
- Windows: 1.1.2-4
Documentation
C# Example
Program
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Corsis.Xhtml;
namespace Test
{
class Program
{
static void Main(string[] args)
{
Html2XhtmlBasicExample();
Html2XDocumentExample();
}
static string html = "<html><head><TITLE>title</TITLE></head><body>I♥NY<p>b<br>c:±<img src=2 nonsense=x></a><font size=2>c</font></body></html>";
static void Html2XhtmlBasicExample()
{
var xhtml = Html2Xhtml.RunAsFilter(stdin => stdin.Write(html)).ReadToEnd();
Console.WriteLine(xhtml);
}
static void Html2XDocumentExample()
{
var xdoc = Html2Xhtml.RunAsFilter(stdin => stdin.Write(html)).ReadToXDocument(keepXhtmlNamespace: true);
Console.WriteLine(xdoc);
}
}
}
Output
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
title
</title>
</head>
<body>
I♥NY
<p>
b<br />c:±<img src="2" alt="" /><font size="2">c</font>
</p>
</body>
</html>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"[]>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
title
</title>
</head>
<body>
I♥NY
<p>
b<br />c:±<img src="2" alt="" /><font size="2">c</font></p></body>
</html>
Initial E-Mail
May contain obsolete code!
.
.
.
=Technische Verbesserung=
Bis Heute benutzten wir eine kommerzielle Bibliothek um Htmls sauber zu walk-en: http://www.chilkatsoft.com/buyHtmlToXml.asp
Jetzt habe ich in Zusammenarbeit mit einem Kollegen (http://www.it.uc3m.es/jaf/) eine völlig neue Konvertierungsroutine entwickelt.
* liefert perfekte Ergebnisse, immer! (chilkat's könnte ein paar Dokumente nicht konvertieren)
* läuft 2x bis 4x schneller!
* steht unter GPL frei zur Verfügung
* hat eine sehr flexible API
.
.
.
Ein relevantes Beispiel:
Console.Title = "Loading PreLex Raw";
var preLexHtmls = BIO.Load<ConcurrentDictionary<string, string>>("all.prelex.htmls.bio");
var preLexDGs = new ConcurrentDictionary<string, string>();
Console.Title = "Extracting DGs";
Parallel.ForEach(preLexHtmls.Keys, link =>
{
var html = preLexHtmls[link];
var hxml = Html2Xhtml.RunAsFilter(stdin => stdin.Write(html), tabLength: 0).ReadToXDocument();
var prs = from ktd in hxml.Descendants("td")
let font = ktd.Element("font")
where font != null && font.Value.Clean() == "Primarily responsible" // 1
let tableName = ktd.p_("table").Descendants("b").FirstOrDefault(b => b.Value == "Adoption by Commission") // 2, 3
where tableName != null
let vtd = ktd.ElementsAfterSelf("td").First() // 4
where vtd != null
select vtd.Value.Clean();
var pra = prs.ToArray();
preLexDGs.TryAdd(link, pra.WhenEmptySingleDefault()[0]);
});
Abschnitt vom Ausgangshtml:
2 <table BORDER=0 CELLSPACING=0 WIDTH="100%" BGCOLOR="#CECEFF">
<tr>
<td width="1%" BGCOLOR="#CCCCCC"> </td>
<td WIDTH="20%" BGCOLOR="#CCCCCC">
<font face="Arial">
<font size=-2>
<B>16-10-2006</B>
</font>
</font>
</td>
<td ALIGN=CENTER WIDTH="69%" BGCOLOR="#A0A0FF">
<font face="Arial">
<font size=-2>
3 <B>Adoption by Commission</B>
</font>
</font>
</td>
</tr>
<tr>
<td width="3"> </td>
<td VALIGN=TOP><font face="Arial"><font size=-2>Decision mode:</font></font></td>
<td VALIGN=TOP><font face="Arial"><font size=-2>Written procedure</font></font></td>
</tr>
<tr>
<td width="3"> </td>
<td VALIGN=TOP><font face="Arial"><font size=-2>Addressee</font></font></td>
<td VALIGN=TOP><font face="Arial"><font size=-2>Council; Court of Auditors; European Parliament</font></font></td>
</tr>
<tr>
<td width="3"> </td>
1 <td VALIGN=TOP><font face="Arial"><font size=-2>Primarily responsible</font></font></td>
4 <td VALIGN=TOP><font face="Arial"><font size=-2>DG Health, Consumer Protection</font></font></td>
Ich habe vor diese Bibliothek in wenigen Tagen open-source zur Verfügung zu stellen.
Diese Methode einzusetzen statt nur einfache reguläre Ausdrücke zu benutzen sorgt für
sowohl hohe Qualität der Ausgaben als auch 10x bis 20x kürzere Entwicklungszeiten.
Remarks
Html2Xhtml is implemented as a binding to the same named HTML to XHTML convertor written in C: html2xhtml.
