Html2Xhtml

From CORSIS

Jump to: navigation, search

Html2Xhtml is a .NET 4.0 library for converting HTML to XHTML licensed under GPLv2 or above.

I tested Html2Xhtml in the local reconstruction of a large online database of the European Union. Tidy/Tidy.NET would not even produce valid output most of the time, Chilkat's HTML-to-XML was a bit slow and produced strange results (misplaced, missing, unexplainable elements). In attempt to find a free, fast and reliable conversion tool I created this library. It converts 2 - 4x faster than all other libraries I tested.

Html2Xhtml, combined with the power of LINQ to XML, is an excellent tool for all large-scale data extraction and web crawling scenarios.

Download

Documentation

C# Example

Program

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

using Corsis.Xhtml;

namespace Test
{
    class Program
    {
        static void Main(string[] args)
        {
            Html2XhtmlBasicExample();
            Html2XDocumentExample();
        }

        static string html = "<html><head><TITLE>title</TITLE></head><body>I♥NY<p>b<br>c:±<img src=2 nonsense=x></a><font size=2>c</font></body></html>";

        static void Html2XhtmlBasicExample()
        {
            var xhtml = Html2Xhtml.RunAsFilter(stdin => stdin.Write(html)).ReadToEnd();

            Console.WriteLine(xhtml);
        }

        static void Html2XDocumentExample()
        {
            var xdoc = Html2Xhtml.RunAsFilter(stdin => stdin.Write(html)).ReadToXDocument(keepXhtmlNamespace: true);

            Console.WriteLine(xdoc);
        }
    }
}

Output

<?xml version="1.0" encoding="utf-8"?>

<!DOCTYPE html
   PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>
      title
    </title>
  </head>
  <body>
    I♥NY
    <p>
      b<br />c:±<img src="2" alt="" /><font size="2">c</font>
    </p>
  </body>
</html>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"[]>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>
      title
    </title>
  </head>
  <body>
    I♥NY
    <p>
      b<br />c:±<img src="2" alt="" /><font size="2">c</font></p></body>
</html>

Initial E-Mail

May contain obsolete code!

.
.
.

=Technische Verbesserung=

Bis Heute benutzten wir eine kommerzielle Bibliothek um Htmls sauber zu walk-en: http://www.chilkatsoft.com/buyHtmlToXml.asp

Jetzt habe ich in Zusammenarbeit mit einem Kollegen (http://www.it.uc3m.es/jaf/) eine völlig neue Konvertierungsroutine entwickelt.

    * liefert perfekte Ergebnisse, immer! (chilkat's könnte ein paar Dokumente nicht konvertieren)
    * läuft 2x bis 4x schneller!
    * steht unter GPL frei zur Verfügung
    * hat eine sehr flexible API

.
.
.

Ein relevantes Beispiel:

            Console.Title = "Loading PreLex Raw";
            var preLexHtmls = BIO.Load<ConcurrentDictionary<string, string>>("all.prelex.htmls.bio");
            var preLexDGs = new ConcurrentDictionary<string, string>();
            Console.Title = "Extracting DGs";
            Parallel.ForEach(preLexHtmls.Keys, link =>
            {
                var html = preLexHtmls[link];
                var hxml = Html2Xhtml.RunAsFilter(stdin => stdin.Write(html), tabLength: 0).ReadToXDocument();
                var prs = from ktd in hxml.Descendants("td")
                          let font = ktd.Element("font")
                          where font != null && font.Value.Clean() == "Primarily responsible" // 1
                          let tableName = ktd.p_("table").Descendants("b").FirstOrDefault(b => b.Value == "Adoption by Commission") // 2, 3
                          where tableName != null
                          let vtd = ktd.ElementsAfterSelf("td").First() // 4
                          where vtd != null
                          select vtd.Value.Clean();
                var pra = prs.ToArray();
                preLexDGs.TryAdd(link, pra.WhenEmptySingleDefault()[0]);
            });

Abschnitt vom Ausgangshtml:

2    <table BORDER=0 CELLSPACING=0 WIDTH="100%" BGCOLOR="#CECEFF">
        <tr>
            <td width="1%" BGCOLOR="#CCCCCC"> </td>
            <td WIDTH="20%" BGCOLOR="#CCCCCC">

                <font face="Arial">
                <font size=-2>
                <B>16-10-2006</B>
                </font>
                </font>
            </td>
            <td ALIGN=CENTER WIDTH="69%" BGCOLOR="#A0A0FF">
                    <font face="Arial">

                    <font size=-2>
3                    <B>Adoption by Commission</B>
                    </font>
                    </font>
            </td>
        </tr>
                <tr>
                <td width="3"> </td>
                <td VALIGN=TOP><font face="Arial"><font size=-2>Decision mode:</font></font></td>
                <td VALIGN=TOP><font face="Arial"><font size=-2>Written procedure</font></font></td>
                </tr>
            <tr>
            <td width="3"> </td>
            <td VALIGN=TOP><font face="Arial"><font size=-2>Addressee</font></font></td>
            <td VALIGN=TOP><font face="Arial"><font size=-2>Council; Court of Auditors; European Parliament</font></font></td>
            </tr>
            <tr>
            <td width="3"> </td>
1            <td VALIGN=TOP><font face="Arial"><font size=-2>Primarily responsible</font></font></td>
4            <td VALIGN=TOP><font face="Arial"><font size=-2>DG Health, Consumer Protection</font></font></td>

Ich habe vor diese Bibliothek in wenigen Tagen open-source zur Verfügung zu stellen.
Diese Methode einzusetzen statt nur einfache reguläre Ausdrücke zu benutzen sorgt für
sowohl hohe Qualität der Ausgaben als auch 10x bis 20x kürzere Entwicklungszeiten.

Remarks

Html2Xhtml is implemented as a binding to the same named HTML to XHTML convertor written in C: html2xhtml.

Personal tools