一种网页正文的提取方法

Extraction method for webpage text

Abstract

The invention relates to an extraction method for a webpage text. The extraction method comprises the following steps: 1) extracting a webpage title through a regular expression; 2) preprocessing the webpage; 3) dynamically dividing a text block; 4) rating the text block and selecting the optimal text block; 5) circularly expanding the text block. According to the extraction method for the webpage text, the extraction speed is high, the extraction effects for various pages of the news gateway, personal blog and community forum are all excellent, the accuracy is high and the robustness is excellent.
一种网页正文的提取方法,包括以下步骤:步骤一,通过正则表达式提取网页标题;步骤二,网页预处理;步骤三,动态划分文本块;步骤四,对文本块进行打分,选取最优文本块;步骤五,循环扩大文本块。本发明提取速度很快,无论对新闻门户还是个人博客、论坛社区的各种网页都有很好的提取效果,且准确性高、鲁棒性好。

Claims

Description

Topics

Download Full PDF Version (Non-Commercial Use)

Patent Citations (5)

    Publication numberPublication dateAssigneeTitle
    CN-101937438-AJanuary 05, 2011富士通株式会社网页内容提取方法和装置
    CN-102663023-ASeptember 12, 2012浙江盘石信息技术有限公司Implementation method for extracting web content
    CN-102810097-ADecember 05, 2012高德软件有限公司Method and device for extracting webpage text content
    CN-103064827-AApril 24, 2013盘古文化传播有限公司一种网页内容抽取的方法及装置
    JP-4606439-B2January 05, 2011モバイダーズ・インコーポレイテッドMobiders,Inc.Htmlファイルをフラッシュイメージに変換するファイル変換装置及びその変換方法

NO-Patent Citations (0)

    Title

Cited By (0)

    Publication numberPublication dateAssigneeTitle