Skip to content

文本提取

福昕 PDF SDK 安卓版通过 Core SDK 提供用于文本提取、选择、搜索与检索的 API。PDF 的文本内容由 TextPage 对象进行组织,并与具体页面一一对应。通过 TextPage,可获取页面中的字符、单词、指定范围文本,以及指定矩形区域内的文本等信息。

TextPage 也常用于构建其它文本相关能力:

  • 文本搜索:用 TextPage 构造 TextSearch
  • 超链接访问:用 TextPage 构造 PageTextLinks
  • 选中文本高亮/区域计算:用 TextPage 根据选择起止点计算文字区域。

如何根据选择起止点获取页面文本区域

本示例演示如何根据选择起止点计算文字区域,用于选中高亮或范围标注等场景。

java
import com.foxit.sdk.PDFException;
import com.foxit.sdk.common.Constants;
import com.foxit.sdk.common.fxcrt.PointF;
import com.foxit.sdk.common.fxcrt.RectF;
import com.foxit.sdk.pdf.PDFPage;
import com.foxit.sdk.pdf.TextPage;

import java.util.ArrayList;

// Get the text area on page by selection.
// The starting selection position and ending selection position are specified by startPos and endPos.
public ArrayList<RectF> getTextRectsBySelection(PDFPage page, PointF startPos, PointF endPos) {
    try {
        // If the page hasn't been parsed yet, throw an exception.
        if (!page.isParsed()) {
            throw new PDFException(Constants.e_ErrNotParsed, "PDF Page should be parsed first");
        }

        // Create a text page from the parsed PDF page.
        TextPage textPage = new TextPage(page, TextPage.e_ParseTextNormal);
        if (textPage == null || textPage.isEmpty()) return null;

        int startCharIndex = textPage.getIndexAtPos(startPos.getX(), startPos.getY(), 5);
        int endCharIndex = textPage.getIndexAtPos(endPos.getX(), endPos.getY(), 5);

        // API getTextRectCount requires that start character index must be lower than or equal to end character index.
        startCharIndex = startCharIndex < endCharIndex ? startCharIndex : endCharIndex;
        endCharIndex = endCharIndex > startCharIndex ? endCharIndex : startCharIndex;

        int count = textPage.getTextRectCount(startCharIndex, endCharIndex - startCharIndex);
        if (count > 0) {
            ArrayList<RectF> array = new ArrayList<>();
            for (int i = 0; i < count; i++) {
                RectF rectF = textPage.getTextRect(i);
                if (rectF == null || rectF.isEmpty()) continue;
                array.add(rectF);
            }
            // The return rects are in PDF unit.
            // If caller need to highlight the text rects on the screen, then these rects should be converted in device unit first.
            return array;
        }
    } catch (PDFException e) {
        e.printStackTrace();
    }
    return null;
}

API 参考

TextPage 的完整接口说明请参阅 API 手册