[Share] After creating annotations, automatically remove spaces and garbled characters from the content #269

yzy1228682367 · 2024-02-20T12:16:38Z

yzy1228682367
Feb 20, 2024

Is there an existing issue for this?

I have searched the existing issues

Environment

OS: macOS Sonoma 14.1.2
Zotero Version: zotero 7 beta 60
Plugin Version: 1.0.0-beta.34

Describe the feature request

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
感谢开发action＆tag！自动化省了很多功夫。请问可否利用action＆tag实现添加注释以后，自动去除注释内容中的空格以及乱码呢？

Why do you need this feature?
A clear and concise description of why you need this feature.
pdf中的中文文本有时空格很多，即使一行内没有空格，换行也会造成空格。目前可以利用快捷指令、quicker等工具选中文本以后去除空格，但是是否可以利用action＆tag的功能实现全自动去除空格呢？感谢开发者～

Describe the solution you'd like

The solution you'd like
A clear and concise description of what you want to happen.

Alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Anything else?

wakewon · 2024-02-25T16:53:19Z

wakewon
Feb 25, 2024

介绍 Introduction

本脚本可用于去除注释中的多余空格、换行符，替换全角字母、数字，并规范标点符号。
This script can be used to remove extra spaces, line breaks, replace full-width letters and numbers, and standardize punctuation in comments.

用法 Usage

本脚本可用于自动触发（事件：新建注释）和手动触发（菜单项：注释菜单中）。
This script can be used for automatic triggering (Event: Create Annotation) and manual triggering (Menu Label: In Annotation Menu).

请将以下代码完整拷贝至“数据”中：
Please copy the following script into "Data":

第一版：基于规则的处理（无需联网）

Version 1: Rule-based Processing (No networking required)

/**
 * Format Chinese Annotations
 * @author wakewon
 * @usage Create Annotation & In Annotation Menu
 * @link https://github.com/windingwind/zotero-actions-tags/discussions/269
 * @see https://github.com/windingwind/zotero-actions-tags/discussions/269
*/

if (!item) return;

const topItem = Zotero.Items.getTopLevel([item])[0];
const formatLang = ["", "zh", "zh-CN", "zh_CN"];
const lang = topItem.getField("language");

if (!formatLang.includes(lang)) return "[Action: Format Chinese Annotations] Skip due to language";

return await editAnnotation(item);

async function editAnnotation(annotationItem) {
  if (!annotationItem.isAnnotation()) return "[Action: Format Chinese Annotations] Not an annotation item";
  if (!annotationItem.annotationText) return "[Action: Format Chinese Annotations] No text found in this annotation";
  
  annotationItem.annotationText = await formatText(annotationItem.annotationText);
  return;
}

async function formatText(text) {
  const punctuationMap = { '.': '。', ',': '，', '!': '！', '?': '？', ':': '：', ';': '；' };
  const fullWidthToHalfWidth = s => String.fromCharCode(s.charCodeAt(0) - 0xFEE0);

  return text
    .replace(/[\r\n]/g, '') // Remove all line breaks
    .replace(/[\uE5D2\uE5CF\uE5CE\uE5E5]/g, '') // Remove special characters
    .replace(/[Ａ-Ｚａ-ｚ０-９！＂＇（）［］｛｝＜＞，．：；－]/g, fullWidthToHalfWidth) // Full-width to half-width
    .replace(/\s+/g, ' ') // Replace consecutive spaces with a single space
    .replace(/(?<=\d)\s+|\s+(?=\d)/g, '') // Remove spaces around digits
    .replace(/\s*(?=[.,:;!?"()\[\]。？！，、；：“”‘’（）《》【】])|(?<=[.,:;!?"()\[\]。？！，、；：“”‘’（）《》【】])\s*/g, '') // Remove spaces around punctuation
    .replace(/(\S)\s+(?=[\u4e00-\u9fa5])|(?<=[\u4e00-\u9fa5])\s+(\S)/g, '\$1\$2') // Remove spaces between Chinese characters
    .replace(/([\u4e00-\u9fa5]+)([,.!?:;]+)/g, (m, c, p) => c + p.split('').map(p => punctuationMap[p]).join('')) // Replace English punctuation marks with Chinese ones
    .replace(/([,.!?:;]+)([\u4e00-\u9fa5]+)/g, (m, p, c) => p.split('').map(p => punctuationMap[p]).join('') + c) // Replace English punctuation marks with Chinese ones
    .replace(/\(([^()]*[\u4e00-\u9fa5][^()]*)\)|\[([^\[\]]*[\u4e00-\u9fa5][^\[\]]*)\]/g, (m, c1, c2) => c1 ? `（${c1}）` : `【${c2}】`) // Replace full-width parentheses
    .replace(/([0-9a-zA-Z])（/g, "\$1" + String.fromCharCode(0xFF08)) // Full-width parentheses around digits and letters
    .replace(/）([0-9a-zA-Z])/g, String.fromCharCode(0xFF09) + "\$1") // Full-width parentheses around digits and letters
    .replace(/([a-zA-Z]+)([,.!?:;]+)([a-zA-Z]+)/g, (m, w1, p, w2) => w1 + p + ' ' + w2) // Add space for English punctuations
    .replace(/(\S)\(/g, '\$1 (') // Add space before parenthesis
    .replace(/\)([\u4e00-\u9fa5])/g, ') \$1') // Add space after parenthesis
    .replace(/([,.!?:;)])(?!\s|(?<=\.)\d)/g, '\$1 ') // Add a space after punctuation if not followed by a space or a digit after '.'
    .replace(/🔤(.*)/g, (match, p1) => p1.trim() ? `\n🔤${p1}` : '🔤'); // Add a newline before 🔤 if there is content after it
}

第二版：使用AI处理（需要联网，需要有效的OpenAI API）

Version 2: Processing with AI (Requires networking and a valid OpenAI API)

请注意，你需要拥有有效的OpenAI服务的密钥，同时拥有可以正常访问OpenAI服务的网络（或使用可用的服务URL）。请在使用前修改下方脚本中的API_KEY，将其后面双引号里的内容替换为你的OpenAI密钥。
Please note that you need to have a valid key for the OpenAI service, as well as a network that can access the OpenAI service properly (or use the available service URL). Please modify the API_KEY in the script below before using it and replace the double quotes after it with your OpenAI key.

这里的API_URL需填写如同脚本格式的完整请求地址。如果你的服务代理商仅提供了较短的地址（如api.chatanywhere.com.cn），你需要将脚本中的api.openai.com替换为代理商提供的短域名，并保留剩余的部分（即修改为https://api.chatanywhere.com.cn/v1/chat/completions）。
The API_URL here needs to be filled in as if it were the full request address in script format. If your service provider only provides a shorter address (e.g. api.chatanywhere.com.cn), you need to replace api.openai.com in the script with the short domain name provided by the provider and keep the remainder (i.e., change it to https://api.chatanywhere.com.cn/ v1/chat/completions).

/**
 * AI Normalize Punctuation
 * This script standardizes punctuation in the selected text, handling both Chinese and English punctuation.
 * It uses the OpenAI API for text processing.
 * 
 * @usage In Annotation Menu
 * @link https://github.com/windingwind/zotero-actions-tags/discussions/269
 * @see https://github.com/windingwind/zotero-actions-tags/discussions/269
 */

/** { 👍 "openai" } service provider */
const SERVICE = "openai";

// OpenAI API configuration
const OPENAI = {
  API_KEY: "InputYourKeyHere", // 替换为你的OpenAI API密钥。 // Replace with your OpenAI API key.
  MODEL: "gpt-3.5-turbo", // 默认模型名称，可以根据需要进行更改。 // Default model name, which can be changed as needed.
  API_URL: "https://api.openai.com/v1/chat/completions", // 请求地址，可以根据需要进行更改。 // Request address, which can be changed as needed.
};

if (!item) return;

const topItem = Zotero.Items.getTopLevel([item])[0];
const formatLang = ["", "zh", "zh-CN", "zh_CN"];
if (!formatLang.includes(lang)) return "[Action: Format Chinese Annotations] Skip due to language";

if (!formatLang.includes(lang)) return;

return await normalizePunctuation(item);

async function normalizePunctuation(annotationItem) {
  if (!annotationItem.isAnnotation()) return "[Action: AI Normalize Punctuation] Not an annotation item";
  if (!annotationItem.annotationText) return "[Action: AI Normalize Punctuation] No text found in this annotation";


  const selectedText = annotationItem.annotationText;
  let result;
  let success;

  switch (SERVICE) {
    case "openai":
      ({ result, success } = await callOpenAI(selectedText));
      break;
    default:
      result = "Service Not Found";
      success = false;
  }

  if (success) {
    annotationItem.annotationText = `${result}`;
    return `Formatted Text: ${result}`;
  } else {
    return `Error: ${result}`;
  }
}

async function callOpenAI(text) {
  const prompt = `
  Please standardize the punctuation in the following text, using Chinese punctuation for Chinese content. Return only the corrected text:
  ${text}
  `;

  const data = {
    model: OPENAI.MODEL,
    messages: [
      { role: "system", content: "You are a helpful language assistant." },
      { role: "user", content: prompt }
    ],
    max_tokens: 1000,
    temperature: 0.2,
  };

  try {
    const xhr = await Zotero.HTTP.request(
      "POST",
      OPENAI.API_URL,
      {
        headers: {
          'Authorization': `Bearer ${OPENAI.API_KEY}`,
          'Content-Type': 'application/json; charset=utf-8',
        },
        body: JSON.stringify(data),
        responseType: "json",
      }
    );

    if (xhr && xhr.status && xhr.status === 200 && xhr.response.choices && xhr.response.choices.length > 0) {
      return {
        success: true,
        result: xhr.response.choices[0].message.content.trim(),
      };
    } else {
      return {
        result: xhr.response.error ? xhr.response.error.message : 'Unknown error',
        success: false,
      };
    }
  } catch (error) {
    console.error('Error calling OpenAI API:', error);
    return {
      result: error.message,
      success: false,
    };
  }
}

定制化用法 Customized Usage

跳过特定语言的文献 Skip the documentation of a specific language

本脚本只处理语言字段为zh、zh-CN、zh_CN以及没有语言信息条目下的PDF文档。

如果需要处理更多语言的文献，请在 const formatLang = ["", "zh", "zh-CN", "zh_CN"]; 中自行添加；
如果希望处理任意语言的文献，请在 if (!formatLang.includes(lang)) return "[Action: Format Chinese Annotations] Skip due to language"; 前添加两个斜线 //。

This script only handles PDF documents with language fields zh, zh-CN, zh_CN and no language information entries.

If you need to process documents in more languages, please add them yourself in const formatLang = ["", "zh", "zh-CN", "zh_CN"];;
If you want to handle documents in any language, please add two slashes // before if (!formatLang.includes(lang)) return "[Action: Format Chinese Annotations] Skip due to language";".

关闭提醒弹窗 Turn off alert pop-ups

如果希望关闭某一个弹窗提醒，你可以将代码中 return 后双引号及双引号内的内容删除。
例如：如果你希望关闭因语言跳过的提醒，你只需要将 return "[Action: Format Chinese Annotations] Skip due to language"; 改为 return; 即可。

If you wish to turn off a particular pop-up alert, you can remove the double quotes and the content inside the double quotes after the return in the code.
For example, if you want to turn off alerts that skip due to language, you just need to change return "[Action: Format Chinese Annotations] Skip due to language"; to return;.

致谢 Acknowledgements

本脚本主要参考了 #107 和 #220 ，并借助gpt-4o完成了主要的代码编写工作。再次感谢原脚本作者的帮助以及GPT的强力支持！
This script mainly references #107 and #220 , and the main coding is done with the help of gpt-4o. Thanks again to the original script authors for their help and GPT for their strong support!

0 replies

yzy1228682367 · 2024-02-26T06:27:49Z

yzy1228682367
Feb 26, 2024
Author

感谢大佬！太牛了！

0 replies

yslemmo · 2024-04-20T12:03:00Z

QingDi0817 · 2024-09-13T06:31:08Z

Sinoftj · 2024-12-09T15:16:38Z

Sinoftj
Dec 9, 2024

不知道，无论“事件”选择哪一个，都不能自动执行，只能自己手动启动···

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Share] After creating annotations, automatically remove spaces and garbled characters from the content #269

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Share] After creating annotations, automatically remove spaces and garbled characters from the content #269

Is there an existing issue for this?

Environment

Describe the feature request

Describe the solution you'd like

Anything else?

Replies: 5 comments · 5 replies

介绍 Introduction

用法 Usage

第一版：基于规则的处理（无需联网）

Version 1: Rule-based Processing (No networking required)

第二版：使用AI处理（需要联网，需要有效的OpenAI API）

Version 2: Processing with AI (Requires networking and a valid OpenAI API)

定制化用法 Customized Usage

跳过特定语言的文献 Skip the documentation of a specific language

关闭提醒弹窗 Turn off alert pop-ups

致谢 Acknowledgements

yzy1228682367 Feb 26, 2024 Author

Replies: 5 comments 5 replies

yzy1228682367
Feb 26, 2024
Author