關於 Prompt engineering 我只是略懂:來自 OpenAI 的優良提示詞工程六大策略


Table of Contents

策略一:提供 LLM 清晰的指引

LLM 讀不懂你的心思。如果輸出太長,就請求簡短回答。如果輸出太簡單,就要求專家級的寫作。如果你不喜歡格式,就展示你想看到的格式。模型猜測你想要什麼的需求越少,你獲得想要的結果的可能性就越大。

在你的查詢中包含細節,以獲得更相關的答案(Include details in your query to get more relevant answers)

先看以下的範例

🚫 不好的提示詞:問題太籠統,沒有提供足夠的上下文或具體需求,讓模型難以提供精確的答案。

How do I add numbers in Excel?

✅ 較好的提示詞:提供了具體的操作場景和期望的結果,使模型能夠提供針對性的解決方案。

How do I add up a row of dollar amounts in Excel? I want to do this automatically for a whole sheet of rows with all the totals ending up on the right in a column called "Total".

🚫 不好的提示詞:問題過於模糊,沒有指定國家或時間點,讓模型無法提供有意義的回答。

Who’s president?

✅ 較好的提示詞:明確指定了國家、時間點和額外的問題(選舉頻率),提供了足夠的信息,以便模型提供具體且相關的回答。

Who was the president of Mexico in 2021, and how frequently are elections held?

🚫 不好的提示詞:缺乏細節,如程式語言偏好、性能要求或程式碼的解釋需求,讓模型無法針對特定需求提供最佳解答。

Write code to calculate the Fibonacci sequence.

✅ 較好的提示詞:指定了程式語言(TypeScript)、效率要求和需要詳細註解的需求,讓模型能提供更專業、易於理解的程式碼。

Write a TypeScript function to efficiently calculate the Fibonacci sequence. Comment the code liberally to explain what each piece does and why it's written that way.

🚫 不好的提示詞:沒有具體指示關於摘要的格式、重點的細節,或是後續行動項目,使得模型的回答可能不夠全面或精確。

Summarize the meeting notes.

✅ 較好的提示詞:提供了詳細的指示關於如何組織訊息,包括摘要的格式、講者和關鍵點的列表,以及後續行動項目,使得模型能提供結構化且全面的回答。

Summarize the meeting notes in a single paragraph. Then write a markdown list of the speakers and each of their key points. Finally, list the next steps or action items suggested by the speakers, if any.

🌟 奠基在上述的提示詞技巧上,以下這段提示詞是我每天在問 ChatGPT 問題之前都會在每一個 thread 的一開頭就先加上去的。透過以下的提示詞我先明確的定義 LLM 要扮演的角色以及應當具備的專業技能,然後我要求他一定要使用繁體中文(zh-TW)以及台灣人所熟悉的單詞、語句跟慣用語來生成我後續問題的答案,盡全力讓他不要產生簡體字跟大陸用語。

From now on, you will be an expert in [Place your desired topic here...]. You are also an expert in [Place your desired topic here...]. You are Taiwanese. You will answer all my questions in zh-TW and use wordings, phrases, and idioms in Taiwan instead of Mainland China.

要求模型採用一個人物角色(Ask the model to adopt a persona)

SYSTEM:When I ask for help to write something, you will reply with a document that contains at least one joke or playful comment in every paragraph.

USER:Write a thank you note to my steel bolt vendor for getting the delivery in on time and in short notice. This made it possible for us to deliver an important order.

透過要求 LLM 透過模擬某一個 Persona 來進行答案的推論能讓 LLM

  • 提升回答的相關性和精準度:透過模擬特定的人物角色,可以讓模型更清楚地理解用戶的需求和語境,從而提供更加貼切和具體的回答。例如,指定一個專家級的persona可以讓模型提供更專業、深入的見解。
  • 增加互動的趣味性:給予模型一個特定的人物角色,可以讓回答不僅僅是乾巴巴的資訊傳遞,還可以加入特定的風格或幽默元素,讓互動過程更加生動有趣。
  • 提高用戶體驗:當模型能夠根據用戶指定的 persona 提供服務時,這種個性化的互動方式可以增加用戶的滿意度和參與感,從而提高整體的用戶體驗。
  • 更好地管理期望:透過明確指定模型應該採用的角色,用戶可以更好地設定和管理對於模型回答的期望。這種方式有助於避免模糊不清的指令導致的誤解或不滿足的結果。

使用分隔符號清楚地指示輸入的不同部分(Use delimiters to clearly indicate distinct parts of the input

如以下範例中的提示詞使用 #三重引號 這樣的分隔符號來明確標示輸入的不同部分,是一種有效提升語言模型輸出質量和準確性的最佳實務。這種方法使得請求更加直觀、易於理解,並且允許模型更專注地處理具體的任務要求。

USER:

Summarize the text delimited by triple quotes with a haiku.

"""insert text here"""

以下的範例則是使用了 XML 標籤來對提示詞內容中不同的部位進行了更加明確的區隔

SYSTEM:

You will be provided with a pair of articles (delimited with XML tags) about the same topic. First summarize the arguments of each article. Then indicate which of them makes a better argument and explain why.

USER:

<article> insert first article here </article>

<article> insert second article here </article>

最後一個 prompt 範例則是通過標示「Abstract:」和「Title:」這段指示清晰地分隔了摘要和標題兩部分,使模型能夠輕鬆識別每部分的內容及其功能。這種明確分隔有助於模型理解用戶的具體需求,並針對性地進行處理。

對於模型來說,這種格式化的輸入使得從文本中提取和處理信息變得更加容易。模型可以自動地識別「摘要」和「標題」,並根據這些信息進行相關的操作,如評估標題是否適合摘要,或者基於摘要生成一個新的標題。

這種指示方式不僅限於論文摘要和標題的配對,也可應用於其他需要清晰分類或結構化輸入的場景,提高了指示的通用性和靈活性。

SYSTEM:

You will be provided with a thesis abstract and a suggested title for it. The thesis title should give the reader a good idea of the topic of the thesis but should also be eye-catching. If the title does not meet these criteria, suggest 5 alternatives.

USER:

Abstract: insert abstract here

Title: insert title here

指定完成任務所需的步驟(Specify the steps required to complete a task)

為何以下的提示詞範例是 OpenAI 建議的最佳實務?

  • 步驟化指令:將任務分解成有序的步驟,使得模型能夠更容易地理解並按順序執行任務。這種逐步引導的方式有助於降低複雜性,使模型能夠更專注於每個步驟的具體要求。
  • 明確的任務界定:每個步驟都有明確的指示和預期輸出,比如「將文本摘要成一句話」和「將摘要翻譯成西班牙語」。這樣的明確性確保模型可以清楚地知道每一步要做什麼,並且知道成功完成任務的標準是什麼。
  • 輸入與輸出格式的預設:通過要求輸出以特定的前綴開始,如「Summary:」和「Translation:」,這不僅組織了模型的回應,也為解析和理解模型輸出提供了便利。這種格式化的要求有助於用戶更容易地識別和使用模型的輸出。
  • 促進跨語言理解:透過要求模型將摘要翻譯成另一種語言,這種提示詞寫法鼓勵模型應用其跨語言處理能力,這不僅展示了模型的多語言能力,也提供了一種實用的多語言輸出。
  • 鼓勵精確與簡潔:首先要求模型將文本摘要成一句話,這要求模型能夠捕捉到文本的核心意義並精煉表達,這種能力對於提高信息的準確性和易於理解性是非常重要的。
SYSTEM:

Use the following step-by-step instructions to respond to user inputs.

Step 1 - The user will provide you with text in triple quotes. Summarize this text in one sentence with a prefix that says "Summary: ".

Step 2 - Translate the summary from Step 1 into Spanish, with a prefix that says "Translation: ".

USER:
"""insert text here"""

提供範例(Provide examples)

也就是我們經常在說的 Few-shot learning(少樣本學習)

通過提供範例來示範給 LLM 去理解應該如何進行推論比試圖用文字描述要求來得更簡單、更直接。尤其是當預期的推論結果難以用一般性語言清楚定義時,提供幾個具體的回答範例可以幫助模型「抓住」所需的回答風格、格式、pattern 和內容深度。

SYSTEM:Answer in a consistent style.

USER:Teach me about patience.

ASSISTANT:

The river that carves the deepest valley flows from a modest spring; the grandest symphony originates from a single note; the most intricate tapestry begins with a solitary thread.

USER:Teach me about the ocean.

指定輸出的期望長度(Specify the desired length of the output)

指定輸出的目標長度是一個有效的策略,可以幫助模型生成更符合用戶需求的回答。這種策略的優點在於:

  • 滿足特定需求:有時用戶對輸出的長度有明確的需求,例如在限定篇幅的文章、報告或簡報中。通過指定目標長度,可以確保 LLM 生成的內容適合用戶的特定格式或空間限制。
  • 提高可讀性和可用性:根據目標讀者或使用場景,某些信息可能更適合簡短直接,而其他情況則可能需要更詳盡的解釋。通過控制輸出的長度,可以幫助 LLM 產生更符合目標讀者期待的內容。
  • 效率和準確性:雖然 LLM 在生成特定數量的單詞方面可能不夠精準,但它在產生具有特定段落數量或項目點的輸出方面更為可靠。這種方法有助於提高內容的組織結構和閱讀流暢性,使得信息傳達更為有效。
  • 靈活適應:在一些情境下,用戶可能需要較短的回答以快速獲得資訊,而在另一些情況下,則可能需要更詳細的解釋以深入理解某個主題。指定輸出長度的策略賦予了 LLM 適應這些不同需求的能力。
USER:Summarize the text delimited by triple quotes in about 50 words.

"""insert text here"""

USER:Summarize the text delimited by triple quotes in 2 paragraphs.

"""insert text here"""

USER:Summarize the text delimited by triple quotes in 3 bullet points.

"""insert text here"""

策略二:Provide reference text

LLM 在被詢問關於深奧主題或要求提供引用和網址時,可能會自信地捏造虛假答案。就像 Open book 的方式可以幫助學生在考試中取得更好的成績一樣,提供參考文本給 LLM 可以幫助它們減少謎之自信般捏造答案的機率。

Tactic: Instruct the model to answer using a reference text(提供 LLM 可參照的文本去協助推論)

透過在有限的 context window 長度內在提示詞中極盡可能的納入輔助的 fact knowledge 或是希望 LLM 去協助摘要出重點的原始長文本都是能夠協助 LLM 去進行高品質回答推論非常有效的方法。

SYSTEM:

Use the provided articles delimited by triple quotes to answer questions. If the answer cannot be found in the articles, write "I could not find an answer."

USER:

<insert articles, each delimited by triple quotes>

Question: <insert question here>

Tactic: Instruct the model to answer with citations from a reference text

當輸入的 prompt context 中包含了相關的知識或文件時,你可以指示 LLM 在回答問題時引用其中的特定段落。也就是說,如果你提供了相關的文件原始語料或知識給 LLM,你可以要求模型在給出答案時,加上引用來自這些 context 的語句。這樣做的好處是,這些引用的準確性可以通過在提供的文檔中進行字串匹配來進行程式化的驗證。簡單來說,這個策略就是利用已有的知識來增強模型的回答品質,並且通過引用確切的資料來源,來提高回答的可信度和透明度。這對於需要精確資訊來源驗證的場合特別有用,像是學術研究、論文閱讀、法律分析等等。

  • 明確性:它提供了非常明確的指示,告訴 LLM 需要如何回應。這包括只使用提供的文件來回答問題,以及如果文件中沒有足夠的信息來回答問題時該如何處理。
  • 結構化輸入:通過將文件和問題的輸入格式化,它使得輸入對模型來說更加結構化,從而容易解析和理解。
  • 引用要求:要求引用文件中的相應段落來支持答案,這不僅增加了回答的可靠性,也使得答案的根據變得透明。這對於確認模型提供的信息的準確性和來源非常重要。
  • 格式化引用:通過指定引用的格式({“citation”: …}),這個prompt進一步提升了回答的結構性和可讀性,使得後續的處理或驗證變得更加簡單。
  • 錯誤處理:這個 prompt 明確包含了一種錯誤處理機制(即當脈絡中資訊不足以回答問題時如何回應),這增加了模型應對不同情況的能力,並減少了生成不確定或無關答案的風險。
SYSTEM:

You will be provided with a document delimited by triple quotes and a question. Your task is to answer the question using only the provided document and to cite the passage(s) of the document used to answer the question. If the document does not contain the information needed to answer this question then simply write: "Insufficient information." If an answer to the question is provided, it must be annotated with a citation. Use the following format for to cite relevant passages ({"citation": …}).

USER:

"""<insert document here>"""

Question: <insert question here>

以下我嘗試使用這項 prompt 技巧去透過將一位朋友的某篇網誌內文當作 context information 帶入 LLM 中去問他究竟上週去了新加玻的 Google 總部做了些什麼?

You will be provided with a document delimited by triple quotes and a question. Your task is to answer the question using only the provided document and to cite the passage(s) of the document used to answer the question. If the document does not contain the information needed to answer this question then simply write: "Insufficient information." If an answer to the question is provided, it must be annotated with a citation. Use the following format for to cite relevant passages ({"citation": …}).

"""
Blog Title: Developer Creators and Online Communities Summit and The Hitchhiker's Guide to the NFC Tag

Blog Content:

I had the privilege of attending the Developer Creators and Online Communities Summit held in Singapore from January 24th to January 27th. This event brought together tech enthusiasts and creative individuals from all around the world. Being a part of this summit was more than just a learning experience; it offered a unique platform to forge connections with people from Vietnam, Mongolia, India, Kyrgyzstan, Egypt, Mauritius, among others.

A big shoutout to the organizing team for their outstanding efforts in making this summit a reality. Initially, I wondered, "Why do we even need an agenda when so many content creators and community organizers are gathered together?" However, it became evident that countless hours were invested to ensure the event was not only exceptional but also inclusive and joyful.

One night this week, my wife asked me, "After a busy day, how do you still seem to be full of energy?" I replied, "I guess I overcharged myself last week. So much energy."

To make our meeting more memorable, I've prepared some small gifts — NFC Tags (officially known as NTAG215. You can buy it on Amazon). Although these cards are not widely known by the public, mainstream iOS and Android smartphones can both read and write information to them through an app. This enables sharing of contact cards, URLs, sending text messages, and even launching iOS Shortcuts to perform complex tasks, perfectly demonstrating the fusion of technology and creativity.

NFC Tags, based on Near Field Communication (NFC) technology, can store information and interact with devices equipped with NFC capabilities. These cards utilize NDEF (NFC Data Exchange Format) to store up to 504 bytes of data, and they are rewritable, making them akin to tiny flash drives. For gamers, an exciting feature of NTAG215 cards is their compatibility with Nintendo Switch Amiibo figures. This allows for the creation of custom Amiibo cards that store favorite game characters and data, eliminating the need to carry physical figures.

I use an iOS app named NFC.cool Tools for writing information to them. My knowledge of the Android ecosystem is limited, but I'm confident you can find a suitable tool. Once you've written the information to an NFC card, you can read it with your smartphone. Check out the demo video:

If you're looking to dive deeper into NFC or Bluetooth technology, I think the Flipper Zero is an insanely cool gadget worth considering!

Last but not least, I use Insta360 to record the Gardens by the Bay and the Light & Water Show. You can check out the videos:

"""

Question: What the hell is the author of this article doing in Singapore?

以下則是 ChatGPT 給我的回答:

{"citation": "I had the privilege of attending the Developer Creators and Online Communities Summit held in Singapore from January 24th to January 27th. This event brought together tech enthusiasts and creative individuals from all around the world."}

作者參加了在新加坡舉辦的「開發者創造者與線上社群峰會」,這是一個聚集了來自全世界的科技愛好者和創意人士的活動。

各位可以看到 ChatGPT 的回應中準確的回答了我的問題,並精確的引用他是用整篇文章中哪一段的內容作為回答的依據,這個 prompt 的技巧很棒,也正是 OpenAI 原文章中所說可程式化驗證這句話的原因。

策略三:Split complex tasks into simpler subtasks

Tactic: Use intent classification to identify the most relevant instructions for a user query

OpenAI 原文在講述處理復雜任務時,尤其是在面對需要針對不同情況提供大量獨立指令集的任務時,先對查詢的類型進行分類,然後根據這個分類來確定需要哪些具體指令來處理任務將會是一個能大大提升推論品質的策略。通過定義固定的類別並為每個類別編寫相應的處理指令。或是使用遞迴的方式,將一項任務分解為一系列階段。

採用這種策略的優勢在於,每次查詢只包含執行任務下一階段所需的指令,相較於用單一查詢完成整個任務,這種方法可以降低錯誤率。此外,由於更大的提示會消耗更多的運算資源,從而導致更高的成本,這種逐步處理的方法也可能因為減少了所需的提示大小而降低成本。

總之,這項策略的思路在於將一項複雜任務進行結構化、階段化的迭代處理,通過精確定義任務的各個階段和相關指令來提高處理效率和準確性並同時降低推論所需的算力成本。

接下來我們看看以下的範例:

SYSTEM:

You will be provided with customer service queries. Classify each query into a primary category and a secondary category. Provide your output in json format with the keys: primary and secondary.

Primary categories: Billing, Technical Support, Account Management, or General Inquiry.

Billing secondary categories:
- Unsubscribe or upgrade
- Add a payment method
- Explanation for charge
- Dispute a charge

Technical Support secondary categories:
- Troubleshooting
- Device compatibility
- Software updates

Account Management secondary categories:
- Password reset
- Update personal information
- Close account
- Account security

General Inquiry secondary categories:
- Product information
- Pricing
- Feedback
- Speak to a human

在下好 system prompt 之後,我們可以開始問 LLM 問題如下:

I need to get my internet working again.

上面的 prompt 可以在 ChatGPT 得到以下的輸出結果:

```json
{
  "primary": "Technical Support",
  "secondary": "Troubleshooting"
}
```

以下我們來解析一下為何上述的 prompt text 對 LLM 來說是一段品質優良的指引?

  • 明確的任務定義:Prompt 清晰地定義了任務,將客戶服務查詢分類到主要類別和次要類別中。這種明確性幫助模型理解其需要完成的具體工作。
  • 結構化的輸出格式:要求以 JSON 格式提供輸出,並指定了鍵(keys):primary 和 secondary。這種結構化的輸出格式使得任務的期望輸出變得清晰,並為模型提供了一個具體的輸出模板。
  • 預定義的類別:提示文本提供了一組預定義的主要類別和次要類別,這減少了模型需要從無限可能中選擇的複雜性,並提高了分類的準確性和一致性。
  • 針對性的案例處理:透過將查詢分類到具體的類別,這個策略允許模型針對不同的情況採取不同的處理指令集,從而提高了處理查詢的效率和準確性。適用於複雜任務的分解:這種提示方式符合之前討論過的策略,即將複雜任務分解成更容易管理的子任務。這不僅降低了錯誤率,還能更精確地滿足用戶需求。

延伸上一個 prompt 範例,我們再來看一個微調後的 sample prompt

System:

You will be provided with customer service inquiries that require troubleshooting for technical support. Help the user by:

- Ask them to check that all cables to/from the router are connected. Note that it is common for cables to come loose over time.
- If all cables are connected and the issue persists, ask them which router model they are using
- Now you will advise them how to restart their device: 
-- If the model number is MTD-327J, advise them to push the red button and hold it for 5 seconds, then wait 5 minutes before testing the connection.
-- If the model number is MTD-327S, advise them to unplug and replug it, then wait 5 minutes before testing the connection.
- If the customer's issue persists after restarting the device and waiting 5 minutes, connect them to IT support by outputting {"IT support requested"}.
- If the user starts asking questions that are unrelated to this topic then confirm if they would like to end the current chat about troubleshooting and classify their request according to the following scheme:

Classify their query into a primary category and a secondary category. Provide your output in json format with the keys: primary and secondary.

Primary categories: Billing, Technical Support, Account Management, or General Inquiry.

Billing secondary categories:
- Unsubscribe or upgrade
- Add a payment method
- Explanation for charge
- Dispute a charge

Technical Support secondary categories:
- Troubleshooting
- Device compatibility
- Software updates

Account Management secondary categories:
- Password reset
- Update personal information
- Close account
- Account security

General Inquiry secondary categories:
- Product information
- Pricing
- Feedback
- Speak to a human

下好 system prompt 之後我們再問一次以下的問題:

I need to get my internet working again.

在我的實驗場景中,ChatGPT 會回覆我下面的推論結果:

To help get your internet working again, let's start with a basic but crucial step:

Please check that all cables to and from your router are securely connected. It's common for cables to become loose over time, which can disrupt your internet connection. After checking, let me know if everything is properly connected or if the issue persists.

接下來我按照他的指示回答他某一個特定的 router model:

the model name is MTD-327J

接著 ChatGPT 會按照我們先前透過 context 傳入針對該 route model 的 troubleshooting 步驟:

For the MTD-327J router model, please follow these steps to restart your device:

1. Locate the red button on your router.
2. Push and hold this red button for 5 seconds.
3. After releasing the button, wait for 5 minutes before testing your connection again.

This process helps to reset your router's settings and can often resolve connectivity issues. Please try these steps and let me know if your internet connection starts working again or if the issue persists.

我接著回答 ChatGPT:

It still not working!

ChatGPT 會按照 system prompt 中的指示回覆以下資訊:

{"IT support requested"}

接著我又問:

Do you have any ongoing promotions?

結果 ChatGPT 會按照 system prompt 將我的問題分類成:

```json
{
  "primary": "General Inquiry",
  "secondary": "Pricing"
}
```

最後我問他:

How to make a cake?

ChatGPT 則將會根據 system prompt 裡設定的規則跟我說我的問題已經偏離它能處理的範疇並詢問我是否要終止對話,並將我的問題根據 system prompt 裡設定的格式進行了 primary 與 secondary 的分類。

It seems your question has shifted away from technical support to a general inquiry about baking. Would you like to end the current chat about troubleshooting and focus on how to make a cake, or do you have a technical issue that needs to be addressed first?

For your query on cake-making, here's how I would classify it:

```json
{
  "primary": "General Inquiry",
  "secondary": "Product information"
}
```

同樣的,我們來簡單的拆解一下為何這是一段品質優良的提示詞範例

  • 明確的解決步驟:Prompt 本身提供了一系列清晰、具體的解決步驟來幫助用戶解決技術支援相關的問題,這有助於直接引導用戶針對常見問題進行疑難排解。
  • 針對性的問題解決:Ptompt 根據用戶可能遇到的具體狀況(如路由器的不同型號)提供了客製化的指令,這增加了解決問題的效率和成功率。
  • 動態反應能力:當提供的解決方案無效時,Prompt 指示如何將用戶轉接到更專業的技術支援(IT支援),這體現了對用戶需求的靈活應對和深層次的支援。
  • 用戶意圖重定向:如果用戶的提問偏離了當前的疑難排解話題,Prompt 也提供了一個機制來重新分類和處理這些查詢,這保證了不同類型的用戶需求都能被有效識別和處理。

上述的 prompt 有點像是將對話系統轉化為狀態機(State Machine),透過這種方法,可以更有組織地管理對話過程中的狀態轉換和相對應的指令。具體來說,模型被指示發出特定的字串來表示對話的狀態發生了變化。這些特定的字串讓系統知道當前的對話狀態,從而根據該狀態決定注入哪些相關的指令。

在這個框架下,系統會跟蹤當前的狀態,了解在該狀態下哪些指令是相關的,並且還可以選擇性地設定從當前狀態允許的狀態轉換。這樣做的好處是,可以對用戶體驗設置保護欄,避免在較不結構化的對話管理方法中可能遇到的問題。

簡單來說,通過將對話系統設計為狀態機,可以在對話過程中引入更多的結構和控制,使得對話管理更加精確和有效。這種方法尤其適用於需要根據用戶輸入和對話進展動態調整對話流程的情況,比如客服對話、技術支援等場景。透過明確的狀態管理和轉換控制,可以提供更加個性化和高效的用戶體驗。

Tactic: For dialogue applications that require very long conversations, summarize or filter previous dialogue

這一個 tactic 主要是針對 LLM 在有限 context window 下的限制與應對方式。當 user 和 assistant 之間的對話包含在 context window 中時,這個對話脈絡在實務上是不可能無限制地持續下去,因為現階段的 LLMs 只能處理有限量的文字(Tokens)。

  • 對話轉折的摘要:當輸入達到預定的門閥值長度時,可以觸發一個查詢來摘要之前對話的部分內容。然後,這個對話的摘要可以作為 system prompt 的一部分包含在內。這種方式可以縮小需要處理的文本量,同時保留對話的關鍵信息。
  • 非同步摘要:在整個對話過程中,可以非同步的摘要先前的對話。這種方法允許系統持續更新對話的簡短版本,而不需等到達到某個特定的長度門閾值。
  • RAG:透過使用基於嵌入(embeddings)的檢索技術,可以動態選擇對當前查詢最相關的對話部分。它允許 LLMs 更有效地回顧過去的對話脈絡,選擇最有助於當前查詢的信息,從而提高對話品質和相關性。

Tactic: Summarize long documents piecewise and construct a full summary recursively

這一個段路主要是在討論處理超長文件進行摘要的策略思路,特別是當問件長度超過 LLMs 支援的 context window 時。由於這項限制,LLMs 無法一次性摘要一個長度超過這個限制的語料。策略的核心思想是 分段摘要遞迴構建全文摘要。具體來說,當需要摘要一個長文檔(比如一本書)時,可以按照文檔的結構(如章節)將其分成多個部分,然後對每個部分進行單獨摘要。

這些分段的摘要可以被進一步連接起來,並對這些摘要的摘要進行摘要,這個過程可以遞迴進行,直到整個文檔被摘要。如果後續段落的理解需要依賴於前面段落的信息,這種策略還可以進一步改進:在摘要某個段落時,包含一個對前面文本的運行摘要。這種方式確保了即使在摘要過程中進行遞迴,也能夠保持文檔整體意義的連貫性和完整性。

額外補充一下,這一項策略不僅僅能用在文件摘要這項任務上,在執行超長文件翻譯時也非常適用。

策略四:Give the model time to “think”

這一策略的思路在於:在得出結論之前,先明確指示模型從基本原理開始推理。

這種方法的核心思想是鼓勵 LLM 不要急於給出結論,而是先進行深入的思考和分析。例如,如果我們希望模型評估一個學生對於數學問題的解答,最直接的方式可能是直接問模型這個解答是否正確。然而,這種方法可能不會考慮解答過程中的思考邏輯和可能的錯誤。

相反,如果我們指示模型先根據數學的基本公理來分析學生的解答過程,然後再評估解答的正確性,這樣可以促使模型更加深入地理解問題和解答過程,從而提供更準確、更有洞察力的回饋。

這種策略的好處是它鼓勵模型進行更為系統性和原則性的思考,而不是僅僅依賴於表面的信息或快速結論。這樣不僅能提升模型對於特定問題的理解和分析能力,還能增強其解決問題的綜合能力。這種方法特別適用於需要深入分析和批判性思考的情境,如技術分析、學術評估、創意問題解決等領域。

SYSTEM:

Determine if the student's solution is correct or not.

USER:

Problem Statement: I'm building a solar power installation and I need help working out the financials.
- Land costs $100 / square foot
- I can buy solar panels for $250 / square foot
- I negotiated a contract for maintenance that will cost me a flat $100k per year, and an additional $10 / square foot
What is the total cost for the first year of operations as a function of the number of square feet.

Student's Solution: Let x be the size of the installation in square feet.
1. Land cost: 100x
2. Solar panel cost: 250x
3. Maintenance cost: 100,000 + 100x
Total cost: 100x + 250x + 100,000 + 100x = 450x + 100,000

ASSISTANT:

The student's solution is correct.

上面的 prompt 方式會讓 ChatGPT 錯誤的回答說學生提出的解決方案是正確的,然而事實上學生提出的解答是錯的。但我們如果根據次策略的建議在 system prompt 中先行要求 LLM 在推論學生的答案是否正確之前,先用自己的預訓練權重中所包含的數學公理的的知識先推論一遍,然後再來跟學生提出的解答進行比對,以下優化後的 prompt 會成功的推論出學生的解決方案其實是不正確的。

我們先送出 system prompt 如下:

First work out your own solution to the problem. Then compare your solution to the student's solution and evaluate if the student's solution is correct or not. Don't decide if the student's solution is correct until you have done the problem yourself.

然後我們送出 user prompt:

Problem Statement: I'm building a solar power installation and I need help working out the financials.
- Land costs $100 / square foot
- I can buy solar panels for $250 / square foot
- I negotiated a contract for maintenance that will cost me a flat $100k per year, and an additional $10 / square foot
What is the total cost for the first year of operations as a function of the number of square feet.

Student's Solution: Let x be the size of the installation in square feet.
1. Land cost: 100x
2. Solar panel cost: 250x
3. Maintenance cost: 100,000 + 100x
Total cost: 100x + 250x + 100,000 + 100x = 450x + 100,000

最終我們會成功地得到以下正確的推論:

To solve this problem, we'll calculate the total cost for the first year of operations for a solar power installation based on the given conditions. Then, we will compare our solution with the student's solution to evaluate its correctness.

### Problem Breakdown

1. **Land Cost**: Given as $100 per square foot. If \(x\) represents the number of square feet, then the land cost is \(100x\).
   
2. **Solar Panel Cost**: Given as $250 per square foot. Thus, the cost for solar panels is \(250x\).
   
3. **Maintenance Cost**: A flat $100,000 per year plus an additional $10 per square foot. So, the maintenance cost is \(100,000 + 10x\).

### Our Calculation

The total cost for the first year of operations as a function of the number of square feet (\(x\)) can be calculated by adding up the three costs:

- **Total Cost** = Land Cost + Solar Panel Cost + Maintenance Cost
- **Total Cost** = \(100x + 250x + 100,000 + 10x\)

Let's compute the simplified form of this equation.

### Simplification

\[
\text{Total Cost} = 100x + 250x + 10x + 100,000
\]

\[
\text{Total Cost} = (100 + 250 + 10)x + 100,000
\]

\[
\text{Total Cost} = 360x + 100,000
\]

### Evaluation

The student's solution calculates the total cost as \(450x + 100,000\), which appears to incorporate an error in the addition of the costs per square foot. The correct calculation should be \(360x + 100,000\), taking into account the actual costs for land, solar panels, and maintenance without the additional \(100x\) mistakenly added to the maintenance cost.

Therefore, the student's solution is incorrect. The correct total cost for the first year of operations as a function of the number of square feet is \(360x + 100,000\).

ChatGPT 在比對它是先推論的解答後,會指出學生的推論結果其實是不正確的。

Tactic: Use inner monologue or a sequence of queries to hide the model’s reasoning process

此段落介紹了一種名為「Inner monologue」的 prompt 技巧,在某些應用場景下,LLMs 在達到最終答案的過程中所用的推理過程可能不適合直接與用戶分享。舉個例子,在輔導應用中,我們可能希望鼓勵學生自己解題,但是如果模型直接展示其對學生解答的分析過程,這可能會無意中透露出正確答案。

「Inner monologue」策略的核心思想是指示模型將那些意圖隱藏於用戶的輸出部分放入一種結構化的格式,使得這些信息易於解析但不會直接顯示給用戶。然後,在將輸出展示給用戶之前,進行解析並只展示部分輸出。

這樣,就可以在不直接透露答案的情況下,提供有用的輔導或反饋,從而支持學習過程而不破壞學習的目的。這種策略有效地平衡了提供輔導支持和促進自主學習之間的關係,尤其適用於教育技術和個性化學習場景。

首先我們使用以下的 system prompt 作為開頭:

Follow these steps to answer the user queries.

Step 1 - First work out your own solution to the problem. Don't rely on the student's solution since it may be incorrect. Enclose all your work for this step within triple quotes (""").

Step 2 - Compare your solution to the student's solution and evaluate if the student's solution is correct or not. Enclose all your work for this step within triple quotes (""").

Step 3 - If the student made a mistake, determine what hint you could give the student without giving away the answer. Enclose all your work for this step within triple quotes (""").

Step 4 - If the student made a mistake, provide the hint from the previous step to the student (outside of triple quotes). Instead of writing "Step 4 - ..." write "Hint:".

然後輸入以下的 user prompt:

Problem Statement: A box contains 4 red balls and 6 blue balls. Two balls are drawn from the box without replacement. What is the probability that one of the drawn balls is red and the other one is blue?

Student Solution: (4/10)*(6/9) = 24/90 = 4/15

最終我們會得到以下 ChatGPT 的推論回應,

"""
Step 1: First, let's work out the correct solution to the problem.

The problem asks for the probability of drawing one red ball and one blue ball in any order from a box containing 4 red balls and 6 blue balls, without replacement.

There are two scenarios to consider:
1. Drawing a red ball first and then a blue ball.
2. Drawing a blue ball first and then a red ball.

Let's calculate the probability for each scenario:

1. Probability of drawing a red ball first and then a blue ball:
   - The probability of drawing a red ball first is 4/10 (since there are 4 red balls out of a total of 10 balls).
   - After drawing a red ball, there are now 9 balls left in the box, with 6 of them being blue. So, the probability of drawing a blue ball next is 6/9.
   - Therefore, the probability of this scenario is (4/10) * (6/9).

2. Probability of drawing a blue ball first and then a red ball:
   - The probability of drawing a blue ball first is 6/10 (since there are 6 blue balls out of a total of 10 balls).
   - After drawing a blue ball, there are now 9 balls left in the box, with 4 of them being red. So, the probability of drawing a red ball next is 4/9.
   - Therefore, the probability of this scenario is (6/10) * (4/9).

The total probability of drawing one red ball and one blue ball in any order is the sum of the probabilities of these two scenarios:
Total probability = (4/10) * (6/9) + (6/10) * (4/9).

Let's calculate this:
Total probability = (4*6)/(10*9) + (6*4)/(10*9)
                   = 24/90 + 24/90
                   = 48/90
                   = 8/15

So, the correct probability is 8/15.

Step 2: Now, let's compare the student's solution to the correct solution.

The student calculated the probability as (4/10) * (6/9) = 24/90 = 4/15. This calculation only accounts for the scenario of drawing a red ball first and then a blue ball. The student did not consider the scenario of drawing a blue ball first and then a red ball, which also satisfies the condition of drawing one red ball and one blue ball in any order.

Therefore, the student's solution is incorrect because it only accounts for half of the possible outcomes that meet the problem's criteria.

Step 3: To determine what hint to give the student without giving away the answer:

The student should be reminded that there are two possible orders in which a red ball and a blue ball can be drawn: red then blue, and blue then red. They should consider the probability of both scenarios and then sum these probabilities to find the total probability of drawing one red ball and one blue ball in any order.
"""

Hint: Remember, when calculating the probability of drawing one red ball and one blue ball, you need to consider both possible orders: drawing a red ball first and then a blue ball, and drawing a blue ball first and then a red ball. Calculate the probability for each scenario separately and then add them together to find the total probability.

另一種類似的做法是如下我們先要求 LLM 去獨立解決問題,而不考慮學生的解答。這一步驟的輸出不會展示給學生,其優點在於模型提出的解決方案不會受到學生嘗試解答的影響,保持了 LLM 解答的獨立性和客觀性。

System prompt:

A box contains 4 red balls and 6 blue balls. Two balls are drawn from the box without replacement. What is the probability that one of the drawn balls is red and the other one is blue?

我們會先得到以下的模型推論輸出:

There are two scenarios in which we can draw one red ball and one blue ball: either we draw a red ball first and then a blue ball, or we draw a blue ball first and then a red ball.

The probability of drawing a red ball first and then a blue ball is (4/10) * (6/9) = 24/90 = 4/15.

The probability of drawing a blue ball first and then a red ball is (6/10) * (4/9) = 24/90 = 4/15.

Since these are mutually exclusive events, we can add these probabilities together to get the total probability of drawing one red ball and one blue ball.

So, the total probability is 4/15 + 4/15 = 8/15.

緊接著我們使用以下的 prompt

System prompt:

Compare your solution to the student's solution and evaluate if the student's solution is correct or not.

然後接續以下的 user prompt:

User prompt:

Problem statement: """A box contains 4 red balls and 6 blue balls. Two balls are drawn from the box without replacement. What is the probability that one of the drawn balls is red and the other one is blue?"""

Your solution: """There are two possible successful outcomes: either we draw a red ball first and then a blue ball, or we draw a blue ball first and then a red ball. We'll find the probability of each of these outcomes and then add them together.

1) Red first, then blue: The probability of drawing a red ball first is 4/10, since there are 4 red balls and 10 total balls. After drawing a red ball, there are 3 red balls and 6 blue balls left, so the probability of drawing a blue ball next is 6/9. So the probability of this outcome is (4/10)*(6/9).

2) Blue first, then red: The probability of drawing a blue ball first is 6/10, since there are 6 blue balls and 10 total balls. After drawing a blue ball, there are 4 red balls and 5 blue balls left, so the probability of drawing a red ball next is 4/9. So the probability of this outcome is (6/10)*(4/9).

Now we add the probabilities of these two outcomes together: (4/10)*(6/9) + (6/10)*(4/9) = 24/90 + 24/90 = 48/90. This fraction simplifies to 8/15, so the probability of drawing one red ball and one blue ball is $oxed{rac{8}{15}}$."""

Student’s solution: """(4/10)*(6/9) = 24/90 = 4/15"""

我們會得到以下的 Assistent 回應:

Assistant: The student's solution is incorrect. They only calculated the probability of one of the outcomes (red first, then blue) and did not consider the other outcome (blue first, then red). The correct solution should include both outcomes and add their probabilities together.

最終我們再接續使用以下的 System prompt:

System prompt:

You are a math tutor. If the student made an error, offer a hint to the student in a way that does not reveal the answer. If the student did not make an error, simply offer them an encouraging comment.

以及以下接續的 User prompt:

User prompr:

Problem statement: """A box contains 4 red balls and 6 blue balls. Two balls are drawn from the box without replacement. What is the probability that one of the drawn balls is red and the other one is blue?"""

Your solution: """There are two possible successful outcomes: either we draw a red ball first and then a blue ball, or we draw a blue ball first and then a red ball. We'll find the probability of each of these outcomes and then add them together.

1) Red first, then blue: The probability of drawing a red ball first is 4/10, since there are 4 red balls and 10 total balls. After drawing a red ball, there are 3 red balls and 6 blue balls left, so the probability of drawing a blue ball next is 6/9. So the probability of this outcome is (4/10)*(6/9).

2) Blue first, then red: The probability of drawing a blue ball first is 6/10, since there are 6 blue balls and 10 total balls. After drawing a blue ball, there are 4 red balls and 5 blue balls left, so the probability of drawing a red ball next is 4/9. So the probability of this outcome is (6/10)*(4/9).

Now we add the probabilities of these two outcomes together: (4/10)*(6/9) + (6/10)*(4/9) = 24/90 + 24/90 = 48/90. This fraction simplifies to 8/15, so the probability of drawing one red ball and one blue ball is $oxed{rac{8}{15}}$."""

Student’s solution: """(4/10)*(6/9) = 24/90 = 4/15"""

Analysis: """The student's solution is incorrect. They only calculated the probability of drawing a red ball first and then a blue ball, but they didn't consider the other possible successful outcome of drawing a blue ball first and then a red ball. The correct solution, as you showed, is to add the probabilities of both successful outcomes together, which results in a probability of 8/15."""

我們會得到以下最終的 Assistant 輸出:

Hint: Remember that there are two possible successful outcomes: drawing a red ball first and then a blue ball, or drawing a blue ball first and then a red ball. Make sure to consider both of these outcomes when calculating the probability.

Tactic: Ask the model if it missed anything on previous passes

當處理的原始文本很大時,模型可能會過早地停止搜索,導致未能列出所有相關的摘錄。這在自動文件提取或摘要生成的應用中是一個常見的問題。為了解決這個問題,可以在LLM 完成第一次內容提取後,透過後續查詢來促使模型再次檢查原始文件,尋找可能在先前階段遺漏的相關內容。這種策略可以顯著提高模型的性能,因為它允許模型更全面地探索原始文檔,並減少遺漏重要信息的風險。

透過這種反覆迭代追問的過程,可以確保 LLM 提供更完整、更準確的相關摘要列表,從而提高資訊檢索或文件分析任務的聘職。這方法論對於需要從大量文本中提取關鍵資訊的任務尤其有用,比如法律文件分析、學術研究、或者市場研究報告等等。

我們來是實驗一下,首先我們先使用以下的 System prompt:

System prompt:

You will be provided with a document delimited by triple quotes. Your task is to select excerpts which pertain to the following question: "What significant paradigm shifts have occurred in the history of artificial intelligence."  

Ensure that excerpts contain all relevant context needed to interpret them - in other words don't extract small snippets that are missing important context. Provide output in JSON format as follows:

[{"excerpt": "insert excerpt here"},
...
{"excerpt": "insert excerpt here"}]

然後以 User prompt 的方式輸入以下要進行文本擷取任務的長原始文本:

"""
Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by humans or by other animals. Example tasks in which this is done include speech recognition, computer vision, translation between (natural) languages, as well as other mappings of inputs.

AI applications include advanced web search engines (e.g., Google Search), recommendation systems (used by YouTube, Amazon, and Netflix), understanding human speech (such as Siri and Alexa), self-driving cars (e.g., Waymo), generative or creative tools (ChatGPT and AI art), automated decision-making, and competing at the highest level in strategic game systems (such as chess and Go).[1]

[...中間省略數千字]

AI content detector tools are software applications that use artificial intelligence (AI) algorithms to analyze and detect specific types of content in digital media, such as text, images, and videos. These tools are commonly used to identify inappropriate content, such as speech errors, violent or sexual images, and spam, among others.

Some benefits of using AI content detector tools[157] include improved efficiency and accuracy in detecting inappropriate content, increased safety and security for users, and reduced legal and reputational risks for websites and platforms.
"""

ChatGPT 會回傳我們以下的推論結果:

[{"excerpt": "Artificial intelligence was founded as an academic discipline in 1956, and in the years since it has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success, and renewed funding. AI research has tried and discarded many different approaches, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge, and imitating animal behavior. In the first decades of the 21st century, highly mathematical and statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia."}, {"excerpt": "By the 1950s, two visions for how to achieve machine intelligence emerged. One vision, known as Symbolic AI or GOFAI, was to use computers to create a symbolic representation of the world and systems that could reason about the world. Proponents included Allen Newell, Herbert A. Simon, and Marvin Minsky. The second vision, known as the connectionist approach, sought to achieve intelligence through learning. Proponents of this approach, most prominently Frank Rosenblatt, sought to connect Perceptron in ways inspired by connections of neurons. Manyika argues that symbolic approaches dominated the push for artificial intelligence in this period, due in part to its connection to intellectual traditions of Descartes, Boole, Gottlob Frege, Bertrand Russell, and others. Connectionist approaches based on cybernetics or artificial neural networks were pushed to the background but have gained new prominence in recent decades."}, {"excerpt": "In the early 1980s, AI research was revived by the commercial success of expert systems, a form of AI program that simulated the knowledge and analytical skills of human experts. By 1985, the market for AI had reached over a billion dollars. However, beginning with the collapse of the Lisp Machine market in 1987, AI once again fell into disrepute, and a second, longer-lasting winter began."}, {"excerpt": "Interest in neural networks and "connectionism" was revived by Geoffrey Hinton, David Rumelhart and others in the middle of the 1980s. Soft computing tools were developed in the 1980s, such as neural networks, fuzzy systems, Grey system theory, evolutionary computation and many tools drawn from statistics or mathematical optimization."}, {"excerpt": "Faster computers, algorithmic improvements, and access to large amounts of data enabled advances in machine learning and perception; data-hungry deep learning methods started to dominate accuracy benchmarks around 2012. According to Bloomberg's Jack Clark, 2015 was a landmark year for artificial intelligence, with the number of software projects that use AI within Google increased from a "sporadic usage" in 2012 to more than 2,700 projects."}, {"excerpt": "Numerous academic researchers became concerned that AI was no longer pursuing the original goal of creating versatile, fully intelligent machines. Much of current research involves statistical AI, which is overwhelmingly used to solve specific problems, even highly successful techniques such as deep learning. This concern has led to the subfield of artificial general intelligence (or "AGI"), which had several well-funded institutions by the 2010s."}]

如果我們按照 OpenAI 官方的範例再一次的透過 User prompt 用以下的 prompt 問 ChatGPT 一次:

User prompt:

Are there more relevant excerpts? Take care not to repeat excerpts. Also ensure that excerpts contain all relevant context needed to interpret them - in other words don't extract small snippets that are missing important context.

我們會再次得到以下一長串的 Assistent 輸出:

[{"excerpt": "By the 1950s, two visions for how to achieve machine intelligence emerged. One vision, known as Symbolic AI or GOFAI, was to use computers to create a symbolic representation of the world and systems that could reason about the world. Proponents included Allen Newell, Herbert A. Simon, and Marvin Minsky. Closely associated with this approach was the "heuristic search" approach, which likened intelligence to a problem of exploring a space of possibilities for answers. The second vision, known as the connectionist approach, sought to achieve intelligence through learning. Proponents of this approach, most prominently Frank Rosenblatt, sought to connect Perceptron in ways inspired by connections of neurons. James Manyika and others have compared the two approaches to the mind (Symbolic AI) and the brain (connectionist). Manyika argues that symbolic approaches dominated the push for artificial intelligence in this period, due in part to its connection to intellectual traditions of Descartes, Boole, Gottlob Frege, Bertrand Russell, and others. Connectionist approaches based on cybernetics or artificial neural networks were pushed to the background but have gained new prominence in recent decades."},{"excerpt": "In the early 1980s, AI research was revived by the commercial success of expert systems, a form of AI program that simulated the knowledge and analytical skills of human experts. By 1985, the market for AI had reached over a billion dollars. At the same time, Japan's fifth generation computer project inspired the U.S. and British governments to restore funding for academic research. However, beginning with the collapse of the Lisp Machine market in 1987, AI once again fell into disrepute, and a second, longer-lasting winter began."},{"excerpt": "Many researchers began to doubt that the symbolic approach would be able to imitate all the processes of human cognition, especially perception, robotics, learning and pattern recognition. A number of researchers began to look into "sub-symbolic" approaches to specific AI problems. Robotics researchers, such as Rodney Brooks, rejected symbolic AI and focused on the basic engineering problems that would allow robots to move, survive, and learn their environment."},{"excerpt": "Interest in neural networks and "connectionism" was revived by Geoffrey Hinton, David Rumelhart and others in the middle of the 1980s. Soft computing tools were developed in the 1980s, such as neural networks, fuzzy systems, Grey system theory, evolutionary computation and many tools drawn from statistics or mathematical optimization."},{"excerpt": "Faster computers, algorithmic improvements, and access to large amounts of data enabled advances in machine learning and perception; data-hungry deep learning methods started to dominate accuracy benchmarks around 2012. According to Bloomberg's Jack Clark, 2015 was a landmark year for artificial intelligence, with the number of software projects that use AI within Google increased from a "sporadic usage" in 2012 to more than 2,700 projects. He attributed this to an increase in affordable neural networks, due to a rise in cloud computing infrastructure and to an increase in research tools and datasets."},{"excerpt": "Numerous academic researchers became concerned that AI was no longer pursuing the original goal of creating versatile, fully intelligent machines. Much of current research involves statistical AI, which is overwhelmingly used to solve specific problems, even highly successful techniques such as deep learning. This concern has led to the subfield of artificial general intelligence (or "AGI"), which had several well-funded institutions by the 2010s."}]

基本上後續 User prompt 重點在多了一段 ...Take care not to repeat excerpts...,嘗試要求 LLM 在推論的時候盡量不要去重複的在原始語料中去提取先前已經萃取出的內容,透過伴隨著本輪代入在 system prompt 中前一輪的推論結果作為 context。

不過實務上的測試結果在 ChatGPT 中由於無法透過後端程式去將每一輪生成的堆論結果進行摘要,所以如果原始語料過於龐大的話,不出兩輪整個 context window 就爆了。

策略五:Use external tools

Tactic: Use embeddings-based search to implement efficient knowledge retrieval

這段的方法論在討論的就是 RAG,使用基於文字嵌入(embeddings)的搜索來實現高效知識檢索的策略。這種方法允許模型在處理輸入問題時,動態地從外部資源中獲取並加入相關的脈絡資訊,以輔助 LLMs 生成更加準確和更即時的回應內容。例如,當用戶提問有關特定電影的問題時,將關於該電影的高品質資訊(如演員名單、導演以及評價等)加入到 LLM 的 context window 中,可以幫助 LLM 提供更具體且相關的回答。

文字嵌入是一種向量,能夠量化文字串之間的相關性。相似或相關的文本串在高維度的嵌入空間中會彼此更接近,與此同時,快速的向量搜索算法的存在使得可以通過嵌入來實現高效的知識檢索。具體來說,可以將一個文字語料庫分割成多個切塊,對每個塊進行嵌入並儲存在選定的向量資料庫(Vector Store)中。當有一個特定的查詢時,可以將這個查詢進行嵌入轉換為高維度向量,然後進行向量搜索,以找到在向量語料庫中與該查詢最相關(即在嵌入空間中最接近)的文本區塊。

關於 RAG 這個主題,我們在未來預計會有數十篇專文會針對它進行討論。

Tactic: Use code execution to perform more accurate calculations or call external APIs

這段談論的其實就是我們之前在 這篇文章 中討論過的 Program-Aided Language(PAL)。在推論需要運算的問題上先指導模型撰寫並運行程式碼來完成,而不是依靠它的預訓練權重直接進行推論。

具體來說,可以指導 LLM 將需要運行的代碼放置在指定的格式中,如使用三重反引號(“`)來包裹程式碼。這樣做的目的是為了清楚地區分出哪部分文本是代碼,從而方便後續的提取和運行。在代碼輸出產生後,這段代碼可以被提取出來並在適當的代碼執行引擎(例如Python Interpreter)中運行。

最後,如果需要,程式碼執行的輸出結果可以作為輸入提供給 LLM,用於下一個查詢的處理。這樣,通過結合 LLM 的自然語言處理能力和程式碼執行機制,可以有效提高處理算術運算或複雜計算問題的準確性和靈活性。這種方法不僅提高了模型在特定任務上的實用性,也展示了通過提示工程來擴展模型應用範圍的可能性。

以下我們來簡單地進行測試:

System prompt:

You can write and execute Python code by enclosing it in triple backticks, e.g. ```code goes here```. Use this to perform calculations.

ChatGPT 會回應出以下的 Assistent 回應:

Assistant:

Yes, I can execute Python code for various tasks such as calculations, data analysis, simulations, and more. If you have a specific task or calculation in mind, please let me know, and I can assist you with it.

然後我們以 User role 輸入以下的 prompt:

User prompt:

Find all real-valued roots of the following polynomial: 3*x**5 - 5*x**4 - 3*x**3 - 7*x - 10.

最終在 ChatGPT 輸出的結果中我們會看到它呼叫了 Code Interpreter 去透過撰寫 Python code 來嘗試回答我們輸入的數學問題。


而 ChatGPT 的輸出如下:

The real-valued root of the polynomial \(3x^5 - 5x^4 - 3x^3 - 7x - 10\) is approximately \(x = 2.3697\).

在實務上我們也可以透過 system prompt 要求 LLM 跟外部 API 進行溝通來解決特定的問題,我們可以透過提供 API 或是 SDK 文件在 prompt 中去驅動 LLM 與外部資源進行溝通來解決複雜的推論任務。

System role:

You can write and execute Python code by enclosing it in triple backticks. Also note that you have access to the following module to help users send messages to their friends:

```python
import message
message.write(to="John", message="Hey, want to meetup after work?")```

User role:

Let Alice know I'll meet her at the coffee shop at 10.

Assistant role:

Sure, I can help you with that. Here is the Python code to send a message to Alice.

```python
import message
message.write(to="Alice", message="I'll meet you at the coffee shop at 10.")
```

需要注意的是上述 OpenAI 提供的 prompt 範例並不能明確的展示這方法論的好處,我會在後續開一篇獨立的文章來分享 LLM 整合外部 APIs 的進階用法。

Tactic: Give the model access to specific functions

這一段講的就比較不通用性了,主要是在講述 OpenAI 所提供的一些特殊 API 可以在 GPT 進行回應推論的時候給予協助,例如 Function Calling 這工具,透過在 prompt 中遵循 JSON Schema 的標準去撰寫能夠函數來剖析使用者輸入的內容,並從中萃取出有用的資訊(ex. Intents 與 Entities),進而決定要呼叫哪些外部的工具、API 來輔助推論的完成,這是一個很大的題目,我也會在未來以系列文章來進行分享。

策略六:Test changes systematically

要能夠優化一項任務,系統化的評估並記錄各項相關的數據絕對是必要的。有時候,對提示詞的修改在少數獨立的場景中可能可以達到很好的性能,但在更有代表性的測試集合上卻可能導致整體性能下降。因此,為了確保一個提示詞內容的變更對 LLM 應用程式的推論效能是正面的提升還是退化,這就需要為你的應用場景定義出一個全面的測試流程(也稱為「評估 eval」)。

一個好的評估(eval)具有以下幾個要件:

  • 符合真實世界的使用情況(或至少是多樣化的)
  • 包含許多測試案例,以獲得更大的統計信度
  • 易於自動化或重複執行(透過程式或有效率的 LLMOps)

該如何有效率的執行 prompt engineering evals 呢?

  • 自動化與客觀性:透過電腦進行評估能夠自動化地處理具有客觀標準的評估,例如有單一正確答案的問題。這意味著對於那些可以明確定義正確輸出的情況,可以高效且一致地評估大型語言模型的輸出,減少了人工評估的需求和主觀偏差。
  • 處理主觀或模糊標準:對於一些主觀或不那麼清晰的標準,電腦評估也可以透過其他模型查詢來評估模型輸出。這表示在評估模型對於複雜、開放式問題的回答能力時,我們可以利用模型本身的輸出作為評估標準,進行更深入的分析。
  • OpenAI Evals: OpenAI Evals 提供了一套開源軟體框架,用於建立自動化的評估工具。這使得研究者和開發者能夠更容易地設計和實施自動化評估程序,加速 LLM 的優化和改進。
  • 模型基於評估的靈活性:當存在一系列可能輸出,且這些輸出在質量上被視為等同高質的情況下,基於模型的評估尤其有用。這類評估允許對於長答案等複雜問題的輸出進行質量評估,增加了評估的多樣性和深度。
  • 不斷變化的評估邊界:隨著 LLM 能力的提升,僅能透過模型基於評估與需要人工評估的邊界不斷變化。這鼓勵實驗和創新,以探索基於模型的評估在不同場景中的有效性。

Tactic: Evaluate model outputs with reference to gold-standard answers(奠基在內容都是既定事實的資料集來去做 prompt 的 eval)

System prompt

You will be provided with text delimited by triple quotes that is supposed to be the answer to a question. Check if the following pieces of information are directly contained in the answer:

- Neil Armstrong was the first person to walk on the moon.
- The date Neil Armstrong first walked on the moon was July 21, 1969.

For each of these points perform the following steps:

1 - Restate the point.
2 - Provide a citation from the answer which is closest to this point.
3 - Consider if someone reading the citation who doesn't know the topic could directly infer the point. Explain why or why not before making up your mind.
4 - Write "yes" if the answer to 3 was yes, otherwise write "no".

Finally, provide a count of how many "yes" answers there are. Provide this count as {"count": <insert count here>}.

User prompt

"""Neil Armstrong is famous for being the first human to set foot on the Moon. This historic event took place on July 21, 1969, during the Apollo 11 mission."""

上述提示詞在 ChatGPT 中我們會得到以下的答案

### Point 1: Neil Armstrong was the first person to walk on the moon.

- **Citation**: "Neil Armstrong is famous for being the first human to set foot on the Moon."
- **Inference Explanation**: The citation directly states that Neil Armstrong was the first human to walk on the Moon. Someone reading this sentence would not need additional information to understand that Neil Armstrong was the first person to do so. The statement is clear and straightforward.
- **Directly Contained**: yes

### Point 2: The date Neil Armstrong first walked on the moon was July 21, 1969.

- **Citation**: "This historic event took place on July 21, 1969, during the Apollo 11 mission."
- **Inference Explanation**: The citation directly provides the date when Neil Armstrong first walked on the Moon. It explicitly mentions "July 21, 1969," linking it to the historic event of the first Moon walk. Someone reading this would be able to understand the exact date of the event without needing additional context.
- **Directly Contained**: yes

### Count of "Yes" Answers

```json
{"count": 2}
```

如果我們將 user prompt 改成

"""Neil Armstrong made history when he stepped off the lunar module, becoming the first person to walk on the moon."""

我們將會在 ChatGPT 中得到以下的內容

### Point 1: Neil Armstrong was the first person to walk on the moon.

- **Citation**: "Neil Armstrong made history when he stepped off the lunar module, becoming the first person to walk on the moon."
- **Inference Explanation**: The citation directly states that Neil Armstrong became the first person to walk on the Moon upon stepping off the lunar module. This statement is clear and leaves no room for misinterpretation regarding who was the first to walk on the Moon.
- **Directly Contained**: yes

### Point 2: The date Neil Armstrong first walked on the moon was July 21, 1969.

- **Citation**: No specific citation is available for the date within the provided text.
- **Inference Explanation**: Since the text does not mention the date "July 21, 1969," someone reading it would not be able to infer the specific date Neil Armstrong first walked on the Moon. The information about the date is not provided in the text, so it cannot be directly inferred.
- **Directly Contained**: no

### Count of "Yes" Answers

```json
{"count": 1}
```

最後!如果我們將 user prompt 再次變更為

"""In the summer of '69, a voyage grand,
Apollo 11, bold as legend's hand.
Armstrong took a step, history unfurled,
"One small step," he said, for a new world."""

ChatGPT 產生的輸出會變更如下

### Point 1: Neil Armstrong was the first person to walk on the moon.

- **Citation**: "Armstrong took a step, history unfurled,"
- **Inference Explanation**: The citation metaphorically indicates that Armstrong's action of taking a step was historically significant, implying he was the first to do something monumental. However, without explicit mention that this step was the first on the Moon, someone unfamiliar with the context might not directly infer this specific historical achievement. The connection relies on prior knowledge of Armstrong's association with the Moon landing.
- **Directly Contained**: no

### Point 2: The date Neil Armstrong first walked on the moon was July 21, 1969.

- **Citation**: "In the summer of '69, a voyage grand,"
- **Inference Explanation**: The citation provides a general timeframe, the summer of 1969, for the event. However, it does not specify the exact date of July 21, 1969. While it places the event in the correct year, the lack of a specific date means someone reading this cannot directly infer the precise day Neil Armstrong first walked on the Moon.
- **Directly Contained**: no

### Count of "Yes" Answers

```json
{"count": 0}
```

System prompt:

Use the following steps to respond to user inputs. Fully restate each step before proceeding. i.e. "Step 1: Reason...".

Step 1: Reason step-by-step about whether the information in the submitted answer compared to the expert answer is either: disjoint, equal, a subset, a superset, or overlapping (i.e. some intersection but not subset/superset).

Step 2: Reason step-by-step about whether the submitted answer contradicts any aspect of the expert answer.

Step 3: Output a JSON object structured like: {"type_of_overlap": "disjoint" or "equal" or "subset" or "superset" or "overlapping", "contradiction": true or false}

在 ChatGPT 下了 system prompt 後會取得如下的回應

To respond to user inputs following the outlined steps, let's break down each step for clarity:

### Step 1: Reason step-by-step about whether the information in the submitted answer compared to the expert answer is either: disjoint, equal, a subset, a superset, or overlapping (i.e. some intersection but not subset/superset).

This step involves comparing the content of the submitted answer with the expert answer to understand the relationship between them. Here's how to approach it:

- **Disjoint**: Determine if the submitted answer and the expert answer have no content in common. This means they address completely different aspects or questions.
- **Equal**: Check if the submitted answer contains exactly the same information as the expert answer, with no additional or missing details.
- **Subset**: Ascertain if all information in the submitted answer is included in the expert answer, but the expert answer contains more details or information not found in the submitted answer.
- **Superset**: Identify if the submitted answer includes all the information found in the expert answer, plus additional details or information not covered by the expert answer.
- **Overlapping**: Evaluate if there is some common information shared between the submitted and expert answers, but each also contains unique information not shared by the other.

### Step 2: Reason step-by-step about whether the submitted answer contradicts any aspect of the expert answer.

In this step, you'll assess if there are any direct conflicts between the information provided in the submitted answer and the expert answer:

- **Identify any discrepancies** in factual information, methodologies, conclusions, or interpretations between the two answers.
- **Consider the context** and nuances of both answers to determine if what might initially appear as a contradiction could be reconciled with additional explanations or by understanding a broader context.
- **Determine contradiction** by establishing whether the submitted answer makes any claims or presents any evidence that directly opposes or invalidates any part of the expert answer.

### Step 3: Output a JSON object structured like

```json
{
  "type_of_overlap": "disjoint" or "equal" or "subset" or "superset" or "overlapping",
  "contradiction": true or false
}
```

This step involves synthesizing the findings from the first two steps into a structured JSON response:

- **"type_of_overlap"**: This field should reflect the relationship between the submitted and expert answers as determined in Step 1.
- **"contradiction"**: This boolean field indicates whether, based on Step 2, any aspect of the submitted answer contradicts the expert answer.

By following these steps methodically, you can systematically analyze and compare the submitted answer to the expert answer, leading to a clear and structured evaluation of their relationship.

接著我們輸入第一則 user prompt:

Question: """What event is Neil Armstrong most famous for and on what date did it occur? Assume UTC time."""

Submitted Answer: """Didn't he walk on the moon or something?"""

Expert Answer: """Neil Armstrong is most famous for being the first person to walk on the moon. This historic event occurred on July 21, 1969."""

針對第一個問題,我們會得到以下的推論結果

Step 1: Reason step-by-step about whether the information in the submitted answer compared to the expert answer is either: disjoint, equal, a subset, a superset, or overlapping (i.e. some intersection but not subset/superset).

- The submitted answer states, "Didn't he walk on the moon or something?" This indicates an acknowledgment of Neil Armstrong's moonwalk, aligning with the expert answer's core information about Armstrong being famous for walking on the moon.
- The expert answer provides additional specific details, including the exact date of the event: July 21, 1969.
- The submitted answer's information is contained within the expert answer, but it lacks the specificity and detail (e.g., the exact date).

Conclusion: The submitted answer is a subset of the expert answer because it mentions the moonwalk but lacks the specific details provided by the expert, such as the exact date.

Step 2: Reason step-by-step about whether the submitted answer contradicts any aspect of the expert answer.

- The submitted answer suggests uncertainty ("Didn't he... or something?") but does not provide incorrect information or directly contradict the expert answer. It simply lacks specificity.
- There are no statements in the submitted answer that conflict with the facts provided by the expert answer. The core event mentioned (walking on the moon) is the same.

Conclusion: There is no contradiction between the submitted and expert answers because the submitted answer does not provide any specific information that would conflict with the details provided by the expert.

Step 3: Output a JSON object structured like: {"type_of_overlap": "disjoint" or "equal" or "subset" or "superset" or "overlapping", "contradiction": true or false}

```json
{
  "type_of_overlap": "subset",
  "contradiction": false
}
```

再來我們輸入第二則 user prompt:

Question: """What event is Neil Armstrong most famous for and on what date did it occur? Assume UTC time."""

Submitted Answer: """On the 21st of July 1969, Neil Armstrong became the second person to walk on the moon, following after Buzz Aldrin."""

Expert Answer: """Neil Armstrong is most famous for being the first person to walk on the moon. This historic event occurred on July 21, 1969."""

得到的回應如下

Step 1: Reason step-by-step about whether the information in the submitted answer compared to the expert answer is either: disjoint, equal, a subset, a superset, or overlapping (i.e. some intersection but not subset/superset).

- Both the submitted and expert answers mention Neil Armstrong's moonwalk and the specific date it occurred: July 21, 1969. This establishes a common ground between the two answers.
- The expert answer states Neil Armstrong as the "first person to walk on the moon," while the submitted answer incorrectly identifies him as the "second person," mistakenly placing Buzz Aldrin before him.

Conclusion: The information is overlapping because both answers share the event's core details (the moonwalk and the date). However, the submitted answer contains a critical inaccuracy regarding Armstrong's order among moonwalkers.

Step 2: Reason step-by-step about whether the submitted answer contradicts any aspect of the expert answer.

- The submitted answer directly contradicts the expert answer in one key aspect: the order in which Armstrong and Aldrin walked on the moon. The expert answer correctly identifies Armstrong as the first person to walk on the moon, while the submitted answer incorrectly claims he was the second, following Buzz Aldrin.
- This specific contradiction is significant as it misrepresents the historical fact around which Neil Armstrong's fame is centered.

Conclusion: There is a contradiction between the submitted and expert answers regarding Neil Armstrong's order of walking on the moon.

Step 3: Output a JSON object structured like: {"type_of_overlap": "disjoint" or "equal" or "subset" or "superset" or "overlapping", "contradiction": true or false}

```json
{
  "type_of_overlap": "overlapping",
  "contradiction": true
}
```

我們輸入最後一則 user prompt

Question: """What event is Neil Armstrong most famous for and on what date did it occur? Assume UTC time."""

Submitted Answer: """At approximately 02:56 UTC on July 21st 1969, Neil Armstrong became the first human to set foot on the lunar surface, marking a monumental achievement in human history."""

Expert Answer: """Neil Armstrong is most famous for being the first person to walk on the moon. This historic event occurred on July 21, 1969."""

推論結果如下所示

Step 1: Reason step-by-step about whether the information in the submitted answer compared to the expert answer is either: disjoint, equal, a subset, a superset, or overlapping (i.e. some intersection but not subset/superset).

- Both answers highlight Neil Armstrong's moonwalk as his most famous event, establishing a shared focus between the two.
- The submitted answer provides the exact time of the event (approximately 02:56 UTC) on July 21st, 1969, adding a layer of detail not explicitly mentioned in the expert answer.
- The expert answer provides the date (July 21, 1969) but does not specify the time.

Conclusion: The submitted answer is a superset of the expert answer. It includes all the information found in the expert answer (the event and its date) and adds the specific time of the event.

Step 2: Reason step-by-step about whether the submitted answer contradicts any aspect of the expert answer.

- There is no contradiction in the event's description or the date between the submitted and expert answers. Both agree that Neil Armstrong was the first person to walk on the moon on July 21, 1969.
- The additional detail provided by the submitted answer (the exact time) does not contradict the expert answer; it merely adds to it.

Conclusion: There is no contradiction between the submitted and expert answers.

Step 3: Output a JSON object structured like: {"type_of_overlap": "disjoint" or "equal" or "subset" or "superset" or "overlapping", "contradiction": true or false}

```json
{
  "type_of_overlap": "superset",
  "contradiction": false
}
```

我每天常用的 prompt

指派 ChatGPT 作為一位特定領域的專家角色並以繁體中文跟台灣人熟悉的詞彙、句子跟慣用語來回答問題

  • 提高回答的精確度:當 prompt text 明確指出感興趣的主題或領域時,LLM能夠更準確地理解用戶的查詢意圖,從而提供更加專業、具體的回答。例如,如果用戶指定想要成為「SEO和Facebook行銷活動」的專家,模型就可以針對這些領域提供深入的洞察和策略,而不是給出一般性的行銷或廣告建議。
  • 篩選相關信息:明確的 prompt 幫助 LLM 篩選和優先處理與指定領域相關的信息。這意味著 LLM 會從其龐大的知識庫中選擇最合適的信息來回答問題,減少了不相關或離題的回答出現的機率。
  • 促進有效的溝通:在溝通過程中,明確性是關鍵。當用戶提供具體的 prompt 時,LLM 能更好地理解用戶的需求,並根據這些需求提供服務。這類似於人與人之間的溝通,當一方清楚表達自己的需求時,另一方能更有效地提供幫助。
  • 適應用戶的特定需求:每位用戶的需求和背景都是獨特的。明確性允許 LLM 根據每位用戶的具體情況提供個性化的回答,從而增加了回答的個人相關性和價值。
  • 地區化語言選擇:明確的要求 LLM 使用繁體中文並以台灣人熟悉的單詞、慣用語和成語來回答問題,這有助於提升用戶的理解和舒適度,尤其是針對台灣的用戶群體。這顯示出了對地區語言差異的敏感性和尊重,是個好做法。
From now on, you will be an expert in [Place your desired topics or fields here...]. You are also an expert in [Place your desired topics or fields here...]. You are Taiwanese. You will answer all my questions in zh-TW and use wordings, phrases, and idioms in Taiwan instead of Mainland China.