In a paper generated by GPT-3, ChatGPT reproduces the original paper on the Turing test

ChatGPT’s paper writing ability is higher than Turing’s, and it also fooled the AI writing scoring tool.

Author | Li Mei, Huang Nan

Editor | Chen Caixian

The rise of text generation represented by ChatGPT is prompting many researchers to seek a more challenging Turing test than the original version.

The Turing test addresses two questions: “Can a machine think?”, and if so, “How can it be proven?” The classic Turing test addresses one of AI’s most vexing goals: how to fool unaware humans? But as current language models become more complex, researchers are starting to focus more on the question of “how to prove it?” than how AI can fool humans.

There is a view that the modern Turing test should prove the ability of the language model in a scientific environment, rather than just looking at whether the language model can fool or imitate humans.

A recent study re-examined the classic Turing test and used the content of Turing’s 1950 paper as a prompt, using ChatGPT to generate a more credible version of the paper to evaluate its language understanding and generation capabilities . After quantitative scoring using the AI writing aid Grammarly, it was found that the essays generated by ChatGPT scored 14% higher than Turing’s original paper. Interestingly, some of the papers published in this research were generated by GPT-3.

Paper address: https://ift.tt/WqmOCsw

However, it remains a question mark whether ChatGPT’s algorithm actually exhibits Turing’s original ideas. In particular, large language models that are increasingly good at imitating human language can easily give people the illusion that they have “belief” and can “reason”, which will prevent us from deploying these AI in a more credible and safer way system.

1
Evolution of the Turing Test

The 1950 version of the Turing Test was a question-and-answer format. In the paper, Turing simulated the test of future intelligent computers, an arithmetic problem as shown in the figure below: what is 34957 plus 70764?

Legend: ChatGPT’s question-and-answer sequence, in which the answer is correct, and the question comes from Turing’s 1950 paper

This problem has failed the best language models of the time such as GPT‑2. Ironically, however, at the time, Turing’s paper (the human version) gave an incorrect answer: (Pauses for about 30 seconds, then gives the answer) 105621. Even with the possibility that the machine made mistakes on purpose in order to pass the Turing test, the five-minute conversation convinced the referees that the computer was manipulating it more than 30 percent of the time.

Since 1950, there have been many improvements to the Turing test, including a famous 2014 test called the “Lovelace 2.0 test.” The Lovelace 2.0 test is based on the representative example that a machine can create in art, literature, or any similar creative leap.

In 2014, a chat robot named Eugene Goostman simulated a 13-year-old Ukrainian boy and successfully deceived 33% of the referees. It is considered to be the first machine to pass the Turing test.

But critics were quick to note that the predefined questions and topics, as well as the short format using only keystrokes, meant the results of the Turing test were unreliable.

In 2018, Google CEO Sundar Pichai introduced their latest computer assistant called Duplex in a video, and the machine successfully made hair salon appointments, becoming part of people’s unknowing interaction with the machine. While officially passing the Turing test may take many forms, The Big Think concludes that “to date, no computer has definitively passed the Turing test for AI”. Other researchers have also reiterated whether all these issues are worth exploring, especially given the current application of large language models in a large number of scenarios, such as aerospace engineering texts. The pigeons are exactly the same and outwit the other pigeons”.

2
Generate a more believable Turing test with ChatGPT

In a study by PeopleTec, the authors used the original paper content of the Turing test as a prompt, asked ChatGPT to regenerate a more plausible version of the paper, and evaluated it using a writing assessment tool.

There has been previous work to write and publish research papers written entirely by machines using early versions of the GPT‑3 model. Recognizing machine-generated narratives, complaints about machine-generated text often stem from known model flaws, such as a tendency to lose context, degenerate into repetition or gibberish, restate questions in answer form, and plagiarize Internet sources when stumped .

The paper format to be generated here mainly performs several conventional large language model (Large Language Model, LLM) tasks, especially text summarization and using the Turing problem as the prompt itself to generate original content. Additionally, authors use the Grammarly Pro tool to evaluate the generated content, providing quantitative assessments of papers on hard-to-characterize characteristics such as originality, style, clarity, and overall persuasiveness.

This work focuses more on the second half of the Turing challenge, less about how models can fool humans, and more about how to quantify good text generation . So, part of the remarkable progress demonstrated by OpenAI’s efforts comes down to its ability to improve machine-derived conversations in a way that increases human productivity.

The authors first used Grammarly to evaluate Turing’s original papers, resulting in scores, and then used Turing’s test questions as prompts to create original GPT-3 content, thereby replicating these scores.

The study used three texts as benchmarks:

(1) Turing Original, a paper published by Turing on Mind in 1950;

(2) Turing Summarization, “Free Research Preview: ChatGPT optimized for dialog” in 2022;

(3) Turing Generative Prompt, same as (2), but generated in dialogue using Turing questions.

Each text block output provides data for Grammarly metrics and is set based on Audience: Expert, Form: Neutral, Domain: General, where most grammar rules and conventions apply, with medium strictness.

Such a Turing test can actually verify a deceptive task: Can one machine (ChatGPT) deceive another machine (Grammarly)?

Legend: Metrics used to score large language models and Turing papers

Turing’s original 1950 paper posed 37 questions for the Turing test, some of which addressed central themes in his thinking about machines, and some were example questions posed to a computer in an experimental imitation game. In ChatGPT’s dialog box, the researchers extracted these questions after mixing together topics from the paper outline, and used them to prompt ChatGPT to reproduce the original basic content.

After ChatGPT completes the content generation, it compares with Turing’s original paper in terms of readability, correctness, and clarity. The results are shown in the figure below.

Legend: Comparison results of Turing’s 1950 paper and ChatGPT generated papers in various tasks

On the more subjective ratings of clarity (“a little unclear”), engagement (“a little tedious”) and messaging (“slightly off”), all four editions failed to resonate with experts or casual readers .

The first text summarization challenge shows that ChatGPT can grasp the intent of short prompts, such as summarizing a paper into ten paragraphs and providing a link to a PDF of the paper. This requires not only how well the model understands and follows the abstraction in the request, but also knows what the link represents and finds it as a reference or guesses from its tokenized header.

OpenAI says GPT3 will not answer questions that may not have been part of its initial training data, such as “Who won the November 2022 election?” This knowledge gap suggests that ChatGPT itself is not actively looking for links, but is aware of what others have done to its content before.

Interestingly, when the same prompt appears twice (the only difference being the text break after the colon in the prompt works and the link itself), ChatGPT’s answer is quite different. Among them, the first time is a passing student essay summarizing the main points of Turing’s original paper; the second time the question is interpreted as a summary of each of the first ten paragraphs, rather than a summary of the entire paper.

The final results show that the overall content of research papers generated by ChatGPT achieves high scores in terms of metrics, but lacks coherence, especially when questions are omitted as prompts in the narrative.

From this it may be concluded that this exchange with ChatGPT speaks volumes about its ability to generate truly creative content or thought leaps.

3
ChatGPT refuses to admit passing the Turing test

GPT‑3 has an important filter to remove inherent bias when generating content. This time ChatGPT is also designed to be quite morally justified. When asked what it thinks about something, ChatGPT will refuse to give any specific answers, and only emphasize how it was created.

Many researchers also agree that any model must morally declare itself to be a mere machine when asked, and ChatGPT strictly adheres to this requirement.

Moreover, after OpenAI’s fine-tuning of ChatGPT’s various model layers, the current ChatGPT will answer when it is directly asked whether it is just an equation or Turing deceives: “My ability to imitate people does not necessarily mean that I have the ability to communicate with people.” The same thought, feeling or consciousness. I am just a machine, and my behavior is determined by the algorithms and data it was trained on. ”

Turing also suggested humans’ ability to remember lists: “Practical human computers really remember what they have to do…Building instruction lists is often described as ‘programming’.”

As with the evolution of ever-larger language models (>100 billion), improvements have built-in heuristics or model execution guardrails, and GPT‑3’s Instruct series demonstrates the ability to answer questions directly. ChatGPT, on the other hand, includes long-term dialogue memory, so the API can keep track of conversations even if there are narrative jumps that cannot be crossed by a single API call.

We can test conversations with impersonal pronouns like “it” where context is taken together with previous API calls in a single session – this is an easy-to-grasp example for ChatGPT’s API memory because Encoding longer conversations is powerful and expensive.

In LLM, API limitations and fee impacts make the correlation between token weights usually decay in the overall context every few segments (2048 tokens in GPT-3) over a long period of time. Overcoming this contextual limitation distinguishes ChatGPT from its publicly available predecessor.

The second-generation Lovelace 2.0 test presents constraints on creative tasks and refinement of execution tasks. Expert human judges then assess whether the model can be explained in a deterministic manner, or whether the output qualifies as valuable, novel, and surprising. So rather than having the program “write a short story,” the task could be refined to exhibit a particular length, style, or theme. The test combines many different types of intelligent understanding, with layers of constraints trying to limit Google searches and arguments about AI’s success in diluting or disguising original sources.

The following shows an example of a short story that directly answers the challenge posed in the Lovelace 2.0 test: A boy falls in love with a girl, aliens abduct the boy, and the girl saves the world with the help of a talking cat

Since 2014, the use of high-quality cue engineering has become commonplace as a constraint on text and image generation, usually the better the effect, the more detailed the description or qualifier about style, place or time. In fact, building the hint itself is the most creative aspect of getting good output in AI today. In this case, one can interweave the Turing and Lovelace tests by using ChatGPT to force creative work while dealing with a single subject, multiple layers of constraints on the style and tone of the desired output.

The ten poems generated by ChatGPT in the Turing Imitation Game are shown below:

The results of the Turing test are adjudicated by humans. As ChatGPT answers, whether the questioner judges the model to pass the Turing test “will depend on a variety of factors, such as the quality of the responses provided by the machine, the ability of the questioner to distinguish human and machine responses, and the methods used to determine whether the machine successfully imitated Human-specific rules and standards. Ultimately, the outcome of the game will depend on the circumstances and the players.”

LLM only does sequence prediction

don’t really understand language

It can be seen that contemporary LLM-based dialogue interactions can create a convincing illusion that we are in front of thinking creatures like humans. But in essence, such systems are fundamentally different from humans, and LLMs like ChatGPT also touch on topics in the philosophy of technology.

Language models are getting better at imitating human language, which creates a strong sense that these AI systems are already very human-like, and that we use words like “know”, “believe” and “think” with strong Self-awareness terms are used to describe these systems. Based on the above status quo, DeepMind senior scientist Murray Shanahan mentioned in a recent article that to break any myths that are either overly pessimistic or overly optimistic, we need to understand how the LLM system works.

Murray Shanahan

1. What is LLM and what can it do?

The emergence of LLMs such as BERT and GPT-2 has changed the rules of the artificial intelligence game. Subsequent large models such as GPT-3, Gopher, and PaLM are based on the Transformer architecture and are trained on hundreds of terabytes of text data, which highlights the power of data. effect.

The capabilities of these models are astonishing. First, their performance on benchmarks scales with the size of the training set; second, their capabilities take a qualitative leap as models scale up; and finally, many tasks that require human intelligence can be reduced to using models with sufficient performance “Predict the next token”.

This last point actually reveals that language models operate fundamentally differently than humans. The intuitions that humans use to communicate with each other have evolved over thousands of years, and today people are mistakenly transferring these intuitions to AI systems. ChatGPT has considerable utility and enormous commercial potential, and to ensure it can be deployed trustworthily and securely, we need to understand how it actually works.

How are large language models fundamentally different compared to human language?

As Wittgenstein said, the use of human language is an aspect of collective human behavior, and it has meaning only in the context of human social activities. Human infants are born into a shared world with other language speakers and acquire language through interaction with the outside world.

However, LLM’s language ability comes from different sources. Human-generated texts constitute a large-scale public corpus, which contains tokens such as words, word components, or single characters with punctuation. Large-scale language models are generative mathematical models of the statistical distribution of these tokens.

The so-called “generation” means that we can sample from these models, that is, ask questions. But the questions asked are very specific. For example, if we ask ChatGPT to help us write a paragraph, we are actually asking it to predict what words may appear next based on its statistical model of human language. Suppose we give ChatGPT the prompt “Was the first man to walk on the moon” and assume it will answer “Neil Armstrong”. It’s not really really asking who was the first man to walk on the moon here, but: Given the statistical distribution of words in a large public corpus of text, which words are most likely to follow the phrase “first man to walk on the moon”? The person is the sequence of “?

Although the model’s answers to these questions may be interpreted by humans as the model “understanding” the language, in reality all the model has to do is generate sequences of words that are statistically probable.

2. Does LLM really know everything?

LLM is transformed into a question answering system in two ways:

a) embed it into a larger system;

b) Use the prompt project to trigger the desired behavior.

In this way, LLM can be used not only for question answering, but also for summarizing news articles, generating scripts, solving logic puzzles, and doing language translation.

There are two important takeaways here. First, the basic function of LLMs, generating statistically probable word sequences, is very general. Second, despite this versatility, at the heart of all such applications is the same model, all doing the same thing: generating statistically probable sequences of words.

The basic model of LLM includes model architecture and training parameters. An LLM doesn’t really “know” anything because all it does is sequence prediction in the underlying sense. Models themselves have no concept of “true” or “false” because they do not have the means for humans to apply these concepts. LLM is in a sense independent of intentional positions.

The same is true for LLM-centric dialogue systems, which do not understand concepts of truth in human language because they do not exist in the world that we human language users share.

3. About Emergence

Today’s LLM is so powerful and versatile that it’s hard not to give it more or less personality. A rather appealing argument is that although LLMs fundamentally only perform sequence prediction, in learning to do so, they may have discovered the need to do so in higher-level terms such as “knowledge” and “belief”. Describe the emergence mechanism.

In fact, ANNs can approximate any computable function to arbitrary precision. So whatever mechanisms are needed to form beliefs, they probably reside somewhere in the parameter space. If stochastic gradient descent is the best way to optimize for the goal of accurate sequence prediction, then given a large enough model, enough data of the right type, and enough computing power to train the model, maybe they can actually discover that mechanism.

Moreover, recent advances in LLM research have shown that extraordinary and unexpected capabilities emerge when sufficiently large models are trained on very large amounts of text data.

However, as long as our consideration is limited to a simple LLM-based question answering system, it does not involve communicative graphs at all. Regardless of the internal mechanisms it uses, sequence prediction itself has no communicative intent, nor does simply embedding a communicative graph into a dialogue management system.

We can only talk about “belief” in the fullest sense of the word if we can tell the difference between true and false, but the LLM is not responsible for making judgments, it just models which words are likely to follow other words. We can say that an LLM “encodes”, “stores” or “contains” knowledge, and it is reasonable to say that an emergent property of an LLM is that it encodes knowledge about the world of everyday life and how it works, but if we say “ChatGPT knows Beijing is the capital of China”, that is just rhetorical.

4. External sources of information

The point here is that it involves the prerequisites for fully attributing any belief to a system.

Nothing counts as a belief about the world we share, broadly speaking, unless it is in the context of the ability to appropriately update beliefs based on evidence from a world, which is an important aspect of the ability to discern true from false .

Can Wikipedia, or some other site, provide external criteria for measuring the truth or falsity of a belief? Assuming an LLM is embedded in a system that regularly consults such resources and uses modern model editing techniques to maintain the factual accuracy of its predictions, what capabilities are needed to achieve belief updating?

Sequence predictors themselves may not be the sort of thing that can have communicative intent or form beliefs about external reality. However, as has been repeatedly emphasized, an LLM in the wild must be embedded within a larger architecture to be useful.

To build a question answering system, the LLM only needs to be supplemented with a dialogue management system to query the model appropriately. Anything that this larger structure does counts as the ability to communicate intentions or form beliefs.

Crucially, this line of thinking hinges on a shift from the language model itself to the larger system that the language model belongs to. The language model itself is still just a sequence predictor and doesn’t have as much access to the outside world as it used to. Only in this case does the intentional position become more convincing relative to the system as a whole. But before succumbing to it, we should remind ourselves how different such systems are from humans.

5. Vision-Language Model

LLMs can be combined with other types of models and/or embedded in more complex architectures. For example, visual language models (VLMs) such as VilBERT and Flamingo combine a language model with an image encoder and train on a multimodal corpus of text-image pairs. This enables them to predict how a given sequence of words will continue in the context of a given image. VLMs can be used for visual question answering or conversations about user-supplied images, commonly known as “picture-talking”

So, can user-provided images represent external reality that can assess the truth or falsehood of a proposition? Is it reasonable to talk about LLM’s beliefs? We can imagine a VLM using an LLM to generate a hypothesis about an image, verify its truth against that image, and then fine-tune the LLM so as not to make proven false statements.

But most VLM-based systems don’t work that way. Instead, they rely on frozen models of the joint distribution of text and images. The relationship between a user-supplied image and the text generated by a VLM is fundamentally different from the world we share with humans and the text we use to talk about that world. The important thing is that the former is only a correlation, while the latter is a causal relationship. Of course, there is a causal structure in the calculations performed by the model during the reasoning process, but this is different from the causal relationship between words and what they refer to.

6. Embodied AI

Human language users exist in a shared world, which makes us fundamentally different from LLMs. An LLM in isolation cannot update its beliefs by communicating with the outside world, but what if an LLM is embedded in a larger system? For example, systems that appear as robots or virtual avatars. Is it reasonable to talk about LLM knowledge and beliefs at this time?

It depends on how the LLM is embodied.

Take the SayCan system released by Google this year as an example. In this work, the LLM is embedded in the system that controls the physical robot. robot root

Perform everyday tasks (such as cleaning up spilled water on the table) based on the user’s high-level natural language commands.

Among them, the job of the LLM is to map the user’s instructions to low-level actions that will help the robot achieve the desired goal (such as finding a sponge). This is done with an engineered prompt prefix that causes the model to output natural language descriptions of suitable low-level actions, scoring their usefulness.

The language model component of the SayCan system may give action suggestions regardless of the actual environment in which the robot is located, for example, there is no sponge next to it. So, the researchers used a separate perception module to leverage the robot’s sensors to assess the scene and determine the current feasibility of performing each low-level action. Combining the LLM’s assessment of the usefulness of each action with the perception module’s assessment of the feasibility of each action, the next optimal action can be derived.

Even though SayCan has physical interaction with the real world, the way it learns and uses language is still very different from humans. Language models included in systems like SayCan are pre-trained to perform sequence prediction in the disembodied environment of plain text datasets. They don’t learn language by talking to other speakers of the language.

SayCan does bring us an imagination about the future language use system, but in today’s system, the role of language is very limited. The user sends instructions to the system in natural language, and the system generates an interpretable natural language description of its actions. But this tiny scale of language use simply cannot match the scale of collective human activity supported by language.

So, even for embodied AI systems that include LLMs, we have to choose words carefully to describe them.

7. Can LLM reason?

Now we can deny that ChatGPT has beliefs, but can it really reason?

The problem is even trickier because in formal logic reasoning is content neutral. For example, the inference rule “affirming the antecedent” (modus ponens) is valid regardless of the premises:

If: All men are mortal and Socrates is a man; then: Socrates is mortal.

The content-neutrality of logic seems to imply that we cannot be too demanding on LLMs in reasoning, since LLMs are not good enough to measure true and false external realities. But even so, when we prompt ChatGPT “All men are mortal, Socrates is man, then”, we are not asking the model to perform hypothetical reasoning, but are asking: the statistics of words in a given public corpus distribution, which words might follow the sequence “all men are mortal, Socrates is man, then”.

Moreover, more complex reasoning problems will involve multiple reasoning steps, and due to clever hint engineering, LLM can be effectively applied to multi-step reasoning without further training. For example, in chain-of-thought prompts, submitting a prompt-prefix to the model before the user query contains some examples of multi-step reasoning, and explicitly stating that all intermediate steps include a prompt-prefix in chain-of-thought style encourages the model to Subsequent sequences are generated in the same style, that is, including a sequence of explicit reasoning steps leading to the final answer.

As usual, the question really posed to the model is of the form “given the statistical distribution of words in the public corpus, which words are likely to follow the sequence S”, in this case the sequence S is the link thought hint prefixed with the user’s query , the token sequences that most likely follow S, will have a similar form to the sequences found in the hint prefix, that is, among them, will include multiple inference steps, so these are model-generated.

It is worth noting that not only is the model’s response in the form of a multi-step argument, but the argument in question is often (but not always) valid, and the final answer is usually (but not always) correct. To the extent that a properly hinted LLM seems to infer correctly, it does so by mimicking well-formed parameters in its training set and/or hints.

But does such imitation constitute true reasoning? Even though today’s models occasionally make mistakes, can these mistakes be further narrowed such that the performance of the model is indistinguishable from that of a hard-coded inference algorithm?

Maybe the answer is yes, but how do we know? How can we trust such a model?

Sentence sequences generated by a theorem prover are faithful to logic because they are the result of an underlying computational process whose causal structure mirrors the inferential structure of the theorem. One way to use LLMs to build trusted reasoning systems is to embed them in algorithms that enforce the same causal structure. But if we stick to pure LLM, then the only way to fully believe in the arguments it generates is to reverse engineer it and discover emergent mechanisms that conform to the rules of faithful reasoning. At the same time, we should be more cautious and exercise caution when describing the role of these models.

Reference link: 1. https://ift.tt/WqmOCsw

2. https://ift.tt/mzsJtHx

For more content , click below to follow:

Without the authorization of “AI Technology Review”, reprinting on web pages, forums, and communities in any way is strictly prohibited!

Please leave a message in the background of “AI Technology Review” to obtain authorization before reprinting from the official account. When reprinting, you must indicate the source and insert the business card of this official account.

Leifeng.com

This article is transferred from: https://www.leiphone.com/category/academic/A5s3GivKcQLwYMfX.html
This site is only for collection, and the copyright belongs to the original author.