Comparing AI services–an objective analysis?

If you have been following my articles about comparing AI services, you’d know that, through some ‘rule of thumb’ reasoning I was able to determine the following ranking of Ai services:

1. Deepseek

2. M365 Copilot

3. Copilot Researcher

4. Gemini

5. Copilot Studio

6. ChatGPT deep research

7. ChatGPT

The problem is that I used the same AI services to potentially evaluate the results that they in fact generated. Could that result in bias? Unsure, but I’d suggest probably, if you look at the results.

What I therefore decided to do was have the original articles evaluated by two AI services that were not on my original list, Claude and Grok. Here’s the result of jus these two:

AI Service	Claude	Grok	Total
M365 Copilot	7	4	11
Gemini	3	7	10
Copilot Studio	5	5	10
Deepseek	6	2	8
Copilot Researcher	2	6	8
ChatGPT Deep Research	4	3	7
ChatGPT	1	1	2

If I now incorporate these results in the overall results I get the following:

AI Service	Researcher	Gemini	ChatGPT	Claude	Grok	Total
M365 Copilot	7	3	4	7	4	25
Deepseek	5	4	7	6	2	24
Gemini	4	7	2	3	7	23
Copilot Studio	2	5	5	5	5	22
Copilot Researcher	6	6	1	2	6	21
ChatGPT Deep Research	3	2	3	4	3	15
ChatGPT	1	1	6	1	1	10

That changes the ranking slightly to:

1. M365 Copilot

2. Deepseek

3. Gemini

4. Copilot Studio

5. Copilot Researcher

6. ChatGPT deep research

7. ChatGPT

with the average score being 20, which most services exceed. ChatGPT still lags, even after this! Interesting, huh?

I think my original conclusion remains valid – most AI services, except for ChatGPT, seem to produce very similar quality on average when prompted in the same way.