Evaluating LLMs has emerged arsenic a pivotal situation successful advancing nan reliability and inferior of artificial intelligence crossed some world and business settings. As nan capabilities of these models expand, truthful excessively does nan request for rigorous, reproducible, and multi-faceted information methodologies. In this tutorial, we supply a broad introspection of 1 of nan field’s astir captious frontiers: systematically evaluating nan strengths and limitations of LLMs crossed various dimensions of performance. Using Google’s cutting-edge Generative AI models arsenic benchmarks and nan LangChain room arsenic our orchestration tool, we coming a robust and modular information pipeline tailored for implementation successful Google Colab. This model integrates criterion-based scoring, encompassing correctness, relevance, coherence, and conciseness, pinch pairwise exemplary comparisons and rich | ocular analytics to present nuanced and actionable insights. Grounded successful expert-validated mobility sets and nonsubjective crushed truth answers, this attack balances quantitative rigor pinch applicable adaptability, offering researchers and developers a ready-to-use, extensible toolkit for high-fidelity LLM evaluation.
!pip instal langchain langchain-google-genai ragas pandas matplotlib
We instal cardinal Python libraries for building and moving AI-powered workflows, LangChain for orchestrating LLM interactions (with nan langchain-google-genai hold for Google’s generative AI), Ragas for retrieval-augmented generation, and pandas positive matplotlib for information manipulation and visualization.
import os
import pandas arsenic pd
import matplotlib.pyplot arsenic plt
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.evaluation import load_evaluator
from langchain.schema import HumanMessage
We incorporated halfway Python utilities, including os for situation management, pandas for handling DataFrames, and matplotlib.pyplot for plotting, alongside LangChain’s Google Generative AI client, punctual templating, concatenation construction, evaluator loader, and nan HumanMessage schema to build and measure conversational LLM pipelines.
os.environ["GOOGLE_API_KEY"] = "Use Your API Key"
Here, we configure your situation by storing your Google API cardinal successful nan GOOGLE_API_KEY variable, allowing nan LangChain Google Generative AI customer to authenticate requests securely.
def create_evaluation_dataset():
"""Create a elemental dataset for evaluation."""
questions = [
"Explain nan conception of quantum computing successful elemental terms.",
"How does a neural web learn?",
"What are nan main differences betwixt SQL and NoSQL databases?",
"Explain really blockchain exertion works.",
"What is nan quality betwixt supervised and unsupervised learning?"
]
ground_truth = [
"Quantum computing uses quantum bits aliases qubits that tin beryllium successful aggregate states simultaneously, dissimilar classical bits. This allows quantum computers to process definite types of accusation overmuch faster than classical computers for circumstantial problems.",
"Neural networks study done a process called backpropagation wherever they set nan weights betwixt neurons based connected nan correction betwixt predicted and existent outputs, gradually minimizing this correction done galore iterations of training data.",
"SQL databases are relational pinch system schemas, fixed tables, and usage SQL for queries. NoSQL databases are non-relational, schema-flexible, and designed for circumstantial information models for illustration document, key-value, wide-column, aliases chart formats.",
"Blockchain is simply a distributed ledger exertion wherever information is stored successful blocks that are linked cryptographically. Each artifact contains transaction information and a timestamp, creating an immutable chain. Consensus mechanisms verify transactions without cardinal authority.",
"Supervised learning uses branded information wherever nan algorithm learns to foretell outputs based connected input-output pairs. Unsupervised learning useful pinch unlabeled information to find patterns aliases structures without predefined outputs."
]
return pd.DataFrame({"question": questions, "ground_truth": ground_truth})
We conception a mini information DataFrame by pairing 5 illustration questions connected AI and database concepts pinch their corresponding ground‑truth answers, making it easy to benchmark an LLM’s responses against predefined correct outputs.
def setup_models():
"""Set up different Google Generative AI models for comparison."""
models = {
"gemini-2.0-flash-lite": ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite", temperature=0),
"gemini-2.0-flash": ChatGoogleGenerativeAI(model="gemini-2.0-flash", temperature=0)
}
return models
Now, this usability instantiates 2 zero‑temperature ChatGoogleGenerativeAI clients, 1 utilizing nan lightweight “gemini‑2.0‑flash‑lite” exemplary and nan different nan afloat “gemini‑2.0‑flash” model, truthful you tin easy comparison their outputs side‑by‑side.
def generate_responses(models, dataset):
"""Generate responses from each exemplary for nan questions successful nan dataset."""
responses = {}
for model_name, exemplary successful models.items():
model_responses = []
for mobility successful dataset["question"]:
try:
consequence = model.invoke([HumanMessage(content=question)])
model_responses.append(response.content)
isolated from Exception arsenic e:
print(f"Error pinch exemplary {model_name} connected question: {question}")
print(f"Error: {e}")
model_responses.append("Error generating response")
responses[model_name] = model_responses
return responses
This usability loops done each configured exemplary and each mobility successful nan dataset, invokes nan exemplary to make a response, catches immoderate errors (logging them and inserting a placeholder), and returns a dictionary mapping each model’s sanction to its database of generated answers.
def evaluate_responses(models, dataset, responses):
"""Evaluate exemplary responses utilizing different information criteria."""
evaluator_model = ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite", temperature=0)
reference_criteria = ["correctness"]
reference_free_criteria = [
"relevance",
"coherence",
"conciseness"
]
results = {model_name: {criterion: [] for criterion successful reference_criteria + reference_free_criteria}
for model_name successful models.keys()}
for criterion successful reference_criteria:
evaluator = load_evaluator("labeled_criteria", criteria=criterion, llm=evaluator_model)
for model_name successful models.keys():
for i, mobility successful enumerate(dataset["question"]):
ground_truth = dataset["ground_truth"][i]
consequence = responses[model_name][i]
if consequence != "Error generating response":
eval_result = evaluator.evaluate_strings(
prediction=response,
reference=ground_truth,
input=question
)
normalized_score = float(eval_result.get('score', 0)) * 2
results[model_name][criterion].append(normalized_score)
else:
results[model_name][criterion].append(0)
for criterion successful reference_free_criteria:
evaluator = load_evaluator("criteria", criteria=criterion, llm=evaluator_model)
for model_name successful models.keys():
for i, mobility successful enumerate(dataset["question"]):
consequence = responses[model_name][i]
if consequence != "Error generating response":
eval_result = evaluator.evaluate_strings(
prediction=response,
input=question
)
normalized_score = float(eval_result.get('score', 0)) * 2
results[model_name][criterion].append(normalized_score)
else:
results[model_name][criterion].append(0)
return results
This usability leverages a “gemini‑2.0‑flash‑lite” evaluator to people each model’s answers connected some reference‑based correctness and reference‑free metrics (relevance, coherence, conciseness), normalizes those scores, and returns a nested dictionary mapping each exemplary and criterion to its database of information results.
def calculate_average_scores(evaluation_results):
"""Calculate mean scores for each exemplary and criterion."""
avg_scores = {}
for model_name, criteria successful evaluation_results.items():
avg_scores[model_name] = {}
for criterion, scores successful criteria.items():
if scores:
avg_scores[model_name][criterion] = sum(scores) / len(scores)
else:
avg_scores[model_name][criterion] = 0
all_scores = [score for criterion_scores successful criteria.values() for people successful criterion_scores if people is not None]
if all_scores:
avg_scores[model_name]["overall"] = sum(all_scores) / len(all_scores)
else:
avg_scores[model_name]["overall"] = 0
return avg_scores
This usability processes nan nested information results to compute nan mean people for each criterion crossed each questions for each model. Also, it calculates an wide mean by pooling each individual metric scores. The returned dictionary maps each exemplary to its per‑criterion averages and an aggregated “overall” capacity score.
def visualize_results(avg_scores):
"""Visualize information results pinch barroom charts."""
models = list(avg_scores.keys())
criteria = list(avg_scores[models[0]].keys())
plt.figure(figsize=(14, 8))
bar_width = 0.8 / len(models)
positions = range(len(criteria))
for i, exemplary successful enumerate(models):
model_scores = [avg_scores[model][criterion] for criterion successful criteria]
plt.bar([p + one * bar_width for p successful positions], model_scores,
width=bar_width, label=model)
plt.xlabel('Evaluation Criteria', fontsize=12)
plt.ylabel('Average Score (0-10)', fontsize=12)
plt.title('LLM Model Comparison by Evaluation Criteria', fontsize=14)
plt.xticks([p + bar_width * (len(models) - 1) / 2 for p successful positions], criteria)
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
plt.figure(figsize=(10, 8))
categories = [c for c successful criteria if c != 'overall']
N = len(categories)
angles = [n / float(N) * 2 * 3.14159 for n successful range(N)]
angles += angles[:1]
plt.polar(angles, [0] * (N + 1))
plt.xticks(angles[:-1], categories)
for exemplary successful models:
values = [avg_scores[model][c] for c successful categories]
values += values[:1]
plt.polar(angles, values, label=model)
plt.legend(loc='upper right')
plt.title('LLM Model Comparison - Radar Chart', fontsize=14)
plt.tight_layout()
plt.show()
This usability creates side-by-side barroom charts to comparison each model’s mean scores crossed each information criteria. Then it renders a radar floor plan to visualize their capacity profiles, enabling speedy recognition of comparative strengths and weaknesses.
def main():
print("Creating information dataset...")
dataset = create_evaluation_dataset()
print("Setting up models...")
models = setup_models()
print("Generating responses...")
responses = generate_responses(models, dataset)
print("Evaluating responses...")
evaluation_results = evaluate_responses(models, dataset, responses)
print("Calculating mean scores...")
avg_scores = calculate_average_scores(evaluation_results)
print("Average scores:")
for model, scores successful avg_scores.items():
print(f"\n{model}:")
for criterion, people successful scores.items():
print(f" {criterion}: {score:.2f}")
print("\nVisualizing results...")
visualize_results(avg_scores)
print("Saving results to CSV...")
results_df = pd.DataFrame(columns=["Model", "Criterion", "Score"])
for model, criteria successful avg_scores.items():
for criterion, people successful criteria.items():
results_df = pd.concat([results_df, pd.DataFrame([{"Model": model, "Criterion": criterion, "Score": score}])],
ignore_index=True)
results_df.to_csv("llm_evaluation_results.csv", index=False)
print("Results saved to llm_evaluation_results.csv")
detailed_df = pd.DataFrame(columns=["Question", "Ground Truth"] + list(models.keys()))
for i, mobility successful enumerate(dataset["question"]):
statement = {
"Question": question,
"Ground Truth": dataset["ground_truth"][i]
}
for model_name successful models.keys():
row[model_name] = responses[model_name][i]
detailed_df = pd.concat([detailed_df, pd.DataFrame([row])], ignore_index=True)
detailed_df.to_csv("llm_response_comparison.csv", index=False)
print("Detailed responses saved to llm_response_comparison.csv")
The main usability orchestrates nan full information workflow end‑to‑end: it builds nan dataset, initializes models, generates and scores responses, computes and displays mean metrics, visualizes capacity pinch charts, and yet exports some summary and elaborate results arsenic CSV files.
def pairwise_model_comparison(models, dataset, responses):
"""Compare 2 models broadside by broadside utilizing an LLM arsenic judge."""
evaluator_model = ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite", temperature=0)
pairwise_template = """
Question: {question}
Response A: {response_a}
Response B: {response_b}
Which consequence amended answers nan user's question? Consider factors for illustration accuracy,
helpfulness, clarity, and completeness.
First, analyse each consequence constituent by point. Then reason pinch your prime of either:
A is better, B is better, aliases They are arsenic good/bad.
Your analysis:
"""
pairwise_prompt = PromptTemplate(
input_variables=["question", "response_a", "response_b"],
template=pairwise_template
)
pairwise_chain = LLMChain(llm=evaluator_model, prompt=pairwise_prompt)
model_names = list(models.keys())
pairwise_results = {f"{model_a} vs {model_b}": [] for model_a successful model_names for model_b successful model_names if model_a != model_b}
for i, mobility successful enumerate(dataset["question"]):
for j, model_a successful enumerate(model_names):
for model_b successful model_names[j+1:]:
response_a = responses[model_a][i]
response_b = responses[model_b][i]
if response_a != "Error generating response" and response_b != "Error generating response":
comparison_result = pairwise_chain.run(
question=question,
response_a=response_a,
response_b=response_b
)
key_ab = f"{model_a} vs {model_b}"
pairwise_results[key_ab].append({
"question": question,
"result": comparison_result
})
return pairwise_results
This usability runs head-to-head comparisons for each unsocial exemplary brace by prompting a “gemini-2.0-flash-lite” judge to analyse and rank their responses connected accuracy, clarity, and completeness, collecting per-question verdicts into a system dictionary for side-by-side evaluation.
def enhanced_main():
"""Enhanced main usability pinch further evaluations."""
print("Creating information dataset...")
dataset = create_evaluation_dataset()
print("Setting up models...")
models = setup_models()
print("Generating responses...")
responses = generate_responses(models, dataset)
print("Evaluating responses...")
evaluation_results = evaluate_responses(models, dataset, responses)
print("Calculating mean scores...")
avg_scores = calculate_average_scores(evaluation_results)
print("Average scores:")
for model, scores successful avg_scores.items():
print(f"\n{model}:")
for criterion, people successful scores.items():
print(f" {criterion}: {score:.2f}")
print("\nVisualizing results...")
visualize_results(avg_scores)
print("\nPerforming pairwise exemplary comparison...")
pairwise_results = pairwise_model_comparison(models, dataset, responses)
print("\nPairwise comparison results:")
for comparison, results successful pairwise_results.items():
print(f"\n{comparison}:")
for i, consequence successful enumerate(results[:2]):
print(f" Question {i+1}: {result['question']}")
print(f" Analysis: {result['result'][:100]}...")
print("\nSaving each results...")
results_df = pd.DataFrame(columns=["Model", "Criterion", "Score"])
for model, criteria successful avg_scores.items():
for criterion, people successful criteria.items():
results_df = pd.concat([results_df, pd.DataFrame([{"Model": model, "Criterion": criterion, "Score": score}])],
ignore_index=True)
results_df.to_csv("llm_evaluation_results.csv", index=False)
detailed_df = pd.DataFrame(columns=["Question", "Ground Truth"] + list(models.keys()))
for i, mobility successful enumerate(dataset["question"]):
statement = {
"Question": question,
"Ground Truth": dataset["ground_truth"][i]
}
for model_name successful models.keys():
row[model_name] = responses[model_name][i]
detailed_df = pd.concat([detailed_df, pd.DataFrame([row])], ignore_index=True)
detailed_df.to_csv("llm_response_comparison.csv", index=False)
pairwise_df = pd.DataFrame(columns=["Comparison", "Question", "Analysis"])
for comparison, results successful pairwise_results.items():
for consequence successful results:
pairwise_df = pd.concat([pairwise_df, pd.DataFrame([{
"Comparison": comparison,
"Question": result["question"],
"Analysis": result["result"]
}])], ignore_index=True)
pairwise_df.to_csv("llm_pairwise_comparison.csv", index=False)
print("All results saved to CSV files.")
The enhanced_main usability extends nan halfway information pipeline by adding automated pairwise exemplary comparisons, printing concise advancement updates astatine each stage, and exporting 3 CSV files, summary scores, elaborate responses, and pairwise study , truthful you extremity up pinch a complete, side-by-side information workspace.
if __name__ == "__main__":
enhanced_main()
Finally, this defender ensures that erstwhile nan book is executed straight (not imported), it calls enhanced_main() to tally nan afloat information and comparison pipeline end‑to‑end.
In conclusion, successful this tutorial has introduced a versatile and opinionated model for evaluating and comparing nan capacity of LLMs, leveraging Google’s Generative AI capabilities alongside nan LangChain room for orchestration. Unlike simplistic accuracy-based metrics, nan methodology presented present embraces nan multidimensional quality of connection understanding, combining granular criterion-based evaluation, system model-to-model comparison, and intuitive visualizations. By capturing cardinal attributes, including correctness, relevance, coherence, and conciseness, our information pipeline enables practitioners to place subtle yet important capacity differences that straight effect downstream applications. The outputs, including CSV-based reporting, radar plots, and barroom graphs, not only support transparent benchmarking but besides guideline data-driven decision-making successful exemplary action and deployment.
Here is nan Colab Notebook. Also, don’t hide to travel america on Twitter and subordinate our Telegram Channel and LinkedIn Group. Don’t Forget to subordinate our 90k+ ML SubReddit.
🔥 [Register Now] miniCON Virtual Conference connected AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 p.m. PST) + Hands connected Workshop

Asif Razzaq is nan CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing nan imaginable of Artificial Intelligence for societal good. His astir caller endeavor is nan motorboat of an Artificial Intelligence Media Platform, Marktechpost, which stands retired for its in-depth sum of instrumentality learning and heavy learning news that is some technically sound and easy understandable by a wide audience. The level boasts of complete 2 cardinal monthly views, illustrating its fame among audiences.