AI and Survey Sampling Problems

My previous post discussed the performance of the artificial intelligence (AI) interface Gemini on undergraduate statistics problems. Now let’s look at how Gemini answers some of the problems in my sampling textbook (Lohr, 2022), and talk about how Gemini could help students learn sampling.

Evaluate a Survey Question

Query: Evaluate the following survey question for the purpose of estimating the number of American adults who have seen the movie Citizen Kane: "According to the American Film Institute, Citizen Kane is the greatest American movie ever made. Have you seen this movie?"

This is Exercise 32(b) on page 27 of Lohr (2022). My solutions manual gives the following answer: “This is a leading question; after being told that it is the greatest American movie ever made, respondents will be loath to say they have not seen it.” But Gemini gives a much better answer, listing additional flaws with the question (identifying social desirability and prestige bias as the underlying motivators that lead the respondent to provide a “yes” answer) and providing an alternative wording for the question.

Gemini's answer

This survey question is highly flawed for the purpose of getting an accurate estimate. If you run this poll, you will almost certainly end up with an overestimate of how many American adults have actually seen Citizen Kane.

Here is a breakdown of why this question fails methodological standards and how it impacts your data.

The Core Problem: Prestige & Leading Bias

The phrase "According to the American Film Institute, Citizen Kane is the greatest American movie ever made" is a textbook example of leading text (specifically, prestige bias).

By introducing an authority figure (the American Film Institute) and a massive superlative ("greatest American movie ever made"), you are priming the respondent before they even get to the actual question.

Why This Ruins Your Data

Social Desirability Bias: People generally want to appear cultured, educated, and "in the loop." When you label a movie as the greatest of all time, respondents who haven't seen it may feel a subconscious pressure to answer "Yes" to avoid feeling culturally out of touch.

Acquiescence Bias: Humans have a natural tendency to agree with statements or validate the context given by the interviewer.

Measuring the Wrong Thing: Instead of purely measuring behavior (whether they watched the film), you are now measuring a mix of behavior, cultural guilt, and agreement with the American Film Institute.

The Verdict & The Fix

The Verdict: Scrap this wording. It compromises the objectivity of the survey and will yield inflated, inaccurate data.

To get a true, unbiased estimate of the number of American adults who have seen the movie, you need to strip away all context and prestige. Keep it completely neutral.

Recommended Revision:

"Have you seen the movie Citizen Kane?"

Yes

No

Not sure / Don't recall

Tip: Adding a "Not sure" option helps capture people who might confuse it with another classic film or who genuinely can't remember, further increasing the accuracy of your "Yes" data.

Calculating Inclusion Probabilities for an Unequal-Probability Sample

Query: Rao (1963) discussed the following rejective method for selecting a probability-proportional-to-size sample without replacement: Select n units with probabilities y_i and with replacement. If any unit appears more than once in the sample, reject the whole sample and select another n units with replacement. Repeat until you obtain a sample of n units with no duplicates.

Gemini says: “This problem from Sharon Lohr's Sampling: Design and Analysis (Chapter 6, on unequal probability sampling) explores Rao's rejective method for n=2, and proceeds to give a flawless derivation of the two inclusion probabilities. Gemini, although recognizing that this problem came from my book, clearly did not copy the proof I wrote for the solutions manual — in fact, Gemini’s proof is a little more elegant than mine. Gemini explains its reasoning at every step, making the solution easy for a student to follow.

Gemini was able to work out every theory problem I submitted to it. Of course, all of these problems are known to have solutions, and solutions to the problems can be found at various places online. I have not tried using Gemini to solve open research problems in sampling.

Recently, however, OpenAI was able to solve a problem that had been posed in 1946 by the great mathematician Paul Erdős and studied by many, many mathematicians since then. Suppose there are n points in a plane, and let v(n) equal the maximum number of points that are one distance unit apart. Erdős conjectured an upper bound for v(n), and the AI was able to disprove his conjecture by showing that there are infinitely many values of n for which the postulated upper bound is exceeded. If an AI can solve an 80-year-old open research problem, it is not surprising that one would be able to work through theoretical statistics problems that have known answers.

For both the Erdős conjecture and the sampling problems, however, the answers had to be verified by humans, since generative AI tools can produce false results that appear superficially to be correct.

Critiquing an Article

One of my favorite assignments in sampling has always been to ask students to critique a research article that uses a sample survey. In my previous post, Gemini did not do well at critiquing a paper describing a designed experiment. How does it do on survey articles?

I asked Gemini to find and critique two research articles that involve a survey sample, using the following queries taken from exercises in Lohr (2022).

Obtain a research article based on a survey employing a complex survey design, and write a short critique. Your critique should include:
(a) a brief summary of the design and analysis
(b) a discussion of the effectiveness of the design and the appropriateness of the analysis
(c) your recommendations for future studies of this type (Lohr, 2022, Chapter 7, exercise 33 on p. 309).
Find a recent journal article that uses a nonprobability sample. What method(s) did the authors use to try to obtain a more representative sample or to reduce the selection bias? How well do you think the methods worked? (Lohr 2022, Chapter 15, exercise 33 on p. 555).

For the first query, Gemini chooses the article by Bhargava et al. (2025), which uses data from the National Health and Nutrition Examination Survey (NHANES) to measure “nutrition security” in the United States. For the second, Gemini chooses to critique the California Statewide Study of People Experiencing Homelessness (CASPEH; Wesson et al., 2025). Here is the Markdown file of Gemini’s full responses to these queries.

Gemini succinctly describes the stratified multistage NHANES design for the years studied in Bhargava et al. (2025) (to be fair, Chapter 7 of my book gives a much more detailed description, including the calculation of the measure of size for the unequal probability sampling and the weight construction). Gemini correctly praises the authors’ use of survey software to account for the unequal selection probabilities and clustering in the estimation.

Gemini also identifies what I view as the main limitation of the Bhargava et al. (2025) — relying solely on the answer to the question “In general, how healthy is your overall diet?” to categorize respondents as having high diet quality (those responding excellent, very good, or good to the question) or low diet quality (those responding fair or poor to the question). The authors also identify this as a study limitation, recognizing that it is imposed by the NHANES questionnaire structure. If they had conducted their own survey, they would have been able to design questions to measure diet quality more accurately, but then they would have had to conduct a very expensive survey instead of taking advantage of NHANES.

Despite having some good suggestions for improvement, Gemini has a few misstatements in its critique. Gemini suggests that the authors should “triangulate” the self-reported metrics with biomarkers from the medical examination (“subjective self-assessments of diet should be cross-verified against objective biometric data already collected in the survey”), but fails to notice that Table 1 of Bhargava et al. (2025) does this, relating their measure of nutrition quality to measured health conditions such as obesity, hypertension, high cholesterol, and heart disease. Gemini also says that the weights must be rescaled when combining multiple years. This is not necessary, as the concatenated dataset, with the weights, represents the concatenated populations (Centers for Disease Control and Prevention, 2026). Dividing each weight by the number of pooled cycles will not change estimates of percentages.

For the second query, CASPEH is a good choice for a critique of a nonprobability sample. Estimating the number of persons experiencing homelessness (PEH), and their characteristics, is extremely challenging, and Gemini picked an article that many students would find interesting. The annual point-in-time count, in which volunteers in each community count the number of PEH during a specific week in January, is generally thought to be an underestimate (but see Example 3.13 of Lohr, 2022, for a description of how New York City tries to improve the accuracy of its count) and CASPEH follows a different approach.

Wesson et al. (2025) describe the design of CASPEH, a statewide survey designed to provide information about the extent and characteristics of PEH in California. It would be easy for a student mistakenly say that the primary sampling units form a probability sample, because Wesson et al. (2025, p. 1239) state: “We used a multistage sampling design with randomization at 3 levels: California counties, venues where PEH could be found within selected counties, and individuals attending selected venues.” Later on that page, though, they state that although they created 200 samples by randomly sampling one county from each of the eight regions, they then “chose the sample most representative of the state with respect to demographic and economic characteristics of both the general and PEH population” (p. 1239). Gemini correctly characterizes the sample as purposively chosen.

Gemini also correctly describes the selection of persons within counties through supplementing a sample of venues (e.g. shelters, free meal programs, free shower locations) with a respondent-driven sample in which PEH in an initial convenience sample recruit others to be in the sample.

Gemini identifies a success of the survey — “The methods worked exceptionally well at capturing a highly diverse and large-scale sample (over 3,200 participants) that far exceeds the scope of standard convenience web or phone surveys” — but does not seem to recognize that methods such as web or phone surveys are not feasible with the PEH population. Similarly, Gemini’s suggestion that the nonprobability sample could be stacked with a reference dataset such as the American Community Survey (ACS) to estimate inclusion probabilities would probably not work well because the ACS misses unsheltered persons (and probably fails to capture some of the PEH in sampled shelters too). Still, it’s an interesting suggestion and makes use of recent research on combining probability and nonprobability samples.

I found Gemini’s survey critiques to be pretty impressive, overall. Gemini did not come up with all of my suggestions for improvement (for example, the CASPEH survey could select more but smaller primary sampling units per stratum, use balanced sampling to select a probability sample of areas, or use a different weighting scheme to reduce the bias), but I would not expect any student to come up with my exact set of suggestions, either. But it identified the survey designs correctly and most of its suggestions were reasonable — its performance was what I would expect from a good student.

Using AI to Enhance Survey Sampling Knowledge

Gemini and other AI interfaces can now do many of the homework problems in sampling (and other math and statistics classes). I think it is still important for sampling students to learn how to design and analyze different types of surveys (and for graduate students in sampling to know how to prove results about different sampling designs and estimators), and it is good for students to work on problems on their own even if Gemini can answer the question in a fraction of the time.

The footnote at the bottom of https://gemini.google.com says “Gemini is AI and can make mistakes.” This is a big difference between Gemini and other technology used in the classroom. When you write code for SAS® software to design or analyze a survey, you can be confident that the software will do exactly the calculations you asked it to do. If you write correct code for calculating regression coefficients, the software will give you the correct answer. Gemini, however, might give you nonsense in response to a query. Students need to know how to select, analyze, and critique samples themselves to be able to distinguish Gemini’s helpful answers from the nonsense.

But Gemini could help develop students’ understanding of the subject during in-class activities. It can find examples of survey datasets with various characteristics, can write code for graphing or analyzing survey data, and its critiques can spur class discussion. To explore further, I asked Gemini how it thinks it could be most helpful for teaching sampling, with its answer in the next section. Note how Gemini built on the queries I gave it earlier to come up with excellent suggestions for integrating AI tools into a sampling classroom.

Gemini's answer to the query "How should a statistics teacher use Gemini when teaching sampling?"

Teaching sampling is notoriously tricky because it forces students to shift from the clean, theoretical world of pure math to the messy, unpredictable world of human behavior and logistics.

Think of Gemini not just as an answer engine, but as an interactive sandbox, a flawed data generator, and a teaching assistant wrapped into one. Here is a blueprint for integrating Gemini into your sampling curriculum to boost engagement and conceptual clarity.

1. The "Flawed Survey" Generator (Active Critiquing)

Students are naturally bad at spotting bias in their own questions, but they are great at spotting it in someone else's. Use Gemini to generate intentionally terrible survey questions to teach measurement and selection bias.

The Tactic: Ask Gemini to generate a 5-question survey on a topic (e.g., climate change or school dress codes) but explicitly tell it to bake in specific flaws: prestige bias, double-barreled questions, acquiescence bias, or unrepresentative sampling frames.
In Class: Have students dissect Gemini’s output, identify the specific biases, and rewrite the questions to be completely neutral.

2. Settling the Eternal Debate: Stratified vs. Cluster Sampling

The distinction between stratifying (homogeneity within, heterogeneity between) and clustering (heterogeneity within, homogeneity between) is one of the most common stumbling blocks for intro stats students.

The Tactic: Have Gemini generate vivid, non-mathematical analogies tailored to your students' interests.
Prompt Idea: "Give me an analogy comparing stratified sampling and cluster sampling using a box of mixed donuts, a high school campus, and a music festival."
In Class: Use these analogies to anchor the technical definitions before moving to the math of design effects.

3. Creating Instant "Sampling Sandboxes"

It is incredibly tedious for a teacher to manually invent a population dataset just to demonstrate how different sampling methods yield different statistics. Gemini can write code or generate data structures in seconds.

The Tactic: Ask Gemini to generate a mock population dataset of 1,000 individuals with known parameters (e.g., income, age, political affiliation) where certain groups are heavily clustered geographically.
In Class: Have Gemini write a quick R or Python script that pulls a Simple Random Sample, a Stratified Sample, and a Cluster Sample from that mock population. Display the sampling distributions side-by-side to visually prove to students how cluster sampling increases standard error while stratified sampling reduces it.

4. Reverse-Engineering Real-World Complex Designs

Instead of analyzing dry textbook problems, have students dissect massive, high-stakes operations like the USDA June Area Survey, Pew Research panels, or Gallup polls.

The Tactic: Use Gemini to break down the methodology sections of complex, peer-reviewed articles.
Prompt Idea: "Summarize the sampling frame, stages of selection, and weighting methods used in the latest CDC NHANES report, written at an undergraduate statistics level."
In Class: Use this summary as a case study. Challenge students to explain why the researchers didn't just use a simple random sample (e.g., cost, geographical constraints, the need to oversample minority groups).

💡 A Quick Prompting Tip for Teachers

When using Gemini to prep for class, use the "Act as a..." framework. Try prompting it: "Act as a stubborn, skeptical intro-stats student who doesn't understand why a sample of 1,200 people can accurately predict a national election of 150 million voters. Argue with me so I can practice my explanations for tomorrow's lecture."

References

Bhargava, V., Lee, J. S., Smith, T. A., and Chakrovorty, S. (2025). A measure of nutrition security using the National Health and Nutrition Examination Survey dataset. JAMA Network Open, 8(2), e2462130.

Centers for Disease Control and Prevention (2026). National Health and Nutrition Examination Survey Datasets and Documentation. https://wwwn.cdc.gov/nchs/nhanes/tutorials/Datasets.aspx

Lohr, S. L. (2022). Sampling: Design and Analysis, 3rd edition. Boca Raton, FL: CRC Press.

Wesson, P., Graham-Squire, D., Perry, E., Assaf, R. D., and Kushel, M. (2025). Novel methods to construct a representative sample for surveying California’s unhoused population: the California Statewide Study of People Experiencing Homelessness. American Journal of Epidemiology, 194(5), 1238-1248.

teaching statistics, sample surveysSharon LohrJune 12, 2026