Open20

Satisficing関係の研究まとめ

川本達郎川本達郎

ウェブ調査の結果はなぜ偏るのか, 吉村治正, 社会学評論 71巻 1号, (2020)

ウェブ調査の示す結果の偏りが,インターネットが使えない人が排除される過少網羅のもたらすバイアス・低い回答率のもたらすバイアス・調査対象者の自己選択によるバイアスのいずれに帰されるかを,2 つの実験的ウェブ調査を通じて検証した.その結果,過少網羅および低回答率は1 次集計結果の偏りにはほとんど影響を与えないこと,対照的にモニター登録という自己選択のもたらす影響は無視しえない深刻なものであることが明らかとなった.

川本達郎川本達郎

オンライン調査モニタの Satisfice に関する実験的研究

著者:三浦 麻子(関西学院大学, 小林 哲郎(国立情報学研究所)

  • Satisfice(目的を達成するた めに必要最小限を満たす手順を決定し、追求する行動) がしばしば見られること、すなわち協力者が調査に際し て応分の注意資源を割こうとしないこと
  • Satisfice 問題はむしろモニタを対象と したオンライン調査においてより深刻となる可能性が ある
  • 山田・江利川(2014)で は、調査に協力したモニタのうち約 7 割が平均して 1 日 1 件以上、3 分の 1 は 2 件以上回答していることが示され ている。これは従来型の調査の協力者とは大きく異な る傾向であり、かれらが日常的に数多くの調査を「こ なす」ことは Satisfice に結びつきやすい可能性がある
  • 調査研究で問題となるこうした Satisfice 行動について、Simon(1957)の Optimize(最適化)と Satisficeの議論を応用して体系的な分類を行ったのが Krosnick(1991)である。

2種類のSatisfice

Krosnick(1991)によれば、調査回答における Satisfice は 2 つに分類される

  • 弱い Satisfice: 調査項目の内容を理解した上で回答しよう としているが、選択可能な選択肢を部分的にしか検討し ないといった回答行動
  • 強い Satisfice: 調査項目の内容を 理解するための認知的コストを払わず、誰にでも選択 可能な選択肢を選んだり、あてずっぽうに選 択したりする回答行動

Satisfice が生じる主要な原因

Satisfice が生じる主要な原因としては、調査項目の難 しさ、回答者の能力、回答者の動機づけが挙げられる (Krosnick, 1991)。これらのうち本研究では回答者の動 機づけに着目する。回答者の動機づけは、回答者の認知 欲求(need for cognition)(Cacioppo, Petty, Feinstein, & Jarvis, 1996)、調査テーマの個人的重要性、調査への回 答が何かの役に立つという信念、回答による疲れ、調査 主体の姿勢(注意深い回答を要求するか否かなど)に影響される(Krosnick & Presser, 2010)。

House effects

  • 調査会社により結果 が異なる現象
  • House effects検討のために2つの調査会社間 の比較も行う。
  • 本研究ではスクリーニング調査の段階で複数 の調査会社に登録しているモニタを排除した
  • 結論:Satisfice率、特にIMC項目の遵守/違反率と「より 強い Satisfice」傾向に調査会社間による差が見られた

問題設定

本研究では日本のオンライン 調査パネルを対象として、二つの方法を用いてモニタの 回答行動に関するデータを収集し、Satisfice の程度を測 定する。

測定1

一つ目の測定方法では、回答者が調査票の教示文を 十分に読み、指示どおりの回答行動を行う程度に注目 する。具体的には、まずスクリーニング調査において Oppenheimer, Meyvis, & Davidenko(2009)が開発した IMC(Instructional manipulation check; 図 1)を踏襲し た設問を用いて、ベースラインとなるモニタの Satisfice 傾向を測定する。

測定2

二つ目の測定方法では、三浦(2014)と同様に、多 数の項目からなるリッカートタイプの尺度項目を十分 に読む程度に注目する。リッカートタイプの尺度は心 理学でよく用いられるが、望ましくない回答行動(例 えば特定選択肢への過度の集中や相互矛盾する回答な ど)が見られやすいことがしばしば問題となる

スクリーニング調査

  • A 社と B 社の登録モニタのうち、成人男女(A 社 36125 名、B 社 81900 名)
  • 個人属性(性別・年齢・居住地域/婚姻 状況・子ども有無/職業; 全 3 ページ)、
  • IMC 項目
  • 社会経済地位(Socioeconomic Status; SES)に関する主観的評価(1(もっとも低い)~10(もっとも高い)「答 えたくない」から 1 肢選択)
  • 他社へのモニタ登録有無(他社モニタ登録を「なし」と回答した協 力者のみを本調査の対象とした)
  • 調査実施を打診したうち 3 社が不可とした理由は、いずれもこ の IMC 項目が「モニタに不快感を与える」と判断した
  • (種々の基準でフィルタリングをした上で)本調査の対象とする 1800 名をそれぞれラン ダムに抽出した。
  • また、抽出モニタにおける調査会社間の違いもほとんどなく、や や異なるのは性別構成のみで、A 社には男性(特に正社 員)が、B 社には女性(特に専業主婦やパート・アルバ イト)が多い傾向があった。

結果

  • 教示の遵守に影響をもつ変数を把握するため、協力 者による教示の遵守/違反を従属変数とし、個人属 性(性・年齢・未既婚・子ども有無)とSES、調査会 社を独立変数とする名義ロジスティック重回帰分析を行ったところ、B社モニタ、男性、高年齢、既婚、子 どもありの協力者において違反率が高かった(ps<.05; Cox–Snell’s R2=0.13)。
  • まず IMC 項目に対する回答所要時間には 遵守/違反群による主効果が見られ、遵守群の方が違 反群よりも長かった(p<.001)。
  • また、IMC 項目前後の項目(群)に関する回答所要 時間を比較したところ、IMC 項目より前に配置された 個人属性項目には遵守/違反群と調査会社の主効果は有 意ではなかった一方で、後に配置された SES と他社モニ タ登録の有無については、SES 項目で遵守/違反群の主 効果が有意(遵守群>違反群)であった(p<.001)。こ の結果は、IMC 遵守群は IMC 項目により「教示を精読 すべき」という規範を獲得し、画像刺激(ラダースケー ルの図)を含み、また教示文も比較的長い SES に関する 項目にも、内容を精読してから回答するよう動機づけら れたことを示唆している。

本調査

  • リッカート尺度 項目に含めた「この項目は一番左(右)を選択してくだ さい」に対する回答について、「左」については「1.全 くちがう」以外、「右」については「5.全くそうだ」以 外を選択した協力者においてSatisficeが生じていると みなす。少なくとも 1 項目において協力者が項目文の 指示とは異なる選択肢を回答した場合を「Satisfice」、2 項目とも指示どおりの選択肢を回答した場合を「not Satisfice」とすると、協力者全体(2872名)のうち Satisfice が生じていたのは 383 名(13.3%)であった 5)。
  • 無作為配置した総 項目数条件間で Satisfice 率を比較する。まず A 社では、 IMC 遵守群と違反群ともに 10 項目条件と 30 項目条件で はSatisfice率にほとんど差が見られない一方、50項目 条件では 4~6%程度高くなっている。このことから、A 社のモニタは少なくとも30項目程度のリッカート尺度 項目は内容を読んだ上で回答しているが、50 項目を超 える尺度の場合は回答疲れによる動機づけの低下によっ てSatisfice率が高まっていることがうかがえる。
  • 総じて、本調査において項目を精読しなかった協力 者には、元来調査協力への動機づけが低い確信犯的な Satisficer に、そこまで動機づけが低いわけではないが 項目数の多さに「厭気」がさしたことによって一時的に 生じた Satisficer が上乗せされていると考えられる。

結論

  • あらかじめス クリーニング調査を実施することは、それ自体が事後の 調査における協力率を高め、またそこに Satisfice 傾向を もつ回答者を特定できる強めの設問をおくことによっ て、Satisfice の含まれにくいデータを得やすくなる可能 性が示された
  • IMC のような「正しく」答えないことを求める設問を 許容しない調査会社もある。
  • Mechanical Turk の普及を背 景に、Satisfice を防ぐさまざまなスクリーニング方法が 開発されつつあることは前述したとおりである。ただ し、Satisfice 傾向をもつ回答者を特定することとそれを 排除することは、完全に同義というわけではない。
川本達郎川本達郎

調査回答の質の向上のための方法の比較

増田 真也 坂上 貴之 森井 真広 (慶應義塾大学)

やったこと

真面目に回答するという宣誓を回答者に求め た。法廷で被告人等が真実を述べることを誓うのにな ぞらえ,この手続きを**冒頭宣誓(taking an oath to answer seriously: 以下,TO とする)**と呼ぶ。

イントロ・背景

  • 中間選択(midpoint response)
  • 5 件法の評定尺度で,不良回答者によ る同じ回答カテゴリの選択が続く同一回答(straight line response)
  • Vannette & Krosnick(2014)は,多段階 評定での中間選択傾向や同一回答などの,項目内容と 無関係に生じる反応バイアス(response bias)が,最 小限化によるものと考えている
  • Web 調査では最小限 化が促進されやすいという(Heerwegh & Loosveldt, 2008)
  • 同一回答や中間選択等は,真面目に回答しても起こりうるので,それだけで良否を判ずるのは難しい。
  • 最小限 化は,項目文が難しいとか長いといった理由でも生じ る(Krosnick, 1991; Velez & Ashworth, 2007)。
  • 先行研究での IMC はかなりの長文 であり,むしろ不良回答を増大させる可能性もある
  • 良質な回 答を誤って不良と見なしてしまったときに重大な問題 となる(Curran, 2016)

チェック方法

  • 回答自体や,それに付随する回答時間のようなパラ データ(paradata)からではなく,別の手続きを加え ることで不良回答者を検出しようとすることもある。 Aust, Diedenhofen, Ullrich, & Musch(2012) は, 選 挙 に関する調査で,真面目に回答したかどうかを単刀直 入に尋ねた(seriousness check: 以下,SC とする)。す ると,回答者の 3.2% が真面目に答えなかったと報告 した。そして真面目に答えたと述べた人の方が,矛盾 のない回答をしていた他,回答と実際の選挙結果の一 致度が高かった。しかしこの方法では,真面目に回答 したという自己報告が,そもそも本当なのかどうかわ からない。
  • Oppenheimer, Meyvis, & Davidenko(2009)は, Web 調査でのある質問で,通常の選択肢ではなく,画 面上の他の箇所をクリックするよう指示して,教示文 をきちんと読んでいるかどうかを確かめた (instructional manipulation check: 以下,IMC とする)。
  • 三浦・小林(2015)では,回答の仕方に関する教示の ある質問に加えて,「一番右を選択してください」と いった特定の回答を指示する項目を設けた。すなわち 教示文だけでなく,項目文を読んでいるかどうかも調 べた。結果は,教示文の指示に従っていない人の方が, 項目文の回答の指示にも従っていないことが多かっ た。

不良回答を見つけようとするよりも,で きるだけ多くの人が真面目に回答するよう,調査のや り方を工夫する研究

  • Huang, Curran, Keeney, Poposki, & DeShon(2012)で,真面目 に回答しないと報酬が得られないと回答者に警告した ところ,不注意回答が減った
  • Ward & Pond(2015) では,回答の質が低いと警告するという教示と,Web 調査の画面上の動く人の画像(virtual presence)の有 無の効果を検討した。すると画像だけでは効果が無 かったものの,教示との有意な交互作用が見られた。

回答率の向上

  • 社会調査法の領域においても,ただ調査票を手渡し て回答と返送を依頼するよりも,短時間で済む調査へ の回答を先に依頼し,次により長い調査への回答か, 後で調査票を返送するよう依頼する方が回答率が高 か っ た(Dillman, Dolsen, & Machlis, 1995)。
  • Reingen & Kernan(1977)では,初回に調査を依頼す る際の質問数の多少(5 問 /35 問)と,報酬の有無(無 し /5 ドルの商品券)を組み合わせた 4 条件を設け,7 ─ 9 日後に 2 回目の調査を実施した。すると,初回の 質問数が少なく,かつ報酬のために依頼に応じたとい う正当化のできない無報酬群で,2 回目の調査の回答 率が最も高くなった。
  • 立場を表明したり,小さな依頼に承諾したりしてコミットメントすると,態度や行動を一貫させようとする力が働く

方法

対象者

  • 2 つの調査会社 A,B のそ れぞれで,モニタ登録されている 20 ─ 69 歳の成人に 協力依頼をし,同意した人は指定されたサイトにアク セスして回答した

質問票(論文から抜粋)

各条件における質問内容

・IMC 条件 インターネットを用いた調査においては,うそをついたり,質問を読まないで,いい加減な回答をしたりする方がいることが問題となっています。つきましては大変失礼なお願いですが,あなたがこの文章をきちんと読ん でいるかどうかを確認させてください。あなたがこの文章をお読みになったら,以下の質問には回答せずに(つ まり,どの選択肢もクリックせずに),次のページに進んでください。
□ 1. そう思う
□ 2. どちらかといえばそう思う
□ 3. どちらともいえない
□ 4. どちらかといえばそう思わない □ 5. そう思わない

・SC 条件 インターネットを用いた調査においては,うそをついたり,質問を読まないで,いい加減な回答をしたりする方がいることが問題となっています。あなたは以上の質問に,真面目に答えてくださいましたか? もしも真面 目に回答していなかったのであれば,そのことをお知らせいただけると,データを分析する上で大変にありがた く存じます。お手数ですが,以下のどちらかにチェックを入れてください。
〇私は以上の質問に真面目に回答しました
〇私は以上の質問に真面目に回答しませんでした

・TO 条件 インターネットを用いた調査においては,うそをついたり,質問を読まないで,いい加減な回答をしたりする方がいることが問題となっています。あなたは質問をきちんと読んで,真面目に答えていただけますか? 真面 目に答えていただけるのであれば,以下のボックスをチェックしてください。
□ 私は以下の質問をきちんと読んで,真面目に回答します


(調査詳細と結果については、今後整理予定)

川本達郎川本達郎

IMCの原論文

Instructional manipulation checks: Detecting satisficing to increase statistical power
Daniel M. Oppenheimer, Tom Meyvis, Nicolas Davidenko

ポイント

  • Satisficingとは何か
  • IMCとは何か?
  • どんな調査をしたか
  • 検出力が上がることをどう示すか?
  • IMCの注意点
  • 他の人からの批判:調査内容が長くなる-> 余計に面倒くさくなる

概要

  • This paper presents and validates a new tool for detecting participants who are not following instructions – the Instructional manipulation check (IMC).

Satisficingとは何か

IMC以外のsatisficing回避方法

IMCとは何か?

a question embedded within the experimental material
that is similar to the other questions in length and
response format (e.g., Likert scale, check boxes,
etc.). However, unlike the other questions, the [Instructional
Manipulation Check] asks participants to
ignore the standard response format
and instead provide
a confirmation that they have read the instructions.
(p. 867)

この質問は、あなたがこのインストラクションをちゃんと読んでいるかを確認するためのものです。もしちゃんと読んでいなければ、我々がインストラクションを微妙に変えて実験をしたとしても、意味のないものになってしまうでしょう。ですから、あなたがちゃんとこのインストラクションを読んでいることを示すため、以下の質問において何も選択せず、「次へ」ボタンも押さないでください。その代わりに、画面トップにある「スポーツについての調査」というタイトルをクリックしてください。

「あなたが普段行っているスポーツを選択してください(該当するものすべてを選択して下さい)」

  • スキー
  • サッカー
  • スノーボード
  • ランニング
  • ホッケー
  • アメフト
  • 水泳
  • テニス
  • バスケットボール
  • サイクリング

どんな調査をしたか

メインの質問内容

  • two classic paradigms from the judgment and decision making literature were replicated.
    1. Thaler’s (1985) beer pricing task that demonstrates how different expectations can change people’s willingness to pay for identical experiences.

You are on the beach on a hot day. For the last hour you have been thinking about how much you would enjoy an ice cold can of soda. Your companion needs to make a phone call and offers to bring back a soda from the only nearby place where drinks are sold, which happens to be a [run-down grocery store] [fancy resort]. Your companion asks how much you are willing to pay for the soda and will only buy it if it is below the price you state. How much are you willing to pay?

  • 問い:[run-down grocery store] vs [fancy resort]で有意な差があるか?

  • 先行研究:Thaler (1985) found that participants were willing to pay sub- stantially more for a beer from a fancy resort than from a run down grocery store, even though the experience of drinking the beer would be identical regardless of the source.

    1. The second paradigm was a sunk cost question, also adapted from Thaler (1985).

    Imagine that your favorite football team is playing an important game. You have a ticket to the game that you [have paid hand- somely for] [have received for free from a friend]. However, on the day of the game, it happens to be freezing cold. What do you do?

  • 問い:[have paid hand- somely for] vs [have received for free from a friend]で有意な差があるか?

  • 先行研究:Previous research has found that people are less likely to skip the game if they have paid for the tickets (Arkes & Blumer, 1985; Thaler, 1980).

実施した2種類のIMC

  • Study1: 一通り質問に回答をした後に、IMCを行う -> IMCで失格となった回答者は除外
  • Study2: 最初にIMCを実施。IMCをクリアできなければ、できるまでIMCを繰り返し、合格したらメインの質問に回答

IMCの注意点

  • IMC assumes that people who fail the IMC will also fail to read other instructions.
  • To make sure this assumption is satisfied, the format of the IMC should be adjusted to match the format of questions in the rest of the survey.
  • Some surveys do not require participants to read the directions in order to successfully answer the questions.
  • There is the concern that if an IMC is used to eliminate participants from the sample then the external validity of the study could be harmed. If the population that failed the IMC differed substantively from those who passed the IMC it could lead to issues regarding generalizability of the findings.
  • Satisficing participants may be embarrassed at having failed the IMC and may seek retribution for their embarrassment by trying to foil the study.
    • However, in the reported studies, we found no evidence for such a backlash.

著者のおすすめ

  • We recommend using IMCs early in a study to convert satisfic- ing participants into diligent participants, as in Study 2.
  • an IMC in conjunction with other methods for increasing participant dili- gence can help identify non-diligent participants if those other methods are not 100% effective.
川本達郎川本達郎

J. A. Krosnick (1991)

J. A. Krosnick, "Response strategies for coping with the cognitive demands of attitude measures in surveys," Applied cognitive psychology 5.3 (1991): 213-236.

Satisficingの種類

  • 弱いsatisficing
    • Primacy効果:当てはまる回答のうち、最適なものではなく、一番最初のものを選択する
    • インタビュアーが提案するものに賛成する
  • 強いsatisficing
    • Endorsing "satus quo": 現状維持を選択
    • Non-differentiation: レーティング回答において、同じ回答を繰り返す
    • 「わからない」を繰り返す
    • Mental coin-flipping: ランダムに回答する

Satisficingを生み出す要因

  • タスクの難しさ
  • 回答者の能力
    • 回答者によっては、判断が難しい
    • 質問に対して経験が豊富な人は、正確に答えられる
    • preconsolidated attitude?
  • 回答者のモチベーション
  • Need for cognition
  • インタビュアーの態度
  • Accountability
川本達郎川本達郎

Seriousness Check

Aust, Frederik, et al. "Seriousness checks are useful to improve data validity in online research." Behavior research methods 45 (2013): 527-535.

概要

  • The primary goal of the present study was to investigate the extent to which the data of self-reported nonserious partici- pants differ from serious submissions and whether their ex- clusion has the potential to increase the validity of data collected online.

Satisficingのチェック方法

  • 整合性チェック
    • checking for implausible or impossible combina- tions of age and education level or income may reveal low- quality data sets.
  • IPアドレスチェック
  • 所要時間チェック
    • Ihme et al., 2009; Keller, Gunasekharan, Mayo, & Corley, 2009; Lahl, Göritz, Pietrowsky, & Rosenberg, 2009; Malhotra, 2008; Musch & Klauer, 2002
    • However, appropriate thresholds necessary to identify speeding are difficult to determine and depend on the distribution of response times (Ratcliff, 1993)
  • IMC
  • Person fit indices
    • Rasch person fit indices
    • a methodologically advanced approach to detecting aberrant responses (Li & Olejnik, 1997).
    • identify atypical response patterns that may occur as a result of cheating (Madsen, 1987) or socially desirable answering behavior (Schmitt, Cortina, & Whitney, 1993).
    • the applicability of such tests is limited. They can be employed only if a test was constructed to fit an item response model (Li & Olejnik, 1997; van den Wittenboer, Hox, & de Leeuw, 1997)
  • Seriousness check
    • improve data quality in an effortless and economic way.
    • A very economic measure for identifying nonserious participants by Musch and Klauer (2002)
    • They directly asked the respondents to indicate the seriousness of their responses.
    • (この時点では)a direct investigation of data quality improvement has not been reported yet
    • Nonserious participantであることを表明する人は5%~6%程度に過ぎない

全然研究でクオリティチェックが行われていない...!(2009~2010の時点で)

an analysis of articles published between 2009 and 2010 in three major journals reporting online studies (Behavior Research Methods, International Journal of Internet Science, and Inter- national Journal of Human–Computer Studies) suggests that seriousness checks are used rarely (Table 1). Out of 32 studies recruiting participants online, only 6 (18.8 %) reported the use of one or more measures to ensure or improve data quality. In three cases (9.4 %), multiple participations were detected by logging the respondents’ IP addresses. Four studies (12.5 %) checked for inconsistent answers, and three studies (9.4 %) considered the survey completion time. Egermann et al. (2009) was the only study (3.1 %) employing a seriousness check.

なぜわざわざnonserious participantであることを表明する人がいるのか

(この研究の段階では解明されていない)
It might not be prudent to screen out a participant simply because he or she does not feel competent enough to answer all questions.

Satisficingのフィルタリングを多用することの欠点

They increase the “researcher degrees of freedom” (Simmons, Nelson, & Simonsohn, 2011). Investigators may be tempted to try different combinations of methods and to report only the most favorable results. For example, a com- pletion time cutoff may be chosen in view of the outcome of a subsequent statistical test, rather than on the basis of a solid rationale.

Using seriousness checks may thus have a positive effect on the researchers’ degrees of freedom

Satisficingのチェックは最初にするべきか、最後にするべきか

  • この研究の調査では、最後にseriousness checkを実施した
  • 最後に実施した方が、不誠実な回答者を減らせると期待
  • Reipsの意見:asking a seriousness question already in the early stage of a survey may serve the additional purpose of increasing motivation and reducing the subsequent dropout rate.
  • Oppenheimer et al. のIMCでは、最初にやるか最後にやるかでの差は見られなかったと報告しているが、本研究の結果とは矛盾している様子(Aust et al.は、比較実験はしていないのでは?)

調査内容と結果

(スキップ)

  • their motivation and interest in the survey topic was likely to be above average, and there was no apparent reason for them to be dishonest about the nonseriousness of their participation.
川本達郎川本達郎

Reips (2009)

Seriousness check

なぜSeriousness checkに通らない人がいるのか?

  • 「単に(まずは)閲覧したいだけ」という人がいる:Of those answering “I would like to look at the pages only” around 75% will drop, while of those answering “I would like to seriously participate now” only ca. 10-15% will drop. Overall, about 30-50% of visitors will fail the seriousness check, i.e. answer “I would like to look at the pages only”.

Dropout(を検知する)

The dropout is sensitive to variations in motivation, and so it can be used to detect motivational confounding [6][7][8].

Warm-up

メインの質問に入る前に、最初に(簡単な)簡単な質問を入れておくことで離脱率を減らせる:As the technique’s name suggests, a period of “warming-up” the participants to the experimental task is inserted before the actual experimental manipulation takes place. Thus, during the actual experiment there is not much dropout. The technique has been shown to work [10]. By applying the warm-up technique Reips, Morger, and Meier were able to reduce dropout to less than 2% during the actual experiment.

川本達郎川本達郎

Too Fast, too Straight, too Weird: Non-Reactive Indicators for Meaningless Data in Internet Surveys

Leiner, Dominik Johannes. "Too fast, too straight, too weird: Non-reactive indicators for meaningless data in internet surveys." Survey Research Methods. Vol. 13. No. 3. 2019.

目的

  • 生成モデルとその推論のような実験の論文
  • Low quality (LQ)の集団には「いい加減な回答」をしてもらい "High Quality"(HQ)の集団には「まともな回答」をしてもらう。non-reactive indicatorsが、ちゃんとLQとHQを識別できるかを検証する(こういう方法は、Burns and Christiansen (2011) (論文が読めない)で提案されているらしい。also see Allen, 1966; Azfar & Murrell, 2009; Lim & Butcher, 1996; Pine, 1995とあるので、他にもあるみたい)

This study employs an experimental-like design to empirically test the ability of non-reactive indicators to identify records with low data quality

  • Research Question 1 (RQ1): Which non-reactive data quality indicators are the most efficient ones in identifying records of meaningless data in an Internet survey?
  • Research Question 2 (RQ2): Which are the most efficient quality indicators to identify specific types of meaningless data?

Limitation

Respondents from the LQ groups often “failed” to respond carelessly. At the same time, there may be some careless responding in the HQ groups as well

イントロ

  • [無効データの原因は回答者だけではない。質問文の質も重要な要素] The respondent, of course, is only one possible source of invalid data. A fit between the research questions and the employed measures, the wording of questions (Converse & Presser, 2003; Payne, 1980) are necessary prerequisites for useful survey data;
  • [無効データは通常ランダムデータではない (primacy effectなど)] Given that respondents will usually not give statistically random responses, the best case is unlikely to occur. More often, we have to assume that meaningless data is systematically different from valid data regarding response distributions.
  • But when data cleaning is based on untested assumptions, removing data may render new biases (Bauermeister et al., 2012; Harzing, Brown, Köster, & Zhao, 2012).

5 classes of non-reactive data quality indicators

  1. percentage of missing data
  • One strategy is to exclude such variables when computing indices for missing data (??)
  • another strategy is to weight each “miss” with the probability that the variable is answered by the overall sample (??)
  • The percentage of DK responses was then used as third indicator for data quality
  1. patterns in matrix-style questions
  • The number of straightlined (short-)scales, i.e., scales where each item received the same response.
  • More differentiation is provided by the “longest string” (Johnson, 2005, p. 109), which is the length of the longest sequence of the same answer within an item battery.
  • mathematical functions and algorithms were employed to compute indices (Baumgartner &
    Steenkamp, 2001; Jong, Steenkamp, Fox, & Baumgartner, 2008; Van Vaerenbergh & Thomas, 2012)
  • Indicator 1: standard deviation (SD)
  • Indicator 2: パターン検知 (Total variationや2ステップ前との比較から変化のパターンを見て加算する)
  • pretests with manufactured patterns show that the absolute second derivation (d) of response values (ri) is sensitive to straight, diagonal, and zigzag lines (response valuesはこちらでstraight, diagonal, and zigzag linesなどを想定してデザインし、それがabsolute second derivationによって検出可能か見てみる??straight lineの長さやジグザグのパターンが事例によって異なるから、dのsensitivityを予め見ておく必要があるということか?)
  1. distance from the sample means
  • absolute z-scored response per item, averaged over all scale items.
  • Mahalanobis distance (Johnson,2005; Mahalanobis, 1936), which is a multivariate measure.
  1. correlation structure within the answers
  • records that influence correlations between variables in an atypical way
  • The even-odd consistency (Johnson, 2005; Meade &Craig, 2012), as the first indicator, requires the scale batteries being half-split into even and odd items
  • Another measure for intra-scale consistency is inspired by the idea of using regressions (Burns& Christiansen, 2011; Jandura et al., 2012):
  1. completion time
  • the per-page medians serve as typical completion time (as the distribution of completion times is heavily skewed)
  • outlier is defined as taking 3/1.34 times the interquartile range (IQR) longer than the median completion time (this would be 3 SD, if the distribution was normally distributed).
  • index of relative completion speed: For each page, the sample’s median page completion time is divided by the individual completion time, resulting in a speed factor. この値は、3でカットオフを入れる:This avoids disqualifying respondents who incidentally skip a single page.
  1. (meaning and plausibility)
  • 年齢などで、明らかにおかしい値を検出

Studies

  • Study1: Pilot
  • Study2: Heterogeneous Data
  • Study3: Replication I
  • Study4: Replication II

study1 & 2: substantial variance in age and location, but not in education

Study 1

  • questionnaire about “public opinion”
    • starts with polling opinions on public issues (allowing “don’t know” but no missing data)
    • then asks detailed questions on one of these issues (attitudes, relation to values, ambivalence, elaboration, uncertainty) mostly by means of five short multi-item scales, presented in a matrix layout / about the same political issue for all respondents.
    • A significant amount of formally missing data was generated by open-ended questions asking for arguments pro and contra the issue (Cappella, Price, & Nir, 2002).
  • 全部で6パターンのインストラクション
    • The LQ group was subdivided to cover three possible origins of meaningless survey data: (1) rushing, only in study 1, (2) careless responding, and (3) intended faking (Nichols et al., 1989)
    • [インストラクション文章を各2パターン用意している] with two possible treatment instructions for each sub-condition (Appendix A.2).

Study 2

  • Study1と同じ質問形式、ほぼ同じ内容
  • respondents were randomly assigned to one of 17 different political issues in the second questionnaire part.
  • The HQ group in study 2 is much larger (nHQ = 10,580) than in study 1.

Study 3

  • a questionnaire on stress perception, physical condition, and stress-related behaviors (E. E. Schneider et al., 2017).
  • The questionnaire includes scales with considerably more items than studies 1 and 2. The longest scale consisted of 30 items that were presented on two pages but analyzed as one scale to obtain the pattern and consistency indicators
  • DK options are not offered by this questionnaire
  • IMCを入れている(the original IMC, and a simplified version)
  • インストラクション
    • The LQ group (nLQ = 368) was asked to imagine that they had no interest in the questionnaire
    • LQの人が慎重に答えてしまわないように、「あとで慎重に回答できます」という指示もいれた:In response to concerns that respondents may answer “too careful” in the LQ group, an announcement was included that one would afterward have the option to complete the questionnaire carefully. (ただし、この回答はHQに入れなかった: These records did not become part of the HQ group)
    • HQの人には、ちゃんと回答するように指示したものと、何も指示しないものの2種類を割り振った:They were either instructed to attentively read the questions and complete the questionnaire very carefully (nH1 = 413) or were not given such an instruction (nH2 = 438).
    • LQもHQも、インストラクションはハイライトした:The instructions in the LQ and HQ groups, if given, were labeled “important advice” and highlighted visually.

Study 4

  • another questionnaire, researching interpersonal communication in organizations (Beckert et al., 2018; Breitsohl & Steidelmüller, 2018).
  • Only employees having a supervisor were allowed for this study
  • Easier IMC: bogus items like “I am currently filling out a questionnaire” (Hargittai, 2009; Meade & Craig, 2012), or instructed response items like “please select «fully disagree» in this line” (DeSimone et al., 2015).
  • IMCをいれた:simplified IMC, or three bogus items (items with only one possible response option), or
    three instructed response items (items saying which response option to check) spread throughout three of six scales (5 to 24 items).
  • インストラクション
    • The instruction for LQ group respondents was to imagine that they had no interest in the questionnaire, but only wanted to attend the lottery.
    • They were explicitly asked not to complete the questionnaire carefully.
    • [Study3と同様に、「あとで慎重に回答できます」という指示もいれた] an option was announced
      to complete the questionnaire carefully, afterwards

結果

Study1

  • The resulting random chance to correctly identify a record from the LQ group is 0.43

    • After removing records from the LQ group that have failed the manipulation check, 475 records with meaningless data had to be distinguished from 621 mostly meaningful records
  • [Good] completion times are most successful in identifying records from the LQ group. These indicators identify about 66% of the LQ records

    • Nagelkerke’s pseudo R2 of .26 when understood as binomial regression (completion time with outliers
      replaced).
  • [Bad] The scales’ even-odd consistency and the weighted non-response are considerably less e cient.

    • indicators based on item non-response, a good part of the drop can be attributed to a lack of di erentiation: Near the cut-o  value, many records share the same non-response rate, so far too many records exceed the cut-o  value (overidentification).
  • [Bad] The other indicators barely exceed random chance or indicate the LQ records even worse than chance, such as the number of DK responses and the simple distance from the sample mean AUC

  • LQのインストラクションの違い

    • [Good] careless responding sub-condition reflects the indicators’ performance observed
      for the overall sample
    • [Bad] In the rushing sub-condition, completion time is the only relevant indicator. This suggests that the
      manipulation failed: Doing a questionnaire as fast as possible does not necessarily provoke meaningless responses(なので、rushingの指示は今後の調査では使わない:the rushing sub-condition is not used for the subsequent studies.)
    • [Bad] In the faking sub-condition, completion time performs much worse than in the other two sub-conditions. [結局、何が言えているか微妙な結果になったということ?] The weighted non-response is the only indicator to identify faked records. Yet, its fair performance may be an artifact

Study2

  • Not a single indicator can identify a substantial part of the faked records in the heterogeneous data from study 2, including the non-response rate that had shown a fair performance in study
  • We also find differences regarding careless responding: While completion time is, again, the most efficient indicator, (in)consistent responding cannot identify LQ records in study 2.
  • [??] On the other hand, e ortless response patterns can play their strengths in study 2, nearly closing up to the sensitivity of completion time.
  • The fact that this respectable sensitivity is not accompanied by a similarly convincing AUC suggests that the strength of e ortless patterns lies in identifying a rather specific part of careless responding

Study3

  • [IMCでLQを正しく検出できる] correctly identify 365 out of 368 LQ records (99%).
  • [IMCではHQをかなり落としてしまう] 347 of 852 HQ records (41%) as meaningless data(5%以上実際にmeaningless dataがHQに入っているとは考えにくい: unlikely that more than 5% of the HQ records actually contain meaningless data.)
  • [IMCでover-sensitiveなのはよくあること] Such an over-sensitive indication is not untypical for the IMC (Revilla & Ochoa, 2014).
  • [the simplified IMCの性能] misidentified 30% HQ records(つまり70%は通過した), while 87% of LQ records were identified correctly
  • [Good] The completion time that showed above-average performance in studies 1 and 2, and also performs best in study 3
  • [??] while losing much fewer records from the LQ group (6%).
  • [ちゃんと答えるようにというインストラクションによる影響は見られなかった] Records from the LQ group are no easier to identify if the HQ group was instructed to complete the questionnaire carefully, than if they were not.

Study4

  • overall indicator performance is similar to the previous studies (table 5).
  • [Good] Again, completion time is the most effective non-reactive indicator.
  • [Bad] (通常の)The IMC, a reactive indicator, again is disproportionately strict
  • [Good] Instructed response items perform similarly good as completion time
  • [Good] bogus items can outperform any other indicator (reactive indicatorの中では一番): Only three bogus items are su cient to correctly identify 92% of the meaningless records.

Discussion

  • a substantial share of meaningless records can be identified by completion time.
  • lack of response variation (near-straightlining) also achieves a respectable identification rate for careless responses in studies 1 and 2.
  • The other classes of non-reactive indicators (item non-response, untypical responses, and inconsistent answering within scales) are of little help in identifying meaningless data.
  • Bogus items outperform completion time, although this paper cannot address the question, whether this outstanding performance can be generalized
  • Even the best available non-reactive indicators identify only a fraction of the problematic records

Recommendations

  • Suggestion 1: completion timeは使える
  • suggestion 2: The second recommendation is to reduce reliance on post-hoc analyses by, for example, sprinkling a few bogus items throughout different scale batteries of the questionnaire.
    • Negative e ects of such items have been discussed (Curran, 2016; Goldsmith, 1989),
    • Breitsohl and Steidelmüller (2018) present empirical evidence that these bogus items have little e ect on response behavior
  • This cut-o  was calculated to estimate the potential of the indicator. A pragmatic recommendation is to
    use a much more lenient cut-o  of 2.0 to identify particularly suspicious records.
  • If the questionnaire asks for facts or knowledge instead of opinions, then completion time is not a valid indicator for meaningless data. Experts are obviously faster in giving facts than nonexperts who must select the information from the filing cabinet.

Indicators

  • sensitivity (true positive rate, or hit rate): The primary performance criterion then is the percentage of LQ records that have been correctly identified by the chosen cut-off value.
  • the area under the curve is calculated as a secondary criterion

The fact that there is no manipulation check for the HQ groups in studies 1 and 2 causes us to underestimate the sensitivity and AUC but does not change the indicators’ relative ranking.
(IMCがないと、HQに採用しがちになるので、LQの割合(=sensitivigy)をunderestiamteする。AUCもunderestimateする(測るareaによるのでは?)。しかし、相対的なランキングには影響しない(どういうこと?))

  • IMCは、non-reactive indicatorではなくreactive indicator
川本達郎川本達郎

Speeding in Web Surveys: The tendency to answer very fast and its association with straightlining

Zhang, Chan, and Frederick Conrad. "Speeding in web surveys: The tendency to answer very fast and its association with straightlining." Survey research methods. Vol. 8. No. 2. 2014.

Speeding = 回答時間が短すぎること

分かったこと

  • Speedingを行う人は、(speedingを行う人に条件づけした場合、属性と無関係に)straightlineも行う傾向がある(respondents who are prone to speed are also prone to straightline regardless of their demographics.)
  • 教育レベルが低いと、speedingやstraightlineをする傾向がある
    • (これは、Malhotra (2008)でもすでに指摘されているのでは?)
    • (交絡については議論されていない。これは因果というより相関の話。)
    • (respondents’ education matters: for the loweducation respondents, speeding is associated with a drastic increase in straightlining, while the increase is more modest for the highly educated.)
  • speedingとstraightlineの関係も相関(We are not suggesting that one behavior causes the other.)
  • Speedingは調査内容に依存するものではなく(質問の難しさの自由度は無視している?)、回答者の特性だと考えられる
    • at least in this study, speeding is not a variable or intermittent behavior but likely a stable characteristic of respondents.
    • We refer to these respondents as “persistent speeders.”
    • Malhotra (2008) で引用されているYan and Tourangeau (2008)の結果とは矛盾している?

Speedingの決め方

5 × 単語数 × 300ミリ秒/単語

The current study employs a simple measure of speeding. Specifically, we set the speeding threshold as 300 milliseconds (msec) per word, a rough estimate of reading speed, 5 multiplied by the number of words in the question.

分析方法

Persistent speedersに絞って分析

  • Logistic regression
  • 説明変数:年齢、性別、教育レベル、出身(オランダ人か移民、移民2世)、パネルとしての熟練度、回答時期の早さ、household received any device (computer7, Internet connection, or both) from the panel to complete surveys
  • 性別、教育レベル、出身は、有意な変数として選ばれなかったので無視

Relationship between Speeding and Straightlining

  • (質問ごとに調べた)As can be seen in Table 4, across all 8 grid questions examined, respondents who sped on the question were substantially more likely to straightline than respondents who did not speed.
  • we used negative binomial regressions to model the number of grid questions on which respondents straightlined. The explanatory variables include respondents’ speeding tendency (persistent speeder vs. not) as well as the demographic variables we have used in the speeding model (i.e., age, gender, education, origin, tenure, early vs. late respondents, and whether they received any device from the panel).
  • we also tested interaction e ects between speeding tendency and the demographic variables

興味深い言及

  • 回答時間は、質問文の難しさの指標にも使われる:Response times have also been used to identify di cult questions. For example, Draisma and Dijkstra (2004) found that longer response times were associated with incorrect answers in a validation study with a series of binary (yes/no) factual questions. In addition, response times have been proposed as a way to identify badly worded questions (e.g., Bassili, 1996).
  • 過去にresponse timeについて研究した論文はある:a few studies (Couper & Kreuter, 2013; Yan & Tourangeau, 2008) have analyzed the influences of question-level and respondent-level characteristics
    on response times, the response times examined in their studies were the actual time respondents spent answering questions, not the optimal time required to answer the questions accurately.
  • 2択問題(yes or noなど)は、straightliningのカウントには含めない:We excluded one grid question consisting of only two statements.

どうやってspeedingをやめさせるか?

  • [効果あり] reminding respondents they were answering very fast and asking them to slow down. This approach
    has been tested in a series of experiments by Conrad and his colleagues (i.e., Conrad et al., 2011; Zhang, 2013).
  • Another approach is to include the speeding information (e.g., the total incidences of speeding and the speeding status on a particular question) with final survey datasets.
川本達郎川本達郎

Malhotra 2008

Malhotra, Neil. "Completion time and response order effects in web surveys." Public opinion quarterly 72.5 (2008): 914-934.

問い

Do respondents who completeWeb surveys more quickly also produce data of lower quality?

背景

  • Yan and Tourangeau (2008) have found that response time to individual items presented over theWeb is related to both respondent characteristics (e.g., age, education, experience with Internet-based questionnaires) and item characteristics (e.g., the number and type of response categories, question
    location).
  • knowledgeable respondents with strong attitudes may also report their responses more quickly (e.g., Krosnick 1989; Bassili 1993).

Response order effects

is mediated by 3 factors

  • satisficing
  • memory limitation
    • [目で見て参照できない場合は覚えるしかない] respondents are not able to remember full lists of items and are prone to recency effects in orally presented items because they are more likely to remember the last item they heard (Smyth et al. 1987).
  • cognitive elaboration
    • [肯定的な選択肢が選択されやすい(順序尺度の場合は肯定度順に並ぶので)] each response alternative induces argumentation within the respondent’s mind, meaning that responses listed higher up on visually presented scales that elicit positive, agreeing thoughts are more likely to be selected (Krosnick and Alwin 1987; Schwarz, Hippler, and Noelle-Neumann 1992).

[しかし、「視覚的に提示され、(数値ではなく)言葉でラベリングされている選択肢」におけるprimacy effectの可能性はsatisficingに限られる] primacy effects in visually presented, verbally labeled rating scales likely are only the consequence of satificing, and not of memory or elaboration effects.

調査対象

  • attitudes toward the government response to Hurricane Katrina in the city of New Orleans.
  • The survey was administered by KN over the Internet between May 26 and May 31, 2006
  • 397 American adults recruited via random digit dialing (RDD).
  • All data are weighted by demographics

調査方法

スキップ

分析方法

  • To assess whether completion time moderated order effects in the rating scales, I estimated the following structural component of a Poisson regression model

E[Yi |xi ] = exp{β0 + β1Ri + β2Ti + β3(Ri × Ti ) + β4Ai + β5(Ri × Ai )}

  • Y_i = 統制群のなかの回答者iが、"上位2つ"を選択した数

    • which respondent i selected one of the “top two” response options as they were presented in
      the control group (= “top two”はcontrol groupの順番を基準にしている)
  • R_i = reversed orderかどうかのdummy

  • T_i = かかった時間の対数(the natural log of the time taken to complete the survey in minutes)

  • A_i = 回答者の年齢

  • low cognitive skills, which I proxy using level of education(え...)

  • The treatment effect is defined as the difference between the control and reverse order conditions in the expected count of selecting one of the top two response options.

  • Confidence intervals of the treatment effects are calculated via the delta method (Xu and
    Long 2005).

川本達郎川本達郎

Fast Times and Easy Questions: The Effects of Age, Experience and Question Complexity on Web Survey

Response Times

Yan, Ting, and Roger Tourangeau. "Fast times and easy questions: The effects of age, experience and question complexity on web survey response times." Applied Cognitive Psychology: The Official Journal of the Society for Applied Research in Memory and Cognition 22.1 (2008): 51-68.

  • 過去の心理学の研究もたくさん引用している
  • Satisficingというよりも、一般的にアンケートに時間がかかる要素は何かをモデリングしている
  • 質問の特徴に回答時間が依存するというのがこの論文の一つのポイントだが、satisficingを議論する際には、質問内容は条件づけられたうえで議論することが多いと思う。なので、他の論文で回答者の特性だと結論づけているのは、単に条件付けによって無視されていると考えるべきなのかもしれない。

主張

回答時間は、質問の特徴と、回答者の特徴の両方の影響を受けている:
The results from the cross-classified models indicate that response times are affected by question characteristics such as the total number of clauses and the number of words per clause that probably reflect reading times. In addition, response times are also affected by the number and type of answer categories, and the location of the question within the questionnaire, as well as respondent characteristics such as age, education and experience with the Internet and with completing web surveys.

イントロ

  • Since the middle of the 19th century when the Dutch psychologist F. C. Donders conducted his
    pioneering work on response times (Donders, 1868), experimental and cognitive psychologists have routinely collected response time data.
  • 回答時間の分布は歪んでいるので、平均値は良い指標ではない。Zandt and Ratcliffは trimmed means,
    harmonic means and mediansを使っている。

Dataset

スキップ

Results

Model1:

A fully unconditional model (which is equivalent to a two-way ANOVA model with random respondent and question effects)

Yijk = u0 + b00j + c00k + eijk

Model2: Simplest conditional model:

the main effects of the question-level predictors are assumed to be constant over respondents and the main effects of the respondent-level predictors are assumed to be constant over questions.

Model3:

allows the main effects of certain selected respondent-level characteristics to vary randomly across question items and the main effects of certain selected question-level characteristics to vary randomly across survey respondents.

結論

  • Education, as expected, has a significant effect on response times.
  • [年齢が高い方が回答が遅いという現象は、電話調査でも確かめられている] Older respondents on an average are slower than younger respondents. This result is also consistent with findings by Fricker, Galesic, Tourangeau, and Yan (2005), who compared web and telephone versions of the same questions. They showed that the time needed to complete the questions increased with age for both web and telephone respondents, but the relation between age and completion times was much steeper for those who completed the web version of the questions.
  • 質問文が長いと回答時間が長い、みたいな当たり前の結果も報告されている。
  • [回答のスケールは、fully labelledの場合以外は、回答時間に無関係] Neither does it make a difference in response times whether the response scale is a frequency scale or a rating scale and whether every point of the frequency scale is labelled or just the end points. However, there is a marginally significant effect of a fully labelled rating scale on response times.
川本達郎川本達郎

Identifying Careless Responses in Survey Data

Meade, Adam W., and S. Bartholomew Craig. "Identifying careless responses in survey data." Psychological methods 17.3 (2012): 437.

背景

  • While historically the base rate of careless or inattentive responding has been assumed to be low
    (Johnson, 2005), there is reason to believe that careless responding may be of concern in contemporary Internet-based survey research, particularly with student samples.
  • indices of consistency calculated by comparing responses from items in different locations in the survey. → しかし、全部ランダムに回答する人は少ない(ちゃんと回答している人が落とされてしまう?)
  • In sum, it appears that relatively few respondents provide truly random
    responses across the entire length of a survey. However, it would seem that occasional careless response to a limited number of items may be quite common.

Factors Affecting Careless Responding

  • Respondent Interest
  • Survey Length
    • Indeed, respondents are more likely to self-report responding randomly toward the middle or end of long
      survey measures (Baer et al., 1997; Berry et al., 1992). Even highly motivated samples such as job applicants may feel fatigue effects in long surveys (Berry et al., 1992).
  • Social Contact
    • [オンラインの匿名性は、責任感を低下させ、ネガティブなコメントを書くことにつながる] Online anonymity has been shown to reduce personal accountability leading to a greater potential for negative behavior, such as posting negative comments in a discussion group (Douglas & McGarty, 2001; Lee, 2006).
  • Environmental Distraction

チェック方法:Methods for Identifying Careless Responders

  1. [ちゃんと読んでいるかをチェックする設計] The first type requires special items or scales to be inserted into the survey prior to administration
  • 回答パターンから、回答者が嘘をついていないかチェックする:例えば、social desirabiliy biasがかかっている場合は、通常は困難な善良行動を実践しているという回答は、嘘をついている可能性がある、など(MMPI-2 Lie scaleで検索)。
  • ちゃんと読んでいるか(IMCは引用されていない)をチェックする:instructed response items (e.g., “To monitor quality, please respond with a two for this item”).
  1. [事後的に評価する方法]
  • 1つはConsistency indices
    • [例1| これはstraightliningのこと] Variants of the consistency approach, which we term response
      pattern indices, are intended to identify persons responding too consistently to items measuring theoretically distinct constructs. These indices are typically computed by examining the number of
      consecutive items for which a respondent has indicated the same response option.
    • [例2 | 外れ値検知] A second general class of indices are outlier indices. While univariate outlier analysis may have some utility for identifying extreme cases, multivariate approaches such as Mahalanobis distance
      are much more appropriate, as they consider the pattern of responses across a series of items
  • Response time
  • [匿名ではない形で調査するとミスが減る] it is possible that forcing respondents to respond in an identified manner would lead to fewer careless responses.

調査1

調査2

Recommendations

  • self-report measure (i.e., “In your honest opinion, should we use your data?”)
  • If only post hoc methods are available, then inspection of response time and computation of the Even-Odd
    Consistency measure are suggested as minimums.
  • We found very little difference in the performance of the Maximum and Average LongString indices
  • We also believe the Mahalanobis D measure to be very useful.
川本達郎川本達郎

Curran (2106)

Invariability: Long-String Analysis

  • known in the literature as ‘long-string analysis’ or ‘response pattern indices’ (Huang et al., 2012; Meade & Craig, 2012).
  • This technique seems to have formally begun with Johnson’s (2005) use of a borrowed technique
  • [選択肢が細かく分かれていれば、ちゃんと答えている人たちは同じ回答を選ばないはず。逆にyes/noなどだとlong stringになりやすい] In addition, a scale that has questions of varying intensity (e.g. ‘I have a lot of friends’ vs ‘I make friends with every person I meet’) should drive careful responders to change their response option more frequently than on a scale that has questions all geared at roughly the same level.
  • [「賛成する」のように常に選ばれやすい選択肢があると、同じ回答が続きやすい。なので基準が作りにくい] long string analysis can be difficult to compare across different data collections without engaging in some degree of scaling.

Outlier Analysis: Mahalanobis Distance

  • 欠点1:First, as Meade & Craig (2012) note, Mahalanobis D is a computationally intensive procedure
  • 欠点2:These metrics rely on a certain degree of normality in the data, and can be influenced by deviations from normality in items, as well as from too much normality in C/IE responders (Meade & Craig, 2012).
  • the practical application of Mahalanobis D in C/IE responding has to deal with the fact that responses are on a scale with constrained limits.

Individual Consistency

  • Odd-Even Consistency
    • The score for this person in terms of odd-even consistency is then simply the correlation between the vectors (Aodd, Bodd, Codd) and (Aeven, Beven, Ceven), corrected with the Spearman-Brown prophecy formula, if desired.
    • variance must be present at this halved-scale level
  • Resampled Individual Reliability
    • Odd-Even Consistencyは計算機が発達していなかった時代の方法。現代的には、resamplingなどを用いて2分割する仕方を何回も生成し、それぞれの相関を測ることで、平均的な信頼性を計算できる。
  • Semantic and Psychometric Antonyms/Synonyms
    • (1) psychometric antonyms, (2) psychometric synonyms, (3) semantic antonyms, and (4) semantic synonyms.
    • ‘I am happy right now,’ vs ‘I am sad right now.’
    • There is no concrete value of correlation magnitude that is considered large enough
  • Inter-item Standard Deviation
    • this is nothing more than a within-individual standard deviation
    • It is proposed by Marjanovic, et al. (2015) that higher values on ISD are more indicative
      of random responding
    • However, scores on this technique do not appear to be linearly related to randomness, per the example above.
  • Polytomous Guttman Errors
    • [Guttman尺度の考え方] individuals should get easy items correct up to a point, then get all remaining, and more difficult, items wrong.
    • [Guttman errorは難しさの順序に違反している回答を検知]
    • [Polytomous Guttman Errorsの説明(これはCurran&Denison(2016)から引用)] The extension into polytomous data follows this same basic concept, in that an individual who agrees more strongly to an item with a lower probability of agreement after agreeing less strongly to an item with higher probability of agreement.
  • Person Total Correlation
    • this person-total correlation is a measure of how consistently each person is acting like all other persons.
    • The core of this idea of person-total correlations, as stated above, is an extension of work on tests done by Donlon and Fischer (1968).
    • [対象にしているitem (質問)は除いてscale score (全体のスコア)を計算する] A corrected or adjusted item-total correlation removes the value of the item in question from the calculation of the scale score. [Fig.2を見ると、item-total correlationと同様、person-total correlationでも、対象にしている回答者は除いて全体のスコアを計算するらしい]
川本達郎川本達郎

Johnson (2005)

Johnson, John A. "Ascertaining the validity of individual protocols from web-based personality inventories." Journal of research in personality 39.1 (2005): 103-129.

  • Researchers have identiWed three major threats to the validity of individual protocols.

Misrepresentations

  • (心理学的な評価) To “fake good,” means to claim to be more competent, welladjusted, or attractive than one actually appears to be in everyday life. The CPI Good Impression (Gi) scale (Gough & Bradley, 1996) was designed to detect this kind of misrepresentation.

  • (心理学的な評価) To “fake bad” means to seem more incompetent or maladjusted than one normally appears to be in everyday life. The CPI Well-Being (Wb) scale is an example of such a “fake bad” protocol validity scale (Gough & Bradley, 1996).

  • 4.1 Non-native speakers may have difficulty with both the literal meanings of items and the more subtle sociolinguistic trait implications of items (Johnson, 1997a).

  • 4.2 (責任感が欠如する)This distance may give participants a sense of reduced accountability for their actions (although anonymity may sometimes encourage participants to be more open and genuine; see Gosling et al., 2004).

  • 4.3 重複投稿の問題:Repeat participation

    • Protocols that appear consecutively in time or share the same nickname and contain identical responses to every item can be confidently classified as duplicates.
    • The strategy of the present study was to examine the frequency curve of identical responses between adjacent protocols (sorted by time and nickname)
  • 4.4 [Straightliningの話] using the same response category repeatedly and leaving items unanswered. Misreading, misplacing responses, and responding randomly can only be estimated by measuring internal consistency

  • 4.7 Item response theory models

    • [Odd-Even Consistency] In Jackson’s method, items within each of the standard scales are numbered sequentially in the order in which they appear in the inventory and then divided into odd-numbered and even-numbered subsets.
    • determining cut points that identified protocols as too inconsistent was based on the frequency curves for the individual reliability coefficients and psychometric antonym scores
    • the two consistency indices were entered into a principal components factor analysis with the standard scales of the personality inventory used in the study.

5. SUMMARY OF THE PRESENT RESEARCH PLAN

  • No previous study had examined the relation between the measures, their relation to structural validity, or
    their relation to the five-factor model

Results

  • Goldberg coefficient scores were nearly normally distributed. The skewed distribution for the Jackson coefficient is what one would expect if most participants were responding with appropriate consistency.
  • The Jackson and Goldberg consistency indices correlated moderately (rd.49) with each other, although the magnitude of the correlation is probably attenuated by the skew of the Jackson measure.
川本達郎川本達郎

Huang et al. 2012

  • comprehensive empirical evaluation of the methods
  • the extent to which inclusion of unmotivated responses can affect the psychometric properties of measures
  • the extent to which different indices tap into an underlying construct

Inconsistency Approach

  • Item pairs are created in three ways, including (i) direct item repetition, (ii) rational selection, and (iii) empirical selection.
  • Lucas and Baird (2005) also recommend survey researchers to design very similar questions in different places of a questionnaire to check against IER.
  • Schinka et al. (1997) selected ten-item pairs into an Inconsistency Scale (INC) for the NEO Per- sonality Inventory-Revised
  • Johnson (2005) applied two variants of the inconsis- tency approach to detect careless respondents to online surveys, including (i) Goldberg’s psychometric antonyms and (ii) Jackson’s (1976) individual reliability.
    • The effectiveness of these two methods has yet to be evaluated.

Study1

psychometric antonymsの構築

inter-item correlations were first computed for the entire sample, and 30 unique pairs of items with the highest negative correlation were selected. The correlation between the 30 pairs of items for a normal respondent was expected to be highly negative. The correlation was then reversed, with a lower score indicating a higher probability of IER.

川本達郎川本達郎

Straightliningまとめ

Niessen 2016

  • maximum longstring
  • The maximum longstring equals the number of times a respondent chooses the same response option consecutively.
  • Although this seems to be a very useful measure to detect this particular type of careless responding, a cutoff score is difficult to establish (Johnson, 2005).

Huang et al. 2012

  • long string index
  • NEO-PI-R (Costa & McCrae, 2008)などの人格診断においては、具体的な基準値が提案されている。
    • none of their 983 cooperative participants selected the same response option more than 6, 9, 10, 14, and 9 times for the response options from strongly disagree to strongly agree, respectively.

Costa and McCrae (2008)

Costa, Paul T., and Robert R. McCrae. "The revised neo personality inventory (neo-pi-r)." The SAGE handbook of personality theory and assessment 2.2 (2008): 179-198.

Meade&Craig (2012)

  • we simply computed the maximum number of items with consecutive response, regardless of the value
    of that response.
  • an overall measure (Avg LongString) was computed as the average of the LongString variable, averaged
    across the nine webpages that included 50 items. A second overall index (Max LongString) was computed as the maximum LongString variable found on any of the webpages.

Curran (2016)

  • known in the literature as ‘long-string analysis’ or ‘response pattern indices’ (Huang et al., 2012; Meade & Craig, 2012).
  • This technique seems to have formally begun with Johnson’s (2005)
  • Because values are simple counts of sequential matches, values will always take on integers bounded by 1 and the length of the assessment
  • (回答スコアが細かいほどスコアが連続する可能性は低くなる) In addition, a scale that has questions of varying intensity (e.g. ‘I have a lot of friends’ vs ‘I make friends with every person I meet’) should drive careful responders to change their response option more frequently than on a scale that has questions all geared at roughly the same level.
  • the typical frequency of a long-string on certain response options tends to be higher than on others; ‘Agree’ is a very popular choice on a typical scale (Curran, Kotrba, & Denison, 2010; Johnson, 2005).
  • long string analysis can be difficult to compare across different data collections without engaging in some degree of scaling.
    • there are no established global cut scores in place for it.
  • Long-string analysis has the potential to remove some of the worst of the worst responders, but may have difficulty doing much more.
    • Despite some limitations, the removal of these worst responders is better than nothing at all.

DeSimone et al. (2016)

  • the use of multidimensional scales or scales with both positively and negatively worded items is recommended when employing the longstring screening technique.
  • questionnaires containing all positively worded items should be more lenient on the longstring
    index than those containing both positively and negatively worded items
川本達郎川本達郎

類語/対義語による検出法

「相関」と言ったとき、Pearsonの相関係数かどうかを明記している論文は少ないが、おそらくそうであろう。

Desimone et al. (2016)

  • Semantic synonyms: “I enjoy my job” may be deemed semantically synonymous with “I like my current occupation.”
    • Researchers should note that it is not advisable to require perfect response consistency because minor differences in responses may reflect subtle differences in item content.
  • Semantic antonyms: “I intend to leave my current organization soon” and “I plan to work here for at least the next ten years,” the participant is demonstrating response inconsistency.
    • It is recommended to limit the survey to two to four semantically synonymous or antonymous item pairs (depending on original survey length).
  • Psychometric synonyms: Whereas the semantic synonym method relies on content experts to identify similar items, the psychometric synonym approach identifies similar items by the magnitude of inter-item correlations. Item pairs with the highest inter-item correlations are defined as psychometrically synonymous.
    • To use this technique effectively, researchers should set a minimum value for the identification of synonymous item pairs before examining the inter-item correlation matrix.
    • Meade and Craig (2012) used an inter-item correlation cutoff of .60 to identify synonymous item pairs.
    • There is no guarantee that any synonymous item pairs will be identified.
  • Psychometric antonyms: item pairs with the largest negative inter-item correlations (e.g. item pairs correlated below -.60).
    • This technique is most effective when both positively and negatively worded items are included in the
      survey instrument.

手順

  1. identify item pairs (semantically or psychometrically).
  2. the researcher must correlate the vector of responses to the first items in the set with the vector of responses to the second items in the set.
  • For example, if a researcher identifies three item pairs (questions 1 and 4, questions 3 and 9, questions 12 and 13), then the index would be computed by correlating responses to items 1, 3, and 12 with responses to items 4, 9, and 13. This correlation can be computed for each respondent and serves as the screening index.
  • Meade and Craig (2012) eliminated respondents with psychometric synonym coefficients below .22.
  • Both Huang et al. (2012) and Johnson (2005) screened respondents with psychometric antonym coefficients greater than −.03.

Huang et al. (2012)

  • psychometric antonyms: inter-item correlations were first computed for the entire sample, and 30 unique pairs of items with the highest negative correlation were selected. The correlation between the 30 pairs of items for a normal respondent was expected to be highly negative.
  • A psychometric antonym score was computed for each protocol by correlating the two sets of items within person.

Curran (2016)

  • 歴史:The emergence of this family of techniques for C/IE detection appears to have been with semantic antonyms, that is, pairs of items that are semantically opposite (Goldberg & Kilkowski, 1985).
  • semantic antonyms: These pairs are determined to be opposite simply on their content, and such pairs can be (and should be) created in the absence of data.
  • psychometric antonyms: The natural extension from a priori pairings to data-driven pairings led to the examination of psychometric antonyms, pairs of opposite items that are found by searching for large negative correlations between individual items in the full sample or in a secondary dataset (Johnson, 2005)
  • There is no concrete value of correlation magnitude that is considered large enough

Meade and Craig (2012)

  • 個数で決めるのではなく、相関係数の値で類語と対義語を決める:
    • 対義語の場合:We used a similar index in the current study. However, rather than using 30 item pairs, we sought to ensure item pairs were truly opposite in meaning and only retained item pairs with a negative correlation stronger than –.60. As a result, this index included only five item pairs.
    • 類語の場合:As before, within-person correlations were computed across item pairs exceeding this  .60
      threshold. There were 27 such pairs.

Goldberg&Kilkowski (1985)

Response biasの問題

  • Social desirability biasが働くとき、類語のペアについては強い正の相関が発生しやすく、対義語のペアについては強い負の相関が発生しやすくなる。この場合、類語ペアを測るか対義語ペアを測るかによる違いはない。
  • 黙認バイアス (Acquiescence bias)が働くとき、類語のペアについて強い正の相関が発生しやすくなるが、対義語については負の相関が弱まると期待される。この場合、類語ペアと対義語ペアで影響の受け方が異なるので、どちらを測るかで違いが出るはず。

単語の難しさ(Word difficulty)

  • どれだけ言語能力が高いかが整合性の高さには影響を与える可能性がある
  • 単語の意味を(辞書の説明)記載しておくことで整合性が高まるかもしれない
川本達郎川本達郎
  • Draisma, Stasja, and Wil Dijkstra. "Response latency and (para) linguistic expressions as indicators of response error." Methods for testing and evaluating survey questionnaires (2004): 131-147.

  • Bassili, John N. "The how and why of response latency measurement in telephone surveys." (1996).

  • Reips, Ulf-Dietrich. "Internet experiments: Methods, guidelines, metadata." Human vision and electronic imaging XIV. Vol. 7240. SPIE, 2009.

  • Leiner, Dominik Johannes. "Too fast, too straight, too weird: Non-reactive indicators for meaningless data in internet surveys." Survey Research Methods. Vol. 13. No. 3. 2019.

  • Bauermeister, Jose A., et al. "Data quality in HIV/AIDS web-based surveys: Handling invalid and suspicious data." Field methods 24.3 (2012): 272-291.

  • Harzing, Anne-Wil, et al. "Response style differences in cross-national research: dispositional and situational determinants." Management International Review 52 (2012): 341-363.