Skip to content

Commit

Permalink
Eval llm (OpenNMT#2410)
Browse files Browse the repository at this point in the history
* add eval LLM script along MMLU dataset
* add benchmarks
  • Loading branch information
vince62s authored Jun 15, 2023
1 parent 7ab633e commit 2f06387
Show file tree
Hide file tree
Showing 181 changed files with 21,971 additions and 1 deletion.
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ multi-bleu.perl
.idea
*.sublime-*
.DS_Store
data/

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down
22 changes: 22 additions & 0 deletions eval_llm/MMLU/data/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
This file contains the dev, val, and test data for our multitask test.
The dev dataset is for few-shot learning to prime the model, and the test set the source of evaluation questions.
The auxiliary_training data could be used for fine-tuning, something important for models without few-shot capabilities. This auxiliary training data comes from other NLP multiple choice datasets such as MCTest (Richardson et al., 2013), RACE (Lai et al., 2017), ARC (Clark et al., 2018, 2016), and OBQA (Mihaylov et al., 2018).
Unless otherwise specified, the questions are in reference to human knowledge as of January 1st, 2020. In the far future, it may be useful to add to the prompt that the question is written for 2020 audiences.

--

If you find this useful in your research, please consider citing the test and also the ETHICS dataset it draws from:

@article{hendryckstest2021,
title={Measuring Massive Multitask Language Understanding},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}

@article{hendrycks2021ethics,
title={Aligning AI With Shared Human Values},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}
5 changes: 5 additions & 0 deletions eval_llm/MMLU/data/dev/abstract_algebra_dev.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Find all c in Z_3 such that Z_3[x]/(x^2 + c) is a field.,0,1,2,3,B
"Statement 1 | If aH is an element of a factor group, then |aH| divides |a|. Statement 2 | If H and K are subgroups of G then HK is a subgroup of G.","True, True","False, False","True, False","False, True",B
Statement 1 | Every element of a group generates a cyclic subgroup of the group. Statement 2 | The symmetric group S_10 has 10 elements.,"True, True","False, False","True, False","False, True",C
Statement 1| Every function from a finite set onto itself must be one to one. Statement 2 | Every subgroup of an abelian group is abelian.,"True, True","False, False","True, False","False, True",A
Find the characteristic of the ring 2Z.,0,3,12,30,A
5 changes: 5 additions & 0 deletions eval_llm/MMLU/data/dev/anatomy_dev.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
What is the embryological origin of the hyoid bone?,The first pharyngeal arch,The first and second pharyngeal arches,The second pharyngeal arch,The second and third pharyngeal arches,D
Which of these branches of the trigeminal nerve contain somatic motor processes?,The supraorbital nerve,The infraorbital nerve,The mental nerve,None of the above,D
The pleura,have no sensory innervation.,are separated by a 2 mm space.,extend into the neck.,are composed of respiratory epithelium.,C
In Angle's Class II Div 2 occlusion there is,excess overbite of the upper lateral incisors.,negative overjet of the upper central incisors.,excess overjet of the upper lateral incisors.,excess overjet of the upper central incisors.,C
Which of the following is the body cavity that contains the pituitary gland?,Abdominal,Cranial,Pleural,Spinal,B
5 changes: 5 additions & 0 deletions eval_llm/MMLU/data/dev/astronomy_dev.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
You are pushing a truck along a road. Would it be easier to accelerate this truck on Mars? Why? (Assume there is no friction),It would be harder since the truck is heavier on Mars.,It would be easier since the truck is lighter on Mars.,It would be harder since the truck is lighter on Mars.,It would be the same no matter where you are.,D
Where do most short-period comets come from and how do we know?,The Kuiper belt; short period comets tend to be in the plane of the solar system just like the Kuiper belt.,The Kuiper belt; short period comets tend to come from random directions indicating a spherical distribution of comets called the Kuiper belt.,The asteroid belt; short period comets have orbital periods similar to asteroids like Vesta and are found in the plane of the solar system just like the asteroid belt.,The Oort cloud; short period comets tend to be in the plane of the solar system just like the Oort cloud.,A
Say the pupil of your eye has a diameter of 5 mm and you have a telescope with an aperture of 50 cm. How much more light can the telescope gather than your eye?,10000 times more,100 times more,1000 times more,10 times more,A
Why isn't there a planet where the asteroid belt is located?,A planet once formed here but it was broken apart by a catastrophic collision.,There was not enough material in this part of the solar nebula to form a planet.,There was too much rocky material to form a terrestrial planet but not enough gaseous material to form a jovian planet.,Resonance with Jupiter prevented material from collecting together to form a planet.,D
Why is Mars red?,"Because the surface is covered with heavily oxidized (""rusted"") minerals.",Because the atmosphere scatters more light at bluer wavelengths transmitting mostly red light.,Because Mars is covered with ancient lava flows which are red in color.,Because flowing water on Mars's surface altered the surface minerals several billion years ago.,A
5 changes: 5 additions & 0 deletions eval_llm/MMLU/data/dev/business_ethics_dev.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
"Beyond the business case for engaging in CSR there are a number of moral arguments relating to: negative _______, the _______that corporations possess and the ________ of business and society.","Externalities, Power, Independence","Publicity, Insubstantial resources, Mutual dependence","Publicity, Power, Independence","Externalities, Power, Mutual dependence",D
"_______ is the direct attempt to formally or informally manage ethical issues or problems, through specific policies, practices and programmes.",Corporate social responsibility,Business ethics management,Sustainability,Environmental management,B
"To ensure the independence of the non-executive board members, they are a number of steps which can be taken, which include non-executives being drawn from _______ the company, being appointed for a _________ time period as well as being appointed _________.","Outside, Limited, Independently","Inside, Limited, Intermittently","Outside, Unlimited, Intermittently","Inside, Unlimited, Independently",A
"Three contrasting tactics that CSO's can engage in to meet their aims are ________ which typically involves research and communication, ________, which may involve physically attacking a company's operations or ________, often involving some form of _______.","Non-violent direct action, Violent direct action, Indirect action, Boycott","Indirect action, Instrumental action, Non-violent direct action, Information campaign","Indirect action, Violent direct action, Non-violent direct-action Boycott","Non-violent direct action, Instrumental action, Indirect action, Information campaign",C
"In contrast to _______, _______ aim to reward favourable behaviour by companies. The success of such campaigns have been heightened through the use of ___________, which allow campaigns to facilitate the company in achieving _________ .","Buycotts, Boycotts, Blockchain technology, Charitable donations","Buycotts, Boycotts, Digital technology, Increased Sales","Boycotts, Buyalls, Blockchain technology, Charitable donations","Boycotts, Buycotts, Digital technology, Increased Sales",D
5 changes: 5 additions & 0 deletions eval_llm/MMLU/data/dev/clinical_knowledge_dev.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
The energy for all forms of muscle contraction is provided by:,ATP.,ADP.,phosphocreatine.,oxidative phosphorylation.,A
What is the difference between a male and a female catheter?,Male and female catheters are different colours.,Male catheters are longer than female catheters.,Male catheters are bigger than female catheters.,Female catheters are longer than male catheters.,B
In the assessment of the hand function which of the following is true?,Abduction of the thumb is supplied by spinal root T2,Opposition of the thumb by opponens policis is supplied by spinal root T1,Finger adduction is supplied by the median nerve,Finger abduction is mediated by the palmar interossei,B
"How many attempts should you make to cannulate a patient before passing the job on to a senior colleague, according to the medical knowledge of 2020?",4,3,2,1,C
Glycolysis is the name given to the pathway involving the conversion of:,glycogen to glucose-1-phosphate.,glycogen or glucose to fructose.,glycogen or glucose to pyruvate or lactate.,glycogen or glucose to pyruvate or acetyl CoA.,C
5 changes: 5 additions & 0 deletions eval_llm/MMLU/data/dev/college_biology_dev.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Which of the following represents an accurate statement concerning arthropods?,They possess an exoskeleton composed primarily of peptidoglycan.,They possess an open circulatory system with a dorsal heart.,They are members of a biologically unsuccessful phylum incapable of exploiting diverse habitats and nutrition sources.,"They lack paired, jointed appendages.",B
"In a given population, 1 out of every 400 people has a cancer caused by a completely recessive allele, b. Assuming the population is in Hardy-Weinberg equilibrium, which of the following is the expected proportion of individuals who carry the b allele but are not expected to develop the cancer?",1/400,19/400,20/400,38/400,D
"The presence of homologous structures in two different organisms, such as the humerus in the front limb of a human and a bird, indicates that",the human and bird are polyphyletic species,a human's and bird's evolution is convergent,the human and bird belong to a clade,the human and bird developed by analogy,C
"According to the pressure-flow model of movement of phloem contents, photosynthate movement from source to sink is driven by",an ATP-dependent pressure-flow pump,a water-pressure potential gradient,transpiration,apoplastic diffusion,B
Which of the following contain DNA sequences required for the segregation of chromosomes in mitosis and meiosis?,Telomeres,Centromeres,Nucleosomes,Spliceosomes,B
5 changes: 5 additions & 0 deletions eval_llm/MMLU/data/dev/college_chemistry_dev.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Which of the following statements about the lanthanide elements is NOT true?,The most common oxidation state for the lanthanide elements is +3.,Lanthanide complexes often have high coordination numbers (> 6).,All of the lanthanide elements react with aqueous acid to liberate hydrogen.,The atomic radii of the lanthanide elements increase across the period from La to Lu.,D
A 0.217 g sample of HgO (molar mass = 217 g) reacts with excess iodide ions according to the reaction shown above. Titration of the resulting solution requires how many mL of 0.10 M HCl to reach equivalence point?,1.0 mL,10 mL,20 mL,50 mL,C
"Predict the number of lines in the EPR spectrum of a solution of 13C-labelled methyl radical (13CH3•), assuming the lines do not overlap.",4,3,6,24,A
"3 Cl−(aq) + 4 CrO_4^2−(aq) + 23 H+(aq) → 3 HClO2(aq) + 4 Cr3+(aq) + 10 H2O(l). In the reaction shown above, Cl−(aq) behaves as",an acid,a base,a catalyst,a reducing agent,D
"Which of the following lists the hydrides of group-14 elements in order of thermal stability, from lowest to highest?",PbH4 < SnH4 < GeH4 < SiH4 < CH4,PbH4 < SnH4 < CH4 < GeH4 < SiH4,CH4 < SiH4 < GeH4 < SnH4 < PbH4,CH4 < PbH4 < GeH4 < SnH4 < SiH4,A
13 changes: 13 additions & 0 deletions eval_llm/MMLU/data/dev/college_computer_science_dev.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Which of the following regular expressions is equivalent to (describes the same set of strings as) (a* + b)*(c + d)?,a*(c + d)+ b(c + d),a*(c + d)* + b(c + d)*,a*(c + d)+ b*(c + d),(a + b)*c +(a + b)*d,D
"A certain pipelined RISC machine has 8 general-purpose registers R0, R1, . . . , R7 and supports the following operations.
ADD Rs1, Rs2, Rd Add Rs1 to Rs2 and put the sum in Rd
MUL Rs1, Rs2, Rd Multiply Rs1 by Rs2 and put the product in Rd
An operation normally takes one cycle; however, an operation takes two cycles if it produces a result required by the immediately following operation in an operation sequence. Consider the expression AB + ABC + BC, where variables A, B, C are located in registers R0, R1, R2. If the contents of these three registers must not be modified, what is the minimum number of clock cycles required for an operation sequence that computes the value of AB + ABC + BC?",5,6,7,8,B
"The Singleton design pattern is used to guarantee that only a single instance of a class may be instantiated. Which of the following is (are) true of this design pattern?
I. The Singleton class has a static factory method to provide its instance.
II. The Singleton class can be a subclass of another class.
III. The Singleton class has a private constructor.",I only,II only,III only,"I, II, and III",D
"A compiler generates code for the following assignment statement.
G := (A + B) * C - (D + E) * F
The target machine has a single accumulator and a single-address instruction set consisting of instructions load, store, add, subtract, and multiply. For the arithmetic operations, the left operand is taken from the accumulator and the result appears in the accumulator. The smallest possible number of instructions in the resulting code is",5,6,7,9,D
"Consider a computer design in which multiple processors, each with a private cache memory, share global memory using a single bus. This bus is the critical system resource. Each processor can execute one instruction every 500 nanoseconds as long as memory references are satisfied by its local cache. When a cache miss occurs, the processor is delayed for an additional 2,000 nanoseconds. During half of this additional delay, the bus is dedicated to serving the cache miss. During the other half, the processor cannot continue, but the bus is free to service requests from other processors. On average, each instruction requires 2 memory references. On average, cache misses occur on 1 percent of references. What proportion of the capacity of the bus would a single processor consume, ignoring delays due to competition from other processors?",1/50,1/27,1/25,2/27,B
8 changes: 8 additions & 0 deletions eval_llm/MMLU/data/dev/college_mathematics_dev.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
"Let V be the set of all real polynomials p(x). Let transformations T, S be defined on V by T:p(x) -> xp(x) and S:p(x) -> p'(x) = d/dx p(x), and interpret (ST)(p(x)) as S(T(p(x))). Which of the following is true?",ST = 0,ST = T,ST = TS,ST - TS is the identity map of V onto itself.,D
"A tank initially contains a salt solution of 3 grams of salt dissolved in 100 liters of water. A salt solution containing 0.02 grams of salt per liter of water is sprayed into the tank at a rate of 4 liters per minute. The sprayed solution is continually mixed with the salt solution in the tank, and the mixture flows out of the tank at a rate of 4 liters per minute. If the mixing is instantaneous, how many grams of salt are in the tank after 100 minutes have elapsed?",2,2 - e^-2,2 + e^-2,2 + e^-4,D
"Let A be a real 2x2 matrix. Which of the following statements must be true?
I. All of the entries of A^2 are nonnegative.
II. The determinant of A^2 is nonnegative.
III. If A has two distinct eigenvalues, then A^2 has two distinct eigenvalues.",I only,II only,III only,II and III only,B
"Suppose that f(1 + x) = f(x) for all real x. If f is a polynomial and f(5) = 11, then f(15/2)",-11,0,11,33/2,C
"Let A be the set of all ordered pairs of integers (m, n) such that 7m + 12n = 22. What is the greatest negative number in the set B = {m + n : (m, n) \in A}?",-5,-4,-3,-2,B
5 changes: 5 additions & 0 deletions eval_llm/MMLU/data/dev/college_medicine_dev.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Glucose is transported into the muscle cell:,via protein transporters called GLUT4.,only in the presence of insulin.,via hexokinase.,via monocarbylic acid transporters.,A
Which of the following is not a true statement?,Muscle glycogen is broken down enzymatically to glucose-1-phosphate,Elite endurance runners have a high proportion of Type I fibres in their leg muscles,Liver glycogen is important in the maintenance of the blood glucose concentration,Insulin promotes glucose uptake by all tissues in the body,D
"In a genetic test of a newborn, a rare genetic disorder is found that has X-linked recessive transmission. Which of the following statements is likely true regarding the pedigree of this disorder?",All descendants on the maternal side will have the disorder.,Females will be approximately twice as affected as males in this family.,All daughters of an affected male will be affected.,There will be equal distribution of males and females affected.,C
"A high school science teacher fills a 1 liter bottle with pure nitrogen and seals the lid. The pressure is 1.70 atm, and the room temperature is 25°C. Which two variables will both increase the pressure of the system, if all other variables are held constant?","Increasing temperature, increasing moles of gas","Increasing temperature, increasing volume","Decreasing volume, decreasing temperature","Decreasing moles of gas, increasing volume",A
An expected side effect of creatine supplementation is:,muscle weakness.,gain in body mass.,muscle cramps.,loss of electrolytes.,B
5 changes: 5 additions & 0 deletions eval_llm/MMLU/data/dev/college_physics_dev.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
A refracting telescope consists of two converging lenses separated by 100 cm. The eye-piece lens has a focal length of 20 cm. The angular magnification of the telescope is,4,5,6,20,A
For which of the following thermodynamic processes is the increase in the internal energy of an ideal gas equal to the heat added to the gas?,Constant temperature,Constant volume,Constant pressure,Adiabatic,B
"One end of a Nichrome wire of length 2L and cross-sectional area A is attached to an end of another Nichrome wire of length L and cross- sectional area 2A. If the free end of the longer wire is at an electric potential of 8.0 volts, and the free end of the shorter wire is at an electric potential of 1.0 volt, the potential at the junction of the two wires is most nearly equal to",2.4 V,3.3 V,4.5 V,5.7 V,A
A refracting telescope consists of two converging lenses separated by 100 cm. The eye-piece lens has a focal length of 20 cm. The angular magnification of the telescope is,4,5,6,20,A
"The muon decays with a characteristic lifetime of about 10^-6 second into an electron, a muon neutrino, and an electron antineutrino. The muon is forbidden from decaying into an electron and just a single neutrino by the law of conservation of",charge,mass,energy and momentum,lepton number,D
5 changes: 5 additions & 0 deletions eval_llm/MMLU/data/dev/computer_security_dev.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
SHA-1 has a message digest of,160 bits,512 bits,628 bits,820 bits,A
"_____________ can modify data on your system – so that your system doesn’t run correctly or you can no longer access specific data, or it may even ask for ransom in order to give your access.",IM – Trojans,Backdoor Trojans,Trojan-Downloader,Ransom Trojan,D
What is ethical hacking?,"""Hacking"" ethics so they justify unintended selfish behavior","Hacking systems (e.g., during penetration testing) to expose vulnerabilities so they can be fixed, rather than exploited",Hacking into systems run by those whose ethics you disagree with,"A slang term for rapid software development, e.g., as part of hackathons",B
Exploitation of the Heartbleed bug permits,overwriting cryptographic keys in memory,a kind of code injection,a read outside bounds of a buffer,a format string attack,C
The ____________ is anything which your search engine cannot search.,Haunted web,World Wide Web,Surface web,Deep Web,D
5 changes: 5 additions & 0 deletions eval_llm/MMLU/data/dev/conceptual_physics_dev.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
"Compared with the mass of a uranium atom undergoing fission, the combined masses of the products after fission are",less,more,the same,zero,A
Things that are equivalent according to the equivalence principle are,space and time.,a traveling twin and a stay-at-home twin.,gravity and acceleration.,mass and energy.,C
Colors in a soap bubble result from light,converted to a different frequency,deflection,interference,polarization,C
A model airplane flies slower when flying into the wind and faster with wind at its back. When launched at right angles to the wind a cross wind its groundspeed compared with flying in still air is,the same,greater,less,either greater or less depending on wind speed,B
Which of these three elements has the most mass per nucleon?,Hydrogen,Iron,Uranium,Same in each,A
Loading

0 comments on commit 2f06387

Please sign in to comment.