GEM (Gene-Environment interaction analysis for Millions of samples) is a software program for large-scale gene-environment interaction testing in cross-sectional and longitudinal data, from unrelated and related individuals. It enables genome-wide association studies in up to millions of samples while allowing for multiple exposures, and control for genotype-covariate interactions.
Current version: 2.2
Additional documentation:
https://large-scale-gxe-methods.github.io/GEMShowcaseWorkspace
Contents
- Quick Installation
- Dependencies
- Usage
- Contact
- License
- Recent Updates
Quick Installation
Option 1: Use the binary executable file for Linux
After downloading, make the file executable:
Option 2: Build GEM Library Dependencies
- C++17 compiler or later
- BLAS/LAPACK. For Intel processors, we recommend that GEM be compiled with an optimized math routine library such as the Intel oneAPI Math Kernal Library to replace BLAS/LAPACK for optimal performance.
- Boost C++ libraries. GEM links to the following Boost libraries:
boost_program_options, boost_thread, boost_system, and boost_filesystem
- SuiteSparse Library, download and compile it using CMake.
To install GEM, run the following lines of code:
git clone https://github.com/large-scale-gxe-methods/GEM
cd GEM/
cmake -B build
cmake --build build
Dependencies
C/C++ Compiler
- A compiler with C++17 (or later) support is required.
LAPACK and BLAS
- The LAPACK (Linear Algebra PACKage) and BLAS (Basic Linear Algebra Subprograms) libraries are used for matrix operations in GEM.
Intel processors:
- We recommend linking GEM to the Intel oneAPI Math Kernal Library (oneMKL), instead of classical BLAS/LAPACK, for a greater performance boost. This can be done by replacing -llapack and -lblas in the makefile with -lmkl_gf_lp64 -lmkl_sequential -lmkl_core before compiling.
- It is important to compile with -lmkl_sequential since GEM already does multi-threading across SNPs.
AMD processors:
- For AMD processors, OpenBLAS (-lopenblas) may be a better alternative.
Boost C++ Libraries
- The Boost C++ libraries are used for command-line, file management and multi-threading purposes.
- The following Boost libraries are required :
- libboost_system
- libboost_program_options
- libboost_filesystem
- libboost_thread
Eigen Library
- The Eigen library is used for linear algebra of dense and sparse matrices.
SuiteSparse Library
- The SuiteSparse library is used for linear algebra operations on dense and sparse matrices. It is particularly utilized in in the GLMM model to fit the null model during association test.
Armadillo Library
- The Armadillo library is used for linear algebra of dense and sparse matrices. It is particularly utilized in G × E (Gene-Environment) interaction test.
Usage
Running GEM
- Command Line Options
- Input Files
- Output File Format
- Examples
Command Line Options
Once GEM is installed, the executable ./GEM_2.2 can be used to run the program.
For a list of options, use ./GEM_2.2 --help.
List of Options
**General Options**
--help
Prints the available options of GEM and exits.
--help
Prints the available options of GEM and exits.
--version
Prints the version of GEM and exits.
Input/Output File Options:
--pheno-file
Path to the phenotype file.
--kin-file
Path to the kinship file.
--bgen
Path to the BGEN file.
--sample
Path to the sample file.
Required when the BGEN file does not contain sample identifiers.
--pfile
Path and prefix to the .pgen, .pvar, and .psam files.
If this flag is used, then --pgen/--pvar/--psam don't need to be specified.
--pgen
Path to the pgen file.
--pvar
Path to the pvar file.
--psam
Path to the psam file.
--bfile
Path and prefix to the .bed, .bim and .fam files.
If this flag is used, then --bed/--bim/--fam don't need to be specified.
--bed
Path to the bed file.
--bim
Path to the bim file.
--fam
Path to the fam file.
--out
Full path and extension to where GEM output results.
Default: gem.out
--output-style
Modifies the output of GEM. Must be one of the following:
minimum: Output the summary statistics for only the GxE and marginal G terms.
meta: 'minimum' output plus additional fields for the main G and any GxCovariate terms. For a robust analysis, additional columns for the model-based summary statistics will be included.
full: 'meta' output plus additional fields needed for re-analyses of a subset of interactions.
Default: meta
Phenotype File Options:
--sampleid-name
Column name in the phenotype file that contains sample identifiers.
--pheno-name
Column name in the phenotype file that contains the phenotype of interest.
If the number of levels (unique observations) is 2, the phenotype is treated as binary; otherwise it is assumed to be continuous.
--exposure-names
One or more column names in the phenotype file naming the exposure(s) to be included in interaction tests.
If no exposures are included, GEM will only perform the marginal test.
--int-covar-names
Any column names in the phenotype file naming the covariate(s) for which interactions should be included for adjustment (do not include with --exposure-names).
--covar-names
Any column names in the phenotype file naming the covariates for which only main effects should be included for adjustment (do not include with --exposure-names or --int-covar-names).
--random-slope-name
Column name in the phenotype file that contains random slope.
--group-name
Column name in the phenotype file that contains the group.
--robust
0 for model-based standard errors and 1 for robust standard errors.
Default: 0
--tol
Convergence tolerance for logistic regression.
Default: 0.0000001
--delim
Delimiter separating values in the phenotype file. Tab delimiter should be represented as \t and space delimiter as \0.
Default: , (comma-separated)
--missing-value
Indicates how missing values in the phenotype file are stored.
Default: NA
--center
0 for no centering to be done, 1 to center ALL exposures and covariates, and 2 to center all the interaction covariates only.
Default: 2
--scale
0 for no scaling to be done and 1 to scale ALL exposures and covariates by the standard deviation.
Default: 0
--categorical-names
Names of the exposure or interaction covariate that should be treated as categorical.
Default: None
--cat-threshold
A cut-off to determine which exposure or interaction covariate not specified using --categorical-names
should be automatically treated as categorical based on the number of levels (unique observations).
Default: 2
Kinship File Options:
--kin-delim
Delimiter separating values in the kinship file. Tab delimiter should be represented as \t and space delimiter as \0.
Default: , (comma-separated)
--kin-diag
Diagonal value of kinship matrix that not accounting for inbreeding.
Default: 1.0 (for 2x the kinship matrix).
Filtering Options:
--maf
Variants with a minor allele frequency less than this threshold value (range: [0, 0.5]) will be excluded.
Default: 0.001
--miss-geno-cutoff
Variants with a missing genotype rate greater than this threshold value (range: [0, 1.0]) will be excluded.
Default: 0.05
--include-snp-file
Path to file containing a subset of variants in the specified genotype file to be used for analysis.
The first line in this file is the header that specifies which variant identifier in the genotype file
is used for ID matching. This must be 'snpid' (PLINK or BGEN) or 'rsid' (BGEN only). There should be one variant
identifier per line after the header.
Performance Options:
--threads
Set number of compute threads.
Default: ceiling(detected threads / 2)
--stream-snps
Number of SNPs to analyze in a batch. Memory consumption will increase for larger values of stream-snps.
Default: 1
Input Files
Phenotype File
A file which should contain a sample identifier column and columns for the phenotypes, exposures, and covariates. The ordering of the columns does not matter. All inputs should be coded numerically (e.g., males/females as 0/1)
Kinship File
A file contains nonzero values in a three-column format, where the first two columns represent sample identifiers, and the third column indicates the kinship coefficient or genetic relatedness between individuals, or the inbreeding coefficient for the same individual. The scale of the kinship matrix does not matter (e.g., you can use two times the kinship matrix, or identity-by-descent information), as long as the values are on the same scale of the diagonal elements to be added in –kin-diag. For genetic relationship matrices computed from genotypes that have different values on the diagonals, we recommend specifying all diagonal values explicitly in the kinship file, along with –kin-diag 0.
Example 1 of Kinship Matrix (with an inbreeding coefficient of 0.05 for the fifth individual):
⎡ 0.5 0 0.25 0.25 0 ⎤
⎢ 0 0.5 0.25 0.25 0 ⎥
⎢ 0.25 0.25 0.5 0.25 0 ⎥
⎢ 0.25 0.25 0.25 0.5 0 ⎥
⎣ 0 0 0 0 0.55 ⎦
The Kinship File Corresponding to Example 1 (along with –kin-diag 0.5):
| ID1 | ID2 | Kinship|
|-----|-----|--------|
| 1 | 3 | 0.25 |
| 1 | 4 | 0.25 |
| 2 | 3 | 0.25 |
| 2 | 4 | 0.25 |
| 3 | 4 | 0.25 |
| 5 | 5 | 0.05 |
Example 2 of Kinship Matrix (two times the kinship matrix in Example 1):
⎡ 1 0 0.5 0.5 0 ⎤
⎢ 0 1 0.5 0.5 0 ⎥
⎢ 0.5 0.5 1 0.5 0 ⎥
⎢ 0.5 0.5 0.5 1 0 ⎥
⎣ 0 0 0 0 1.1 ⎦
The Kinship File Corresponding to Example 2 (along with –kin-diag 1):
| ID1 | ID2 | Kinship|
|-----|-----|--------|
| 1 | 3 | 0.5 |
| 1 | 4 | 0.5 |
| 2 | 3 | 0.5 |
| 2 | 4 | 0.5 |
| 3 | 4 | 0.5 |
| 5 | 5 | 0.1 |
Genotype Files
BGEN
Variants that are non-biallelic should be filtered from the BGEN file. Note that since there are no indication of a REF/ALT allele in the BGEN file, the second allele is the effect allele counted in association testing.
A .sample file is required as input when the .bgen file does not contain a sample identifier block.
Plink BED
**.fam** - The .fam file can be space or tab-delimited and must contain at least 2 columns where the first column is the family ID (FID) and the second column is the individual ID (IID). GEM will use the IID column for sample identifier matching with the phenotype file.
**.bim** - The .bim file can also be space or tab-delimited and should be in the following order: the chromosome, variant id, cM (optional), base-pair coordinate, ALT allele, and REF allele.
**.bed** - A bed file must be stored in variant-major form. The ALT allele specified in the .bim file is the effect allele counted in association testing.
Plink 2.0 PGEN
**.psam** - The .psam file is a tab-delimited text file containing the sample information. If header lines are present, the last header line should contain a column with the name #IID (if the first column is not #FID) or IID (if the first column is #FID) that holds the individual ID for sample identifier matching with the phenotype file. All previous header lines will be ignored. If no header line beginning with #IID or #FID is present, then the columns are assumed to be in .fam file order.
**.pvar** - The .pvar file is a tab-delimited text file containing the variant information. If header lines are present, the last header line should start with #CHROM. If #CHROM is present, then the columns POS, ID, REF, and ALT must also be present. All previous header lines will be ignored. If the .pvar file contain no header lines beginning with #CHROM, it is assumed that the columns are in .bim file order.
**.pgen** - The .pgen file should be filtered for non-biallelic variants. The ALT allele specified in the .pvar file is the effect allele counted in association testing.
Output File Format
GEM will write results to the output file specified with the –out parameter (or 'gem.out' if no output file is specified).
Below are details of the possible column headers in the output file.
SNPID - The SNP identifier as retrieved from the genotype file.
RSID - The reference SNP ID number. (BGEN only)
CHR - The chromosome of the SNP.
POS - The physical position of the SNP.
Non_Effect_Allele - The allele not counted in association testing.
Effect_Allele - The allele that is counted in association testing.
N_Samples - The number of samples without missing genotypes.
AF - The allele frequency of the effect allele.
GV - The variance of the effected allele.
N_catE_* - The number of non-missing samples in each combination of strata for all of the categorical exposures and interaction covariates.
AF_catE_* - The allele frequency of the effect allele for each combination of strata for all of the catgorical exposure or interaction covariate.
GV_catE_* - The variance of the effect allele for each combination of strata for all of the catgorical exposure or interaction covariate.
Beta_Marginal - The coefficient estimate for the marginal genetic effect (i.e., from a model with no interaction terms).
SE_Beta_Marginal - The model-based SE associated with the marginal genetic effect estimate.
robust_SE_Beta_Marginal - The robust SE associated with the marginal genetic effect estimate.
Beta_G - The coefficient estimate for the genetic main effect (G) (i.e., from a model with interaction terms).
Beta_G-* - The coefficient estimate for the interaction or interaction covariate terms.
SE_Beta_G - Model-based SE associated with the the genetic main effect (G).
SE_Beta_G-* - Model-based SE associated with any GxE or interaction covariate terms.
robust_SE_Beta_G - Robust SE associated with the the genetic main effect (G).
robust_SE_Beta_G-* - Robust SE associated with any GxE or interaction covariate terms.
Cov_Beta_G_G-* - Model-based covariance between the genetic main effect (G) and any GxE or interaction covariate terms.
Cov_Beta_G-*_G-* - Model-based covariance between any GxE or interaction covariate terms.
robust_Cov_Beta_G_G-* - Robust covariance between the genetic main effect (G) and any GxE or interaction covariate terms.
robust_Cov_Beta_G-*_G-* - Robust covariance between any GxE or interaction covariate terms.
P_Value_Marginal - Marginal genetic effect p-value from model-based SE.
P_Value_Interaction - Interaction effect p-value (K degrees of freedom test of interaction effect) from model-based SE. (K is number of major exposures)
P_Value_Joint - Joint test p-value (K+1 degrees of freedom test of genetic and interaction effect) from model-based SE.
robust_P_Value_Marginal - Marginal genetic effect p-value from robust SE.
robust_P_Value_Interaction - Interaction effect p-value from robust SE.
robust_P_Value_Joint - Joint test p-value (K+1 degrees of freedom test of genetic and interaction effect) from robust SE.
The –output-style flag can be used to specify which columns should be included in the output file:
minimum:
Includes the variant information, Beta_Marginal, SE_Beta_Marginal, coefficient estimates for only the GxE terms, and depending on the –robust option, SE and covariance for only the GxE terms.
meta:
Includes each of the possible outputs listed above when applicable. For a model-based analysis (–robust 0), the columns containing the "robust" prefix (robust_*) are excluded in the output file.
full:
Includes, in addition to "meta", an initial header line with the residual variance estimate necessary for re-analysis of a subset of interactions using only summary statistics (for example, switching an exposure and interaction covariate).
Examples
To run GEM using the example data, execute GEM with the following code.
Example for cross-sectional data without kinship:
./GEM --bgen example.bgen --sample example.sample --pheno-file example.pheno --sampleid-name sampleid --pheno-name pheno2 --covar-names cov3 --exposure-names cov1 --out cross_sectional_without_kinship.out
The results should look like the following output file cross_sectional_without_kinship.out.
Example for cross-sectional data with kinship:
./GEM --kin-file example.kinship --kin-diag 0.5 --pheno-file example.pheno --pheno-name pheno2 --sampleid-name sampleid --exposure-names cov1 --covar-names cov3 --random-slope-name cov3 --bgen example.bgen --sample example.sample --output-style meta --out cross_sectional_with_kinship.out
The results should look like the following output file cross_sectional_with_kinship.out.
Example for longitudinal data without kinship:
./GEM --pheno-file example.pheno2 --pheno-name pheno2 --sampleid-name sampleid --exposure-names cov1 --covar-names cov3 --random-slope-name cov3 --bgen example.bgen --sample example.sample --output-style meta --out longitudinal_without_kinship.out
The results should look like the following output file longitudinal_without_kinship.out.
Example for longitudinal data with kinship:
./GEM --kin-file example.kinship --kin-diag 0.5 --pheno-file example.pheno2 --pheno-name pheno2 --sampleid-name sampleid --exposure-names cov1 --covar-names cov3 --random-slope-name cov3 --bgen example.bgen --sample example.sample --output-style meta --out longitudinal_with_kinship.out
The results should look like the following output file longitudinal_with_kinship.out.
Recent Updates
Version 2.2 - March 4, 2026:
- Added convergence check for logistic regression. The program now reports detailed coefficient estimates when the model fails to converge and exits safely after 500 iterations.
- Added support for
group variable in null model fitting to allow heteroscedastic residual variance across sample groups.
- Improved memory handling and internal limits to support larger numbers of observations (larger sample sizes)
Version 2.1.3 - June 14, 2025:
- Added log file support
- Updated headers in the output file
Version 2.1.2 - May 27, 2025:
- Added output columns GV and GV_catE_*
Version 2.1.1 - April 18, 2025:
- Logged the SNP ID when a singular matrix is detected
Version 2.1 - March 10, 2025:
- Added support for GLMM on PGEN and BED files
- Added support for GEI test on PGEN and BED files
Version 2.0 - February 14, 2025:
- Added generalized linear mixed model (GLMM)
- Added gen-environment interaction (GEI) test
Version 1.5.3 - May 20, 2024:
- Included stratified values for binary outcomes
Version 1.5.2 - August 16, 2023:
- Fixed the output when there is no exposure
Version 1.5.1 - April 20, 2023:
- Treated empty strings as missing values
- Fixed a bug for empty strings at the end of each line
- Minor changes to messages printed to stdout
- Error out if the sample size is not greater than the number of predictors (intercept, exposures, interaction covariates, and covariates) in the null model fitting
Version 1.5 - March 9, 2023:
- Changed the default of the –center flag to 2 to center all the interaction covariates only
Version 1.4.5 - November 11, 2022:
- Added collinearity check of the covariates before fitting the null model
Version 1.4.4 - October 5, 2022:
- Fixed the bugs of include-snp-file
- Removed the default value of flag "--center"
Version 1.4.3 - March 23, 2022:
- Sorted the output headers of categorical variables
Version 1.4.2 - November 22, 2021:
- Add math.h library to install GEM through Docker desktop
- Added a binary executable file
Version 1.4.1 - September 14, 2021:
- Added to read phenotype files created from the Windows system
Version 1.4 - July 2, 2021:
- Remove –pheno-type flag. If the number of levels (unique observations) is 2, the phenotype is treated as binary; otherwise it is assumed to be continuous
- Check for categorical exposures and interaction covariates
- Output number of non-missing samples (N) and allele frequency (AF) for effect allele for each combination of strata for all exposures and interaction covariates
- Add two additional flags –categorical-names and –cat-threshold for user definition of categorical variables
- Output the SE instead of variance for the coefficient estimates
- Output only the lower triangle of the covariance matrix instead of the full matrix
- For robust analysis and "meta"/"full" output style, include model-based summary statistics in the output file
- Column names for the robust summary statistics will include the prefix "robust_"
- For "full" output style, an initial header line with the dispersion is included in the output file
- The V matrix no longer included in the output file for "full" output style
Version 1.3 - April 7, 2021:
- Add a new flag (–output-style) to modify which summary statistics should be included in the the output file. Column names now include the exposure and interaction covariate names instead of numbers.
- The –exposure-names flag is now optional. If no exposures are specified, GEM will run a G-only model. Covariates (not of interest) can still be adjusted for using –covar-names flag.
Version 1.2 - January 22, 2021:
- Fix issue to allow for space and tab delimited phenotype files.
- Allow for centering and scaling of exposures and covariates.
- Update calculations for model-based joint test
- Update calculations for robust joint test
- Output covariance, coefficients and standard errors to the log.
- Change Allele1 and Allele2 in outfile file to Non_Effect_Allele and Effect_Allele.
- Fix bug when phenotype is binary and there are missing genotypes.
- Support PGEN/BED files.
Version 1.1 - July 21, 2020:
- Allow GEM to subset the BGEN file based on a list of variants to include for analysis. –include-snp-file
- Use matrix operation to adjust for covariates instead of for-loop. Use the libdeflate package for faster zlib decompression of BGEN genotype blocks. Compile GEM with -O2 (optimizer flag). Prioritize BGEN sample file over the BGEN sample identifier block. Error if phenotype (–pheno-name) is also included as an exposoure or covariate
- Support BGEN v1.1, v1.2 and v1.3 uncompressed genotype blocks.
- Fix major printing bug.
- Handle missing genotypes in BGEN files.
Contact
For comments, suggestions, bug reports and questions, please contact Han Chen (hanch.nosp@m.enph.nosp@m.d@gma.nosp@m.il.c.nosp@m.om), Alisa Manning (AKMAN.nosp@m.NING.nosp@m.@mgh..nosp@m.harv.nosp@m.ard.e.nosp@m.du), Kenny Westerman (KEWES.nosp@m.TERM.nosp@m.AN@mg.nosp@m.h.ha.nosp@m.rvard.nosp@m..edu) or Samaneh Salehi Nasab (Saman.nosp@m.eh.S.nosp@m.alehi.nosp@m.Nasa.nosp@m.b@uth.nosp@m..tmc.nosp@m..edu). For bug reports, please include an example to reproduce the problem without having to access your confidential data.
References
If you use GEM in your analysis, please cite
- Westerman KE, Pham DT, Hong L, Chen Y, Sevilla-González M, Sung YJ, Sun YV, Morrison AC, Chen H, Manning AK. (2021) GEM: scalable and flexible gene-environment interaction analysis in millions of samples. Bioinformatics 37(20):3514-3520. PubMed PMID: 34695175. PMCID: PMC8545347. DOI: 10.1093/bioinformatics/btab223.
License
GEM : Gene-Environment interaction analysis for Millions of samples
Copyright (C) 2018-2026 Liang Hong, Han Chen, Duy Pham, Cong Pan, Samaneh Salehi Nasab
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
The GEM package is distributed under GPL (>= 3). It includes source code from open source third-party software:
- libdeflate: MIT
- Plink: LGPLv3+
Zstandard (zstd): BSD_3_clause | GPL-2
The binary release of GEM also links to third-party libraries:
- Boost: Boost Software License, Version 1.0
- Eigen: Mozilla Public License, Version 2.0
- Intel oneAPI Math Kernel Library (oneMKL): Intel Simplified Software License (Version October 2022 or later)
- Armadillo: Apache License 2.0
- SuiteSparse:
- libcholmod: GNU Lesser General Public License (LGPL), version 2.1 or later
- libcxsparse: GNU Lesser General Public License (LGPL), version 2.1 or later
- libspqr:GNU General Public License (GPL), version 2 or later
- libumfpack: GNU General Public License (GPL), version 2 or later
- libcamd: BSD 3-Clause License
- libccolamd: BSD 3-Clause License
- libcolamd: BSD 3-Clause License
- libamd: BSD 3-Clause License
- libsuitesparseconfig: BSD-3-clause
fmt: MIT License
Full copies of license agreements for GEM, third-party source code, linked libraries can be found here.