The Synthetic Data Generator is a lightweight tool that creates realistic, customizable datasets for training, testing, and demonstration purposes—without exposing sensitive information. By simulating real-world data structures, the tool helps researchers, trainers, and developers practice analysis workflows, prototype dashboards, or showcase tools while protecting respondent confidentiality.
This project reduces risks around handling personal data while increasing efficiency in research support, training, and capacity-building activities.
- Customizable Data Generation: Control the number of rows and columns
- Statistical Control: Choose distributions (Normal, Uniform, Exponential, Lognormal) for numeric variables
- Correlation Management: Enable and control correlation between numeric variables
- Missing Data: Adjust missing data percentage for realistic datasets
- Personal Information: Include realistic fake data (names, emails, addresses, phone numbers, etc.)
- Multiple Export Formats: Download data as CSV, Excel, or Stata DTA files
- Data Preview: Visualize correlations and data quality metrics
Clone the repository:
git clone <your-repo-url>
cd Synthetic-Data-GeneratorInstall required packages:
pip install -r requirements.txtRun the application:
streamlit run fake_data_generator.pyThe application requires the following Python packages:
streamlit– Web application frameworkfaker– Fake data generationpandas– Data manipulation and analysisnumpy– Numerical computingscipy– Scientific computingopenpyxl– Excel file supportmatplotlib– Data visualization
- Configure Parameters: Use the sidebar to set:
- Number of rows and columns
- Personal information fields to include
- Missing data percentage
- Variable distributions and correlations
- Generate Data: Click the "Generate Data" button
- Preview: Review the data in the "Preview Data" tab
- Export: Download your dataset in CSV, Excel, or Stata format
Synthetic-Data-Generator/
├── fake_data_generator.py # Main application file
├── requirements.txt # Python dependencies
├── README.md # Project documentation
└── run.bat # Windows batch file for easy execution
For easy execution on Windows, use the provided run.bat file:
- Double-click
run.batto start the application - The application will open in your default web browser
You can easily modify the application to:
- Add new distribution types
- Include additional personal information fields
- Change correlation methods
- Add new export formats
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
This project is open source and available under the MIT License.
If you encounter any issues:
- Check that all dependencies are installed
- Ensure your Python environment is properly configured
- Verify that the required directories are in your system PATH
- Built with Streamlit
- Uses Faker for realistic fake data generation
- Pandas for data manipulation and export capabilities
Note: This tool generates synthetic data for testing and development purposes only. Always ensure compliance with data protection regulations when working with personal information.