How to profile Datasets of Amorphic?
info
- Follow the steps mentioned below.
- Total time taken for this task: 20 Minutes.
- Pre-requisites: User registration is completed, logged in to Amorphic and role switched
Create Dataset with data profiling option
- Click on 'DATASETS' --> 'Datasets' from left navigation-bar.
- Click on ➕ icon at the top right corner.
- Enter the following information.
{
  "Dataset Name": "covid19_daily_us_profile_<your_userid>"
  "Description": "Covid-19 daily report of US as of 5/31/2021. The target location is Redshift."
  "Domain": "workshop(workshop)"
  "Data Classifications":
  "Keywords": "Retail"
  "Connection Type": "API (default)"
  "File Type": "csv"
  "Target Location": "Redshift"
  "Update Method": "Append"
   "My Data Files Have Headers": "Yes"
  "Custom Delimiter": ","
  "Enable Malware Detection": "No"
  "Enable Data Profiling": "Yes"
}
- Click on 'Register' button at the bottom to move to the next step.
- Click on the following CSV file to download it to your computer.
- Click on 'Click to upload' to upload the covid-19 csv file.
- Click on 'Extract Schema' as shown below.
- You will get a message 'File uploaded successfully'. Click OK.
- A new screen will appear with the schema extracted as shown below.
- Verify the columns and data types.
- Change the 'Sort Key Type' to None.
- Click on 'Publish Dataset'. You will get a 'completed the registration process successfully' message. Click OK.
- Click on 'Files' tab. Upload the same covid-19 csv file again.
- Data profiling jobs run at 12 AM UTC everyday.
- Under the 'Profile' tab, you will see 'Schema & Data Profile'. If the data profiling option is disabled, you'd see 'Schema' only. Data profile is updated in this section as shown below.
Enable data profiling for existing datasets.
- You may enable data profiling for existing datasets with target as Redshift or S3Athena. S3 datasets cannot be profiled.
- Data profiling is enabled by clicking 'edit' ✏️ icon and change 'Enable Data Profiling' as 'Yes'.
tip
- How long does it take for all data profiles to get updated?- Lets say there are 100 datasets to be profiled, and each dataset takes about 3 minutes (depends on the dataset size) for the data profile to be updated. With a concurrency factor of 5, the total time taken will be 20*3min = 60min.
 
- What happens in case of failures?- In cases where a data profile fails to be extracted the error is displayed on the profile tab, and an email alert is sent to the subscribed user.