-
Notifications
You must be signed in to change notification settings - Fork 0
/
devices_multiple.py
169 lines (111 loc) · 6.34 KB
/
devices_multiple.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
The strong disparity in multiple_devices and multiple_IPs between fraud and non-fraud cases is a valuable insight, and you can do more than just compare proportions. Here’s a deeper analysis and modeling approach to make the most of this insight:
1. Analyze the Interaction Between multiple_devices and multiple_IPs
Examine whether the combination of multiple_devices and multiple_IPs creates patterns that differentiate fraud cases more effectively than either feature alone.
a. Cross-Tabulation
Create a cross-tabulation to see how the combination of multiple_devices and multiple_IPs aligns with fraud.
# Create a new feature combining multiple_devices and multiple_IPs
df['multi_device_ip'] = df['multiple_devices'].astype(str) + "_" + df['multiple_IPs'].astype(str)
# Cross-tabulation
cross_tab = pd.crosstab(df['multi_device_ip'], df['target'])
cross_tab['fraud_ratio'] = cross_tab[1] / (cross_tab[0] + cross_tab[1])
print(cross_tab)
# Visualize the fraud ratios for each combination
cross_tab['fraud_ratio'].plot(kind='bar', figsize=(12, 6), title='Fraud Ratio by Multiple Device-IP Combination')
plt.xlabel('Multiple Devices - Multiple IPs')
plt.ylabel('Fraud Ratio')
plt.show()
Interpretation:
• Combinations with high fraud ratios may indicate high-risk behavior.
• Use these combinations as new categorical features for models.
2. Threshold-Based Analysis
Quantify how the number of devices or IPs affects the likelihood of fraud.
a. Threshold-Based Binning
Bin the number of unique devices and IPs into thresholds and calculate the fraud ratio for each bin.
# Create bins for unique devices and IPs
df['device_bins'] = pd.cut(df['unique_devices'], bins=[0, 1, 2, 5, 10, np.inf], labels=['1', '2', '3-5', '6-10', '10+'])
df['ip_bins'] = pd.cut(df['unique_IPs'], bins=[0, 1, 2, 5, 10, np.inf], labels=['1', '2', '3-5', '6-10', '10+'])
# Calculate fraud ratios for each bin
device_fraud_ratio = df.groupby('device_bins')['target'].mean()
ip_fraud_ratio = df.groupby('ip_bins')['target'].mean()
# Plot fraud ratios
device_fraud_ratio.plot(kind='bar', title='Fraud Ratio by Device Count Bins', figsize=(10, 5))
plt.xlabel('Device Count Bins')
plt.ylabel('Fraud Ratio')
plt.show()
ip_fraud_ratio.plot(kind='bar', title='Fraud Ratio by IP Count Bins', figsize=(10, 5))
plt.xlabel('IP Count Bins')
plt.ylabel('Fraud Ratio')
plt.show()
Interpretation:
• Identify thresholds where the likelihood of fraud sharply increases.
• Use these thresholds to create new features (e.g., high_risk_device for customers with >5 devices).
3. Temporal Analysis of multiple_devices and multiple_IPs
Investigate whether changes in devices or IPs over time correlate with fraud.
a. Rate of Change
Calculate how frequently devices or IPs change over time.
# Calculate time differences for each customer
df['time_diff'] = df.groupby('customer')['_time'].diff().dt.total_seconds()
# Calculate rates of device/IP changes
df['device_change_rate'] = df.groupby('customer')['device'].transform('nunique') / df.groupby('customer')['time_diff'].transform('sum')
df['ip_change_rate'] = df.groupby('customer')['IP'].transform('nunique') / df.groupby('customer')['time_diff'].transform('sum')
# Compare rates for fraud vs. non-fraud
rate_comparison = df.groupby('target')[['device_change_rate', 'ip_change_rate']].mean()
print(rate_comparison)
Interpretation:
• Higher rates of device or IP changes may indicate fraud.
• Use these rates as additional features in a model.
4. Network Analysis
Treat device and IP as nodes in a network to uncover relationships between entities.
a. Bipartite Graph
Create a bipartite graph with customers, devices, and IPs to identify fraud-prone connections.
import networkx as nx
# Create a bipartite graph
B = nx.Graph()
# Add nodes for customers, devices, and IPs
B.add_nodes_from(df['customer'].unique(), bipartite=0) # Customers
B.add_nodes_from(df['device'].unique(), bipartite=1) # Devices
B.add_nodes_from(df['IP'].unique(), bipartite=1) # IPs
# Add edges
for _, row in df.iterrows():
B.add_edge(row['customer'], row['device'])
B.add_edge(row['customer'], row['IP'])
# Project to customer space
customer_graph = nx.bipartite.projected_graph(B, df['customer'].unique())
# Analyze centrality (fraud-prone customers may have higher centrality)
centrality = nx.degree_centrality(customer_graph)
df['centrality'] = df['customer'].map(centrality)
print(df.groupby('target')['centrality'].mean())
Interpretation:
• Customers with higher centrality may have connections to many devices or IPs, indicating potential fraud.
• Use centrality as a new feature for fraud detection.
5. Machine Learning with Interaction Terms
Combine these insights into a predictive model by creating interaction terms and engineered features.
a. Create New Features
# Interaction terms
df['devices_ips_interaction'] = df['unique_devices'] * df['unique_IPs']
df['high_risk_combination'] = ((df['unique_devices'] > 3) & (df['unique_IPs'] > 3)).astype(int)
# Train a Random Forest model
X = df[['unique_devices', 'unique_IPs', 'devices_ips_interaction', 'high_risk_combination']]
y = df['target']
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
# Feature Importance
importances = model.feature_importances_
print(pd.DataFrame({'Feature': X.columns, 'Importance': importances}).sort_values(by='Importance', ascending=False))
6. Anomaly Detection with High-Risk Behavior
Focus on outliers within multiple_devices and multiple_IPs.
a. Isolation Forest
from sklearn.ensemble import IsolationForest
# Fit Isolation Forest
iso = IsolationForest(contamination=0.1, random_state=42)
df['anomaly_score'] = iso.fit_predict(X)
# Compare fraud vs. non-fraud among anomalies
print(df.groupby(['target', 'anomaly_score']).size())
Interpretation:
• Fraud cases classified as anomalies confirm that behavior involving multiple devices or IPs is unusual and high-risk.
By exploring these deeper analyses:
• You’ll uncover high-risk behaviors in how devices and IPs interact with fraud.
• Insights like interaction terms, change rates, and network centrality will help you build more precise models and rules to detect fraud effectively. Let me know if you’d like a specific implementation!