2016-03-06 9 views
2

У меня есть набор навыков программирования, который мне нравится препроцитировать/очистить и создать еще несколько общих групп.Текстовый анализ и кластеризация для небольшого текста

  • Для чистой очистки текста я могу сделать следующий текст. Примеры из следующих наборов данных. Visual C и C являются одинаковыми или Yi и Yi являются одинаковыми.
  • Есть ли какой-либо лексикон для программистов/инженерии программного обеспечения и управления проектами или онтологии, которые могли бы мне помочь классифицировать следующим более абстрактных категорий

Вот мой набор данных

C++ 
C 
CAE 
Programming 
Matlab 
Simulations 
Finite Element Analysis 
Software Engineering 
Algorithms 
Linux 
Software Development 
Python 
Engineering 
CAD 
Numerical Analysis 
Fortran 
Java 
Mechanical Engineering 
C/C++ STL 
CFD 
Optimization 
ANSYS 
AutoCAD 
LaTeX 
ANSA 
Eclipse 
HTML 
Machine Learning 
Software Design 
SQL 
UML 
Abaqus 
C# 
MySQL 
Aerodynamics 
Catia 
JavaScript 
PHP 
Microsoft Office 
Nastran 
OpenGL 
Stress Analysis 
CSS 
Qt 
Modeling 
Structural Analysis 
Computational Geometry 
Fluid Mechanics 
Mathematica 
Parallel Computing 
Visual Studio 
XML 
CATIA 
Computational Mechanics 
D 
Fluid Dynamics 
LS-DYNA 
NetBeans 
Object Oriented Design 
Objective-C 
R&amp 
Windows 
Composites 
Computer Science 
Customer Service 
Inventor 
Manufacturing 
Operating Systems 
Parallel Programming 
Pro Engineer 
Research 
Solidworks 
Business Strategy 
Crashworthiness 
jQuery 
Management 
Microsoft Excel 
OpenFOAM 
Pattern Recognition 
Shell Scripting 
TCP/IP 
Vim 
?TA 
Android Development 
Autodesk Inventor 
Automotive Engineering 
Blender 
CFX 
Databases 
Git 
Joomla 
Mathematics 
Microsoft Visual Studio... 
NVH 
Optimizations 
Photoshop 
PostgreSQL 
Product Design 
Product Development 
Scripting 
Solid Mechanics 
Subversion 
Unix 
Web Development 
Analysis 
Artificial Intelligence 
Automotive 
Business Development 
C 
CAD/CAM 
CUDA 
Data Analysis 
Data Mining 
Electrical Engineering 
Engineering Design 
Engineering Management 
Heat Transfer 
High Performance... 
Hypermesh 
Image Processing 
Java Enterprise Edition 
Mercurial 
mETA 
Microsoft SQL Server 
Microsoft Word 
MPI 
Multithreading 
Negotiation 
New Business Development 
OpenMP 
Perl 
PowerPoint 
Project Engineering 
Project Management 
Prolog 
Pthreads 
Robotics 
Simulation 
SolidWorks 
Thermodynamics 
Visual Basic 
3D Studio Max 
Accounting 
Agile Methodologies 
ANSA/mETA 
Ant 
Apache 
AutoCAD Mechanical 
Biomechanics 
Biomedical Engineering 
Budgets 
Business Analysis 
Computer Vision 
Corporate Communications 
CVS 
Delphi 
Design Patterns 
Dynamics 
EJB 
Embedded C 
Embedded Systems 
Energy 
English 
Financial Analysis 
Fortran 95 
Genetic Algorithms 
Haskell 
Hibernate 
HTML 5 
iOS development 
JPA 
JSP 
JUnit 
Marketing Strategy 
Materials Science 
Meshing 
Meta 
MongoDB 
Multithreaded... 
Network Programming 
Neural Networks 
Numerical Simulation 
OOP 
Parallel Algorithms 
Parallel Processing 
Piping 
Post Processing 
Powertrain 
Presentations 
Public Relations 
Radioss 
Sales 
Scientific Computing 
Scrum 
SOAP 
Software Project... 
Solid Edge 
Spring 
Star-CCM+ 
Strategic Planning 
Teaching 
Team Leadership 
Template Metaprogramming 
Test Driven Development 
Ubuntu 
Unigraphics 
Unit Testing 
Vehicles 
Visual C++ 
Web Applications 
Web Services 
Weblogic 
Wireshark 
WordPress 
.NET 
?TA Post-processor 
?TA Post Processor 
?ヤチ 
3D 
3D Modeling 
Account Management 
Account Reconciliation 
Accounts Payable 
Acoustics 
Active Directory 
Adjoint Optimization 
Aerospace 
Agile Project Management 
Algorithm Development 
Analog Photography 
Android 
AngularJS 
ANSA Pre-processor 
ANSA/META 
ANSY CFX 
ANSYS FLUENT 
Apache HTTP Server 
Applied Mathematics 
Approximation Algorithms 
Architecture 
ARM Assembly 
Artificial Neural... 
ASME 
Assembly Language 
Astrophysics 
Automotive Design 
AVL Boost 
B2B 
Balanced Scorecard 
BEM 
Benchmarking 
Bind 
Biomaterials 
boost 
Boost C++ 
Business Coaching 
C/C++ 
C++ Builder 
C++ Language 
CAD/CAM Software 
CAE Process Automation 
Carbon Fiber 
Casting 
CATIA V5 
CATIA, CFD, ANSA, ?TA 
Channel Partners 
Characterization 
Cilk 
Civil Engineering 
ClearCase 
ClearQuest 
Cluster 
Cluster Development 
CNC 
Coaching 
Cocoa 
Combustion 
Company Presentations 
Competitive Analysis 
Compiler Construction 
Compression Algorithms 
Computation Geometry 
Computational Physics 
Computer Graphics 
Computer Repair 
Consecutive... 
Constitutive Modeling 
Corel 
Corporate Identity 
Corporate Sales... 
Crash 
Crisis Communications 
CRM 
CSS3 
Data Acquisition 
Data Exchange 
Data Management 
Data Privacy 
Database Administration 
Database Design 
Decision Support 
Digital Photography 
Direct Sales 
DirectX 
Discrete Mathematics 
Distributed Systems 
Domain Specific... 
Driving License 
Dynamic Programming 
Dynamical Systems 
Economics 
ECU manager- MoTeC 
Editing 
Education 
Electronic Engineering 
Electronics 
Emacs 
Embedded Software 
Employee Training 
Energy Derivatives 
Engine bench data... 
Engine calibration 
Engine Modelling 
Engine Performance 
Engineering Analysis 
Entrepreneurship 
Ergonomics 
ERP 
Event Management 
Event Planning 
Evolutionary Algorithms 
Evolutionary Computation 
Experimentation 
Fatigue Analysis 
FEM analysis 
Financial Reporting 
Finite Cell Method 
Fixed Assets 
Fluid-Structure... 
Functional Programming 
General Ledger 
Generative Programming 
Glade 
Glassfish 
GLSL 
GNU Make 
GPU Computing 
GPU Programming 
Graphics 
Grid Generation 
GT-Power 
GUI development 
Hadoop 
Human-computer... 
Human Factors 
Human Factors... 
Illustrator 
Image Segmentation 
Information Architecture 
Information Systems 
Informix 
Integration 
Internal Communications 
International Business 
International Sales 
Interpreting 
Isogeometric analysis 
Italian languages 
J2EE 
Java RMI 
JavaSE 
JBoss 
JBoss Application Server 
JDBC 
JMS 
jQuery Mobile 
JSON 
JT Open Toolkit 
Kanban 
KDevelop 
Key Account Management 
Kinematics 
Language Services 
Latex 
Lex 
Lightroom 
Linear Algebra 
Linguistics 
Linux server... 
Linux System... 
Localization 
Machine Embroidery 
Machining 
Management Consulting 
MapReduce 
Market Analysis 
Market Research 
Marketing Communications 
Materials Testing 
Mathematical Modeling 
Mathematical Programming 
MATLAB 
Maven 
Mechanical Behavior of... 
Mechanical Testing 
Mechanism Design 
Media Relations 
Medical Devices 
Medical Translation 
Mesh Generation 
MetaPost 
Microcontrollers 
Microscopy 
Microsoft Windows 
Microstructure 
Mobile Application... 
ModeFrontier 
Monte Carlo 
Moodle 
Morphing 
Motion Analysis 
MSC.Patran 
Nanoindentation 
NASTRAN 
Network Administration 
Network Simulator 
Node.js 
NoSQL 
OGRE 
Online Gaming 
Open Source 
openACC 
openCL 
OpenCL 
OpenCV 
Optical Microscopy 
Outlook 
Pamcrash 
Paraview 
pascal 
Pascal 
Patents 
Pedestrian Safety 
Performance Management 
pFEM 
phpMyAdmin 
Physical Modeling 
Physics 
PL/SQL 
Plasticity 
Plex 
Polymers 
POSIX Threads 
Pre-sales 
Presentation Skills 
Press Releases 
Pressure Vessels 
PRINCE2 
Problem Solving 
Process Improvement 
Product Management 
Product Marketing 
Program Management 
Programming Languages 
Project Planning 
Prototyping 
PTC Creo 
Public Speaking 
qt 
Qt Creator 
Quantum Mechanics 
Quartz 
RadTherm 
Requirements Analysis 
REST 
Revenue Recognition 
Reverse Engineering 
RPAS 
Safety 
Sales Management 
SAP2000 
Scanning Electron... 
Scheme 
Science 
Scientific Visualization 
SDL Trados 
SEM 
Sensitivity Analysis 
Servlets 
Shape Analysis 
Shape Recognition 
Shape Registration 
Shared Memory 
Signal Processing 
Simulation Software 
SImulations 
Simulink 
SIP 
Skilled Negotiator 
SNMP 
Social Media 
Social Networking 
Socket Programming 
Sockets 
Software Architectural... 
Software Documentation 
Software Quality... 
Software Testing 
Solaris 
SpaceClaim 
Spectroscopy 
Spring Data 
Spring Framework 
Spring MVC 
SQL Server 
Squeak and Rattle 
STAR-CD 
Star CCM+ 
STAR CCM+ 
Start-ups 
Statistics 
Steel Design 
Steel Structures 
STEP ISO 10303 
stl 
STL 
Strategic Alliances 
Strategy 
Structural Optimization 
Struts 
Struts2 
Subtitling 
Swing 
System Administration 
Tax Advisory 
TCL 
Tcl-Tk 
Team Building 
Team Management 
Teamwork 
Technical Translation 
Technical Writing 
Telecommunications 
Tenrox 
Testing 
Time Series Analysis 
Tomcat 
Tortoise SVN 
TR-069 
Track testing 
Trados 
Translation 
Tribology 
Turbomachinery 
Turbulence 
Turbulence Modeling 
Typo3 
Unity3D 
Unix Shell Scripting 
User Experience 
User Interface Design 
Vaadin 
VBA 
Vehicle Dynamics 
VHDL 
Virtual Reality 
Visual C# 
VTK 
Web Design 
Website Localization 
Websphere 
WebSphere 
WebSphere Application... 
WebSphere MQ 
Weka 
Widgets 
win32 
Windows 7 
Windows Azure 
Windows Server 
Wordfast 
Wordpress 
Workflow Reference Model 
wxWidgets 
XQuery 
XSLT 
Yacc 
Yii 
Yii Framework 
+1

Запросы на использование ресурсов за пределами участка здесь. – jonrsharpe

+0

как простой скрипт о кластеризации на подстроках? – ritesht93

+1

Что вы пытаетесь? Там может работать простейший пакет слов с поддержкой 2 грамма и ограничение сроков. Также word2vec может быть uesful – Lol4t0

ответ

1

Есть 2 способа сделать очистку и категоризацию наборов данных:

  1. вручную
  2. Использование некоторого API для извлечения текста, который даст вам некоторое представление об иерархии. Вы можете использовать AlchemyAPI, TextMiner и т. Д., Чтобы увидеть, какие термины сгруппированы вместе. Это не даст вам точной классификации, но даст вам общую картину категории.

 Смежные вопросы

  • Нет связанных вопросов^_^